Introducing Instella-Long: A Fully Open Language Model with Long-Context Capability#

AMD is excited to announce Instella-Long, a long-context language model continually trained from Instella-3B-Instruct on AMD Instinct™ MI300X GPUs. To our knowledge, Instella-Long makes Instella series the first fully open language model trained from scratch that supports long-context. Instella-Long can support 128K context length and achieve competitive performance outperforming open-weights models such as Phi-3.5-mini [1], Gemma-3-4B [2], and Qwen2.5-3B [3] on the long-context benchmark.
By training Instella with long context extension on Instinct MI300X GPUs, we highlight our hardware’s capability and scalability in handling demanding AI training workloads, offering a viable alternative in the AI hardware landscape. In line with the AMD commitment to open source, we are sharing all the model weights, detailed training configurations, datasets, and code, enabling the AI community to collaborate, replicate, and innovate, thereby accelerating progress.
Key Takeaways#
Announcing Instella-Long, a 3B long-context language model with 128K context length support developed by AMD, trained on 64 Instinct MI300X GPUs.
To our knowledge, Instella-Long makes the Instella series the first fully open language model trained from scratch that supports long-context. Huggingface model, training data, and training code are fully open-sourced.
Supported by the AMD ROCm software stack, Instella-Long employs efficient training techniques such as Sequence Parallelism, FlashAttention-2 [4], Torch Compile, and FSDP to distribute model training over 8 MI300 nodes each with 8 GPUs.
Instella-Long#
Instella-Long is based on the Instella model released in March. Specifically, Instella-Long is continually trained from Instella-3B-Instruct and follows the same model architecture. The training of Instella-Long comprises three stages: 1. Continued Pre-Training, 2. Supervised Finetuning (SFT), 3. Direct Preference Optimization (DPO).
Continued Pre-Training#
Training: We employ a two-phase pre-training starting from Instella-3B-Instruct (4K context length).
Phase 1: We extend the context length from 4,096 to 65,536 tokens and train the model using 20B tokens. We follow the RoPE scaling law to increase the base frequency of RoPE [5] from 10,000 to 514,640.
Phase 2: As indicated by Prolong [6], it is beneficial to train the model with the data whose context length is longer than the target context length. In this phase, we train the model on 20B tokens with a maximum context length of 262,144 - 2× the target context length of 128K. Following the RoPE scaling law, we increase the RoPE base frequency to 3,691,950.
Data: Our continued pre-training data originates from the data mix created by Prolong. We use the text data curated by Prolong and tokenize the data with our tokenizer. In each phase of the continued pre-training, we train on a mix of long and short context data. Specific details are outlined as follows:
Training Phase | 64K Long Data | 256K Long Data | Short Data |
---|---|---|---|
Phase 1 | Code repos (30%), Books (30%), Textbooks (3%) |
- | FineWeb-Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (3%), OpenWebMath (5%) |
Phase 2 | Code repos (10%), Books (15%) |
Code repos (20%), Books (15%), Textbooks (2%) |
FineWeb-Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (4%), OpenWebMath (5%) |
Supervised Finetuning (SFT)#
Training: After continued training on the long-context pre-training data, we perform supervised finetuning on long-context instruction data. We train the model using a 1B-token mixture of short- and long-context instruction data.
Data: Similar to the continued pre-training stage, we train the model on a mixture of short- and long-context instruction data with a ratio of 4 to 6. For short-context instruction data, we use Ultrachat 200K [7], OpenMathinstruct-2 [8], Tülu-3 Instruction Following [9], and MMLU auxiliary train set [10]. For long-context instruction data, we construct a synthetic long-context instruction dataset due to the lack of long-context SFT data. Specifically, we make use of the long documents from Books, which is part of our continued pre-training data corpus. We select documents with a minimum length of 8K tokens and truncate those exceeding 128K tokens to a maximum length of 128K. Then, we use Qwen2.5-14B-Instruct-1M as a teacher model to synthetically generate question-answer pairs for the documents. To speed up this process, we randomly choose a subpart of the document for the QA generation instead of using the whole document. The length of the subpart is randomly set to be between 2K and 8K tokens. We use NLTK sentence tokenizer to divide documents into sentences to make sure that the selected subpart has complete sentences. The generated question and answer are appended to the end of the long document, serving as a complete single-round instruction-following data sample. In addition, we also generate long-context instruction data using short documents, in order to increase the dataset diversity with more data sources. We use arXiv from our continued pre-training corpus and the DCLM subset from Dolmino-Mix-1124 [11]. We first generate QA for each short document following the same pipeline aforementioned. Then, we iteratively concatenate different short documents until it reache 128K tokens. The concatenated document can exceed 128K as we do not truncate the last document. Lastly, we randomly choose one QA corresponding to one of the short documents and append it to the end of the concatenated document. The final data mixture for the SFT stage is shown as follows:
Short Data | Long Data |
---|---|
Ultrachat 200K (25%), OpenMathinstruct-2, (10%) MMLU auxiliary train set (3%), Tülu-3 Instruction Following (2%) |
Books (44%), DCLM (10%), ArXiv (6%) |
Direct Preference Optimization (DPO)#
Training: At the last training stage, we perform human preference alignment training using Direct Preference Optimization [12]. We employ the same DPO training as Instella-3B-Instruct using the same data. Unlike previous training stages, in the DPO stage, we train on short data only whose maximum context length is 2K. Consistent with the findings of other open-weights models, we observe that performing DPO on short data alone continues to improve the model performance on long-context tasks.
Data: We use the OLMo-2-1124-7B-Preference-Mix [11] dataset as our DPO data which contains 0.76B tokens.
Sequence Parallelism#
To enable training with extremely long inputs, we implement sequence parallelism based on Deepspeed Ulysses [13]. The sequence parallelism distributes the attention heads across GPUs during the attention computation. It is more efficient than Ring-Attention [14] in GPU communications. We use four GPUs as a sequence parallelism group for the Phase 2 continued pre-training and SFT due to the long inputs.
Results#
We evaluate the long-context performance on Helmet [15], a recent comprehensive long-context evaluation benchmark encompassing diverse categories. Helmet demonstrates better consistency of human perception than the previous long-context benchmarks.
Instella-3B-Long-Instruct outperforms open weights models including Phi-3.5-mini-instruct [1], Gemma-3-4B-it [2], Qwen2.5-3B-Instruct [3], and MiniCPM-2B-128k [16] on most tasks of the Helmet benchmark (Table 3).
We performed a side-by-side comparison at 8K, 16K, and 32K context lengths with Qwen2.5-3B-Instruct as its context length is 32K. Instella-3B-Long-Instruct outperforms Qwen2.5-3B-Instruct by 2.75% on average (Table 4).
Models | Size | Training Tokens (from scratch) | Natural Questions (RAG) | TriviaQA (RAG) | HotpotQA (RAG) | InfiniteBench QA | InfiniteBench MC | NarrativeQA | NIAH (multi value needles) | Average | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Open Weight Models | ||||||||||||||
Llama-3.2-3B-Instruct | 3.21B | ~9T | 51.8 | 86.2 | 56.4 | 38.7 | 56.0 | 26.0 | 99.2 | 59.19 | ||||
Phi-3.5-mini-instruct | 3.82B | - | 41.2 | 78.6 | 48.6 | 24.0 | 55.0 | 27.7 | 87.0 | 51.73 | ||||
gemma-3-4b-it | 4.3B | ~4T | 47.2 | 76.8 | 45.2 | 21.0 | 49.0 | 20.7 | 74.0 | 47.70 | ||||
Qwen2.5-3B-Instruct | 3.09B | ~18T | 34.6 | 65.8 | 41.8 | 14.7 | 35.0 | 21.0 | 80.4 | 41.90 | ||||
MiniCPM-2B-128k | 2.4B | ~1T | 28.4 | 61.6 | 30.8 | 3.7 | 22.0 | 3.3 | 46.6 | 28.06 | ||||
Fully Open Models | ||||||||||||||
Instella-3B-Long-Instruct | 3.11B | ~4T | 43.6 | 73.0 | 51.6 | 30.7 | 54.0 | 32.3 | 84.0 | 52.74 |
Model | NIAH (multi value needles) | Natural Questions (RAG) | TriviaQA (RAG) | HotpotQA (RAG) | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8K | 16K | 32K | 8K | 16K | 32K | 8K | 16K | 32K | 8K | 16K | 32K | ||
Instella-3B-Long-Instruct | 98 | 95 | 87 | 53 | 49 | 46 | 79 | 73 | 75 | 59 | 59 | 51 | 68.67 |
Qwen2.5-3B-Instruct | 95 | 94 | 95 | 48 | 42 | 39 | 77 | 78 | 74 | 51 | 50 | 48 | 65.92 |
Evaluation Metrics: We use substring exact match (SubEM) for the RAG tasks including Natural Questions, TriviaQA, and HotpotQA. We use recall for NIAH and exact match for InfiniteBench MC. For InfiniteBench QA and NarrativeQA, where the answers are open-ended, we use gpt-4o-mini to evaluate the answers against the ground truth using the prompt and metric provided by the Helmet.
Models | MMLU | IFEval | MT-Bench | TruthfulQA | Toxigen (↓) | Crows-Pair |
---|---|---|---|---|---|---|
Instella-3B-Instruct | 58.90 | 71.35 | 7.23 | 55.47 | 57.02 | 58.86 |
Instella-3B-Long-Instruct | 57.44 | 68.76 | 6.83 | 55.52 | 42.34 | 60.05 |
Short-context results: We observe performance drops on some short-context benchmarks compared to Instella-3B-Instruct (Table 5). Interestingly, TruthfulQA remains stable, while Crows-Pair shows a slight improvement, indicating potential gains in certain responsible AI metrics. The reduction in Toxigen (57.02 → 42.34, lower is better) suggests improved toxicity avoidance in the long-context variant. We hypothesize that these results reflect a trade-off between optimizing for longer context lengths and retaining short-context performance, which may be more pronounced at the 3B parameter scale compared to larger models.
Summary#
In this blog, we introduced Instella-Long, a fully open long-context language model trained from scratch on AMD Instinct™ MI300X GPUs, detailing its training methodology, datasets used and benchmark performance.
The release of the Instella-Long model represents a significant stride in advancing open-source AI and demonstrates the capabilities of AMD hardware in language model training. To our knowledge, Instella-Long makes the Instella series the first fully open language model trained from scratch that supports long-context, while achieving competitive performance compared to open-weights models.
By fully open sourcing the Instella-Long model, including weights, training configurations, datasets, and code, we aim to foster innovation and collaboration within the AI community. We believe that transparency, reproducibility and accessibility are key drivers of progress in AI research and development. We invite developers, researchers, and AI enthusiasts to explore Instella-long, contribute to its ongoing improvement, and join us in pushing the boundaries of what is possible with language models.
Resources#
Hugging face Model Cards: amd/Instella-3B-Long-Instruct
Training data: amd/Instella-Long
Training Code: AMD-AIG-AIMA/Instella
Please refer to the following blogs to get started with using these techniques on AMD GPUs:
Bias, Risks, and Limitations#
The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety critical situations, health, or medical applications, generating false information, facilitating toxic conversations.
Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases.
It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model.
Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.
License#
The Instella-3B-Long-Instruct model is licensed for academic and research purposes under a ResearchRAIL license. Refer to the LICENSE and NOTICES files for more information.
The amd/Instella-Long is a collection of pre-training and instruction following data that is used to train Instella-3B-Long-Instruct, and is licensed for academic and research purposes under a ResearchRAIL license. Refer to the LICENSE in the amd/Instella-Long dataset card for more information.
Contributors#
Core contributors: Jialian Wu, Jiang Liu, Sudhanshu Ranjan, Xiaodong Yu, Gowtham Ramesh, Prakamya Mishra, Zicheng Liu
Contributors: Yusheng Su, Ximeng Sun, Ze Wang, Emad Barsoum
Feel free to cite our Instella models:
@misc{Instella,
title = {Instella: Fully Open Language Models with Stellar Performance},
url = {https://huggingface.co/amd/Instella-3B},
author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
month = {March},
year = {2025}
}
Disclaimers#
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.