loader
Generating audio...

arxiv

Paper 2412.06845

Fully Open Source Moxin-7B Technical Report

Authors: Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Xingchen Xu, Yu Huang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang

Published: 2024-12-08

Abstract:

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, and some use restrictive licenses whilst claiming to be "open-source," which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF), a ranked classification system that evaluates AI models based on model completeness and openness, adhering to principles of open science, open source, open data, and open access. Our model achieves the highest MOF classification level of "open science" through the comprehensive release of pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation.

Paper Content:
Page 1: Fully Open Source Moxin-LLM Technical Report Pu Zhao1, Xuan Shen1, Zhenglun Kong2, Yixin Shen3, Sung-En Chang1, Timothy Rupprecht1, Lei Lu1, Enfu Nan1, Changdi Yang1, Yumei He4, Xingchen Xu5, Yu Huang6, Wei Wang7, Yue Chen7, Yong He7, Yanzhi Wang1,8 1Northeastern University,2Harvard University, 3Cornell University,4Tulane University,5University of Washington, 6Roboraction.ai,7Futurewei Technologies,8AIBAO LLC Abstract Recently, Large Language Models (LLMs) have undergone a significant transfor- mation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, and some use restrictive licenses whilst claiming to be “open-source,” which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF), a ranked classification system that evaluates AI models based on model completeness and openness, adhering to principles of open science, open source, open data, and open access. Our model achieves the highest MOF classification level of “open science” through the comprehensive release of pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation. Homepage: https://github.com/moxin-org/Moxin-LLM Base model: https://huggingface.co/moxin-org/moxin-llm-7b Chat model: https://huggingface.co/moxin-org/moxin-chat-7b 1 Introduction The field of natural language processing has witnessed the most exciting discoveries of the last ten years with the emergence of large language models (LLMs). At the forefront of this evolution are LLMs such as GPT-4 [ 1], Claude [ 2], and Gemini [ 3], which have captured the attention of the AI community due to their performance and versatility. Meanwhile, the recent emergence of openly accessible yet highly capable LLMs such as LLaMA [ 4], Falcon [ 5], and Mistral [ 6] allow researchers and practitioners to easily obtain, customize, and deploy LLMs in more various environments and for more diverse use cases. The trends have made people eagerly asking about what’s next and some suggest “a general intelligence” is right around the corner.arXiv:2412.06845v2 [cs.CL] 11 Dec 2024 Page 2: Despite the growing influence and accessibility of open-source LLMs, a notable challenge emerged: many model producers restrict visibility and access to their training, fine-tuning, and evaluation processes, including crucial components such as their training code and data [ 7]. Some model producers even use restrictive licenses whilst claiming to be “open-source.” This practice creates barriers for the broader AI research community to study, replicate, and innovate upon advanced LLMs. In parallel, it prevents businesses from fully leveraging open-source models for innovative industrial applications, as its commercialization has raised concerns about transparency, reproducibility, and safety. To unlock the full potential of LLMs and open innovation, we must return to democratize this research by putting the model into the hands of more researchers and making the datasets the models train on fully open-source. This requires moving beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, which is crucial for fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem [8]. To achieve this goal, we introduce Moxin 7B, a fully open-source LLM developed by complying with the Model Openness Framework (MOF) introduced by [ 9]. The MOF provides a systematic ranking classification system to rate AI models based on their completeness and openness, incorporating the principles of open science, open source, open data, and open access. By promoting transparency and reproducibility, the MOF serves as a crucial tool to combat “openwashing” practices and establishes completeness and openness as primary criteria alongside the core tenets of responsible AI. Wide adoption of the MOF will cultivate a more open AI ecosystem, benefiting research, innovation, and adoption of state-of-the-art models. Our open-source LLM has released pre-training code and configurations, training and fine-tuning data, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. Our model achieves the highest MOF classification level of “open science.” It is noteworthy that this commitment to openness has not compromised performance: our base model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation. Remarkably, our chat model can outperform 7B baselines like Llama2-7B-chat. Our homepage is https://github.com/moxin-org/Moxin-LLM . 2 Related Work 2.1 Models, Tokenizers, and Training Models. State-of-the-art large language models (LLMs) typically comprise a substantial number of parameters, often approaching or exceeding 100 billion [ 4,1,3]. To facilitate broader acces- sibility, smaller models with fewer than 20 billion parameters, and even those around 7 billion parameters, have been developed [ 10,11,4,6,12,13]. In addition, efficiency-enhancing techniques, such as implementing MAMBA-based architectures in Jamba, have been employed to optimize performance [12, 13]. Tokenizers. Tokenizers are essential to convert raw data into a suitable format for model processing. Many contemporary models employ Byte-Pair Encoding (BPE)[ 14], with OpenAI’s tiktoken tokenizer[ 15] being a notable implementation. However, for languages that handle tokens differently from Romance languages, alternatives such as SentencePiece [ 16] are utilized, as seen in XLNet [ 17]. Hugging Face offers an excellent summary of state-of-the-art tokenizers with practical examples [ 18]. Moreover, tokenization extends beyond text modalities; many foundational models now include multimodal capabilities, processing documents, audio, images, and even videos [19, 20, 21, 22]. Training. To enhance the performance of smaller models beyond their inherent limitations, various training strategies can be employed. A notable example is the application of Mixture of Experts (MoE) training, which has achieved significant success in models like Mixtral [23]. 2.2 Data curation methods Researchers commonly collect large datasets for training language models (LMs)[ 24] by performing web crawls. However, these datasets often contain undesirable content, necessitating data curation to improve their quality. To enhance model performance[ 25,26,24,27], several data curation techniques 2 Page 3: are widely employed. These include filtering by language [ 28,29,30], heuristic-based filtering [ 25, 31, 32], quality filtering [33, 34, 35], data deduplication [36, 37], and data mixing [38, 39, 40]. 2.3 Open-source datasets As the scale of LMs has increased in recent years [ 4,41,42,1], the community has correspondingly curated larger datasets to support their training. Early datasets include the C4 dataset, containing 160 billion tokens, and The Pile [ 32], which comprises 300 billion tokens. More recently, even larger datasets have been introduced: RefinedWeb [ 25] with 600 billion tokens, Dolma [ 43] with 3 trillion tokens, FineWeb [ 44] with 15 trillion tokens, and RedPajama-v2 [ 45] containing 30 trillion tokens. In addition to these general-purpose datasets, large domain-specific datasets have also been developed. For instance, StackV2 [ 46], a code-focused dataset, includes 900 billion tokens, and FineWeb-Edu [44], a high-quality filtered educational text dataset, contains 1.3 trillion tokens. 3 Model Training 3.1 Model Architecture We opt to extend the Mistral model architecture [ 6] due to its ability to achieve high performance while maintaining efficient inference speeds. The original Mistral 7B model demonstrates superior performance compared to multiple 7B language models and even outperforms larger models on various evaluation benchmarks. Notably, it surpasses the LLaMA 34B model [ 47] in tasks such as mathematics and code generation. Table 1: Parameter setting. Parameter Value n_layers 36 dim 4096 head_dim 128 hidden_dim 14336 n_heads 32 n_kv_heads 8The original Mistral model leverages grouped-query attention (GQA)[ 48] and sliding window attention (SWA)[ 49]. GQA reduces memory requirements during decoding, allowing for larger batch sizes and higher throughput, and it significantly accelerates inference speed—an essential factor in real-time applications. Meanwhile, SWA effectively handles long sequences without incurring substantial com- putational overhead. By incorporating these techniques, the model achieves significant improvements in performance and efficiency, which we have adopted in our extended model. Building upon the original Mistral model, which consists of 32 blocks, we have extended the architecture to 36 blocks. Furthermore, we also employ GQA to partition the query heads into multiple groups, each sharing a single key head and value head. This approach interpolates between multi-query attention (MQA) and multi-head attention (MHA) in large language models, striking a balance between the computational speed of MQA and the representational quality of MHA, thereby providing a favorable trade-off. Additionally, our model incorporates a rolling buffer cache with a fixed attention span, effectively limiting cache size and preventing excessive memory usage when processing long sequences. 3.2 Training Data Data are fundamental to the pre-training of LLMs. Preparing such training data requires careful con- sideration of multiple challenges, including handling sensitive information, ensuring comprehensive knowledge coverage, and achieving higher efficiency with improved data quality. In this section, we detail the processes of preparing textual data from general domains and coding data related to programming languages. 3.2.1 Text Data We use a mix of data from SlimPajama [50] and DCLM-BASELINE [38] as our text training data. During the training of LLaMA, it was demonstrated that the performance of a 7B model continues to improve even after being trained on more than 1T tokens [ 51]. Given the outstanding performance of LLaMA, its data collection methodology was rapidly replicated, leading to the release of RedPajama, an open-source dataset containing 1.2 trillion tokens [52]. 3 Page 4: However, subsequent analyses reveal a significant limitation: some corpora within RedPajama contain a large percentage of duplicate content. The deduplication guidelines in RedPajama operate only within individual data sources, leaving inter-source duplicates largely unaddressed. To improve data quality and training efficiency, SlimPajama1was developed as a refined iteration of RedPajama, offering a cleaned and extensively deduplicated version [50]. SlimPajama implements a rigorous two-stage preprocessing pipeline to enhance data quality. In the first stage, short and low-quality documents are removed from RedPajama. Specifically, documents that have fewer than 200 characters after removing punctuation, space symbols, newlines, and tabs are filtered out, as these documents typically contain only metadata and lack useful information. As a result of this step, 1.86% of RedPajama documents are eliminated. The second step involves removing duplicate data, as deduplication enhances training efficiency and reduces memorization, thereby decreasing the likelihood of generating text solely by recalling training data [ 25,53,54,36,55]. To perform deduplication, document signatures are created using pre-processed, lower-cased 13-grams. Subsequently, MinHashLSH [ 56] is employed to identify and eliminate duplicates based on a Jaccard similarity threshold of 0.8. Deduplication is performed both within and across data sources. Overall, by pruning 49.6% of the bytes from the RedPajama dataset, the 627B-token SlimPajama dataset is obtained. Additionally, we utilize the DCLM-BASELINE [ 38] dataset2, which is derived from CommonCrawl, a web-crawled dataset [ 57]. The construction of DCLM-BASELINE involves several steps. First, resiliparse is employed to extract text from CommonCrawl. Second, deduplication is performed using MinHash [ 58] within a suffix array pipeline [ 59,36] and near-duplicate Bloom filtering, which enhances the exact document and paragraph deduplication scheme [ 43]. Third, recent studies [60,43,61] demonstrate that utilizing learnable models as quality filters leads to downstream performance improvements. Consequently, DCLM-BASELINE applies a fastText OH-2.5 combined with an ELI5 classifier score to retain the top 10% of documents. 3.2.2 Coding Data Programming is crucial for LLMs to support various downstream tasks, such as code completion from natural language descriptions, documentation generation for individual functions, and auto- completion of code snippets. Furthermore, as code is generally better structured and organized than natural language, training on code data may improve the LLM reasoning capabilities [ 62]. Therefore, We use part of the-stack-dedup [63] dataset3during the pretraining. The Stack comprises more than 6TB of permissively-licensed source code files across 358 program- ming languages [ 63]. This carefully curated resource was designed to enhance the code generation capabilities of LLMs. It facilitates the synthesis of programs by code-generating AI systems from both natural language descriptions and existing code snippets. To construct the Stack dataset, 220.92 million active GitHub repositories were collected from event archives published between 2015 and 2022 on GHArchive. Of these repositories, only 137.36 million were publicly accessible on GitHub, resulting in 51.76 billion downloaded files. After initial filtering, 5.28 billion unique files were identified, with an uncompressed size of 92.36 TB. To ensure data quality, near-deduplication was implemented within the preprocessing pipeline in addition to exact deduplication. Specifically, MinHash with 256 permutations was computed for all documents, and Locality Sensitive Hashing was employed to identify clusters of duplicates. Within these clusters, Jaccard similarities were calculated to detect near-duplicates using a similarity threshold of 0.85. Approximately 40% of permissively licensed files were identified as (near- )duplicates and subsequently removed. 3.2.3 Capability Enhancement LLMs are expected to demonstrate capabilities such as reasoning, mathematical problem-solving, and knowledge memorizing. However, a significant challenge lies in that, in the pre-training process, high-quality capability-related data is sparsely distributed in the entire corpus, and thereby it is 1https://huggingface.co/datasets/cerebras/SlimPajama-627B 2https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0 3https://huggingface.co/datasets/bigcode/the-stack-dedup 4 Page 5: difficult for models to be proficient at these above-mentioned capabilities. Previous research, such as work on Qwen [ 10], GLM-130B [ 64], Nemotron-4 [ 65], has tried to incorporate instruction- based or high-quality data during the pre-training stage to enhance these abilities. In our study, we collect open-source data from HuggingFace, primarily utilizing the training datasets of various evaluation benchmarks such as MMLU [ 66] and HellaSwag [ 67]. These data are used experimentally to investigate the relationship between high-quality, capability-focused training data and model performance. 3.3 Training Configuration The total number of tokens used for pre-training our Moxin-7B model is over 2T, and the pre-training process consists of three phases. In the first phase, we use pre-training corpora with the context length of 2k. In the second phase, we use pre-training corpora with the context length of 4k. In the third phase, we utilize the capability-specific enhancement data. We provide the model performance with only the first two phases and also with all three phases to validate the performance of the third phase. We use Colossal-AI [ 68] as our training framework. Colossal-AI is a unified deep learning system that provides the fullest set of acceleration techniques for the AI community. With its modular design, ColossalAI allows for a free combination of these techniques to achieve the best training speedup. Colossal-AI’s optimized parallelism and heterogeneous training methods are employed to achieve superior system performance compared to baseline systems. These methods are provided through user-friendly APIs, requiring minimal code modifications. During training, AdamW [ 69] with β1= 0.9,β2= 0.95,ϵ= 1e−8and weight decay = 0.1 is used to optimize the model. We use the cosine learning rate decay and the learning rate decays to 10% of its maximum. Learning Rate is set to 2e−6. 3.4 Alignment Following the pre-training phase, we fine-tune the model into a helpful and harmless AI assistant. In our Alignment stage, we mainly use supervised fine-tuning (SFT), during which we fine-tune the model to follow diverse human instructions by high-quality instruction data. We use the Tulu v2 dataset [ 70] for instruction tuning. The dataset consists of a mix of FLAN, Open Assistant 1, ShareGPT, GPT4-Alpaca, LIMA, and so on. 3.5 Long-Context To deal with the long-context problem, our model leverages grouped-query attention (GQA) [ 48], sliding window attention (SWA) [ 49], and Rolling Buffer Cache [ 6]. GQA reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput. Besides, SWA can handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. SWA exploits the stacked layers of a transformer to attend information beyond the window size W. At the last layer, with SWA, using a window size of W= 4096 , we have a theoretical attention span of approximately 14Ktokens or above. Our model adopts Rolling Buffer Cache which limits the cache size using a rolling buffer cache with a fixed attention span. The cache has a fixed size of W, and the keys and values for the timestep i are stored in position imodWof the cache. As a result, when the position iis larger than W, past values in the cache are overwritten, and the size of the cache stops increasing. On a sequence length of32ktokens, this reduces the cache memory usage by 8 ×, without impacting the model quality. With the above techniques, our model can support 32Kcontext length with fast inference and low memory cost. 4 Evaluation We conducted comprehensive performance comparisons against leading language models of compa- rable scale, including Mistral-7B [ 6], LLaMA 2-7B [ 51], Gemma-7B [ 41], and Qwen v2-7B [ 11]. These models were selected based on their demonstrated excellence within the 7B or 8B category and represent diverse development approaches from various research organizations worldwide. To ensure 5 Page 6: a robust evaluation, we re-run all benchmarks with the same evaluation pipeline for fair comparisons. Specifically, we use lm-evaluation-harness [71] and opencompass [72] for evaluation. Lm-evaluation-harness provides a unified framework to test generative language models on a large number of different evaluation tasks. It supports over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented. This framework is versatile as it extends to models implemented through various architectures, including transformers (including quantization via AutoGPTQ [ 73]), GPT-NeoX [ 74], and Megatron-DeepSpeed [ 75], all unified through a flexible, tokenization-agnostic interface. The framework is reliable, as evidenced by serving as the backend for HuggingFace’s popular Open LLM Leaderboard and being utilized by dozens of organizations, including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML. To complement, we also employed openCompass. This framework performs an in-depth and holistic assessment of large language models structured around eight fundamental dimensions of language model capabilities: language comprehension, knowledge precision, logical deduction, creative ideation, mathematical problem-solving, programming proficiency, extended text analysis, and intelligent agent engagement. 4.1 Evaluation Tasks We evaluate the model performance on various tasks below. •AI2 Reasoning Challenge (ARC) [ 76] - a set of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set (ARC-C) and an Easy Set (ARC-E), where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. •HellaSwag [ 67] - a test of commonsense natural language inference, which is easy for humans ( 95%) but challenging for SOTA models. It consists of 70,000 multiple-choice questions. Each question presents a scenario followed by four possible outcomes, asking the model to select the most reasonable conclusion. •MMLU [ 77] - a test to measure a text model’s multitask accuracy. The test covers 57 tasks, including elementary mathematics, US history, computer science, law, etc. •Winogrande [ 78] - an adversarial and difficult Winograd benchmark at scale, for common- sense reasoning. It contains 44,000 multiple-choice questions with two options each. It requires the model to choose the appropriate entity word for the pronoun in the descriptive text based on the scenario. •PIQA [ 79] - the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering (PIQA). Physical commonsense knowl- edge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. 4.2 Evaluation Results We name the initial model as Moxin-7B-original, which presents the foundation model before fine-tuning on the training data of the evaluation datasets. After subsequent partial fine-tuning of Moxin-7B-original on the training data of the evaluation datasets, we developed Moxin-7B-finetuned, enabling direct assessment of how targeted fine-tuning affects model performance. 4.2.1 Zero-Shot Evaluation We report the result of base models for zero-shot evaluation in Table 2. The tasks are listed below. After training with the training data of evaluation tasks, our Moxin-7B-finetuned can achieve superior performance compared with state-of-the-art (SOTA) baselines. This significant increase from the base model demonstrates the effectiveness of our fine-tuning approach. The improved performance is particularly notable on complex reasoning tasks like PIQA, where the score increased from 78.07% to 82.24%, matching or exceeding several leading models. Consequently, our models emerge as an excellent candidate for real-world applications. 6 Page 7: • AI2 Reasoning Challenge (0-shot) • AI2 Reasoning Easy (0-shot) • HellaSwag (0-shot) • PIQA (0-shot) • Winogrande (0-shot) Table 2: Performance comparison for various models in zero-shot evaluation. Models HellaSwag WinoGrade PIQA ARC-E ARC-C Ave Mistral - 7B 80.39 73.4 82.15 78.28 52.22 73.29 LLaMA 2 - 7B 75.99 69.06 79.11 74.54 46.42 69.02 LLaMA 2 - 13B 79.37 72.22 80.52 77.4 49.06 71.71 LLaMA 3.1 - 8B 78.92 74.19 81.12 81.06 53.67 73.79 gemma - 7b 80.45 73.72 80.9 79.97 54.1 73.83 Qwen v2 - 7B 78.9 72.38 79.98 74.71 50.09 71.21 internlm2.5 - 7b 79.14 77.9 80.52 76.16 51.37 73.02 Baichuan2 - 7B 72.25 67.17 77.26 72.98 42.15 66.36 Yi-1.5-9B 77.86 73.01 80.74 79.04 55.03 73.14 deepseek - 7B 76.13 69.77 79.76 71.04 44.8 68.3 Moxin - 7B - original 72.06 66.31 78.07 71.47 48.15 67.21 Moxin - 7B - finetune 80.03 75.17 82.24 81.12 58.64 75.44 4.2.2 Few-Shot Evaluation Table 3 presents our zero-shot evaluation results across multiple benchmark tasks. The tasks and their few-show settings are listed below. Thanks to its rigorous and high-quality training corpus, our model demonstrates a remarkable competitive edge in tasks that involve language understanding and knowledge application. Our Moxin-7B-original achieves superior performance than LLaMA2-7B in this scenario. After training with the training data of evaluation tasks, our Moxin-7B-finetuned can achieve competitive performance compared with SOTA baselines. Consequently, our models emerge as an excellent choice for a multitude of real-world applications where the reliance on robust language comprehension and extensive knowledge is paramount. • AI2 Reasoning Challenge (25-shot) • HellaSwag (10-shot) • MMLU (5-shot) • Winogrande (5-shot) Table 3: Performance comparison for various models in few-shot evaluation. model ARC-C hellaswag mmlu WinoGrade Ave Mistral - 7B 57.59 83.25 62.42 78.77 70.51 LLaMA 3.1 - 8B 54.61 81.95 65.16 77.35 69.77 LLaMA 3 - 8B 55.46 82.09 65.29 77.82 70.17 LLaMA 2 - 7B 49.74 78.94 45.89 74.27 62.21 Qwen 2 - 7B 57.68 80.76 70.42 77.43 71.57 gemma - 7B 56.48 82.31 63.02 78.3 70.03 internlm2.5 - 7B 54.78 79.7 68.17 80.9 70.89 Baichuan2 - 7B 47.87 73.89 54.13 70.8 61.67 Yi-1.5-9B 58.36 80.36 69.54 77.53 71.48 Moxin - 7B - original 53.75 75.46 59.43 70.32 64.74 Moxin - 7B - finetuned 59.47 83.08 60.97 78.69 70.55 4.3 Alignment Evaluation 7 Page 8: Table 4: Performance for various chat models. Model MTbench Moxin Chat 7B 6.42 Llama 2 Chat 13B 6.65 Vicuna 13B 6.57 Llama 2 Chat 7B 6.27 Vicuna 7B 6.17 Alpaca 13B 4.53We evaluate the alignment performance on MTBench [ 80]. It is a two-round conversation dataset with 80 questions. It covers eight dimensions (reasoning, roleplay, math, coding, writing, humanities, STEM, and information extraction) with 10 questions for each dimension. The model needs to answer the first question and then refine its previous response follow- ing additional specific instructions. We use GPT-4 as a judge model to provide scores (between 1-10) for the quality of re- sponses. Our Moxin-7B-chat achieves superior performance on MTbench compared with baselines, as shown in Table 4. 4.4 Generation Examples We demonstrate some text generation examples from our Moxin Chat 7B and Llama2 Chat 7B in Table 5 and 6. Example 1 Moxin Chat 7BCan you explain the concept of regularization in machine learning? Regularization is a technique used in machine learning to prevent overfitting by introducing a penalty term in the loss function. The idea behind regularization is to encourage simpler models that are easier to interpret and generalize better to new data. There are two main types of regularization: 1. L1 regularization: also known as Lasso regularization, this method adds a penalty term to the loss function that is proportional to the absolute value of the coefficients in the model. This encourages sparsity, meaning that some coefficients may be exactly zero, which simplifies the model and makes it easier to interpret. 2. L2 regularization: also known as Ridge regularization, this method adds a penalty term to the loss function that is proportional to the square of the coefficients in the model. This encourages shrinkage, meaning that the coefficients are pulled closer to zero, which simplifies the model and reduces the variance of the model estimates. Regularization can be used in various machine learning models, including linear regression, logistic regression, and neural networks. It is often used in conjunction with other techniques such as cross- validation to find the optimal hyperparameters that balance the bias-variance tradeoff and prevent overfitting. Llama2 Chat 7BCan you explain the concept of regularization in machine learning? Regularization is a technique used in machine learning to prevent overfitting. It involves adding a penalty term to the loss function to discourage large values of the model’s weights. There are several types of regularization, including: 1. L1 regularization (Lasso): This adds a penalty term to the loss function that is proportional to the absolute value of the model’s weights. This encourages the model to have smaller weights, which can help prevent overfitting. 2. L2 regularization (Ridge): This adds a penalty term to the loss function that is proportional to the square of the model’s weights. This also encourages the model to have smaller weights, but it can lead to some weights being set to zero, which can result in a simpler model. 3. Dropout regularization: This is a type of regularization that is applied during training. It involves randomly setting a fraction of the model’s neurons to zero during each training iteration. This helps to prevent overfitting by making the model less reliant on any single neuron. 4. Early stopping: This is a type of regularization that is applied during training. It involves monitoring the validation loss during training and stopping the training process when the validation loss stops improving. This helps to prevent overfitting by stopping the training process before the model has a chance to fit the noise in the training data. Regularization can be applied to different layers of a neural network, such as the input layer, hidden layers, or output layer. It can also be applied to different parts of a model, such as the weights, biases, or activation functions. The main idea behind regularization is to add a penalty term to the loss function that discourages large values of the model’s weights. This helps to prevent overfitting by encouraging the model to find a simpler solution that generalizes better to new data. Regularization can be used in conjunction with other techniques, such as cross-validation and grid search, to improve the performance of a machine learning model. In summary, regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function. There are several types of regularization, including L1, L2, dropout, and early stopping, and it can be applied to different layers and parts of a model. Regularization helps to find a simpler solution that generalizes better to new data and can be used in conjunction with other techniques to improve the performance of a machine learning model. Table 5: Generation Examples. 8 Page 9: Example 2 Moxin Chat 7BCan you explain the concept of normalization in machine learning? Normalization is a technique used to scale the data before using it for training a machine learning model. It is used to make the data more uniform, and to prevent it from being skewed by large or small values. There are several ways to normalize data, including min-max scaling, z-score normalization, and logarithmic scaling. In min-max scaling, the values are scaled to a range between 0 and 1, by dividing them by the maximum value in the dataset. In z-score normalization, the values are scaled to have a mean of 0 and a standard deviation of 1, by subtracting the mean from each value and dividing by the standard deviation. In logarithmic scaling, the values are transformed using the logarithm function, which can be useful for normalizing data with a wide range of values. Normalization is an important step in preparing data for machine learning, as it can help to improve the performance of the model by making the data more uniform and easier to analyze. It can also help to prevent overfitting and improve generalization. Llama2 Chat 7BCan you explain the concept of normalization in machine learning? Normalization is a technique used in machine learning to transform input data into a range of values that are suitable for use in a model. The goal of normalization is to ensure that the data is on the same scale, so that the model can learn meaningful relationships between the features. There are several types of normalization techniques, including: 1. Min-max normalization: This involves scaling the data so that it falls within a specific range, usually between 0 and 1. 2. Standardization: This involves subtracting the mean and dividing by the standard deviation of the data, so that the data has a mean of 0 and a standard deviation of 1. 3. Z-score normalization: This involves subtracting the mean and dividing by the standard deviation of the data, so that the data has a mean of 0 and a standard deviation of 1. 4. Log transformation: This involves taking the logarithm of the data, which can help to reduce the effect of extreme values in the data. 5. Square root transformation: This involves taking the square root of the data, which can help to reduce the effect of outliers in the data. Normalization is an important preprocessing step in machine learning, as it can help to improve the performance of the model by reducing the effect of irrelevant features and improving the interpretability of the results. Here are some key points to keep in mind when normalizing data for machine learning: 1. Normalization can improve the performance of the model by reducing the effect of irrelevant features and improving the interpretability of the results. 2. There are several types of normalization techniques, including min-max normalization, standardization, z-score normalization, log transformation, and square root transformation. 3. The choice of normalization technique will depend on the nature of the data and the specific requirements of the model. 4. Normalization can be applied to both continuous and categorical data. 5. It is important to normalize the data before training the model, as normalization can improve the convergence of the model and reduce the risk of overfitting. 6. Normalization can also be used to improve the interpretability of the results by reducing the effect of irrelevant features and improving the scalability of the model. 7. It is important to carefully evaluate the performance of the normalized data and consider the trade-offs between different normalization techniques. 8. Normalization can be applied to different features of the data, such as the input features, the target variable, or the output features. 9. Normalization can be used in combination with other preprocessing techniques, such as feature scaling, feature selection, and data augmentation. 10. Normalization is a powerful tool for improving the performance of machine learning models, but it is important to carefully evaluate the performance of the normalized data and consider the trade-offs between different normalization techniques. Table 6: Generation Examples. 5 Conclusion The field of Large Language Models has witnessed a significant shift toward open-source develop- ment, fostering innovation within the AI community. However, a critical challenge emerges: many purportedly open-source models withhold essential components necessary for full understanding and reproducibility, creating barriers that limit both academic advancement and commercial adoption. This does not not only hamper scientific progress but also prevent businesses from fully leveraging these models for innovative applications, ultimately diminishing potential societal benefits and eco- nomic value creation. To address these limitations, we introduce Moxin 7B, a fully open-source language model developed in accordance with the Model Openness Framework (MOF), providing comprehensive access to pre-training code, configurations, training and fine-tuning datasets, and all intermediate checkpoints. Our evaluation results demonstrate that the Moxin 7B achieves superior zero-shot evaluation results compared to popular 7B models while maintaining competitive few-shot capabilities. We wish to see more work that establishes new standard for reproducible research in language model development, fostering a more inclusive and economically vibrant AI ecosystem. 9 Page 10: References [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. [2]Anthropic. The claude 3 model family: Opus, sonnet, haiku. https: //www-cdn .anthropic .com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3 .pdf. [3]Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023. [4]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. [5]Thomas Prest, Pierre-Alain Fouque, Jeffrey Hoffstein, Paul Kirchner, Vadim Lyubashevsky, Thomas Pornin, Thomas Ricosset, Gregor Seiler, William Whyte, and Zhenfei Zhang. Falcon. Post-Quantum Cryptography Project of NIST , 2020. [6]Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023. [7]Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index. arXiv preprint arXiv:2310.12941 , 2023. [8]Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, et al. On the societal impact of open foundation models. arXiv preprint arXiv:2403.07918 , 2024. [9]Matt White, Ibrahim Haddad, Cailean Osborne, Ahmed Abdelmonsef, Sachin Varghese, et al. The model openness framework: Promoting completeness and openness for reproducibility, transparency and usability in ai. arXiv preprint arXiv:2403.13784 , 2024. [10] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609 , 2023. [11] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671 , 2024. [12] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887 , 2024. [13] Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer- mamba models at scale. arXiv preprint arXiv:2408.12570 , 2024. [14] Rico Sennrich. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 , 2015. [15] OpenAI Team. tiktoken, 2022. [16] T Kudo. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 , 2018. [17] Zhilin Yang. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 , 2019. [18] Hugging Face Team. Summary of the tokenizers. 2024. [19] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024. 10 Page 11: [20] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023. [21] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 , 2023. [22] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm- llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 , 2024. [23] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 , 2024. [24] Ben Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sas- try, A Askell, S Agarwal, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 1, 2020. [25] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 , 2023. [26] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 , 2021. [27] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359 , 2019. [28] L Xue. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 , 2020. [29] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research , 21(140):1–67, 2020. [30] Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Advances in neural information processing systems , 32, 2019. [31] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. [32] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 , 2020. [33] Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. arXiv preprint arXiv:2402.09668 , 2024. [34] Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169 , 2023. [35] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning , pages 5547–5569. PMLR, 2022. [36] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 , 2021. 11 Page 12: [37] Amit Agarwal, Hema Swetha Koppula, Krishna P Leela, Krishna Prasad Chitrapura, Sachin Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. Url normalization for de-duplication of web pages. In Proceedings of the 18th ACM conference on information and knowledge management , pages 1987–1990, 2009. [38] Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. arXiv preprint arXiv:2406.11794 , 2024. [39] Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models , 2023. [40] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, et al. Slimpajama-dc: Under- standing data combinations for llm training. arXiv preprint arXiv:2309.10818 , 2023. [41] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024. [42] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research , 24(240):1– 113, 2023. [43] Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159 , 2024. [44] Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557 , 2024. [45] Malte Ostendorff, Pedro Ortiz Suarez, Lucas Fonseca Lage, and Georg Rehm. Llm-datasets: An open framework for pretraining datasets of large language models. [46] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua- mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173 , 2024. [47] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 , 2023. [48] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 , 2023. [49] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 , 2020. [50] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras .ai/blog/slimpajama-a-627b-token-cleaned-and- deduplicated-version-of-redpajama , 2023. [51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023. [52] Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, et al. Redpajama: an open dataset for training large language models. arXiv preprint arXiv:2411.12372 , 2024. [53] Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semd- edup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540 , 2023. 12 Page 13: [54] Large-scale near-deduplication behind bigcode, 2023. [55] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 , 2019. [56] J Leskovec, A Rajaraman, and JD Ullman. Mining of massive datasets, cambridge university press, cambridge, 2014. [57] Jay M Patel and Jay M Patel. Introduction to common crawl datasets. Getting structured data from the internet: running web crawlers/scrapers on a big data production scale , pages 277–324, 2020. [58] Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Com- pression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pages 21–29. IEEE, 1997. [59] Fineweb, 2024. [60] David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham Kakade. Color-filter: Conditional loss reduction filtering for targeted language model pre- training. arXiv preprint arXiv:2406.10670 , 2024. [61] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425 , 2023. [62] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838 , 2024. [63] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 , 2022. [64] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations , 2023. [65] Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama- nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, et al. Nemotron-4 15b technical report. arXiv preprint arXiv:2402.16819 , 2024. [66] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. [67] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 , 2019. [68] Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing , pages 766–775, 2023. [69] I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017. [70] Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702 , 2023. [71] LM Evaluation Harness Team. Lm evaluation harness, 2024. Accessed: Summer 2024. [72] Open Compass Team. Open compass, 2024. Accessed: Summer 2024. [73] AutoGPTQ Team. Autogptq: An user-friendly llms quantization package, 2024. Accessed: Spring 2024. [74] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745 , 2022. 13 Page 14: [75] Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Cheng- ming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, et al. Deep- speed4science initiative: Enabling large-scale scientific discovery through sophisticated ai system technologies. arXiv preprint arXiv:2310.04610 , 2023. [76] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1 , 2018. [77] Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations. arXiv preprint arXiv:2406.07358 , 2024. [78] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106, 2021. [79] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. [80] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023. 14

---