Authors: Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Xingchen Xu, Yu Huang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang
Paper Content:
Page 1:
Fully Open Source Moxin-LLM Technical Report
Pu Zhao1, Xuan Shen1, Zhenglun Kong2, Yixin Shen3, Sung-En Chang1,
Timothy Rupprecht1, Lei Lu1, Enfu Nan1, Changdi Yang1, Yumei He4,
Xingchen Xu5, Yu Huang6, Wei Wang7, Yue Chen7, Yong He7, Yanzhi Wang1,8
1Northeastern University,2Harvard University,
3Cornell University,4Tulane University,5University of Washington,
6Roboraction.ai,7Futurewei Technologies,8AIBAO LLC
Abstract
Recently, Large Language Models (LLMs) have undergone a significant transfor-
mation, marked by a rapid rise in both their popularity and capabilities. Leading
this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured
widespread attention in the AI community due to their remarkable performance
and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral,
have made great contributions to the ever-increasing popularity of LLMs due to the
ease to customize and deploy the models across diverse applications. Although
open-source LLMs present unprecedented opportunities for innovation and
research, the commercialization of LLMs has raised concerns about transparency,
reproducibility, and safety. Many open-source LLMs fail to meet fundamental
transparency requirements by withholding essential components like training code
and data, and some use restrictive licenses whilst claiming to be “open-source,”
which may hinder further innovations on LLMs. To mitigate this issue, we
introduce Moxin 7B, a fully open-source LLM developed in accordance with the
Model Openness Framework (MOF), a ranked classification system that evaluates
AI models based on model completeness and openness, adhering to principles of
open science, open source, open data, and open access. Our model achieves the
highest MOF classification level of “open science” through the comprehensive
release of pre-training code and configurations, training and fine-tuning datasets,
and intermediate and final checkpoints. Experiments show that our model achieves
superior performance in zero-shot evaluation compared with popular 7B models
and performs competitively in few-shot evaluation.
Homepage: https://github.com/moxin-org/Moxin-LLM
Base model: https://huggingface.co/moxin-org/moxin-llm-7b
Chat model: https://huggingface.co/moxin-org/moxin-chat-7b
1 Introduction
The field of natural language processing has witnessed the most exciting discoveries of the last ten
years with the emergence of large language models (LLMs). At the forefront of this evolution are
LLMs such as GPT-4 [ 1], Claude [ 2], and Gemini [ 3], which have captured the attention of the AI
community due to their performance and versatility. Meanwhile, the recent emergence of openly
accessible yet highly capable LLMs such as LLaMA [ 4], Falcon [ 5], and Mistral [ 6] allow researchers
and practitioners to easily obtain, customize, and deploy LLMs in more various environments and for
more diverse use cases. The trends have made people eagerly asking about what’s next and some
suggest “a general intelligence” is right around the corner.arXiv:2412.06845v2 [cs.CL] 11 Dec 2024
Page 2:
Despite the growing influence and accessibility of open-source LLMs, a notable challenge emerged:
many model producers restrict visibility and access to their training, fine-tuning, and evaluation
processes, including crucial components such as their training code and data [ 7]. Some model
producers even use restrictive licenses whilst claiming to be “open-source.” This practice creates
barriers for the broader AI research community to study, replicate, and innovate upon advanced LLMs.
In parallel, it prevents businesses from fully leveraging open-source models for innovative industrial
applications, as its commercialization has raised concerns about transparency, reproducibility, and
safety.
To unlock the full potential of LLMs and open innovation, we must return to democratize this research
by putting the model into the hands of more researchers and making the datasets the models train on
fully open-source. This requires moving beyond the simple sharing of model weights to embrace
complete transparency in training, datasets, and implementation detail, which is crucial for fostering
a more inclusive and collaborative research environment that can sustain a healthy open-source
ecosystem [8].
To achieve this goal, we introduce Moxin 7B, a fully open-source LLM developed by complying with
the Model Openness Framework (MOF) introduced by [ 9]. The MOF provides a systematic ranking
classification system to rate AI models based on their completeness and openness, incorporating the
principles of open science, open source, open data, and open access. By promoting transparency and
reproducibility, the MOF serves as a crucial tool to combat “openwashing” practices and establishes
completeness and openness as primary criteria alongside the core tenets of responsible AI. Wide
adoption of the MOF will cultivate a more open AI ecosystem, benefiting research, innovation, and
adoption of state-of-the-art models.
Our open-source LLM has released pre-training code and configurations, training and fine-tuning
data, and intermediate and final checkpoints, aiming to make continuous commitments to fully
open-source LLMs. Our model achieves the highest MOF classification level of “open science.”
It is noteworthy that this commitment to openness has not compromised performance: our base
model achieves superior performance in zero-shot evaluation compared with popular 7B models
and performs competitively in few-shot evaluation. Remarkably, our chat model can outperform 7B
baselines like Llama2-7B-chat. Our homepage is https://github.com/moxin-org/Moxin-LLM .
2 Related Work
2.1 Models, Tokenizers, and Training
Models. State-of-the-art large language models (LLMs) typically comprise a substantial number
of parameters, often approaching or exceeding 100 billion [ 4,1,3]. To facilitate broader acces-
sibility, smaller models with fewer than 20 billion parameters, and even those around 7 billion
parameters, have been developed [ 10,11,4,6,12,13]. In addition, efficiency-enhancing techniques,
such as implementing MAMBA-based architectures in Jamba, have been employed to optimize
performance [12, 13].
Tokenizers. Tokenizers are essential to convert raw data into a suitable format for model processing.
Many contemporary models employ Byte-Pair Encoding (BPE)[ 14], with OpenAI’s tiktoken
tokenizer[ 15] being a notable implementation. However, for languages that handle tokens differently
from Romance languages, alternatives such as SentencePiece [ 16] are utilized, as seen in XLNet [ 17].
Hugging Face offers an excellent summary of state-of-the-art tokenizers with practical examples [ 18].
Moreover, tokenization extends beyond text modalities; many foundational models now include
multimodal capabilities, processing documents, audio, images, and even videos [19, 20, 21, 22].
Training. To enhance the performance of smaller models beyond their inherent limitations, various
training strategies can be employed. A notable example is the application of Mixture of Experts
(MoE) training, which has achieved significant success in models like Mixtral [23].
2.2 Data curation methods
Researchers commonly collect large datasets for training language models (LMs)[ 24] by performing
web crawls. However, these datasets often contain undesirable content, necessitating data curation to
improve their quality. To enhance model performance[ 25,26,24,27], several data curation techniques
2
Page 3:
are widely employed. These include filtering by language [ 28,29,30], heuristic-based filtering [ 25,
31, 32], quality filtering [33, 34, 35], data deduplication [36, 37], and data mixing [38, 39, 40].
2.3 Open-source datasets
As the scale of LMs has increased in recent years [ 4,41,42,1], the community has correspondingly
curated larger datasets to support their training. Early datasets include the C4 dataset, containing
160 billion tokens, and The Pile [ 32], which comprises 300 billion tokens. More recently, even
larger datasets have been introduced: RefinedWeb [ 25] with 600 billion tokens, Dolma [ 43] with 3
trillion tokens, FineWeb [ 44] with 15 trillion tokens, and RedPajama-v2 [ 45] containing 30 trillion
tokens. In addition to these general-purpose datasets, large domain-specific datasets have also been
developed. For instance, StackV2 [ 46], a code-focused dataset, includes 900 billion tokens, and
FineWeb-Edu [44], a high-quality filtered educational text dataset, contains 1.3 trillion tokens.
3 Model Training
3.1 Model Architecture
We opt to extend the Mistral model architecture [ 6] due to its ability to achieve high performance
while maintaining efficient inference speeds. The original Mistral 7B model demonstrates superior
performance compared to multiple 7B language models and even outperforms larger models on
various evaluation benchmarks. Notably, it surpasses the LLaMA 34B model [ 47] in tasks such as
mathematics and code generation.
Table 1: Parameter setting.
Parameter Value
n_layers 36
dim 4096
head_dim 128
hidden_dim 14336
n_heads 32
n_kv_heads 8The original Mistral model leverages grouped-query attention
(GQA)[ 48] and sliding window attention (SWA)[ 49]. GQA reduces
memory requirements during decoding, allowing for larger batch
sizes and higher throughput, and it significantly accelerates inference
speed—an essential factor in real-time applications. Meanwhile, SWA
effectively handles long sequences without incurring substantial com-
putational overhead. By incorporating these techniques, the model
achieves significant improvements in performance and efficiency,
which we have adopted in our extended model.
Building upon the original Mistral model, which consists of 32 blocks,
we have extended the architecture to 36 blocks. Furthermore, we also employ GQA to partition the
query heads into multiple groups, each sharing a single key head and value head. This approach
interpolates between multi-query attention (MQA) and multi-head attention (MHA) in large language
models, striking a balance between the computational speed of MQA and the representational quality
of MHA, thereby providing a favorable trade-off. Additionally, our model incorporates a rolling
buffer cache with a fixed attention span, effectively limiting cache size and preventing excessive
memory usage when processing long sequences.
3.2 Training Data
Data are fundamental to the pre-training of LLMs. Preparing such training data requires careful con-
sideration of multiple challenges, including handling sensitive information, ensuring comprehensive
knowledge coverage, and achieving higher efficiency with improved data quality.
In this section, we detail the processes of preparing textual data from general domains and coding
data related to programming languages.
3.2.1 Text Data
We use a mix of data from SlimPajama [50] and DCLM-BASELINE [38] as our text training data.
During the training of LLaMA, it was demonstrated that the performance of a 7B model continues to
improve even after being trained on more than 1T tokens [ 51]. Given the outstanding performance of
LLaMA, its data collection methodology was rapidly replicated, leading to the release of RedPajama,
an open-source dataset containing 1.2 trillion tokens [52].
3
Page 4:
However, subsequent analyses reveal a significant limitation: some corpora within RedPajama contain
a large percentage of duplicate content. The deduplication guidelines in RedPajama operate only
within individual data sources, leaving inter-source duplicates largely unaddressed. To improve data
quality and training efficiency, SlimPajama1was developed as a refined iteration of RedPajama,
offering a cleaned and extensively deduplicated version [50].
SlimPajama implements a rigorous two-stage preprocessing pipeline to enhance data quality. In the
first stage, short and low-quality documents are removed from RedPajama. Specifically, documents
that have fewer than 200 characters after removing punctuation, space symbols, newlines, and tabs
are filtered out, as these documents typically contain only metadata and lack useful information. As a
result of this step, 1.86% of RedPajama documents are eliminated.
The second step involves removing duplicate data, as deduplication enhances training efficiency
and reduces memorization, thereby decreasing the likelihood of generating text solely by recalling
training data [ 25,53,54,36,55]. To perform deduplication, document signatures are created using
pre-processed, lower-cased 13-grams. Subsequently, MinHashLSH [ 56] is employed to identify and
eliminate duplicates based on a Jaccard similarity threshold of 0.8. Deduplication is performed both
within and across data sources. Overall, by pruning 49.6% of the bytes from the RedPajama dataset,
the 627B-token SlimPajama dataset is obtained.
Additionally, we utilize the DCLM-BASELINE [ 38] dataset2, which is derived from CommonCrawl,
a web-crawled dataset [ 57]. The construction of DCLM-BASELINE involves several steps. First,
resiliparse is employed to extract text from CommonCrawl. Second, deduplication is performed
using MinHash [ 58] within a suffix array pipeline [ 59,36] and near-duplicate Bloom filtering,
which enhances the exact document and paragraph deduplication scheme [ 43]. Third, recent studies
[60,43,61] demonstrate that utilizing learnable models as quality filters leads to downstream
performance improvements. Consequently, DCLM-BASELINE applies a fastText OH-2.5 combined
with an ELI5 classifier score to retain the top 10% of documents.
3.2.2 Coding Data
Programming is crucial for LLMs to support various downstream tasks, such as code completion
from natural language descriptions, documentation generation for individual functions, and auto-
completion of code snippets. Furthermore, as code is generally better structured and organized than
natural language, training on code data may improve the LLM reasoning capabilities [ 62]. Therefore,
We use part of the-stack-dedup [63] dataset3during the pretraining.
The Stack comprises more than 6TB of permissively-licensed source code files across 358 program-
ming languages [ 63]. This carefully curated resource was designed to enhance the code generation
capabilities of LLMs. It facilitates the synthesis of programs by code-generating AI systems from
both natural language descriptions and existing code snippets.
To construct the Stack dataset, 220.92 million active GitHub repositories were collected from event
archives published between 2015 and 2022 on GHArchive. Of these repositories, only 137.36 million
were publicly accessible on GitHub, resulting in 51.76 billion downloaded files. After initial filtering,
5.28 billion unique files were identified, with an uncompressed size of 92.36 TB.
To ensure data quality, near-deduplication was implemented within the preprocessing pipeline in
addition to exact deduplication. Specifically, MinHash with 256 permutations was computed for
all documents, and Locality Sensitive Hashing was employed to identify clusters of duplicates.
Within these clusters, Jaccard similarities were calculated to detect near-duplicates using a similarity
threshold of 0.85. Approximately 40% of permissively licensed files were identified as (near-
)duplicates and subsequently removed.
3.2.3 Capability Enhancement
LLMs are expected to demonstrate capabilities such as reasoning, mathematical problem-solving,
and knowledge memorizing. However, a significant challenge lies in that, in the pre-training process,
high-quality capability-related data is sparsely distributed in the entire corpus, and thereby it is
1https://huggingface.co/datasets/cerebras/SlimPajama-627B
2https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
3https://huggingface.co/datasets/bigcode/the-stack-dedup
4
Page 5:
difficult for models to be proficient at these above-mentioned capabilities. Previous research, such
as work on Qwen [ 10], GLM-130B [ 64], Nemotron-4 [ 65], has tried to incorporate instruction-
based or high-quality data during the pre-training stage to enhance these abilities. In our study,
we collect open-source data from HuggingFace, primarily utilizing the training datasets of various
evaluation benchmarks such as MMLU [ 66] and HellaSwag [ 67]. These data are used experimentally
to investigate the relationship between high-quality, capability-focused training data and model
performance.
3.3 Training Configuration
The total number of tokens used for pre-training our Moxin-7B model is over 2T, and the pre-training
process consists of three phases. In the first phase, we use pre-training corpora with the context length
of 2k. In the second phase, we use pre-training corpora with the context length of 4k. In the third
phase, we utilize the capability-specific enhancement data. We provide the model performance with
only the first two phases and also with all three phases to validate the performance of the third phase.
We use Colossal-AI [ 68] as our training framework. Colossal-AI is a unified deep learning system
that provides the fullest set of acceleration techniques for the AI community. With its modular design,
ColossalAI allows for a free combination of these techniques to achieve the best training speedup.
Colossal-AI’s optimized parallelism and heterogeneous training methods are employed to achieve
superior system performance compared to baseline systems. These methods are provided through
user-friendly APIs, requiring minimal code modifications.
During training, AdamW [ 69] with β1= 0.9,β2= 0.95,ϵ= 1e−8and weight decay = 0.1 is used
to optimize the model. We use the cosine learning rate decay and the learning rate decays to 10% of
its maximum. Learning Rate is set to 2e−6.
3.4 Alignment
Following the pre-training phase, we fine-tune the model into a helpful and harmless AI assistant.
In our Alignment stage, we mainly use supervised fine-tuning (SFT), during which we fine-tune
the model to follow diverse human instructions by high-quality instruction data. We use the Tulu
v2 dataset [ 70] for instruction tuning. The dataset consists of a mix of FLAN, Open Assistant 1,
ShareGPT, GPT4-Alpaca, LIMA, and so on.
3.5 Long-Context
To deal with the long-context problem, our model leverages grouped-query attention (GQA) [ 48],
sliding window attention (SWA) [ 49], and Rolling Buffer Cache [ 6]. GQA reduces the memory
requirement during decoding, allowing for higher batch sizes hence higher throughput.
Besides, SWA can handle longer sequences more effectively at a reduced computational cost, thereby
alleviating a common limitation in LLMs. SWA exploits the stacked layers of a transformer to
attend information beyond the window size W. At the last layer, with SWA, using a window size of
W= 4096 , we have a theoretical attention span of approximately 14Ktokens or above.
Our model adopts Rolling Buffer Cache which limits the cache size using a rolling buffer cache with
a fixed attention span. The cache has a fixed size of W, and the keys and values for the timestep i
are stored in position imodWof the cache. As a result, when the position iis larger than W, past
values in the cache are overwritten, and the size of the cache stops increasing. On a sequence length
of32ktokens, this reduces the cache memory usage by 8 ×, without impacting the model quality.
With the above techniques, our model can support 32Kcontext length with fast inference and low
memory cost.
4 Evaluation
We conducted comprehensive performance comparisons against leading language models of compa-
rable scale, including Mistral-7B [ 6], LLaMA 2-7B [ 51], Gemma-7B [ 41], and Qwen v2-7B [ 11].
These models were selected based on their demonstrated excellence within the 7B or 8B category and
represent diverse development approaches from various research organizations worldwide. To ensure
5
Page 6:
a robust evaluation, we re-run all benchmarks with the same evaluation pipeline for fair comparisons.
Specifically, we use lm-evaluation-harness [71] and opencompass [72] for evaluation.
Lm-evaluation-harness provides a unified framework to test generative language models on a large
number of different evaluation tasks. It supports over 60 standard academic benchmarks for LLMs,
with hundreds of subtasks and variants implemented. This framework is versatile as it extends to
models implemented through various architectures, including transformers (including quantization
via AutoGPTQ [ 73]), GPT-NeoX [ 74], and Megatron-DeepSpeed [ 75], all unified through a flexible,
tokenization-agnostic interface. The framework is reliable, as evidenced by serving as the backend
for HuggingFace’s popular Open LLM Leaderboard and being utilized by dozens of organizations,
including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
To complement, we also employed openCompass. This framework performs an in-depth and holistic
assessment of large language models structured around eight fundamental dimensions of language
model capabilities: language comprehension, knowledge precision, logical deduction, creative
ideation, mathematical problem-solving, programming proficiency, extended text analysis, and
intelligent agent engagement.
4.1 Evaluation Tasks
We evaluate the model performance on various tasks below.
•AI2 Reasoning Challenge (ARC) [ 76] - a set of genuine grade-school level, multiple-choice
science questions, assembled to encourage research in advanced question-answering. The
dataset is partitioned into a Challenge Set (ARC-C) and an Easy Set (ARC-E), where the
former contains only questions answered incorrectly by both a retrieval-based algorithm and
a word co-occurrence algorithm.
•HellaSwag [ 67] - a test of commonsense natural language inference, which is easy for
humans ( 95%) but challenging for SOTA models. It consists of 70,000 multiple-choice
questions. Each question presents a scenario followed by four possible outcomes, asking the
model to select the most reasonable conclusion.
•MMLU [ 77] - a test to measure a text model’s multitask accuracy. The test covers 57 tasks,
including elementary mathematics, US history, computer science, law, etc.
•Winogrande [ 78] - an adversarial and difficult Winograd benchmark at scale, for common-
sense reasoning. It contains 44,000 multiple-choice questions with two options each. It
requires the model to choose the appropriate entity word for the pronoun in the descriptive
text based on the scenario.
•PIQA [ 79] - the task of physical commonsense reasoning and a corresponding benchmark
dataset Physical Interaction: Question Answering (PIQA). Physical commonsense knowl-
edge is a major challenge on the road to true AI-completeness, including robots that interact
with the world and understand natural language. PIQA focuses on everyday situations with
a preference for atypical solutions.
4.2 Evaluation Results
We name the initial model as Moxin-7B-original, which presents the foundation model before
fine-tuning on the training data of the evaluation datasets. After subsequent partial fine-tuning of
Moxin-7B-original on the training data of the evaluation datasets, we developed Moxin-7B-finetuned,
enabling direct assessment of how targeted fine-tuning affects model performance.
4.2.1 Zero-Shot Evaluation
We report the result of base models for zero-shot evaluation in Table 2. The tasks are listed below.
After training with the training data of evaluation tasks, our Moxin-7B-finetuned can achieve superior
performance compared with state-of-the-art (SOTA) baselines. This significant increase from the
base model demonstrates the effectiveness of our fine-tuning approach. The improved performance is
particularly notable on complex reasoning tasks like PIQA, where the score increased from 78.07%
to 82.24%, matching or exceeding several leading models. Consequently, our models emerge as an
excellent candidate for real-world applications.
6
Page 7:
• AI2 Reasoning Challenge (0-shot)
• AI2 Reasoning Easy (0-shot)
• HellaSwag (0-shot)
• PIQA (0-shot)
• Winogrande (0-shot)
Table 2: Performance comparison for various models in zero-shot evaluation.
Models HellaSwag WinoGrade PIQA ARC-E ARC-C Ave
Mistral - 7B 80.39 73.4 82.15 78.28 52.22 73.29
LLaMA 2 - 7B 75.99 69.06 79.11 74.54 46.42 69.02
LLaMA 2 - 13B 79.37 72.22 80.52 77.4 49.06 71.71
LLaMA 3.1 - 8B 78.92 74.19 81.12 81.06 53.67 73.79
gemma - 7b 80.45 73.72 80.9 79.97 54.1 73.83
Qwen v2 - 7B 78.9 72.38 79.98 74.71 50.09 71.21
internlm2.5 - 7b 79.14 77.9 80.52 76.16 51.37 73.02
Baichuan2 - 7B 72.25 67.17 77.26 72.98 42.15 66.36
Yi-1.5-9B 77.86 73.01 80.74 79.04 55.03 73.14
deepseek - 7B 76.13 69.77 79.76 71.04 44.8 68.3
Moxin - 7B - original 72.06 66.31 78.07 71.47 48.15 67.21
Moxin - 7B - finetune 80.03 75.17 82.24 81.12 58.64 75.44
4.2.2 Few-Shot Evaluation
Table 3 presents our zero-shot evaluation results across multiple benchmark tasks. The tasks and
their few-show settings are listed below. Thanks to its rigorous and high-quality training corpus, our
model demonstrates a remarkable competitive edge in tasks that involve language understanding and
knowledge application. Our Moxin-7B-original achieves superior performance than LLaMA2-7B in
this scenario. After training with the training data of evaluation tasks, our Moxin-7B-finetuned can
achieve competitive performance compared with SOTA baselines.
Consequently, our models emerge as an excellent choice for a multitude of real-world applications
where the reliance on robust language comprehension and extensive knowledge is paramount.
• AI2 Reasoning Challenge (25-shot)
• HellaSwag (10-shot)
• MMLU (5-shot)
• Winogrande (5-shot)
Table 3: Performance comparison for various models in few-shot evaluation.
model ARC-C hellaswag mmlu WinoGrade Ave
Mistral - 7B 57.59 83.25 62.42 78.77 70.51
LLaMA 3.1 - 8B 54.61 81.95 65.16 77.35 69.77
LLaMA 3 - 8B 55.46 82.09 65.29 77.82 70.17
LLaMA 2 - 7B 49.74 78.94 45.89 74.27 62.21
Qwen 2 - 7B 57.68 80.76 70.42 77.43 71.57
gemma - 7B 56.48 82.31 63.02 78.3 70.03
internlm2.5 - 7B 54.78 79.7 68.17 80.9 70.89
Baichuan2 - 7B 47.87 73.89 54.13 70.8 61.67
Yi-1.5-9B 58.36 80.36 69.54 77.53 71.48
Moxin - 7B - original 53.75 75.46 59.43 70.32 64.74
Moxin - 7B - finetuned 59.47 83.08 60.97 78.69 70.55
4.3 Alignment Evaluation
7
Page 8:
Table 4: Performance for various chat
models.
Model MTbench
Moxin Chat 7B 6.42
Llama 2 Chat 13B 6.65
Vicuna 13B 6.57
Llama 2 Chat 7B 6.27
Vicuna 7B 6.17
Alpaca 13B 4.53We evaluate the alignment performance on MTBench [ 80].
It is a two-round conversation dataset with 80 questions. It
covers eight dimensions (reasoning, roleplay, math, coding,
writing, humanities, STEM, and information extraction) with
10 questions for each dimension. The model needs to answer
the first question and then refine its previous response follow-
ing additional specific instructions. We use GPT-4 as a judge
model to provide scores (between 1-10) for the quality of re-
sponses. Our Moxin-7B-chat achieves superior performance
on MTbench compared with baselines, as shown in Table 4.
4.4 Generation Examples
We demonstrate some text generation examples from our Moxin Chat 7B and Llama2 Chat 7B in
Table 5 and 6.
Example 1
Moxin Chat 7BCan you explain the concept of regularization in machine learning?
Regularization is a technique used in machine learning to prevent overfitting by introducing a penalty
term in the loss function. The idea behind regularization is to encourage simpler models that are easier to
interpret and generalize better to new data.
There are two main types of regularization:
1. L1 regularization: also known as Lasso regularization, this method adds a penalty term to the loss
function that is proportional to the absolute value of the coefficients in the model. This encourages
sparsity, meaning that some coefficients may be exactly zero, which simplifies the model and makes it
easier to interpret.
2. L2 regularization: also known as Ridge regularization, this method adds a penalty term to the loss
function that is proportional to the square of the coefficients in the model. This encourages shrinkage,
meaning that the coefficients are pulled closer to zero, which simplifies the model and reduces the
variance of the model estimates.
Regularization can be used in various machine learning models, including linear regression, logistic
regression, and neural networks. It is often used in conjunction with other techniques such as cross-
validation to find the optimal hyperparameters that balance the bias-variance tradeoff and prevent
overfitting.
Llama2 Chat 7BCan you explain the concept of regularization in machine learning?
Regularization is a technique used in machine learning to prevent overfitting. It involves adding a
penalty term to the loss function to discourage large values of the model’s weights.
There are several types of regularization, including:
1. L1 regularization (Lasso): This adds a penalty term to the loss function that is proportional to the
absolute value of the model’s weights. This encourages the model to have smaller weights, which can
help prevent overfitting.
2. L2 regularization (Ridge): This adds a penalty term to the loss function that is proportional to the
square of the model’s weights. This also encourages the model to have smaller weights, but it can lead
to some weights being set to zero, which can result in a simpler model.
3. Dropout regularization: This is a type of regularization that is applied during training. It involves
randomly setting a fraction of the model’s neurons to zero during each training iteration. This helps to
prevent overfitting by making the model less reliant on any single neuron. 4. Early stopping: This is a
type of regularization that is applied during training. It involves monitoring the validation loss during
training and stopping the training process when the validation loss stops improving. This helps to prevent
overfitting by stopping the training process before the model has a chance to fit the noise in the training
data.
Regularization can be applied to different layers of a neural network, such as the input layer, hidden
layers, or output layer. It can also be applied to different parts of a model, such as the weights, biases, or
activation functions. The main idea behind regularization is to add a penalty term to the loss function that
discourages large values of the model’s weights. This helps to prevent overfitting by encouraging the
model to find a simpler solution that generalizes better to new data.
Regularization can be used in conjunction with other techniques, such as cross-validation and grid search,
to improve the performance of a machine learning model. In summary, regularization is a technique used
in machine learning to prevent overfitting by adding a penalty term to the loss function. There are several
types of regularization, including L1, L2, dropout, and early stopping, and it can be applied to different
layers and parts of a model. Regularization helps to find a simpler solution that generalizes better to new
data and can be used in conjunction with other techniques to improve the performance of a machine
learning model.
Table 5: Generation Examples.
8
Page 9:
Example 2
Moxin Chat 7BCan you explain the concept of normalization in machine learning?
Normalization is a technique used to scale the data before using it for training a machine learning model.
It is used to make the data more uniform, and to prevent it from being skewed by large or small values.
There are several ways to normalize data, including min-max scaling, z-score normalization, and
logarithmic scaling. In min-max scaling, the values are scaled to a range between 0 and 1, by dividing
them by the maximum value in the dataset. In z-score normalization, the values are scaled to have a mean
of 0 and a standard deviation of 1, by subtracting the mean from each value and dividing by the standard
deviation. In logarithmic scaling, the values are transformed using the logarithm function, which can be
useful for normalizing data with a wide range of values.
Normalization is an important step in preparing data for machine learning, as it can help to improve the
performance of the model by making the data more uniform and easier to analyze. It can also help to
prevent overfitting and improve generalization.
Llama2 Chat 7BCan you explain the concept of normalization in machine learning?
Normalization is a technique used in machine learning to transform input data into a range of values that
are suitable for use in a model. The goal of normalization is to ensure that the data is on the same scale, so
that the model can learn meaningful relationships between the features.
There are several types of normalization techniques, including:
1. Min-max normalization: This involves scaling the data so that it falls within a specific range, usually
between 0 and 1.
2. Standardization: This involves subtracting the mean and dividing by the standard deviation of the data,
so that the data has a mean of 0 and a standard deviation of 1.
3. Z-score normalization: This involves subtracting the mean and dividing by the standard deviation of
the data, so that the data has a mean of 0 and a standard deviation of 1.
4. Log transformation: This involves taking the logarithm of the data, which can help to reduce the effect
of extreme values in the data.
5. Square root transformation: This involves taking the square root of the data, which can help to reduce
the effect of outliers in the data.
Normalization is an important preprocessing step in machine learning, as it can help to improve the
performance of the model by reducing the effect of irrelevant features and improving the interpretability
of the results.
Here are some key points to keep in mind when normalizing data for machine learning:
1. Normalization can improve the performance of the model by reducing the effect of irrelevant features
and improving the interpretability of the results.
2. There are several types of normalization techniques, including min-max normalization, standardization,
z-score normalization, log transformation, and square root transformation.
3. The choice of normalization technique will depend on the nature of the data and the specific
requirements of the model.
4. Normalization can be applied to both continuous and categorical data.
5. It is important to normalize the data before training the model, as normalization can improve the
convergence of the model and reduce the risk of overfitting.
6. Normalization can also be used to improve the interpretability of the results by reducing the effect of
irrelevant features and improving the scalability of the model.
7. It is important to carefully evaluate the performance of the normalized data and consider the trade-offs
between different normalization techniques.
8. Normalization can be applied to different features of the data, such as the input features, the target
variable, or the output features.
9. Normalization can be used in combination with other preprocessing techniques, such as feature scaling,
feature selection, and data augmentation.
10. Normalization is a powerful tool for improving the performance of machine learning models, but it is
important to carefully evaluate the performance of the normalized data and consider the trade-offs
between different normalization techniques.
Table 6: Generation Examples.
5 Conclusion
The field of Large Language Models has witnessed a significant shift toward open-source develop-
ment, fostering innovation within the AI community. However, a critical challenge emerges: many
purportedly open-source models withhold essential components necessary for full understanding and
reproducibility, creating barriers that limit both academic advancement and commercial adoption.
This does not not only hamper scientific progress but also prevent businesses from fully leveraging
these models for innovative applications, ultimately diminishing potential societal benefits and eco-
nomic value creation. To address these limitations, we introduce Moxin 7B, a fully open-source
language model developed in accordance with the Model Openness Framework (MOF), providing
comprehensive access to pre-training code, configurations, training and fine-tuning datasets, and all
intermediate checkpoints. Our evaluation results demonstrate that the Moxin 7B achieves superior
zero-shot evaluation results compared to popular 7B models while maintaining competitive few-shot
capabilities. We wish to see more work that establishes new standard for reproducible research in
language model development, fostering a more inclusive and economically vibrant AI ecosystem.
9
Page 10:
References
[1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774 , 2023.
[2]Anthropic. The claude 3 model family: Opus, sonnet, haiku. https:
//www-cdn .anthropic .com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/
Model_Card_Claude_3 .pdf.
[3]Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly
capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023.
[4]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd
of models. arXiv preprint arXiv:2407.21783 , 2024.
[5]Thomas Prest, Pierre-Alain Fouque, Jeffrey Hoffstein, Paul Kirchner, Vadim Lyubashevsky,
Thomas Pornin, Thomas Ricosset, Gregor Seiler, William Whyte, and Zhenfei Zhang. Falcon.
Post-Quantum Cryptography Project of NIST , 2020.
[6]Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023.
[7]Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty
Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index. arXiv
preprint arXiv:2310.12941 , 2023.
[8]Sayash Kapoor, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter
Cihon, Aspen Hopkins, Kevin Bankston, Stella Biderman, Miranda Bogen, et al. On the societal
impact of open foundation models. arXiv preprint arXiv:2403.07918 , 2024.
[9]Matt White, Ibrahim Haddad, Cailean Osborne, Ahmed Abdelmonsef, Sachin Varghese, et al.
The model openness framework: Promoting completeness and openness for reproducibility,
transparency and usability in ai. arXiv preprint arXiv:2403.13784 , 2024.
[10] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609 , 2023.
[11] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint
arXiv:2407.10671 , 2024.
[12] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez
Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid
transformer-mamba language model. arXiv preprint arXiv:2403.19887 , 2024.
[13] Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben
Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-
mamba models at scale. arXiv preprint arXiv:2408.12570 , 2024.
[14] Rico Sennrich. Neural machine translation of rare words with subword units. arXiv preprint
arXiv:1508.07909 , 2015.
[15] OpenAI Team. tiktoken, 2022.
[16] T Kudo. Sentencepiece: A simple and language independent subword tokenizer and detokenizer
for neural text processing. arXiv preprint arXiv:1808.06226 , 2018.
[17] Zhilin Yang. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv
preprint arXiv:1906.08237 , 2019.
[18] Hugging Face Team. Summary of the tokenizers. 2024.
[19] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-
baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv
preprint arXiv:2403.05530 , 2024.
10
Page 11:
[20] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt:
Towards detailed video understanding via large vision and language models. arXiv preprint
arXiv:2306.05424 , 2023.
[21] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language
model for video understanding. arXiv preprint arXiv:2306.02858 , 2023.
[22] Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-
llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 ,
2024.
[23] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris
Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand,
et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 , 2024.
[24] Ben Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sas-
try, A Askell, S Agarwal, et al. Language models are few-shot learners. arXiv preprint
arXiv:2005.14165 , 1, 2020.
[25] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli,
Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb
dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv
preprint arXiv:2306.01116 , 2023.
[26] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language
models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 ,
2021.
[27] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco
Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual
datasets from web crawl data. arXiv preprint arXiv:1911.00359 , 2019.
[28] L Xue. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint
arXiv:2010.11934 , 2020.
[29] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. Journal of machine learning research , 21(140):1–67, 2020.
[30] Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. Advances
in neural information processing systems , 32, 2019.
[31] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. arXiv preprint arXiv:2107.03374 , 2021.
[32] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse
text for language modeling. arXiv preprint arXiv:2101.00027 , 2020.
[33] Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi,
James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms.
arXiv preprint arXiv:2402.09668 , 2024.
[34] Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph,
Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, et al. A pretrainer’s guide to training
data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint
arXiv:2305.13169 , 2023.
[35] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu,
Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of
language models with mixture-of-experts. In International Conference on Machine Learning ,
pages 5547–5569. PMLR, 2022.
[36] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris
Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models
better. arXiv preprint arXiv:2107.06499 , 2021.
11
Page 12:
[37] Amit Agarwal, Hema Swetha Koppula, Krishna P Leela, Krishna Prasad Chitrapura, Sachin
Garg, Pavan Kumar GM, Chittaranjan Haty, Anirban Roy, and Amit Sasturkar. Url normalization
for de-duplication of web pages. In Proceedings of the 18th ACM conference on information
and knowledge management , pages 1987–1990, 2009.
[38] Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal,
Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation
of training sets for language models. arXiv preprint arXiv:2406.11794 , 2024.
[39] Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data
mixing for language model pre-training. In R0-FoMo: Robustness of Few-shot and Zero-shot
Learning in Large Foundation Models , 2023.
[40] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang,
Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, et al. Slimpajama-dc: Under-
standing data combinations for llm training. arXiv preprint arXiv:2309.10818 , 2023.
[41] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open
models based on gemini research and technology. arXiv preprint arXiv:2403.08295 , 2024.
[42] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. Journal of Machine Learning Research , 24(240):1–
113, 2023.
[43] Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell
Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An
open corpus of three trillion tokens for language model pretraining research. arXiv preprint
arXiv:2402.00159 , 2024.
[44] Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro
V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data
at scale. arXiv preprint arXiv:2406.17557 , 2024.
[45] Malte Ostendorff, Pedro Ortiz Suarez, Lucas Fonseca Lage, and Georg Rehm. Llm-datasets:
An open framework for pretraining datasets of large language models.
[46] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua-
mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack
v2: The next generation. arXiv preprint arXiv:2402.19173 , 2024.
[47] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950 , 2023.
[48] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and
Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head
checkpoints. arXiv preprint arXiv:2305.13245 , 2023.
[49] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150 , 2020.
[50] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness,
and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of
RedPajama. https://cerebras .ai/blog/slimpajama-a-627b-token-cleaned-and-
deduplicated-version-of-redpajama , 2023.
[51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023.
[52] Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov,
Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, et al. Redpajama: an open dataset
for training large language models. arXiv preprint arXiv:2411.12372 , 2024.
[53] Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semd-
edup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint
arXiv:2303.09540 , 2023.
12
Page 13:
[54] Large-scale near-deduplication behind bigcode, 2023.
[55] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural
text degeneration. arXiv preprint arXiv:1904.09751 , 2019.
[56] J Leskovec, A Rajaraman, and JD Ullman. Mining of massive datasets, cambridge university
press, cambridge, 2014.
[57] Jay M Patel and Jay M Patel. Introduction to common crawl datasets. Getting structured
data from the internet: running web crawlers/scrapers on a big data production scale , pages
277–324, 2020.
[58] Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Com-
pression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pages 21–29. IEEE,
1997.
[59] Fineweb, 2024.
[60] David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, Jonathan Richard Schwarz, and Sham
Kakade. Color-filter: Conditional loss reduction filtering for targeted language model pre-
training. arXiv preprint arXiv:2406.10670 , 2024.
[61] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal
Shankar. Data filtering networks. arXiv preprint arXiv:2309.17425 , 2023.
[62] Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord,
Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating
the science of language models. arXiv preprint arXiv:2402.00838 , 2024.
[63] Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz
Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb
of permissively licensed source code. arXiv preprint arXiv:2211.15533 , 2022.
[64] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan
Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang
Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130b: An open bilingual
pre-trained model. In The Eleventh International Conference on Learning Representations ,
2023.
[65] Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subrama-
nian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, et al.
Nemotron-4 15b technical report. arXiv preprint arXiv:2402.16819 , 2024.
[66] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
[67] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830 , 2019.
[68] Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang
Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel
training. In Proceedings of the 52nd International Conference on Parallel Processing , pages
766–775, 2023.
[69] I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017.
[70] Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep
Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing
climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702 , 2023.
[71] LM Evaluation Harness Team. Lm evaluation harness, 2024. Accessed: Summer 2024.
[72] Open Compass Team. Open compass, 2024. Accessed: Summer 2024.
[73] AutoGPTQ Team. Autogptq: An user-friendly llms quantization package, 2024. Accessed:
Spring 2024.
[74] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding,
Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source
autoregressive language model. arXiv preprint arXiv:2204.06745 , 2022.
13
Page 14:
[75] Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Cheng-
ming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, et al. Deep-
speed4science initiative: Enabling large-scale scientific discovery through sophisticated ai
system technologies. arXiv preprint arXiv:2310.04610 , 2023.
[76] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
challenge. arXiv:1803.05457v1 , 2018.
[77] Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F Brown, and Francis Rhys Ward. Ai
sandbagging: Language models can strategically underperform on evaluations. arXiv preprint
arXiv:2406.07358 , 2024.
[78] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106,
2021.
[79] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning
about physical commonsense in natural language, 2019.
[80] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and
chatbot arena. Advances in Neural Information Processing Systems , 36:46595–46623, 2023.
14