loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2408.11796

LLM Pruning and Distillation in Practice: The Minitron Approach

Authors: Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro

Published: 2024-08-21

Abstract:

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

Paper Content: on Alphaxiv
Page 1: 2024-12-10 LLM Pruning and Distillation in Practice: The Minitron Approach Sharath Turuvekere Sreenivas*, Saurav Muralidharan*, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, and Bryan Catanzaro Abstract: Structured pruning with knowledge distillation is a potent combination for obtaining small language models (SLMs) with significantly fewer training tokens and compute resources compared to training from scratch. In this work, we investigate how this strategy can be effectively applied in instances where access to the the original pretraining dataset is restricted. We introduce a new teacher correction phase before distillation which lets the teacher model adjust to our specific data distribution using a lightweight fine-tuning phase. We apply this strategy to compress the Mistral NeMo 12B and Llama 3.1 8B models to 8B and 4B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and further tested for instruction following, role-play, math, coding and function calling capabilities. This approach produces the state-of-the-art Mistral-NeMo-Minitron-8B ( MN-Minitron-8B for brevity) model from Mistral NeMo 12B, and a compelling 4B model from Llama 3.1 8B. We open-source our base model weights on Hugging Face with a permissive license. Models on Hugging Face: Mistral-NeMo-Minitron-8B-Base | Llama-3.1-Minitron-4B-Width-Base | Llama-3.1-Minitron-4B-Depth-Base Introduction LLM providers often train an entire family of models from scratch, each with a different size (number of parameters, e.g. Llama 3.1 with 8B, 70B, and 405B parameters [ 1]); this is done to aid users targeting different deployment scales, sizes and compute bud- gets. However, training multiple billion-plus parame- ter models from scratch is extremely time-, data- and resource-intensive. Recent work has demonstrated the effectiveness of combining weight pruning with knowledge distillation to significantly reduce the cost of training LLM model families [ 2]. Here, only the biggest model in the family is trained from scratch; other models are obtained by successively pruning the bigger model(s) and then performing knowledge distillation [ 3] to recover the accuracy of pruned mod- els. While highly effective, this line of work assumes access to the original pretraining dataset for the dis- tillation phase. With a growing number of frontier LLMs (including open ones) being trained on private, proprietary datasets [ 1,4], this assumption often fails to hold. In this work, we adapt the original Minitron com- pression recipe [ 2] along two directions: (1) we intro- Pretrained model(Mistral-NeMo-12B, LLaMa 3.1 8B etc) Corrected TeacherTeacherCorrection(127B)Pruning Student Minitron modelDistillation(<400B)Figure 1|High-level overview of our proposed pruning and distillation approach. The total number of tokens used for each step is indicated in parentheses. duce a new teacher correction phase for adapting the teacher (unpruned) model to our own data distribu- tion, thus removing any need to access the original pretraining dataset, and (2) we introduce a new and more effective downstream task-based saliency cri- teria for depth pruning. We successfully apply our updated compression strategy to two state-of-the-art models: Llama 3.1 8B [ 1] and Mistral NeMo 12B [ 5], compressing them down to 4B and 8B parameters, re- spectively. For Llama 3.1 8B, we produce two distinct compressed models in the 4B parameter range: (1) Llama 3.1-Minitron-4B -Width (pruning only the width axes), and (2) Llama 3.1-Minitron- * Equal contribution. ©2024 NVIDIA. All rights reserved.arXiv:2408.11796v4 [cs.CL] 9 Dec 2024 Page 2: LLM Pruning and Distillation in Practice: The Minitron Approach Benchmarks (shots) Gemma2 Minitron Llama-3.1-Minitron Gemma Mistral Llama 3.1 MN-Minitron Mistral NeMo 2B* 4B 4B-Depth 4B-Width 7B 7B 8B 8B 12B-Base 12B-FT Total Params 2.6B 4.2B 4.5B 4.5B 8.5B 7.3B 8B 8.4B 12.2B 12.2B Non-Emb. Params 2B 2.6B 3.7B 3.7B 7.7B 7B 7B 7.3B 10.9B 10.9B Training Tokens 2T 94B 94B 94B 6T 8T 15T 380B - +0.1T Winogrande(5) 70.9 74.0 72.1 73.5 78 78.5 77.3 80.4 82.2 82.7 Arc_challenge(25) 55.4 50.9 52.6 55.6 61 60.3 57.9 64.4 65.1 62.3 MMLU(5) 51.3 58.6 58.7 60.5 64 64.1 65.3 69.5 69.0 70.1 Hellaswag(10) 73.0 75.0 73.2 76.1 82 83.2 81.8 83.0 85.2 85.3 GSM8k(5) 23.9 24.1 16.8 41.2 50 37.0 48.6 58.5 56.4 55.7 Truthfulqa(0) - 42.9 38.2 42.9 45 42.6 45.0 47.6 49.8 48.3 XLSum en(20%) (3) - 29.5 27.2 28.7 17 4.8 30.0 32.0 33.4 31.9 MBPP(0) 29.0 28.2 30.7 32.4 39 38.8 42.3 43.8 42.6 47.9 HumanEval(n=20)(0) 20.1 23.3 - - 32.0 28.7 24.8 36.2 23.8 23.8 Table 1|Accuracy numbers for our MN-Minitron-8B andLlama 3.1-Minitron-4B models. We compare our models to similarly-sized SoTA open models on a variety of common language modeling benchmarks. All evaluations are conducted by us, except entries marked with * (taken from corresponding papers). Benchmarks (shots) Phi-2 Gemma2 Qwen2 Minitron Llama-3.1-Minitron LLama 3.1 MN-Minitron 2.7B 2B 1.5B 4B 4B-Depth 4B-Width 8B 8B MT-Bench (GPT4-Turbo) 5.14 7.44 5.49 6.46 6.19 6.88 7.78 7.86 MMLU (5) 56.8 56.9 55.6 59.3 61.21 59.89 69.4* 70.4 GSM8K (0) 19.9 52.2 27.2 65.1 71.11 79.76 83.8 87.1 GPQA (0) 28.8 25.9 28.1 29.5 32.59 30.36 30.4* 31.5 HumanEval (0) 47.6* 45.1 47.0* 39.6 42.7 47.0 72.6 71.3 MBPP (0) 55.0* 50.4 51.9* 57.4 60.3 65.1 72.8* 72.5 IFEval 44.0 64.5 39.8 75.3 66.77 79.54 80.4* 84.4 BFCLv2 (Live) 38.7 40.2 39.9 53.1 55.89 55.0 44.3 67.6 Table 2|Accuracy numbers for instruction tuned models on a variety of benchmarks. All evaluations are conducted by us, except entries marked with * (taken from corresponding papers). Best of each section in bold. For IFEval, we report the average of prompt and instruction across loose and strict evaluations. For BFCLv2, we report live accuracy only. 4B-Depth (pruning depth only). Figure 1 provides a high-level overview of our approach. Tables 1 and 2 provide a summary of our results: our compression strategy yield a state-of-the-art 8B model ( MN-Minitron-8B ) which outperforms all similarly-sized models across the board on com- mon language modeling benchmarks. Our Llama 3.1-Minitron-4B models (both depth and width- pruned variants) also exhibit strong accuracy com- pared to the teacher Llama 3.1 8B model and the previous-generation Minitron-4B model [ 2]; among the two variants, the width-pruned variant achieves better overall accuracy than the depth-pruned one. In terms of runtime inference performance measured us- ing TensorRT-LLM, the Llama 3.1-Minitron- 4Bmodels provide an average speedup of 2.7 ×and 1.8×for the depth and width pruned variants, re- spectively, compared to the original Llama 3.1 8B model. Methodology A high-level overview of our approach is illustrated in Figure 1. Here, the teacher model undergoes alightweight adjustment phase on the target dataset to beusedfordistillation-werefertothisstepas teacher correction . Next, pruning is applied to compress the model, following which distillation is used to recover model accuracy. Teacher Correction Distillation is an effective technique to condense knowledge from a more accurate teacher model to improve a less accurate student model [ 3] [2]. Typi- cally, knowledge is distilled using the same dataset the teacher model was trained on. In cases where access to the original training data is restricted, we notice from our experiments that the teacher model provides sub-optimal guidance if a different dataset is used to distill the knowledge. We hypothesize this is due to the change in distribution of sub-word tokens across the original dataset the teacher model was trained on vs. the dataset being distilled on. To this end, we propose a novel teacher correction phase (illustrated in Figure 2), where we perform a lightweight ( ∼100B tokens) fine-tuning of the teacher model to adapt to the new distillation dataset. We demonstrate in our experimental evaluation (Figure 4 in particular) that 2 Page 3: LLM Pruning and Distillation in Practice: The Minitron Approach Embedding Embeddi ng Input token KL Divergence Frozen Trainable Loss Input token Embedding Transformer Layers LM head Softmax Logits Cross - entropy loss Next tokenStep 1. Teacher correction Step 2. Retraining via Distillation Transformer Layers Transformer Layers LM head Softmax Logits LM head Softmax LogitsTeacher Student Figure 2|Overview of distillation: if/when the original training data is unavailable, a lightweight fine-tuning of the original model on the distillation dataset is recommended, to be used as a teacher. Distillation is then performed by minimizing KL divergence on the logits of the teacher and the pruned student model. this procedure significantly improves the guidance resulting in a more accurate student model. We also explore correcting the teacher in parallel to distilla- tion, and demonstrate that this performs on par with using guidance from a fully corrected teacher. Pruning Weight pruning is a powerful and well-known tech- nique for reducing model size. In this work, we focus on structured pruning, where blocks (or channels) of nonzero elements are removed at once from model weights; examples of structured pruning techniques include neuron, attention head, convolutional filter, and depth pruning [ 6,7,8,9]. We follow the pruning recipe outlined in Minitron [ 2]: as shown in Figure 3, we start the pruning process by first computing the importance of each layer, neuron, head, and embed- ding dimension. We then sort these importance scores to compute a corresponding importance ranking. Importance Estimation We use a purely activation- based importance estimation strategy that simulta- neously computes sensitivity information for all the axes we consider (depth, neuron, head, and embed- ding channel) using a small calibration dataset and only forward propagation passes. We consider depth pruning as a special case and do not combine it with compressing other dimensions. We compute the im- portance of each head, neuron and embedding channel by examining the activations produced by the multi- head attention (MHA), multi-layer perceptron (MLP) and LayerNorm layers, respectively. We use a small calibration dataset (1024 samples) drawn randomly from the full dataset for this purpose. Layer Importance For depth pruning, we consider two distinct metrics for evaluating layer importance: (1) LM validation loss/PPL, and (2) accuracy on the downstream task. We do not consider the Block Im-portance (BI) metric [ 8] as it was recently shown to under-perform the validation loss/PPL metric [ 2]. For ranking, we simply remove a single or a block of con- tiguous layers and compute its effect on each metric; this serves as the “importance” or sensitivity of the layer/layerblock. Basedonourempiricalanalysis(see Figures 8 and 9), we use the Winogrande metric [ 10] to prune sets of contiguous layers. This pruning strat- egy evolved from two important observations: (1) LM validation loss/PPL-based layer importance fails to produce the most accurate pruned model(s) on down- stream tasks, and (2) dropping contiguous layers is better than individual, as also observed in Gromov et al. [11]. Model Trimming Following Minitron [ 2], for a given architecture configuration, we first rank the elements of each axis according to the computed importance and perform trimming of the corresponding weight matrices directly. For neuron and head pruning, we trim MLP and MHA layer weights, respectively. In the case of embedding channels, we trim the embed- ding dimension of the weight matrices in MLP, MHA, and LayerNorm layers. The original approach uses Neural Architecture Search (NAS) to find the best ar- chitecture; in this work, we skip this step and instead utilize the network architecture-related learnings from the original paper. Retraining with Distillation We use the term retraining to refer to the accuracy re- covery process post pruning. In this work, we explore two retraining strategies: (1) conventional training, leveraging ground truth labels, and (2) knowledge dis- tillation using supervision from the unpruned model (teacher). Knowledge Distillation (KD) [ 3] involves transfer of knowledge from a larger or more com- plex model called the teacher to a smaller/simpler 3 Page 4: LLM Pruning and Distillation in Practice: The Minitron Approach Embedding Transformer Block Layer L Layer normLayer normAttentionMLP1. Trained LLM 3. RankIterative5. Distillation2. Estimate importance Layer 1 Layer L Emb1Emb2Emb3Emb4CH 1CH 2CH 3CH 4Emb1Emb2Emb3Emb4Emb1Emb2Emb3Emb4Head1Head2Head3Head4 Layer 1 Layer L Emb4Emb2Emb1Emb3CH 1CH 4CH 2CH 3Emb4Emb2Emb1Emb3Emb4Emb2Emb1Emb3Head3Head1Head4Head24. Trim Layer L Emb4Emb2CH 1CH 4Emb4Emb2Emb4Emb2Head3Head1Head4 Figure 3|Pruning and distillation process outlined in the original paper [ 2]. We follow the same approach in this work. model called the student. The knowledge transfer is achieved by having the student model mimic the output and/or the intermediate states of the teacher model. In our case, the uncompressed and pruned models correspond to the teacher and student, re- spectively. Following the best practices outlined in the Minitron work [ 2], we use forward KL Divergence loss [12] on the teacher and student logits only; this is illustrated in Figure 2. Training Details Pre-training Llama 3.1 8B [ 1] and Mistral NeMo 12B [ 5] are pre- trained on different proprietary datasets, which we do not have access to. According to the Llama 3.1 tech report [ 1], the 8B model is pretrained on 15T tokens. We start with the corresponding Base models that are openly available on Hugging Face. Dataset We use the Nemotron-4 curated continued training (CT) dataset [ 13] [14] for all our pruning and distillation experiments. Teacher Correction Using the original Mistral NeMo 12B or Llama 3.1 8B models directly as a teacher performs sub-optimally on our dataset. To counter this, we apply teacher cor- rection, as described in the previous section, to both modelswith∼100𝐵tokens. Sincethegoalistoadapt the teacher model to the distillation dataset, we use 120 steps of warm-up and low learning rates: one-fifth the peak learning rate, identical batch size, minimum learning rate and decay schedule the original model was trained on. We notice that the correction process has a minor effect on the teacher model’s accuracy on downstream tasks, with some tasks improving andLLaMa-3.1-Minitron MN-Minitron 4B-Width 4B-Depth 8B Total params 4.5B 4.5B 8.4B Non-Emb params 3.7B 3.5B 7.3B Hidden size 3072 4096 4096 Vocabulary 128256 128256 131072 MLP hidden dim 9216 14336 11520 Depth 32 16 40 Attention groups 8 8 8 Query heads 32 32 32 Head dimension 128 128 128 Table 3|Architecture details of our compressed mod- els. some degrading as shown in Table 1. We hypoth- esize this to be an artifact of the dataset used for fine-tuning. Optimizing this process further by using fewer than∼100B tokens, lighter fine-tuning such as LoRA [15] or tuning layer normalization [ 16] param- eters alone would be an interesting topic for future work. Pruning Our pruning recipe is based on the best practices outlined in the Minitron paper [ 2], as described in the previous section. Specifically, for width pruning, we (1)use l2-norm andmeanastheaggregationfunctions across the batch and sequence dimensions, respec- tively, and (2) perform single-shot pruning, avoiding iterative approaches. For depth pruning, we follow the observations from Gromov et al. [11] and drop a continuous subgroup of layers that results in the least accuracy drop on Winogrande [ 10]. In this work, we skip the lightweight neural architecture search (NAS) phase, and go with a manual architecture configu- ration for both Llama 3.1-Minitron-4B and MN-Minitron-8B . The architectures we come up with are inspired by the Minitron-4B and Minitron-8B models [2], and are detailed in Table 3. We provide the pruning recipes for each of our target compressed models below: 4 Page 5: LLM Pruning and Distillation in Practice: The Minitron Approach Llama-3.1-Minitron MN-Minitron Peak learning rate 1e-4 1e-4 Min learning rate 1e-5 4.5e-7 Warm-up steps 40 steps 60 steps LR decay schedule Cosine Cosine Global batch size 1152 768 Context length 8192 8192 Total tokens 94B 380B Table 4|Hyperparameters used during distillation- based retraining. Llama-3.1-Minitron-4B-Width: •Starting model: Llama 3.1 8B •Hidden dimension: 4096 →3072 •MLP hidden dimension: 14336 →9216 •Attention heads: unchanged •Depth: unchanged Llama-3.1-Minitron-4B-Depth: •Starting model: Llama 3.1 8B •Hidden dimension: unchanged •MLP hidden dimension: unchanged •Attention heads: unchanged •Depth: 32→16 MN-Minitron-8B: •Starting model: Mistral NeMo 12B •Hidden dimension: 5120 →4096 •MLP hidden dimension: 14336 →11520 •Attention heads: unchanged •Depth: unchanged Distillation We opt for logit-only distillation, minimizing the for- ward KL Divergence [ 12] loss across the teacher and studentprobabilities, andignoretheLMcross-entropy loss altogether. Here, the unpruned and pruned mod- els correspond to the teacher and student, respectively. We use the hyperparameters listed in Table 4 during distillation. We use 32 NVIDIA DGX H100 nodes for our training jobs. Instruction Tuning To evaluate the instruction-following capabilities of our distilled models, we perform alignment using NeMo-Aligner [ 17]. We follow the same recipe for all our models by first applying math and code super- vised fine-tuning (SFT) followed by instruction SFT and then two rounds of Reward-aware Preference Optimization (RPO) [18].Analysis We perform a series of ablation studies to better un- derstand the effects of distillation, teacher correction, and our new depth-pruning saliency metric. We re- port our findings in this section. Teacher Correction We first compare the effects of teacher correction on the MN-Minitron-8B model in Figure 4; here, we notice the clear benefits of performing teacher correction w.r.t. distilling directly from an uncorrected teacher. Next, we compare two approaches for teacher correction: (1) pruning and distilling the corrected teacher, and (2) pruning the original (uncorrected) teacher and distilling from a continuously corrected teacher. The results in Fig- ure5suggestthatteachercorrectioncanbeperformed in parallel with distillation to recover accuracy of the pruned student model. Pruning and Distillation Figure 6 demonstrates theorthogonalbenefitsofpruninganddistillationover random initialization and conventional fine-tuning, respectively. We compare (1) random weight ini- tialization and distillation, (2) random pruning and distillation, where weights are pruned randomly ignor- ing the importance scores, (3) our proposed pruning with typical cross entropy based LM loss training and (4) our proposed pruning with distillation-based retraining. We notice that pruning results in a sig- nificantly better starting point compared to random initialization, and distillation-based training outper- forms conventional training methods. Overall, our approach requires significantly fewer training tokens (up to 40×; 380B instead of 15T tokens) to produce the state-of-the-art MN-Minitron-8B model. Width vs. Depth Pruning Figure 7 shows the training curve of Llama 3.1-Minitron-4B pruned for width vs. depth. We notice that width pruning results in a lower initial loss and consistently outperforms the depth-pruned model, despite both variants having the same number of parameters. Depth Pruning Metrics By examining how LM validation loss increases as contiguous blocks of layers are removed (Figure 8), we observe that the layers at the beginning and end are the most important. The figure indicates that removing non-contiguous layers can result in even better LM validation loss (the dashed line). However, we notice this observation doesnotnecessarilyholdwhenevaluatingdownstream task performance: specifically, Figure 9 shows that 5 Page 6: LLM Pruning and Distillation in Practice: The Minitron Approach 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 Training Tokens(B)1.71.81.92.0LM Validation LossLM Validation Loss vs Training Steps Original 12B Teacher Fine-tuned 12B Teacher Figure 4|Training convergence plot for the MN- Minitron-8B student model. We compare su- pervision from the original teacher and the corrected teacher. 20 40 60 80 100 120 Training Tokens(B)1.701.751.801.851.90LM Validation LossLM Validation Loss vs Training Steps Prune corrected teacher + distill corrected teacher Prune original teacher + distill continuously corrected teacherFigure 5|Training convergence plot for the MN- Minitron-8B student model. We compare (1) pruning and distilling the corrected teacher with (2) pruning the original (uncorrected) teacher and distill- ing from a continuously corrected teacher. We notice that teacher correction can be performed in parallel with distillation. dropping 16 layers selected based on per-layer impor- tance [8,19] yields a random Winogrande accuracy of 0.5, while removing layers 16 to 31 continuously [ 11] results in an accuracy of 0.595. The gap holds during distillation-based retraining and we opt for the latter approach in this work. Evaluation Benchmarks following Llama [ 20], we evaluate our compressed base and aligned models on a se- ries of downstream tasks, namely MMLU [ 21], Hu- manEval [ 22] for Python code generation, MBPP [ 23] and GSM8K [ 24]. We also evaluate the base models on several question-answering datasets for common- sense reasoning: Arc-C [ 25], HellaSwag [ 26], Truth- fulQA [27], WinoGrande [ 10], and XL-Sum En- glish [28] for summarization. The instruction tuned models are further evaluated for question-answering, function calling, instruction following and multiturn conversations on GPQA [ 29], BFCL [ 30], IFEval [ 31] andMT-Bench(GPT4-Turbo)[ 32], respectively. Note that this MT-Bench is a corrected version of the orig- inal MT-Bench [33]. For base models, accuracy is reported with the following evaluations settings: 5-shot on MMLU, 5- shot on Winogrande, 25-shot on ARC-Challenge, 10- shot on HellaSwag, 0-shot on 20% of XL-Sum and average pass@1 scores for HumanEval and MBPP. For pass@1 scores we use a temperature of 0.2 and nucleus sampling [ 34] with top-p =0.95. For aligned models we use 0 shot and greedy sampling if applicable.Base Models Base model evaluation results are shown in Ta- ble 1. Compared to similarly-sized models, MN- Minitron-8B demonstrates superior accuracy across the board, outperforming the recent Llama 3.1 8B model using 40 ×fewer training tokens (380B vs. 15T). Similarly, the Llama 3.1-Minitron-4B models perform favorably compared to the teacher Llama 3.1 8B model using 150×fewer training to- kens (94B vs. 15T); our pruned Llama models also outperform the original Minitron 4B model [ 2]. We note from Table 1 that the width-pruned Llama vari- ant outperforms the depth-pruned one. These results clearly demonstrate the advantages of our methodol- ogy: state-of-the-art accuracy coupled with an order of magnitude improvement in training efficiency. Instruct Models The accuracy of the instruction-tuned model variants are shown in Table 2. Our aligned models outperform similarlysizedvariantsonmostevaluatedbenchmarks with the exception of HumanEval [ 35] and MBPP [ 23]. Additionally, Llama 3.1-Minitron-4B lags be- hind Gemma2 on MT-Bench [ 33]. Nevertheless, our aligned models are consistently better on MMLU [ 21], GSM8K [ 24], GPQA [ 29], IFEval [ 31] and BF- CLv2 [30]. This demonstrates the strong capabilities of our model. Runtime Performance Analysis To evaluate runtime performance, we optimize the Llama 3.1 8B and Llama 3.1-Minitron-4B variants with NVIDIA TensorRT-LLM, an open- 6 Page 7: LLM Pruning and Distillation in Practice: The Minitron Approach 1 2 3 4 5 6 7 8 9 Training Tokens(B)2.002.252.502.753.00LM Validation LossLM Validation Loss vs Training Steps Random Init + Distillation Random Pruning + DistillationPruning + LM Loss Pruning + Distillation Figure 6|Training convergence plot for the MN- Minitron-8B model. We compare (a) random initialization with distillation, (b) randomly pruned weights with distillation, (c) pruning with standard LM loss, and (d) our pipeline with pruning and dis- tillation. This plot shows the benefits of pruning and distillationover randominitializationandconventional finetuning, respectively. 0 20 40 60 80 100 Training Tokens(B)1.82.02.22.4LM Validation LossLM Validation Loss vs Training Steps Llama-3.1-Minitron-4B-Width Llama-3.1-Minitron-4B-DepthFigure 7|Convergence plots for the width-pruned and depth-pruned versions of Llama 3.1 8B to 4B compressed models. Width pruning consistently out- performs depth pruning for a given parameter budget. source toolkit for optimized LLM inference. Figure 10 shows the throughput in requests per second for the various models in FP8 precision ob- tained on a single H100 80 GB GPU. Different use cases are represented by increasing input sequence length/output sequence length (ISL/OSL) combina- tions, at a batch size of 32 and 64 for the 8B-12B models and the 4B models respectively. The smaller memory footprint of the 4B model allows for larger batches. We notice that Llama 3.1-Minitron- 4B(Depth) is fastest, achieving an average through- put improvement of 2.7×over Llama 3.1 8B; the width-pruned variant achieves an average throughput improvement of 1.8×over Llama 3.1 8B. Compared to BF16, we notice that FP8 delivers a performance boost of 1.4×. Insights In this section, we summarize some interesting and surprising observations based on our evaluation. General 1.Teacher correction is crucial for distillation to work optimally on a new, unseen dataset. Fine- tuningtheteacherwiththedatasetusedfordistil- lation in this manner yields over a 6% reduction in LM validation loss. Teacher correction doesn’t affect the optimality of pruning and can even be performed in parallel with distillation. 2.InlinewiththeMinitronpaper’sobservations, we require a order of magnitude fewer tokens (380Bvs 15T) to achieve state-of-the-art accuracy post pruning with distillation. 3.For width pruning, we achieve stronger accuracy by retaining attention heads and pruning the other dimensions (MLP intermediate dimension, embedding channels). Mistral NeMo 12B to MN-Minitron-8B 1.Our compressed model outperforms the teacher on two benchmarks, GSM8k and HumanEval after pruning and distillation: GSM8k increases from 55.7% to 58.5% and HumanEval increases from 23.8% to 36.2%. This improvement is likely influenced by the dataset. However, retraining is performed using the distillation loss alone. Llama 3.1 8B to Llama 3.1-Minitron-4B 1.Width pruning delivers better accuracy with MMLU at 60.5%, while depth pruning yields 58.7%, for Llama 3.1 compression. 2.Reasoning ability for base variants appears to be impacted significantly for the depth pruned version, with GSM8K accuracy at 16.8% com- pared to 41.24% for the width pruned version. However, the gap reduces with instruct tuning. 3.Depth pruning boosts throughput, achieving 2.7×speedup over Llama-3.1 8B, while width pruning provides 1.7×speedup. 4.For depth pruning, we observe that dropping contiguous layers from the model is more ef- fective than using non-contiguous, importance- based pruning. 7 Page 8: LLM Pruning and Distillation in Practice: The Minitron Approach 4 8 12 16 20 24 28 32 layer no.24681012Validation lossbaseline (32 layers) drop 1 layerdrop 2 layers drop 8 layersdrop 16 layers drop 16 non-continuousLM Validation loss for different set of layers dropped Figure 8|LM loss value on validation set after re- moving 1, 2, 8 or 16 contiguous layers from Llama 3.1 8B. The purple line at layer no. 16 indicates the LM loss if we dropped the first 16 layers. Layer no. 17 indicates the LM loss if we leave the first layer intact and drop layers 2 to 17. The dashed line corresponds to LM loss value when removing 16 non-contiguous layers least increasing the loss. 16 18 20 22 24 26 28 30 32 layer no.0.500.550.600.650.700.75Winogrande (5-shot)drop 16..31 drop 1..16baseline (32 layers) drop 16 layersdrop 16 layers non-continuousAccuracy for different set of 16 layers droppedFigure 9|Accuracy on the Winogrande task when removing 16 contiguous layers from Llama 3.1 8B. Layer no. 17 indicates the accuracy if we leave the first layer intact and drop layers 2 to 17. The dashed line corresponds to the accuracy when removing 16 non-contiguous layers that increasing the loss by the least amount. Figure 10|TensorRT-LLM FP8 throughput compari- son for the Llama 3.1-Minitron-4B models with the Llama 3.1 8B model w.r.t. increasing input and output sequence lengths. Acknowledgments This work would not have been possible without con- tributions from many people at NVIDIA. To mention a few: Foundational Model: Sharath Turuvekere Sreeni- vas, Saurav Muralidharan, Raviraj Joshi, Marcin Cho- chowski, Pavlo Molchanov, Mostofa Patwary, Daniel Korzekwa, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro and Jan Kautz Alignment: Ameya Sunil Mahabaleshwarkar, Hay- leyRoss, BrandonRowlett, OluwatobiOlabiyi, Shizhe Diao and Yoshi Suhara Datasets: Sanjeev Satheesh, Jupinder Parmar, Shengyang Sun, Jiaqi Zeng, Zhilin Wang, Yi Dong, Zi-han Liu, Rajarshi Roy, Wei Ping, Makesh Narsimhan Sreedhar and Oleksii Kuchaiev TensorRT-LLM: Bobby Chen, James Shen and Chenhan Yu Hugging Face Support: Ao Tang, Yoshi Suhara and Greg Heinrich References [1]Abhimanyu Dubey and Abhinav Jauhri et al. The Llama 3 Herd of Models. arXiv 2407.21783 , 2024. [2]Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679 , 2024. [3]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 , 2015. [4]Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro- Ros, Ambrose Slone, Amélie Héliou, Andrea Tac- chetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, 8 Page 9: LLM Pruning and Distillation in Practice: The Minitron Approach Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Bren- nan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Milli- can, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bai- ley, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Kli- menko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024. [5]Mistral AI team. Mistral nemo. https://mistral.ai/news/mistral-nemo, 2024. Accessed: 2024. [6]Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Repre- sentations , 2023. [7]Saleh Ashkboos, Maximilian L Croci, Marcelo Gen- naridoNascimento, TorstenHoefler, andJamesHens- man. Slicegpt: Compress large language models by deleting rows and columns. In The Twelfth Interna- tional Conference on Learning Representations , 2023. [8]Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in Large Language Models are More Redundant Than You Expect, 2024. [9]Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened LLaMA: A simple depth pruning for large language models. In ICLR 2024 Workshop on Mathematical and Empirical Un- derstanding of Foundation Models , 2024. [10]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga- vatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. Commun. ACM , 64(9), 2021. [11]Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. 2024.[12]Solomon Kullback and Richard A. Leibler. On in- formation and sufficiency. Annals of Mathematical Statistics , 22(1):79–86, 1951. [13]Jupinder Parmar, Shrimai Prabhumoye, Joseph Jen- nings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhun- jhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, An- nika Brundyn, James Maki, Miguel Martinez, Jiax- uan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, and Bryan Catanzaro. Nemotron-4 15b technical report, 2024. [14]Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models, 2024. [15]Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations , 2021. [16]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. [17]Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. Nemo- aligner: Scalable toolkit for efficient model alignment, 2024. [18]Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brun- dyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Frid- man, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majum- dar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christo- pher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Saba- vat, Sanjeev Satheesh, Jane Polak Scowcroft, Ja- son Sewall, Pavel Shamis, Gerald Shen, Moham- mad Shoeybi, Dave Sizer, Misha Smelyanskiy, Fe- lipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Ji- aqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, 9 Page 10: LLM Pruning and Distillation in Practice: The Minitron Approach Yian Zhang, and Chen Zhu. Nemotron-4 340b tech- nical report, 2024. [19]Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms. arXiv preprint arXiv:2407.16286 , 2024. [20]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Can- ton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Na- man Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yun- ing Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ran- jan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Au- relien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. [21]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language under- standing. In International Conference on Learning Representations , 2021. [22]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Ed- wards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Moham- mad Bavarian, Clemens Winter, Philippe Tillet, Fe- lipe Petroski Such, David W. Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin- der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluat- ing large language models trained on code. ArXiv, abs/2107.03374, 2021. [23]Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le,and Charles Sutton. Program synthesis with large language models, 2021. [24]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. [25]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018. [26]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a ma- chine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics , Florence, Italy, July 2019. Association for Computational Linguistics. [27]Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. [28]Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Is- lam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages, 2021. [29]David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. [30]Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leader- board. 2024. [31]Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, andLeHou. Instruction-followingevaluationforlarge language models. arXiv preprint arXiv:2311.07911 , 2023. [32]Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer2: Open-source dataset for training top- performing reward models, 2024. [33]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as- a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Infor- mation Processing Systems , volume 36, pages 46595– 46623. Curran Associates, Inc., 2023. 10 Page 11: LLM Pruning and Distillation in Practice: The Minitron Approach [34]Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019. [35]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- try, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cum- mings, Matthias Plappert, Fotios Chantzis, Eliza- beth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welin- der, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. 11

---