loader
Generating audio...

arxiv

Paper 2412.11629

QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

Authors: Changhai Zhou, Yuhua Zhou, Shijie Han, Qian Qiao, Hongguang Li

Published: 2024-12-16

Abstract:

The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme. Quantization precisions are assigned to each layer based on their importance to the target task, and Bayesian optimization is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive experiments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance.

Paper Content:
Page 1: QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models Changhai Zhou1,3, Yuhua Zhou2, Yibin Wang1, Shijie Han4,Qian Qiao5,Hongguang Li3, 1Fudan University,2Zhejiang University,3JF SmartInvest Holdings,4Columbia University,5Soochow University, zhouch23@m.fudan.edu.cn zhouyuhua@zju.edu.cn yibinwang1121@163.com sh4460@columbia.edu qqiao@stu.suda.edu.cn harvey2@mail.ustc.edu.cn Abstract The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial chal- lenges. Structured pruning is an effective ap- proach to reducing model size, but it often re- sults in significant accuracy degradation, ne- cessitating parameter updates to adapt. Unfor- tunately, such fine-tuning requires substantial memory, which limits its applicability. To ad- dress these challenges, we introduce quantiza- tion into the structured pruning framework to reduce memory consumption during both fine- tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that em- ploys structured pruning to reduce model size, followed by a layer-wise mixed-precision quan- tization scheme. Quantization precisions are assigned to each layer based on their impor- tance to the target task, and Bayesian optimiza- tion is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive ex- periments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance. 1 Introduction The advent of large language models (LLMs) has revolutionized various natural language processing (NLP) tasks, such as machine translation (Zhang et al., 2023a; Sato et al., 2020), sentiment analy- sis (Zhang et al., 2023b; Deng et al., 2023), and speech recognition (Min and Wang, 2023). Despite their impressive capabilities, the resource consump- tion required to obtain a fine-tuned model suitable for specific tasks remains substantial due to the large number of parameters and high computationaldemands of LLMs (Frantar and Alistarh, 2023). To address these issues, various compression tech- niques, including pruning (Molchanov et al., 2019; Liu et al., 2018), quantization (Shao et al., 2023; Lee et al., 2023), and distillation (Gu et al., 2023; Tan et al., 2023), have been proposed. Structured pruning (Ma et al., 2023; Xia et al., 2023) is a widely used approach that reduces model size by removing less important parameters in a structured manner, preserving the overall archi- tecture compatibility with hardware requirements. However, the disruption of computational graph uniformity and the removal of parameters can sig- nificantly reduce the accuracy of LLMs, which are inherently information-dense networks. To miti- gate this degradation, fine-tuning is often used to recover the accuracy of pruned models. This fine- tuning step, while effective, is memory-intensive and presents substantial challenges in terms of re- source consumption. To further reduce memory usage during the fine- tuning and inference phases, we introduce quantiza- tion into the structured pruning framework. Specifi- cally, after performing structured pruning, we quan- tize the pruned model and then apply different fine-tuning strategies. Quantization effectively re- duces the bit-width of model parameters, thereby lowering the resource consumption during both fine-tuning and inference. However, integrating quantization with structured pruning introduces ad- ditional complexities. Structured pruning applies different pruning intensities across model layers, which exacerbates the uneven distribution of layer importance, making some layers more critical for maintaining model performance. Moreover, the cumulative quantization error varies across differ- ent layers, potentially amplifying the performance degradation caused by pruning. Therefore, a sim- ple, uniform quantization scheme is suboptimal. In- stead, a more nuanced, layer-wise mixed-precision quantization approach is needed. By allowing morearXiv:2412.11629v1 [cs.LG] 16 Dec 2024 Page 2: critical layers to maintain higher precision, we can better control the overall performance of the model. Building upon these observations, we propose a new framework called QPruner. In QPruner, we first apply structured pruning to reduce the model size, followed by a quantization phase where dif- ferent quantization precisions are assigned to each layer based on their contribution to the target task. To further improve the allocation strategy, Bayesian optimization (Frazier, 2018) is employed to explore better precision configurations. Finally, we apply parameter-efficient fine-tuning (PEFT) fine-tuning strategy, to recover model performance. This inte- grated approach aims to strike an optimal balance between model accuracy and memory efficiency, making it well-suited for resource-constrained sce- narios. The main contributions of this work are summarized as follows: • We propose QPruner, a novel framework that integrates structured pruning and quantization, aiming to significantly reduce the memory consumption of LLMs during both fine-tuning and inference. •We introduce a mixed-precision quantization scheme where quantization precisions are as- signed to each layer based on their importance to the target task, with Bayesian optimiza- tion used to further refine precision allocation strategies. •We demonstrate QPruner’s powerful ability to save memory and maintain performance. It can surpass baseline methods in terms of accuracy by up to 6% while saving at least 30% of memory. 2 Background and Motivation 2.1 Quantization Quantization. Quantization is an essential tech- nique used to reduce the computational and mem- ory overhead of large-scale models by converting high-precision numerical values, such as a 32-bit floating-point number XHP∈R, into a lower-bit integer representation XINT∈ {0,1, . . . , 2N−1}. This process is mathematically expressed as: XINT=round (2N−1)F XHP ,(1) where F(·):R→[0,1]is a normalization func- tion. A typical method is uniform quantization, where F(X)is defined as F(X) =X−Xmin Xmax−Xmin.An alternative approach introduced by QLoRA Dettmers et al. (2024) is 4-bit NormalFloat Quanti- zation (NF4), which assumes that the data follows a normal distribution X∼ N(0, σ2)and applies F(X) = Φ( X/σ), with Φ(·)representing the cu- mulative distribution function of a standard normal distribution. Dequantization. To recover the high-precision values from their quantized forms, a lookup table Tis used, which is defined as: T[i] =F−1i 2N−1 , i= 0,1, . . . , 2N−1, (2) allowing the integer XINTto be mapped back to its simulated high-precision counterpart XD∈R. The dequantization process can be represented as: XD=T[XINT]. (3) Simulated Quantization for Matrices. In prac- tice, it is often more efficient to use simulated quan- tization for matrices rather than directly operating on quantized values (Bai et al., 2020; Shen et al., 2020). In this method, quantized weight matrices are stored as encoded integers and are temporar- ily dequantized into simulated high-precision ma- trices during multiplication operations. This pro- cess is denoted by qN(·):Rm×n→Rm×n N, where RN:{T[i]∈R|0≤i <2N}. 2.2 The Motivating Example Efficient fine-tuning of LLMs on resource- constrained devices requires effective model com- pression and fine-tuning techniques. After applying structured pruning and quantization, more efficient fine-tuning methods are needed to recover accu- racy. One approach is to use LoRA-based methods, as done in LLM-Pruner (Ma et al., 2023), which employs LoRA for quick recovery after structured pruning. Among the LoRA series methods, LoftQ Li et al. (2023) is a method for fine-tuning quan- tized models. Before fine-tuning, LoftQ itera- tively updates the low-rank matrices such that the quantized matrix Q+AB approximates the full- precision matrix W, thereby improving the fine- tuning performance, particularly in low-bit settings. Simply combining pruning, quantization, and LoRA can lead to suboptimal results. Structural pruning reduces model size by removing less im- portant parameters, but due to the varying impor- tance of different layers, it often results in uneven Page 3: Figure 1: Comparison of accuracy and memory usage across different fine-tuning configurations for multiple tasks. The bars represent the accuracy of three different methods (LoRA, LoftQ, LoftQ*) on each task, while the markers indicate the memory usage for each corresponding method. pruning across layers. This uneven pruning leads to a complex and unbalanced network structure, and standard quantization typically applies a uniform configuration across all layers. To explore a better trade-off between performance and memory, we adopted mixed-precision quantization, assigning different computational resources and complexities to different layers, with the goal of allowing more important layers to learn with finer granularity. We conducted experiments using the LLaMA-7b model with a pruning rate of 20%. The pruning was performed using the optimal strategy determined by LLM-Pruner. The methods compared were as follows: LoRA with a uniform 16-bit configura- tion, LoftQ with a uniform 4-bit quantization, and LoftQ* with a mixed-precision setting of 4 or 8 bits per layer. As shown in Figure 1, the quan- tized models (LoftQ) achieved performance com- parable to the original precision models (LoRA), with significantly lower memory usage (21.33 GB versus 35.06 GB). On some tasks, there was a slight drop in performance, but the mixed-precision model (LoftQ*) demonstrated the potential to fur- ther enhance performance while maintaining effi- cient memory usage. 3 QPruner Structured pruning, while effective in reducing model size, can disrupt the balance of layer impor- tance, leading to performance degradation. There-fore, parameter adjustments are often necessary to mitigate this imbalance and restore model per- formance. However, parameter updates require significant memory, which is why we employ quan- tization techniques to reduce memory consumption. As demonstrated in the motivating example, simply combining pruning and quantization is not always the best choice, as the importance of different lay- ers in a pruned model can vary greatly. We need finer-grained layer-wise quantization bit-width con- trol, which introduces a challenging bit-width al- location problem. To address this, we designed a two-stage allocation strategy to effectively balance these trade-offs. Based on these insights, we propose QPruner, an integrated framework tailored for efficient or low- resource NLP tasks. It employs structured pruning, mixed-precision quantization, and efficient fine- tuning to solve the challenges of balancing memory efficiency and model performance. 3.1 Structured Pruning Our framework does not impose specific require- ments on the pruning method; as new technolo- gies evolve, the pruning method can be replaced. The only requirement for this step is to produce a smaller model. Although some methods can achieve good performance without fine-tuning (An et al., 2024), most real-time systems require dy- namic adaptation, which means that the pruned model must be fine-tuned to improve performance. Page 4: Pruned LLM•Discovery •EstimateStructured Pruning Datasets FFT, PEFT4bit16bi tPerformance Recovery init stateoptimal trialBayesian OptimizationMixed-Precision Quantization layer i+n layer i+1 layer iimportance-aware quantization initalization ...[4bit, 8bit,4bit, 8bit ... , 4bit, 8bit, 4bit ,16bit]Figure 2: Overview of the QPruner framework. A popular structured pruning method is LLM- Pruner (Ma et al., 2023), which first identifies dependencies between neurons and groups them, then removes weights based on their importance. LetNiandNjbe two neurons in the model. If Nj∈Out(Ni)and Deg−(Nj) = 1 , then Nj is dependent on Ni. Similarly, if Ni∈In(Nj) andDeg+(Ni) = 1 , then Niis dependent on Nj. Based on this principle, a dependency graph can be constructed to iteratively identify all coupled structures. Next, these coupled structures are grouped, and their importance is estimated to effectively perform pruning. For a group of coupled structures G= {Wi}M i=1, its importance can be expressed as: IW i=|LW i(D)− LW i=0(D)|, (4) where Lrepresents the prediction loss. Using a second-order Taylor expansion, the im- portance can be approximated as: ∂L(D) ∂WiWi−1 2W⊤ iHW i , (5) whereHis the Hessian matrix of the loss func- tion. For each parameter Wi k, its importance is defined as: ∂L(D) ∂Wi kWi k−1 2(Wi k)2Hkk , (6) where Hkkis the k-th diagonal element of the Hessian matrix. Finally, we aggregate the importance of each structure into group-level importance using meth-ods such as summation, multiplication, taking the maximum, or using only the last item. Groups with the lowest importance are selected for pruning, thereby reducing the model size while maintaining performance as much as possible. 3.2 Mixed-Precision Quantization After pruning, we apply mixed-precision quantiza- tion to further reduce memory usage while main- taining model performance. Instead of assigning a uniform bit-width across all layers, different bit- widths are allocated based on each layer’s contri- bution to the final model output. The contribution of each layer is quantified using mutual informa- tion between the layer’s output and the model’s prediction. To compute mutual information, we first run rep- resentative data samples through the pruned model. For each layer, we record its output Xand the fi- nal prediction Y. The mutual information I(X;Y) between the output of layer Xand prediction Yis computed as: I(X;Y) =X x∈XX y∈Yp(x, y) logp(x, y) p(x)p(y),(7) where p(x, y)is the joint probability distribution ofXandY, while p(x)andp(y)are the marginal distributions. A higher mutual information value indicates that the layer is more important for the final output and should therefore be assigned a higher bit-width. Once the mutual information is computed, an average bit-width Bavgis determined based on the available memory budget. Layers Page 5: Algorithm 1 Mixed-Precision Quantization Compute mutual information I(Xi;Y) Initialize bit-width configuration b0based on I(Xi;Y)and memory constraint D ← { (b0, P(b0), M(b0))} while not converged do Train GP model on D bt+1←arg max bα(b) Applybt+1to pruned model and fine-tune Measure P(bt+1),M(bt+1) D ← D ∪ { (bt+1, P(bt+1), M(bt+1))} end while with higher importance receive more bits, and the allocation is performed in discrete bit-widths (e.g., 4-bit, 8-bit), constrained by the total memory limit. Although the initial bit-width configuration de- rived from mutual information offers a reasonable starting point for fine-tuning, the complex interac- tions between layers, particularly in LLMs, mean that the importance of individual layers may shift after fine-tuning. As a result, the initial bit-width assignment might not represent the optimal configu- ration. To further refine the precision configuration, we employ Bayesian optimization. The objective of Bayesian optimization is to maximize model performance while minimizing memory usage. Let b= [B1, B2, . . . , B L]repre- sent the bit-width configuration across Llayers. The optimization problem is formulated as: bopt= arg max bα(b), (8) where α(b)is an acquisition function that bal- ances exploration (of less well-understood config- urations) and exploitation (of known promising configurations). The memory usage M(b)is con- strained by Mmax, the total available memory. The process starts by initializing a dataset Dwith the initial bit-width configuration b0, along with its corresponding performance P(b0)and memory usage M(b0). A Gaussian Process (GP) model is then trained on the data to predict model perfor- mance and the uncertainty for new configurations. Based on this model, the acquisition function α(b) is used to select the next bit-width configuration to evaluate. Once a new configuration bt+1is selected, it is applied to the pruned model, fine-tuned, and its per- formance P(bt+1)and memory usage M(bt+1) are measured. These results are then added to thedataset D, and the GP model is updated with the new data. This iterative process continues until a stopping criterion is met, such as convergence or a maximum number of iterations. Over time, this method refines the bit-width configuration to achieve an optimal balance between model perfor- mance and memory efficiency. 3.3 Performance Recovery After the steps of structured pruning and mixed- precision quantization, significant memory savings are achieved. However, model performance typi- cally needs to be restored through fine-tuning. Full- parameter fine-tuning is often impractical due to the large memory footprint it requires, but our com- pression technique makes full model fine-tuning feasible by reducing both memory and computa- tional costs. In addition to traditional full-parameter fine- tuning, efficient fine-tuning techniques such as LoRA (Low-Rank Adaptation) (Hu et al., 2021) have proven especially effective, particularly in scenarios with limited data. LoRA significantly re- duces the number of trainable parameters by freez- ing the original weight matrix W0and only updat- ing the low-rank approximation of the weight ma- trix, represented as ∆W=AB, where A∈Rd×r andB∈Rr×d. Here, r(the rank) is much smaller than the original dimension d, leading to a substan- tial reduction in the number of trainable parame- ters. The forward computation in this approach can be written as: Y=W0X+ ∆WX =W0X+ABX ,(9) There are also LoRA-like methods specifically designed for quantized models, such as QLoRA (Dettmers et al., 2023) and LoftQ (Li et al., 2023). LoftQ iteratively updates the low-rank matrices A andBsuch that the quantized matrix Q+AB approximates the original full-precision matrix W during fine-tuning. The objective is defined as: min A,B∥W−(Q+AB)∥2. (10) whereQis the quantized matrix. By combining structured pruning, mixed- precision quantization, and performance recovery techniques, QPruner is able to achieve robust adapt- ability with minimal computational overhead. Page 6: Table 1: Zero-shot performance and peak memory usage on LLaMA-7B and Vicuna-7B with varying pruning rates. LLM-Pruner represents the currently widely used half-precision model. The performance is reported in percentage (%), and the memory usage is in gigabytes (GB). Method BoolQ PIQA HellS WinoG ARC-e ARC-c OBQA Memory (GB) LLaMA-7BRate = 0% w/o tuning 73.09 78.35 72.98 67.09 67.42 41.38 42.40 - Rate = 20%LLM-Pruner 63.30 76.82 68.68 63.38 63.76 37.11 40.60 35.06 QPruner167.77 76.55 68.03 61.80 64.06 38.65 40.00 21.78 QPruner268.60 76.79 68.43 62.78 65.50 38.74 40.40 23.05 QPruner369.11 77.23 68.80 63.17 66.16 39.20 41.00 23.32 Rate = 30%LLM-Pruner 62.45 74.37 63.14 61.96 59.22 33.70 39.60 31.38 QPruner158.96 71.22 58.10 58.88 52.19 32.34 38.40 20.12 QPruner262.20 72.88 60.64 60.50 55.61 33.56 38.40 22.87 QPruner366.50 74.43 61.14 61.40 58.12 34.47 39.20 22.15 Rate = 50%LLM-Pruner 43.76 68.88 44.85 50.99 45.20 28.75 34.60 23.89 QPruner145.14 68.34 44.39 52.96 43.86 29.01 35.80 15.47 QPruner247.08 68.85 45.53 53.65 44.31 29.36 36.20 16.85 QPruner348.37 69.20 45.19 54.45 45.28 29.70 36.40 16.65 Vicuna-7BRate = 0% w/o tuning 75.69 77.75 71.06 67.80 69.07 40.78 42.20 - Rate = 20%LLM-Pruner 57.77 77.56 67.16 63.14 67.30 37.71 40.40 35.25 QPruner157.95 76.82 66.42 62.51 66.62 37.37 40.60 21.65 QPruner259.70 77.20 66.31 62.66 67.12 37.48 40.80 22.95 QPruner359.85 77.59 67.31 63.20 67.84 37.85 41.20 23.10 Rate = 30%LLM-Pruner 58.81 74.37 60.70 60.62 59.01 33.79 38.80 31.83 QPruner153.85 74.76 60.65 60.06 59.72 34.30 38.20 19.95 QPruner255.64 75.07 61.65 60.31 59.54 34.47 38.60 21.65 QPruner357.23 75.90 62.00 60.37 60.81 34.79 39.40 21.80 Rate = 50%LLM-Pruner 59.51 66.87 43.18 52.01 48.40 26.45 34.00 24.55 QPruner159.51 67.90 43.30 50.83 48.82 27.49 34.60 14.50 QPruner261.31 68.56 44.54 53.02 49.50 28.13 35.40 15.90 QPruner361.56 68.80 43.72 53.39 49.66 27.98 35.80 15.35 4 Experiments LLMs and Benchmarks. To demonstrate how QPruner performes on different model, we test it on three open source large language models: LLaMA- 7B (Touvron et al., 2023), LLaMA-13B (Touvron et al., 2023) and Vicuna-7B (Zheng et al., 2024), and specific version is stated in the Appendix A. We conduct these LLMs on zero-shot classification tests for commonsense reasoning datasets, includ- ing BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), HellaSwag (Zellers et al., 2019), Wino- Grande (Sakaguchi et al., 2021), ARC-easy (Clark et al., 2018), ARC-challenge (Clark et al., 2018), and OpenbookQA (Mihaylov et al., 2018). Software and hardware configuration. We uti- lize the following configurations: PyTorch version 2.1.2, BitsandBytes library version 0.43.1, Trans- formers library version 4.41.0, PEFT (Parameter- Efficient Fine-Tuning) library version 0.11.1, Op- tuna library version 3.6.1, CUDA version 12.4, GPU: NVIDIA L20 GPU with 48GB of memory.Implementation Details. The pruning method fol- lows LLM-Pruner (Ma et al., 2023), and the dataset uses 50k publicly available samples from the Al- paca (Taori et al., 2023). All experiments were con- ducted with a LoRA matrix rank of 8, and LoftQ initialization with one iteration. We utilized Bit- sandBytes for quantization configuration, for mem- ory considerations, we keep the number of 8-bit layers below 25%. For 4-bit quantization, we em- ployed NF4 (Dettmers et al., 2024), and since 2-bit quantization does not reduce memory usage, each layer’s quantization configuration only considered 4-bit and 8-bit options. More detailed hyperparam- eter settings can be found in Appendix B. 4.1 Main Results In this section, we present experimental results to demonstrate the capability of our proposed QPruner framework in balancing performance while reducing memory usage through integrat- ing quantization and structured pruning. Through further iterative optimization, it can even achieve Page 7: Table 2: Performance comparison (%) of ablation studies on seven tasks at 20% pruning rate on LLaMA-7B. It appears that QPruner captures potential resource allocations without relying on other settings. BenchmarkDtype of 4-bit Adapter Initialization Method Adapter Iteration Count Importance Estimation NF4 FP4 LoftQ Gaussian PiSSA iter=1 iter=2 iter=4 Element1Element2 ARC-e 65.49 62.84 65.49 64.77 64.44 65.49 64.31 64.18 65.49 62.50 ARC-c 38.99 36.77 38.99 38.99 38.40 38.99 38.05 38.14 38.99 37.80 WinoGrande 61.40 63.22 61.40 61.96 61.48 61.40 60.46 60.69 61.40 59.43 OBQA 40.20 39.80 40.20 39.00 40.40 40.20 39.40 39.60 40.20 38.60 BoolQ 67.22 66.48 67.22 64.43 68.20 67.22 67.55 66.85 67.22 65.44 PIQA 76.82 76.82 76.82 76.44 76.39 76.82 76.44 76.55 76.82 76.39 HellaSwag 67.97 67.88 67.97 67.80 68.01 67.97 67.97 67.93 67.97 66.93 better performance than high-precision models. Al- though pruning methods are very important, the pruning method itself is not our focus; therefore, we adopt the popular LLM-Pruner (Ma et al., 2023) as our baseline, which is a widely used structured pruning method that directly removes weights. We evaluate the model performance and peak memory usage of LLM-Pruner and QPruner under different pruning rates. Due to the lack of specific test prompts in the LLaMA paper, we utilize open- source prompts provided by Gao et al. (2023) for benchmarking. Results for the LLaMA-7B and Vicuna-7B models are shown in Table 1, and re- sults for LLaMA-13B are provided in Appendix E. Although our method is expected to have greater advantages on larger models (e.g., 70B parameters or more), due to hardware limitations, we focus only on models within 13B parameters. In our experiments, QPruner1denotes the use of uniform quantization across all layers, QPruner2 represents the mixed-precision configuration based on mutual information, and QPruner3refers to the mixed-precision quantization after further optimization using Bayesian methods based on QPruner2. Theoretically, full-parameter fine-tuning would perform better than PEFT methods; however, it performs poorly on the Alpaca dataset commonly used in model compression. If we perform individ- ual training according to each benchmark, only the pruned models after quantization can be fully fine- tuned, which is an advantage of our framework, but this would lead to unfair comparisons. Therefore, for unquantized models, we use LoRA (Hu et al., 2021) fine-tuning, and for quantized models, we use LoftQ (Li et al., 2023) fine-tuning. From Table 1, we observe that our method demonstrates more significant advantages at higher pruning rates. For instance, at a pruning rate of 50% on the LLaMA-7B model, QPruner3outperformsLLM-Pruner by achieving a higher accuracy on the BoolQ dataset (48.37% vs. 43.76%) while reduc- ing memory usage from 23.89 GB to 16.65 GB—a reduction of approximately 30%. This highlights the effectiveness of our framework in maintaining or even improving performance under aggressive compression. These results demonstrate that our QPruner framework effectively balances memory efficiency and model accuracy by integrating quantization with structured pruning. By employing finer- grained quantization strategies and a combined per- formance recovery phase, we mitigate the detrimen- tal effects that pruning and quantization individu- ally impose on LLMs. This integration not only reduces memory consumption but can also enhance model performance, especially at higher pruning rates. 4.2 Ablation Study We conducted ablation experiments using LLaMA- 7B with a 20% pruning rate, based on results ob- tained by QPruner3. All results are presented in Table 2. We tested different quantization data types (NF4,FP4), LoRA matrix initialization methods (Gaussian, PiSSA (Meng, 2024), LoftQ), varying iteration counts in LoftQ (more iterations repre- sent better error fitting), and different importance estimation methods. Our experiments show that the choice of quanti- zation data type slightly affects performance, but our method is effective across different types. Sim- ilarly, different LoRA initialization methods yield comparable results, indicating robustness to ini- tialization strategies. Interestingly, increasing the number of iterations in LoftQ does not necessarily improve performance, suggesting that fitting resid- uals with low-rank matrices may not always be beneficial. Finally, using first-order Taylor approx- Page 8: imations for importance estimation outperforms second-order ones, highlighting the complexity of LLMs and the limitations of higher-order approxi- mations. Additional experiments on different Bayesian op- timization iteration counts and resource consump- tion are provided in Appendix C. The Pareto fron- tier demonstrates that more iterations can lead to better configurations, albeit at increased computa- tional cost. 5 Related Work 5.1 Efficient Compression of LLMs LLM-Pruner (Ma et al., 2023) uses structured prun- ing to eliminate non-essential interconnected struc- tures by leveraging gradient information. This tech- nique enables compressed models to maintain good performance across multiple tasks with basic fine- tuning. Santacroce et al. (2023) proposes Glob- ally Unique Movement (GUM), a novel pruning technique focusing on the sensitivity and unique- ness of LLMs’ network components. GUM prunes neurons that uniquely contribute to the model out- put and are sensitive to loss changes, thus pre- serving high accuracy. This method optimizes the trade-off between information retention and com- putational efficiency. Quantization-Aware Train- ing (QAT) combines quantization with full model fine-tuning to adapt models for downstream tasks (Peri et al., 2020; Liu et al., 2023). Although QAT is effective, it requires substantial compu- tational resources, such as gradient calculations and optimization states, and it complicates the gra- dient computation for quantized weights. How- ever, by leveraging LoRA, these challenges can be bypassed during task adaptation. Post-Training Quantization (PTQ) frameworks, such as GPTQ and SmoothQuant (Frantar et al., 2022; Xiao et al., 2023), use a small subset of training data to cali- brate high-precision models, enabling the genera- tion of task-specific quantized models without the need for gradient backpropagation. This makes PTQ more cost-efficient than QAT, although it gen- erally results in lower accuracy. Xiao et al. (2023) proposed SmoothQuant, a post-training quantiza- tion framework that employs a mixed-precision strategy to calibrate large language models, en- abling accurate and efficient deployment without the need for retraining.5.2 Parameter Efficient Fine-Tuning LLM-Adapters (Hu et al., 2023) integrate small adapters with few extra parameters into LLMs for efficient fine-tuning, allowing smaller models to perform as well as larger ones on specific tasks. Unlike the serial approach of adapters, low-rank adaptation (LoRA) (Hu et al., 2021) uses a par- allel method to insert trainable rank decomposi- tion matrices into each layer of the model’s ar- chitecture. LoRA adds trainable matrices to each layer while keeping the pre-trained weights un- changed, reducing the number of trainable param- eters and making model adaptation faster and less resource-intensive. QLoRA (Dettmers et al., 2024) combines low-rank adapters and quantized 4-bit weights for efficient LLM fine-tuning, significantly reducing GPU memory requirements while achiev- ing performance comparable to full 16-bit fine- tuning. LoftQ (Li et al., 2023) applies quantization and low-rank approximation alternately to achieve a good initialization for LoRA fine-tuning, miti- gating the discrepancy between quantized and pre- trained weights, and enabling efficient fine-tuning of quantized models, particularly in challenging low-bit regimes. 6 Conclusion We propose QPruner, an innovative framework that combines structured pruning and quantization for efficient model compression. Given that structured pruning and quantization typically require perfor- mance recovery steps, integrating them provides a more holistic approach to mitigating the errors introduced by both techniques while further com- pressing the model. To address the uneven impor- tance distribution across layers and precision loss caused by pruning and quantization, we adopt a fine-grained method to preserve the capacity of critical layers, enhancing their performance further during the fine-tuning process. After pruning, we first allocate mixed-precision quantization based on task relevance, followed by Bayesian optimiza- tion to iteratively refine decisions and probabilisti- cally select the optimal quantization configuration. Experimental results demonstrate that QPruner sig- nificantly outperforms baseline models in terms of memory efficiency while achieving superior accu- racy across multiple NLP benchmarks. By strik- ing a balance between efficiency and performance, shows that QPruner is a powerful solution for de- ploying LLM in resource-limited environments. Page 9: Limitation One of the current limitations of QPruner is the sig- nificant precision loss caused by structured pruning, which still impacts the overall model performance. In future work, we aim to further optimize the prun- ing process to minimize this precision degradation. Additionally, the use of Bayesian optimization re- quires real data to guide the process, which can be time-consuming. While this method improves quantization configurations, the iterative nature of Bayesian optimization introduces additional com- putational overhead that may not be ideal for all deployment scenarios. References Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 10865–10873. Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. 2020. Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701 . Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical com- monsense in natural language. In Proceedings of the AAAI conference on artificial intelligence , volume 34, pages 7432–7439. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 2924–2936. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question an- swering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 . Xiang Deng, Vasilisa Bashlovkina, Feng Han, Simon Baumgartner, and Michael Bendersky. 2023. What do llms know about financial markets? a case study on reddit market sentiment analysis. In Companion Proceedings of the ACM Web Conference 2023 , pages 107–110. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 . Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuningof quantized llms. Advances in Neural Information Processing Systems , 36. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas- sive language models can be accurately pruned in one-shot. In International Conference on Machine Learning , pages 10323–10337. PMLR. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 . Peter I Frazier. 2018. Bayesian optimization. In Recent advances in optimization and modeling of contempo- rary problems , pages 255–278. Informs. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, An- ish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation. Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations . Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adap- tation of large language models. arXiv preprint arXiv:2106.09685 . Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee- Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. 2023. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 5254–5276. Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Hwang, Wonyong Sung, and Jungwook Choi. 2023. Enhancing computation efficiency in large language models through weight and activation quantization. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing , pages 14726–14739. Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatzi- akis, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. Loftq: Lora-fine-tuning-aware quantization for large language models. In The Twelfth International Conference on Learning Representations . Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chan- dra. 2023. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888 . Page 10: Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 . Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Llm-pruner: On the structural pruning of large lan- guage models. Advances in neural information pro- cessing systems , 36:21702–21720. Meng. 2024. Pissa: Principal singular values and sin- gular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948 . Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct elec- tricity? a new dataset for open book question an- swering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 2381–2391. Zeping Min and Jinbo Wang. 2023. Exploring the in- tegration of large language models into automatic speech recognition systems: An empirical study. In International Conference on Neural Information Pro- cessing , pages 69–84. Springer. Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. Importance estima- tion for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11264–11272. Dheeraj Peri, Jhalak Patel, and Josh Park. 2020. De- ploying quantization-aware trained networks using tensorrt. In GPU Technology Conference . Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat- ula, and Yejin Choi. 2021. Winogrande: An adver- sarial winograd schema challenge at scale. Commu- nications of the ACM , 64(9):99–106. Michael Santacroce, Zixin Wen, Yelong Shen, and Yuanzhi Li. 2023. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773 . Shoetsu Sato, Jin Sakuma, Naoki Yoshinaga, Masashi Toyoda, and Masaru Kitsuregawa. 2020. V ocabulary adaptation for domain adaptation in neural machine translation. In Findings of the Association for Com- putational Linguistics: EMNLP 2020 , pages 4269– 4279. Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023. Omniquant: Omnidirectionally calibrated quantization for large language models. In The Twelfth International Con- ference on Learning Representations . Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence , vol- ume 34, pages 8815–8821.Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wen- wen Gong, Shu Zhao, Peng Zhang, and Jie Tang. 2023. Gkd: A general knowledge distillation frame- work for large-scale pre-trained language model. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track) , pages 134–148. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca . Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288 . Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2023. Sheared llama: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Repre- sentations . Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 4791–4800. Biao Zhang, Barry Haddow, and Alexandra Birch. 2023a. Prompting large language model for ma- chine translation: A case study. In International Con- ference on Machine Learning , pages 41092–41110. PMLR. Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muham- mad Ali Babar, and Xiao-Yang Liu. 2023b. En- hancing financial sentiment analysis via retrieval aug- mented large language models. In Proceedings of the Fourth ACM International Conference on AI in Finance , pages 349–356. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems , 36. Page 11: A Version of LLMs We provide the Hugging Face link of LLMs used in the experiment: LLaMA- 7B: https://huggingface.co/baffo32/ decapoda-research-llama-7B-hf ; Vicuna- 7B: https://huggingface.co/lmsys/ vicuna-7b-v1.5 ; LLaMA-13B: https: //huggingface.co/yahma/llama-13b-hf B Hyperparameters In the optimization of the pruned LLaMA-7B model, a comprehensive hyperparameter config- uration was employed to ensure an optimal balance between model performance and computational ef- ficiency. The model was fine-tuned with a learning rate of 3×10−4, utilizing a batch size of 8, further divided into micro batches of 4 to manage mem- ory constraints effectively. Sequences were stan- dardized to a maximum length of 256 tokens, and a dropout of 0.05 was applied specifically to the LoRA layers targeting projections such as query, key, value, and output, alongside gate, down, and up projections. Quantization was dynamically ap- plied at 4-bit and 8-bit levels according to layer requirements to optimize memory use without com- promising computational accuracy. The training employed the paged AdamW optimizer with 32-bit precision, enhancing stability and efficiency. These settings were methodically tested and optimized through the Optuna framework to ensure robust model performance and resource utilization. C Results of Optimization Workflow In this section, we will use the LLaMA-7B model with 50% pruning as our example to illustrate the Pareto optimization workflow, as shown in Figure 4 D Details of The Optimization Workflow. We illustrated the optimization process, memory, and time footprint of QPruner using a 50% param- eter pruning rate on llama-7b as an example. We fine-tuned 10 sets of configurations as the initial- ization for the Gaussian Process (GP) (this is not mandatory; in other experiments, we found that starting from scratch, a good configuration could be found in about 10 iterations). In each config- uration, the quantization precision for all model layers was randomly selected between 4-bit and 8-bit. On average, obtaining data for each initial- ization took approximately 25 minutes. We set the (a) Pareto-front scatter plot for BoolQ (b) Pareto-front scatter plot for WinoGrande Figure 3: Pareto-front scatter plots for BoolQ and Wino- Grande with 50 data points. The red points indicate the non-dominated configurations within the Pareto frontier. total number of iterations for QPruner to 40 (result- ing in 50 data points for constructing the Pareto front) to ensure the best configuration was found. The entire process took approximately 16.5 hours. During QPruner iterations, GP required around 7s to suggest the next configuration, while the predic- tion process and Pareto frontier construction con- sumed approximately 187MB memory. In Figure 3, we present the optimization results for BoolQ and Winograd. More detailed processes and results for other benchmarks are provided in Appendix C. E Performance in LLaMA-13B We list the performance of the configuration de- scribed in Section 4 for LLaMA-13B in Table 3. F Code Usage Instructions In our workflow, we begin by defining a custom pruning rate to prune the original LLM model. This step generates a new, streamlined model version that we save under the “tuning” directory. Sub- sequently, we apply either random or specified Page 12: precision quantization adjustments to this pruned model. These quantization strategies are carefully chosen to further reduce the model’s memory foot- print while striving to maintain its operational ef- fectiveness. Once quantized, the model undergoes a thorough evaluation using specific tools that as- sess its performance on downstream tasks to ensure it retains accuracy and effectiveness after modifica- tions. To integrate Optuna into our optimization pro- cess, we begin by recording the pruning and quan- tization parameters in a JSON format. These pa- rameters serve as historical data inputs for Optuna, facilitating a structured approach to hyperparam- eter tuning. We employ Optuna to conduct multi- objective optimization, setting multiple goals such as balancing model performance with memory effi- ciency. Optuna’s iterative search process explores various parameter combinations across numerous iterations, each informed by the outcomes of pre- vious evaluations. The results from these searches are compiled into a dataframe, which not only aids in subsequent analysis but is also crucial for identi- fying the Pareto frontier, optimizing the trade-offs between performance and resource usage. G Limitations This study focuses solely on the fine-tuning of pruned models, leveraging the self-adjusting na- ture of QPruner. However, we believe that this adaptive approach could be broadly applicable to a wide range of models, not limited to pruned ones. Future research will explore the optimization of fine-tuning across different model architectures. Additionally, we did not conduct experiments on larger models, such as those with 70B param- eters or more. The scalability and effectiveness of QPruner on such massive models remain to be investigated, and this will be a key focus of our future work. Page 13: Pruning Rate MethodQPruner1BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA Rate = 0% w/o tuning 68.50 79.11 76.21 70.09 74.58 44.54 42.20 Rate = 50%LLM-Pruner 61.93(41.32) 71.38(41.32) 53.36(41.32) 53.59(41.32) 29.95(41.32) 53.11(41.32) 38.00(41.32) QPruner161.71(36.68) 72.63(36.68) 56.10(36.68) 55.17(36.68) 31.57(36.68) 55.47(36.68) 38.60(36.68) QPruner361.80(30.53) 73.23 (30.53) 56.37 (30.53) 55.09(31.45) 31.48(30.53) 55.80 (31.45) 39.00 (30.58) Table 3: Zero-shot performance and memory. ‘Bold’ indicates the best performance at each pruning rate. Reported in percentage (%). (a) ARC-c (b) ARC-e (c) HellaSwag (d) OBQA (e) PIQA Figure 4: Pareto-front scatter plots for different Downstream Tasks

---