Paper Content:
Page 1:
QPruner: Probabilistic Decision Quantization for Structured Pruning in
Large Language Models
Changhai Zhou1,3, Yuhua Zhou2, Yibin Wang1,
Shijie Han4,Qian Qiao5,Hongguang Li3,
1Fudan University,2Zhejiang University,3JF SmartInvest Holdings,4Columbia University,5Soochow University,
zhouch23@m.fudan.edu.cn zhouyuhua@zju.edu.cn yibinwang1121@163.com
sh4460@columbia.edu qqiao@stu.suda.edu.cn harvey2@mail.ustc.edu.cn
Abstract
The rise of large language models (LLMs) has
significantly advanced various natural language
processing (NLP) tasks. However, the resource
demands of these models pose substantial chal-
lenges. Structured pruning is an effective ap-
proach to reducing model size, but it often re-
sults in significant accuracy degradation, ne-
cessitating parameter updates to adapt. Unfor-
tunately, such fine-tuning requires substantial
memory, which limits its applicability. To ad-
dress these challenges, we introduce quantiza-
tion into the structured pruning framework to
reduce memory consumption during both fine-
tuning and inference. However, the combined
errors from pruning and quantization increase
the difficulty of fine-tuning, requiring a more
refined quantization scheme. To this end, we
propose QPruner, a novel framework that em-
ploys structured pruning to reduce model size,
followed by a layer-wise mixed-precision quan-
tization scheme. Quantization precisions are
assigned to each layer based on their impor-
tance to the target task, and Bayesian optimiza-
tion is employed to refine precision allocation
strategies, ensuring a balance between model
accuracy and memory efficiency. Extensive ex-
periments on benchmark datasets demonstrate
that QPruner significantly outperforms existing
methods in memory savings while maintaining
or improving model performance.
1 Introduction
The advent of large language models (LLMs) has
revolutionized various natural language processing
(NLP) tasks, such as machine translation (Zhang
et al., 2023a; Sato et al., 2020), sentiment analy-
sis (Zhang et al., 2023b; Deng et al., 2023), and
speech recognition (Min and Wang, 2023). Despite
their impressive capabilities, the resource consump-
tion required to obtain a fine-tuned model suitable
for specific tasks remains substantial due to the
large number of parameters and high computationaldemands of LLMs (Frantar and Alistarh, 2023).
To address these issues, various compression tech-
niques, including pruning (Molchanov et al., 2019;
Liu et al., 2018), quantization (Shao et al., 2023;
Lee et al., 2023), and distillation (Gu et al., 2023;
Tan et al., 2023), have been proposed.
Structured pruning (Ma et al., 2023; Xia et al.,
2023) is a widely used approach that reduces model
size by removing less important parameters in a
structured manner, preserving the overall archi-
tecture compatibility with hardware requirements.
However, the disruption of computational graph
uniformity and the removal of parameters can sig-
nificantly reduce the accuracy of LLMs, which are
inherently information-dense networks. To miti-
gate this degradation, fine-tuning is often used to
recover the accuracy of pruned models. This fine-
tuning step, while effective, is memory-intensive
and presents substantial challenges in terms of re-
source consumption.
To further reduce memory usage during the fine-
tuning and inference phases, we introduce quantiza-
tion into the structured pruning framework. Specifi-
cally, after performing structured pruning, we quan-
tize the pruned model and then apply different
fine-tuning strategies. Quantization effectively re-
duces the bit-width of model parameters, thereby
lowering the resource consumption during both
fine-tuning and inference. However, integrating
quantization with structured pruning introduces ad-
ditional complexities. Structured pruning applies
different pruning intensities across model layers,
which exacerbates the uneven distribution of layer
importance, making some layers more critical for
maintaining model performance. Moreover, the
cumulative quantization error varies across differ-
ent layers, potentially amplifying the performance
degradation caused by pruning. Therefore, a sim-
ple, uniform quantization scheme is suboptimal. In-
stead, a more nuanced, layer-wise mixed-precision
quantization approach is needed. By allowing morearXiv:2412.11629v1 [cs.LG] 16 Dec 2024
Page 2:
critical layers to maintain higher precision, we can
better control the overall performance of the model.
Building upon these observations, we propose
a new framework called QPruner. In QPruner, we
first apply structured pruning to reduce the model
size, followed by a quantization phase where dif-
ferent quantization precisions are assigned to each
layer based on their contribution to the target task.
To further improve the allocation strategy, Bayesian
optimization (Frazier, 2018) is employed to explore
better precision configurations. Finally, we apply
parameter-efficient fine-tuning (PEFT) fine-tuning
strategy, to recover model performance. This inte-
grated approach aims to strike an optimal balance
between model accuracy and memory efficiency,
making it well-suited for resource-constrained sce-
narios. The main contributions of this work are
summarized as follows:
• We propose QPruner, a novel framework that
integrates structured pruning and quantization,
aiming to significantly reduce the memory
consumption of LLMs during both fine-tuning
and inference.
•We introduce a mixed-precision quantization
scheme where quantization precisions are as-
signed to each layer based on their importance
to the target task, with Bayesian optimiza-
tion used to further refine precision allocation
strategies.
•We demonstrate QPruner’s powerful ability
to save memory and maintain performance.
It can surpass baseline methods in terms of
accuracy by up to 6% while saving at least
30% of memory.
2 Background and Motivation
2.1 Quantization
Quantization. Quantization is an essential tech-
nique used to reduce the computational and mem-
ory overhead of large-scale models by converting
high-precision numerical values, such as a 32-bit
floating-point number XHP∈R, into a lower-bit
integer representation XINT∈ {0,1, . . . , 2N−1}.
This process is mathematically expressed as:
XINT=round
(2N−1)F
XHP
,(1)
where F(·):R→[0,1]is a normalization func-
tion. A typical method is uniform quantization,
where F(X)is defined as F(X) =X−Xmin
Xmax−Xmin.An alternative approach introduced by QLoRA
Dettmers et al. (2024) is 4-bit NormalFloat Quanti-
zation (NF4), which assumes that the data follows
a normal distribution X∼ N(0, σ2)and applies
F(X) = Φ( X/σ), with Φ(·)representing the cu-
mulative distribution function of a standard normal
distribution.
Dequantization. To recover the high-precision
values from their quantized forms, a lookup table
Tis used, which is defined as:
T[i] =F−1i
2N−1
, i= 0,1, . . . , 2N−1,
(2)
allowing the integer XINTto be mapped back to
its simulated high-precision counterpart XD∈R.
The dequantization process can be represented as:
XD=T[XINT]. (3)
Simulated Quantization for Matrices. In prac-
tice, it is often more efficient to use simulated quan-
tization for matrices rather than directly operating
on quantized values (Bai et al., 2020; Shen et al.,
2020). In this method, quantized weight matrices
are stored as encoded integers and are temporar-
ily dequantized into simulated high-precision ma-
trices during multiplication operations. This pro-
cess is denoted by qN(·):Rm×n→Rm×n
N, where
RN:{T[i]∈R|0≤i <2N}.
2.2 The Motivating Example
Efficient fine-tuning of LLMs on resource-
constrained devices requires effective model com-
pression and fine-tuning techniques. After applying
structured pruning and quantization, more efficient
fine-tuning methods are needed to recover accu-
racy. One approach is to use LoRA-based methods,
as done in LLM-Pruner (Ma et al., 2023), which
employs LoRA for quick recovery after structured
pruning. Among the LoRA series methods, LoftQ
Li et al. (2023) is a method for fine-tuning quan-
tized models. Before fine-tuning, LoftQ itera-
tively updates the low-rank matrices such that the
quantized matrix Q+AB approximates the full-
precision matrix W, thereby improving the fine-
tuning performance, particularly in low-bit settings.
Simply combining pruning, quantization, and
LoRA can lead to suboptimal results. Structural
pruning reduces model size by removing less im-
portant parameters, but due to the varying impor-
tance of different layers, it often results in uneven
Page 3:
Figure 1: Comparison of accuracy and memory usage across different fine-tuning configurations for multiple tasks.
The bars represent the accuracy of three different methods (LoRA, LoftQ, LoftQ*) on each task, while the markers
indicate the memory usage for each corresponding method.
pruning across layers. This uneven pruning leads to
a complex and unbalanced network structure, and
standard quantization typically applies a uniform
configuration across all layers. To explore a better
trade-off between performance and memory, we
adopted mixed-precision quantization, assigning
different computational resources and complexities
to different layers, with the goal of allowing more
important layers to learn with finer granularity.
We conducted experiments using the LLaMA-7b
model with a pruning rate of 20%. The pruning was
performed using the optimal strategy determined
by LLM-Pruner. The methods compared were as
follows: LoRA with a uniform 16-bit configura-
tion, LoftQ with a uniform 4-bit quantization, and
LoftQ* with a mixed-precision setting of 4 or 8
bits per layer. As shown in Figure 1, the quan-
tized models (LoftQ) achieved performance com-
parable to the original precision models (LoRA),
with significantly lower memory usage (21.33 GB
versus 35.06 GB). On some tasks, there was a
slight drop in performance, but the mixed-precision
model (LoftQ*) demonstrated the potential to fur-
ther enhance performance while maintaining effi-
cient memory usage.
3 QPruner
Structured pruning, while effective in reducing
model size, can disrupt the balance of layer impor-
tance, leading to performance degradation. There-fore, parameter adjustments are often necessary
to mitigate this imbalance and restore model per-
formance. However, parameter updates require
significant memory, which is why we employ quan-
tization techniques to reduce memory consumption.
As demonstrated in the motivating example, simply
combining pruning and quantization is not always
the best choice, as the importance of different lay-
ers in a pruned model can vary greatly. We need
finer-grained layer-wise quantization bit-width con-
trol, which introduces a challenging bit-width al-
location problem. To address this, we designed a
two-stage allocation strategy to effectively balance
these trade-offs.
Based on these insights, we propose QPruner, an
integrated framework tailored for efficient or low-
resource NLP tasks. It employs structured pruning,
mixed-precision quantization, and efficient fine-
tuning to solve the challenges of balancing memory
efficiency and model performance.
3.1 Structured Pruning
Our framework does not impose specific require-
ments on the pruning method; as new technolo-
gies evolve, the pruning method can be replaced.
The only requirement for this step is to produce
a smaller model. Although some methods can
achieve good performance without fine-tuning (An
et al., 2024), most real-time systems require dy-
namic adaptation, which means that the pruned
model must be fine-tuned to improve performance.
Page 4:
Pruned LLM•Discovery
•EstimateStructured Pruning
Datasets
FFT, PEFT4bit16bi
tPerformance Recovery
init stateoptimal trialBayesian
OptimizationMixed-Precision Quantization
layer i+n
layer i+1
layer iimportance-aware
quantization initalization
...[4bit, 8bit,4bit, 8bit
... ,
4bit, 8bit, 4bit ,16bit]Figure 2: Overview of the QPruner framework.
A popular structured pruning method is LLM-
Pruner (Ma et al., 2023), which first identifies
dependencies between neurons and groups them,
then removes weights based on their importance.
LetNiandNjbe two neurons in the model. If
Nj∈Out(Ni)and Deg−(Nj) = 1 , then Nj
is dependent on Ni. Similarly, if Ni∈In(Nj)
andDeg+(Ni) = 1 , then Niis dependent on Nj.
Based on this principle, a dependency graph can
be constructed to iteratively identify all coupled
structures.
Next, these coupled structures are grouped, and
their importance is estimated to effectively perform
pruning. For a group of coupled structures G=
{Wi}M
i=1, its importance can be expressed as:
IW i=|LW i(D)− LW i=0(D)|, (4)
where Lrepresents the prediction loss.
Using a second-order Taylor expansion, the im-
portance can be approximated as:
∂L(D)
∂WiWi−1
2W⊤
iHW i, (5)
whereHis the Hessian matrix of the loss func-
tion.
For each parameter Wi
k, its importance is defined
as:
∂L(D)
∂Wi
kWi
k−1
2(Wi
k)2Hkk, (6)
where Hkkis the k-th diagonal element of the
Hessian matrix.
Finally, we aggregate the importance of each
structure into group-level importance using meth-ods such as summation, multiplication, taking the
maximum, or using only the last item. Groups
with the lowest importance are selected for pruning,
thereby reducing the model size while maintaining
performance as much as possible.
3.2 Mixed-Precision Quantization
After pruning, we apply mixed-precision quantiza-
tion to further reduce memory usage while main-
taining model performance. Instead of assigning
a uniform bit-width across all layers, different bit-
widths are allocated based on each layer’s contri-
bution to the final model output. The contribution
of each layer is quantified using mutual informa-
tion between the layer’s output and the model’s
prediction.
To compute mutual information, we first run rep-
resentative data samples through the pruned model.
For each layer, we record its output Xand the fi-
nal prediction Y. The mutual information I(X;Y)
between the output of layer Xand prediction Yis
computed as:
I(X;Y) =X
x∈XX
y∈Yp(x, y) logp(x, y)
p(x)p(y),(7)
where p(x, y)is the joint probability distribution
ofXandY, while p(x)andp(y)are the marginal
distributions. A higher mutual information value
indicates that the layer is more important for the
final output and should therefore be assigned a
higher bit-width. Once the mutual information is
computed, an average bit-width Bavgis determined
based on the available memory budget. Layers
Page 5:
Algorithm 1 Mixed-Precision Quantization
Compute mutual information I(Xi;Y)
Initialize bit-width configuration b0based on
I(Xi;Y)and memory constraint
D ← { (b0, P(b0), M(b0))}
while not converged do
Train GP model on D
bt+1←arg max bα(b)
Applybt+1to pruned model and fine-tune
Measure P(bt+1),M(bt+1)
D ← D ∪ { (bt+1, P(bt+1), M(bt+1))}
end while
with higher importance receive more bits, and the
allocation is performed in discrete bit-widths (e.g.,
4-bit, 8-bit), constrained by the total memory limit.
Although the initial bit-width configuration de-
rived from mutual information offers a reasonable
starting point for fine-tuning, the complex interac-
tions between layers, particularly in LLMs, mean
that the importance of individual layers may shift
after fine-tuning. As a result, the initial bit-width
assignment might not represent the optimal configu-
ration. To further refine the precision configuration,
we employ Bayesian optimization.
The objective of Bayesian optimization is to
maximize model performance while minimizing
memory usage. Let b= [B1, B2, . . . , B L]repre-
sent the bit-width configuration across Llayers.
The optimization problem is formulated as:
bopt= arg max
bα(b), (8)
where α(b)is an acquisition function that bal-
ances exploration (of less well-understood config-
urations) and exploitation (of known promising
configurations). The memory usage M(b)is con-
strained by Mmax, the total available memory.
The process starts by initializing a dataset Dwith
the initial bit-width configuration b0, along with
its corresponding performance P(b0)and memory
usage M(b0). A Gaussian Process (GP) model is
then trained on the data to predict model perfor-
mance and the uncertainty for new configurations.
Based on this model, the acquisition function α(b)
is used to select the next bit-width configuration to
evaluate.
Once a new configuration bt+1is selected, it is
applied to the pruned model, fine-tuned, and its per-
formance P(bt+1)and memory usage M(bt+1)
are measured. These results are then added to thedataset D, and the GP model is updated with the
new data. This iterative process continues until
a stopping criterion is met, such as convergence
or a maximum number of iterations. Over time,
this method refines the bit-width configuration to
achieve an optimal balance between model perfor-
mance and memory efficiency.
3.3 Performance Recovery
After the steps of structured pruning and mixed-
precision quantization, significant memory savings
are achieved. However, model performance typi-
cally needs to be restored through fine-tuning. Full-
parameter fine-tuning is often impractical due to
the large memory footprint it requires, but our com-
pression technique makes full model fine-tuning
feasible by reducing both memory and computa-
tional costs.
In addition to traditional full-parameter fine-
tuning, efficient fine-tuning techniques such as
LoRA (Low-Rank Adaptation) (Hu et al., 2021)
have proven especially effective, particularly in
scenarios with limited data. LoRA significantly re-
duces the number of trainable parameters by freez-
ing the original weight matrix W0and only updat-
ing the low-rank approximation of the weight ma-
trix, represented as ∆W=AB, where A∈Rd×r
andB∈Rr×d. Here, r(the rank) is much smaller
than the original dimension d, leading to a substan-
tial reduction in the number of trainable parame-
ters.
The forward computation in this approach can
be written as:
Y=W0X+ ∆WX =W0X+ABX ,(9)
There are also LoRA-like methods specifically
designed for quantized models, such as QLoRA
(Dettmers et al., 2023) and LoftQ (Li et al., 2023).
LoftQ iteratively updates the low-rank matrices A
andBsuch that the quantized matrix Q+AB
approximates the original full-precision matrix W
during fine-tuning. The objective is defined as:
min
A,B∥W−(Q+AB)∥2. (10)
whereQis the quantized matrix.
By combining structured pruning, mixed-
precision quantization, and performance recovery
techniques, QPruner is able to achieve robust adapt-
ability with minimal computational overhead.
Page 6:
Table 1: Zero-shot performance and peak memory usage on LLaMA-7B and Vicuna-7B with varying pruning rates.
LLM-Pruner represents the currently widely used half-precision model. The performance is reported in percentage
(%), and the memory usage is in gigabytes (GB).
Method BoolQ PIQA HellS WinoG ARC-e ARC-c OBQA Memory (GB)
LLaMA-7BRate = 0% w/o tuning 73.09 78.35 72.98 67.09 67.42 41.38 42.40 -
Rate = 20%LLM-Pruner 63.30 76.82 68.68 63.38 63.76 37.11 40.60 35.06
QPruner167.77 76.55 68.03 61.80 64.06 38.65 40.00 21.78
QPruner268.60 76.79 68.43 62.78 65.50 38.74 40.40 23.05
QPruner369.11 77.23 68.80 63.17 66.16 39.20 41.00 23.32
Rate = 30%LLM-Pruner 62.45 74.37 63.14 61.96 59.22 33.70 39.60 31.38
QPruner158.96 71.22 58.10 58.88 52.19 32.34 38.40 20.12
QPruner262.20 72.88 60.64 60.50 55.61 33.56 38.40 22.87
QPruner366.50 74.43 61.14 61.40 58.12 34.47 39.20 22.15
Rate = 50%LLM-Pruner 43.76 68.88 44.85 50.99 45.20 28.75 34.60 23.89
QPruner145.14 68.34 44.39 52.96 43.86 29.01 35.80 15.47
QPruner247.08 68.85 45.53 53.65 44.31 29.36 36.20 16.85
QPruner348.37 69.20 45.19 54.45 45.28 29.70 36.40 16.65
Vicuna-7BRate = 0% w/o tuning 75.69 77.75 71.06 67.80 69.07 40.78 42.20 -
Rate = 20%LLM-Pruner 57.77 77.56 67.16 63.14 67.30 37.71 40.40 35.25
QPruner157.95 76.82 66.42 62.51 66.62 37.37 40.60 21.65
QPruner259.70 77.20 66.31 62.66 67.12 37.48 40.80 22.95
QPruner359.85 77.59 67.31 63.20 67.84 37.85 41.20 23.10
Rate = 30%LLM-Pruner 58.81 74.37 60.70 60.62 59.01 33.79 38.80 31.83
QPruner153.85 74.76 60.65 60.06 59.72 34.30 38.20 19.95
QPruner255.64 75.07 61.65 60.31 59.54 34.47 38.60 21.65
QPruner357.23 75.90 62.00 60.37 60.81 34.79 39.40 21.80
Rate = 50%LLM-Pruner 59.51 66.87 43.18 52.01 48.40 26.45 34.00 24.55
QPruner159.51 67.90 43.30 50.83 48.82 27.49 34.60 14.50
QPruner261.31 68.56 44.54 53.02 49.50 28.13 35.40 15.90
QPruner361.56 68.80 43.72 53.39 49.66 27.98 35.80 15.35
4 Experiments
LLMs and Benchmarks. To demonstrate how
QPruner performes on different model, we test it on
three open source large language models: LLaMA-
7B (Touvron et al., 2023), LLaMA-13B (Touvron
et al., 2023) and Vicuna-7B (Zheng et al., 2024),
and specific version is stated in the Appendix A.
We conduct these LLMs on zero-shot classification
tests for commonsense reasoning datasets, includ-
ing BoolQ (Clark et al., 2019), PIQA (Bisk et al.,
2020), HellaSwag (Zellers et al., 2019), Wino-
Grande (Sakaguchi et al., 2021), ARC-easy (Clark
et al., 2018), ARC-challenge (Clark et al., 2018),
and OpenbookQA (Mihaylov et al., 2018).
Software and hardware configuration. We uti-
lize the following configurations: PyTorch version
2.1.2, BitsandBytes library version 0.43.1, Trans-
formers library version 4.41.0, PEFT (Parameter-
Efficient Fine-Tuning) library version 0.11.1, Op-
tuna library version 3.6.1, CUDA version 12.4,
GPU: NVIDIA L20 GPU with 48GB of memory.Implementation Details. The pruning method fol-
lows LLM-Pruner (Ma et al., 2023), and the dataset
uses 50k publicly available samples from the Al-
paca (Taori et al., 2023). All experiments were con-
ducted with a LoRA matrix rank of 8, and LoftQ
initialization with one iteration. We utilized Bit-
sandBytes for quantization configuration, for mem-
ory considerations, we keep the number of 8-bit
layers below 25%. For 4-bit quantization, we em-
ployed NF4 (Dettmers et al., 2024), and since 2-bit
quantization does not reduce memory usage, each
layer’s quantization configuration only considered
4-bit and 8-bit options. More detailed hyperparam-
eter settings can be found in Appendix B.
4.1 Main Results
In this section, we present experimental results
to demonstrate the capability of our proposed
QPruner framework in balancing performance
while reducing memory usage through integrat-
ing quantization and structured pruning. Through
further iterative optimization, it can even achieve
Page 7:
Table 2: Performance comparison (%) of ablation studies on seven tasks at 20% pruning rate on LLaMA-7B. It
appears that QPruner captures potential resource allocations without relying on other settings.
BenchmarkDtype of 4-bit Adapter Initialization Method Adapter Iteration Count Importance Estimation
NF4 FP4 LoftQ Gaussian PiSSA iter=1 iter=2 iter=4 Element1Element2
ARC-e 65.49 62.84 65.49 64.77 64.44 65.49 64.31 64.18 65.49 62.50
ARC-c 38.99 36.77 38.99 38.99 38.40 38.99 38.05 38.14 38.99 37.80
WinoGrande 61.40 63.22 61.40 61.96 61.48 61.40 60.46 60.69 61.40 59.43
OBQA 40.20 39.80 40.20 39.00 40.40 40.20 39.40 39.60 40.20 38.60
BoolQ 67.22 66.48 67.22 64.43 68.20 67.22 67.55 66.85 67.22 65.44
PIQA 76.82 76.82 76.82 76.44 76.39 76.82 76.44 76.55 76.82 76.39
HellaSwag 67.97 67.88 67.97 67.80 68.01 67.97 67.97 67.93 67.97 66.93
better performance than high-precision models. Al-
though pruning methods are very important, the
pruning method itself is not our focus; therefore,
we adopt the popular LLM-Pruner (Ma et al., 2023)
as our baseline, which is a widely used structured
pruning method that directly removes weights.
We evaluate the model performance and peak
memory usage of LLM-Pruner and QPruner under
different pruning rates. Due to the lack of specific
test prompts in the LLaMA paper, we utilize open-
source prompts provided by Gao et al. (2023) for
benchmarking. Results for the LLaMA-7B and
Vicuna-7B models are shown in Table 1, and re-
sults for LLaMA-13B are provided in Appendix E.
Although our method is expected to have greater
advantages on larger models (e.g., 70B parameters
or more), due to hardware limitations, we focus
only on models within 13B parameters.
In our experiments, QPruner1denotes the use of
uniform quantization across all layers, QPruner2
represents the mixed-precision configuration based
on mutual information, and QPruner3refers
to the mixed-precision quantization after further
optimization using Bayesian methods based on
QPruner2. Theoretically, full-parameter fine-tuning
would perform better than PEFT methods; however,
it performs poorly on the Alpaca dataset commonly
used in model compression. If we perform individ-
ual training according to each benchmark, only the
pruned models after quantization can be fully fine-
tuned, which is an advantage of our framework, but
this would lead to unfair comparisons. Therefore,
for unquantized models, we use LoRA (Hu et al.,
2021) fine-tuning, and for quantized models, we
use LoftQ (Li et al., 2023) fine-tuning.
From Table 1, we observe that our method
demonstrates more significant advantages at higher
pruning rates. For instance, at a pruning rate of 50%
on the LLaMA-7B model, QPruner3outperformsLLM-Pruner by achieving a higher accuracy on the
BoolQ dataset (48.37% vs. 43.76%) while reduc-
ing memory usage from 23.89 GB to 16.65 GB—a
reduction of approximately 30%. This highlights
the effectiveness of our framework in maintaining
or even improving performance under aggressive
compression.
These results demonstrate that our QPruner
framework effectively balances memory efficiency
and model accuracy by integrating quantization
with structured pruning. By employing finer-
grained quantization strategies and a combined per-
formance recovery phase, we mitigate the detrimen-
tal effects that pruning and quantization individu-
ally impose on LLMs. This integration not only
reduces memory consumption but can also enhance
model performance, especially at higher pruning
rates.
4.2 Ablation Study
We conducted ablation experiments using LLaMA-
7B with a 20% pruning rate, based on results ob-
tained by QPruner3. All results are presented in
Table 2. We tested different quantization data types
(NF4,FP4), LoRA matrix initialization methods
(Gaussian, PiSSA (Meng, 2024), LoftQ), varying
iteration counts in LoftQ (more iterations repre-
sent better error fitting), and different importance
estimation methods.
Our experiments show that the choice of quanti-
zation data type slightly affects performance, but
our method is effective across different types. Sim-
ilarly, different LoRA initialization methods yield
comparable results, indicating robustness to ini-
tialization strategies. Interestingly, increasing the
number of iterations in LoftQ does not necessarily
improve performance, suggesting that fitting resid-
uals with low-rank matrices may not always be
beneficial. Finally, using first-order Taylor approx-
Page 8:
imations for importance estimation outperforms
second-order ones, highlighting the complexity of
LLMs and the limitations of higher-order approxi-
mations.
Additional experiments on different Bayesian op-
timization iteration counts and resource consump-
tion are provided in Appendix C. The Pareto fron-
tier demonstrates that more iterations can lead to
better configurations, albeit at increased computa-
tional cost.
5 Related Work
5.1 Efficient Compression of LLMs
LLM-Pruner (Ma et al., 2023) uses structured prun-
ing to eliminate non-essential interconnected struc-
tures by leveraging gradient information. This tech-
nique enables compressed models to maintain good
performance across multiple tasks with basic fine-
tuning. Santacroce et al. (2023) proposes Glob-
ally Unique Movement (GUM), a novel pruning
technique focusing on the sensitivity and unique-
ness of LLMs’ network components. GUM prunes
neurons that uniquely contribute to the model out-
put and are sensitive to loss changes, thus pre-
serving high accuracy. This method optimizes the
trade-off between information retention and com-
putational efficiency. Quantization-Aware Train-
ing (QAT) combines quantization with full model
fine-tuning to adapt models for downstream tasks
(Peri et al., 2020; Liu et al., 2023). Although
QAT is effective, it requires substantial compu-
tational resources, such as gradient calculations
and optimization states, and it complicates the gra-
dient computation for quantized weights. How-
ever, by leveraging LoRA, these challenges can
be bypassed during task adaptation. Post-Training
Quantization (PTQ) frameworks, such as GPTQ
and SmoothQuant (Frantar et al., 2022; Xiao et al.,
2023), use a small subset of training data to cali-
brate high-precision models, enabling the genera-
tion of task-specific quantized models without the
need for gradient backpropagation. This makes
PTQ more cost-efficient than QAT, although it gen-
erally results in lower accuracy. Xiao et al. (2023)
proposed SmoothQuant, a post-training quantiza-
tion framework that employs a mixed-precision
strategy to calibrate large language models, en-
abling accurate and efficient deployment without
the need for retraining.5.2 Parameter Efficient Fine-Tuning
LLM-Adapters (Hu et al., 2023) integrate small
adapters with few extra parameters into LLMs for
efficient fine-tuning, allowing smaller models to
perform as well as larger ones on specific tasks.
Unlike the serial approach of adapters, low-rank
adaptation (LoRA) (Hu et al., 2021) uses a par-
allel method to insert trainable rank decomposi-
tion matrices into each layer of the model’s ar-
chitecture. LoRA adds trainable matrices to each
layer while keeping the pre-trained weights un-
changed, reducing the number of trainable param-
eters and making model adaptation faster and less
resource-intensive. QLoRA (Dettmers et al., 2024)
combines low-rank adapters and quantized 4-bit
weights for efficient LLM fine-tuning, significantly
reducing GPU memory requirements while achiev-
ing performance comparable to full 16-bit fine-
tuning. LoftQ (Li et al., 2023) applies quantization
and low-rank approximation alternately to achieve
a good initialization for LoRA fine-tuning, miti-
gating the discrepancy between quantized and pre-
trained weights, and enabling efficient fine-tuning
of quantized models, particularly in challenging
low-bit regimes.
6 Conclusion
We propose QPruner, an innovative framework that
combines structured pruning and quantization for
efficient model compression. Given that structured
pruning and quantization typically require perfor-
mance recovery steps, integrating them provides
a more holistic approach to mitigating the errors
introduced by both techniques while further com-
pressing the model. To address the uneven impor-
tance distribution across layers and precision loss
caused by pruning and quantization, we adopt a
fine-grained method to preserve the capacity of
critical layers, enhancing their performance further
during the fine-tuning process. After pruning, we
first allocate mixed-precision quantization based
on task relevance, followed by Bayesian optimiza-
tion to iteratively refine decisions and probabilisti-
cally select the optimal quantization configuration.
Experimental results demonstrate that QPruner sig-
nificantly outperforms baseline models in terms of
memory efficiency while achieving superior accu-
racy across multiple NLP benchmarks. By strik-
ing a balance between efficiency and performance,
shows that QPruner is a powerful solution for de-
ploying LLM in resource-limited environments.
Page 9:
Limitation
One of the current limitations of QPruner is the sig-
nificant precision loss caused by structured pruning,
which still impacts the overall model performance.
In future work, we aim to further optimize the prun-
ing process to minimize this precision degradation.
Additionally, the use of Bayesian optimization re-
quires real data to guide the process, which can
be time-consuming. While this method improves
quantization configurations, the iterative nature of
Bayesian optimization introduces additional com-
putational overhead that may not be ideal for all
deployment scenarios.
References
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao
Wang. 2024. Fluctuation-based adaptive structured
pruning for large language models. In Proceedings
of the AAAI Conference on Artificial Intelligence ,
volume 38, pages 10865–10873.
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing
Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin
King. 2020. Binarybert: Pushing the limit of bert
quantization. arXiv preprint arXiv:2012.15701 .
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi,
et al. 2020. Piqa: Reasoning about physical com-
monsense in natural language. In Proceedings of the
AAAI conference on artificial intelligence , volume 34,
pages 7432–7439.
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina
Toutanova. 2019. Boolq: Exploring the surprising
difficulty of natural yes/no questions. In Proceedings
of the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and
Short Papers) , pages 2924–2936.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
swering? try arc, the ai2 reasoning challenge. arXiv
preprint arXiv:1803.05457 .
Xiang Deng, Vasilisa Bashlovkina, Feng Han, Simon
Baumgartner, and Michael Bendersky. 2023. What
do llms know about financial markets? a case study
on reddit market sentiment analysis. In Companion
Proceedings of the ACM Web Conference 2023 , pages
107–110.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
of quantized llms. arXiv preprint arXiv:2305.14314 .
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Luke Zettlemoyer. 2024. Qlora: Efficient finetuningof quantized llms. Advances in Neural Information
Processing Systems , 36.
Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Mas-
sive language models can be accurately pruned in
one-shot. In International Conference on Machine
Learning , pages 10323–10337. PMLR.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and
Dan Alistarh. 2022. Gptq: Accurate post-training
quantization for generative pre-trained transformers.
arXiv preprint arXiv:2210.17323 .
Peter I Frazier. 2018. Bayesian optimization. In Recent
advances in optimization and modeling of contempo-
rary problems , pages 255–278. Informs.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman,
Sid Black, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li,
Kyle McDonell, Niklas Muennighoff, Chris Ociepa,
Jason Phang, Laria Reynolds, Hailey Schoelkopf,
Aviya Skowron, Lintang Sutawika, Eric Tang, An-
ish Thite, Ben Wang, Kevin Wang, and Andy Zou.
2023. A framework for few-shot language model
evaluation.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023.
Minillm: Knowledge distillation of large language
models. In The Twelfth International Conference on
Learning Representations .
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen. 2021. Lora: Low-rank adap-
tation of large language models. arXiv preprint
arXiv:2106.09685 .
Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-
Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria,
and Roy Lee. 2023. Llm-adapters: An adapter family
for parameter-efficient fine-tuning of large language
models. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing ,
pages 5254–5276.
Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok
Hwang, Wonyong Sung, and Jungwook Choi. 2023.
Enhancing computation efficiency in large language
models through weight and activation quantization.
InProceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing , pages
14726–14739.
Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatzi-
akis, Pengcheng He, Weizhu Chen, and Tuo Zhao.
2023. Loftq: Lora-fine-tuning-aware quantization for
large language models. In The Twelfth International
Conference on Learning Representations .
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie
Chang, Pierre Stock, Yashar Mehdad, Yangyang
Shi, Raghuraman Krishnamoorthi, and Vikas Chan-
dra. 2023. Llm-qat: Data-free quantization aware
training for large language models. arXiv preprint
arXiv:2305.17888 .
Page 10:
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang,
and Trevor Darrell. 2018. Rethinking the value of
network pruning. arXiv preprint arXiv:1810.05270 .
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023.
Llm-pruner: On the structural pruning of large lan-
guage models. Advances in neural information pro-
cessing systems , 36:21702–21720.
Meng. 2024. Pissa: Principal singular values and sin-
gular vectors adaptation of large language models.
arXiv preprint arXiv:2404.02948 .
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
Sabharwal. 2018. Can a suit of armor conduct elec-
tricity? a new dataset for open book question an-
swering. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing ,
pages 2381–2391.
Zeping Min and Jinbo Wang. 2023. Exploring the in-
tegration of large language models into automatic
speech recognition systems: An empirical study. In
International Conference on Neural Information Pro-
cessing , pages 69–84. Springer.
Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri
Frosio, and Jan Kautz. 2019. Importance estima-
tion for neural network pruning. In Proceedings of
the IEEE/CVF conference on computer vision and
pattern recognition , pages 11264–11272.
Dheeraj Peri, Jhalak Patel, and Josh Park. 2020. De-
ploying quantization-aware trained networks using
tensorrt. In GPU Technology Conference .
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
ula, and Yejin Choi. 2021. Winogrande: An adver-
sarial winograd schema challenge at scale. Commu-
nications of the ACM , 64(9):99–106.
Michael Santacroce, Zixin Wen, Yelong Shen, and
Yuanzhi Li. 2023. What matters in the structured
pruning of generative language models? arXiv
preprint arXiv:2302.03773 .
Shoetsu Sato, Jin Sakuma, Naoki Yoshinaga, Masashi
Toyoda, and Masaru Kitsuregawa. 2020. V ocabulary
adaptation for domain adaptation in neural machine
translation. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020 , pages 4269–
4279.
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng
Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng
Gao, Yu Qiao, and Ping Luo. 2023. Omniquant:
Omnidirectionally calibrated quantization for large
language models. In The Twelfth International Con-
ference on Learning Representations .
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
Yao, Amir Gholami, Michael W Mahoney, and Kurt
Keutzer. 2020. Q-bert: Hessian based ultra low
precision quantization of bert. In Proceedings of
the AAAI Conference on Artificial Intelligence , vol-
ume 34, pages 8815–8821.Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wen-
wen Gong, Shu Zhao, Peng Zhang, and Jie Tang.
2023. Gkd: A general knowledge distillation frame-
work for large-scale pre-trained language model. In
Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 5:
Industry Track) , pages 134–148.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
github.com/tatsu-lab/stanford_alpaca .
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288 .
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi
Chen. 2023. Sheared llama: Accelerating language
model pre-training via structured pruning. In The
Twelfth International Conference on Learning Repre-
sentations .
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu,
Julien Demouth, and Song Han. 2023. Smoothquant:
Accurate and efficient post-training quantization for
large language models. In International Conference
on Machine Learning , pages 38087–38099. PMLR.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
machine really finish your sentence? In Proceedings
of the 57th Annual Meeting of the Association for
Computational Linguistics , pages 4791–4800.
Biao Zhang, Barry Haddow, and Alexandra Birch.
2023a. Prompting large language model for ma-
chine translation: A case study. In International Con-
ference on Machine Learning , pages 41092–41110.
PMLR.
Boyu Zhang, Hongyang Yang, Tianyu Zhou, Muham-
mad Ali Babar, and Xiao-Yang Liu. 2023b. En-
hancing financial sentiment analysis via retrieval aug-
mented large language models. In Proceedings of
the Fourth ACM International Conference on AI in
Finance , pages 349–356.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024.
Judging llm-as-a-judge with mt-bench and chatbot
arena. Advances in Neural Information Processing
Systems , 36.
Page 11:
A Version of LLMs
We provide the Hugging Face link of
LLMs used in the experiment: LLaMA-
7B: https://huggingface.co/baffo32/
decapoda-research-llama-7B-hf ; Vicuna-
7B: https://huggingface.co/lmsys/
vicuna-7b-v1.5 ; LLaMA-13B: https:
//huggingface.co/yahma/llama-13b-hf
B Hyperparameters
In the optimization of the pruned LLaMA-7B
model, a comprehensive hyperparameter config-
uration was employed to ensure an optimal balance
between model performance and computational ef-
ficiency. The model was fine-tuned with a learning
rate of 3×10−4, utilizing a batch size of 8, further
divided into micro batches of 4 to manage mem-
ory constraints effectively. Sequences were stan-
dardized to a maximum length of 256 tokens, and
a dropout of 0.05 was applied specifically to the
LoRA layers targeting projections such as query,
key, value, and output, alongside gate, down, and
up projections. Quantization was dynamically ap-
plied at 4-bit and 8-bit levels according to layer
requirements to optimize memory use without com-
promising computational accuracy. The training
employed the paged AdamW optimizer with 32-bit
precision, enhancing stability and efficiency. These
settings were methodically tested and optimized
through the Optuna framework to ensure robust
model performance and resource utilization.
C Results of Optimization Workflow
In this section, we will use the LLaMA-7B model
with 50% pruning as our example to illustrate the
Pareto optimization workflow, as shown in Figure
4
D Details of The Optimization Workflow.
We illustrated the optimization process, memory,
and time footprint of QPruner using a 50% param-
eter pruning rate on llama-7b as an example. We
fine-tuned 10 sets of configurations as the initial-
ization for the Gaussian Process (GP) (this is not
mandatory; in other experiments, we found that
starting from scratch, a good configuration could
be found in about 10 iterations). In each config-
uration, the quantization precision for all model
layers was randomly selected between 4-bit and
8-bit. On average, obtaining data for each initial-
ization took approximately 25 minutes. We set the
(a) Pareto-front scatter plot for BoolQ
(b) Pareto-front scatter plot for WinoGrande
Figure 3: Pareto-front scatter plots for BoolQ and Wino-
Grande with 50 data points. The red points indicate the
non-dominated configurations within the Pareto frontier.
total number of iterations for QPruner to 40 (result-
ing in 50 data points for constructing the Pareto
front) to ensure the best configuration was found.
The entire process took approximately 16.5 hours.
During QPruner iterations, GP required around 7s
to suggest the next configuration, while the predic-
tion process and Pareto frontier construction con-
sumed approximately 187MB memory. In Figure
3, we present the optimization results for BoolQ
and Winograd. More detailed processes and results
for other benchmarks are provided in Appendix C.
E Performance in LLaMA-13B
We list the performance of the configuration de-
scribed in Section 4 for LLaMA-13B in Table 3.
F Code Usage Instructions
In our workflow, we begin by defining a custom
pruning rate to prune the original LLM model. This
step generates a new, streamlined model version
that we save under the “tuning” directory. Sub-
sequently, we apply either random or specified
Page 12:
precision quantization adjustments to this pruned
model. These quantization strategies are carefully
chosen to further reduce the model’s memory foot-
print while striving to maintain its operational ef-
fectiveness. Once quantized, the model undergoes
a thorough evaluation using specific tools that as-
sess its performance on downstream tasks to ensure
it retains accuracy and effectiveness after modifica-
tions.
To integrate Optuna into our optimization pro-
cess, we begin by recording the pruning and quan-
tization parameters in a JSON format. These pa-
rameters serve as historical data inputs for Optuna,
facilitating a structured approach to hyperparam-
eter tuning. We employ Optuna to conduct multi-
objective optimization, setting multiple goals such
as balancing model performance with memory effi-
ciency. Optuna’s iterative search process explores
various parameter combinations across numerous
iterations, each informed by the outcomes of pre-
vious evaluations. The results from these searches
are compiled into a dataframe, which not only aids
in subsequent analysis but is also crucial for identi-
fying the Pareto frontier, optimizing the trade-offs
between performance and resource usage.
G Limitations
This study focuses solely on the fine-tuning of
pruned models, leveraging the self-adjusting na-
ture of QPruner. However, we believe that this
adaptive approach could be broadly applicable to a
wide range of models, not limited to pruned ones.
Future research will explore the optimization of
fine-tuning across different model architectures.
Additionally, we did not conduct experiments
on larger models, such as those with 70B param-
eters or more. The scalability and effectiveness
of QPruner on such massive models remain to be
investigated, and this will be a key focus of our
future work.
Page 13:
Pruning Rate MethodQPruner1BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
Rate = 0% w/o tuning 68.50 79.11 76.21 70.09 74.58 44.54 42.20
Rate = 50%LLM-Pruner 61.93(41.32) 71.38(41.32) 53.36(41.32) 53.59(41.32) 29.95(41.32) 53.11(41.32) 38.00(41.32)
QPruner161.71(36.68) 72.63(36.68) 56.10(36.68) 55.17(36.68) 31.57(36.68) 55.47(36.68) 38.60(36.68)
QPruner361.80(30.53) 73.23 (30.53) 56.37 (30.53) 55.09(31.45) 31.48(30.53) 55.80 (31.45) 39.00 (30.58)
Table 3: Zero-shot performance and memory. ‘Bold’ indicates the best performance at each pruning rate. Reported
in percentage (%).
(a) ARC-c
(b) ARC-e
(c) HellaSwag
(d) OBQA
(e) PIQA
Figure 4: Pareto-front scatter plots for different Downstream Tasks