Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2412.14590

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Authors: Zhen Zheng, Xiaonan Song, Chuanjie Liu

Published: 2024-12-19

Abstract:

Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.

Paper Content:

Page 1: MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design Zhen Zheng∗, Xiaonan Song, Chuanjie Liu Microsoft Abstract Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output feature based on the insight that different output feature matter differently in the model. MixLLM identifies the output feature with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm- system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency. 1 Introduction Large language models (LLMs) [7, 32] have shown remarkable performance on various tasks. But their large memory consumption and massive computation cost have become an obstacle for the efficient deployment [43, 44]. Quantization has become one of the most sufficient solution to compress LLMs into smaller size [14, 26, 45, 46], by representing the weight or activation with smaller bit-width. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. There is a triangle of characteristics for efficient LLM quantization: accuracy ,memory consumption of parameters, and system efficiency of execution, which we call effectiveness triangle of quantization. The existing quantization solutions have different focus and trade-off in the triangle: •The weight-only methodologies target to solve the memory consumption problem, and can speedup the small-batched decoding execution that faces the memory-wall problem [43, 21]. But their accuracy drop of 4-bit quantization can be a challenge for the production workloads sensitive to accuracy, which becomes more serious in the new models with higher information density like Llama 3 [32], as illustrated in recent studies [42, 22]. Besides, the weight-only method can lead to system performance drop for large-batched workloads (e.g., the SOTA W4A16 kernel only achieves 83% performance of its float16 counterpart at batch size 512 with hidden size 4096, shown in Fig.5). •The weight-activation quantization represents the activation with low-bit values along with the weights, potentially lead to higher system efficiency. But it can lead to larger accuracy drop than the weight-only ∗Correspondence to Zhen Zheng <zhengzhen@microsoft.com >. 1arXiv:2412.14590v1 [cs.LG] 19 Dec 2024 Page 2: in-featuresout-features quant w/ 8 -bit (high salience)8-bit activation output linear weightquant w/ 4 -bit (low salience) searched global precision config pre-pack w/ mem -interleave4-bit 8-bit Ahead -of-time quantization. two MatMuls in parallel fused scatter (kernel epilogue ) Runtime k ernel execution .Figure 1: Illustration of the quantization with mixed-precision between output features and kernel execution. method as the activation is usually harder to quantize [50, 3, 27]. Besides, it introduces more dequantiza- tion overhead for the activation that can hurt the system efficiency. The transformation optimizations in some works can make the system efficiency even worse. •Outlier separation and mixed-precision technologies emerge to improve the accuracy of low-bit quanti- zation by either excluding the unstructured high-salience weights from quantization [11, 21] or assigning larger bit-width for the quantization of structured high-salience weights [50]. The former shows system ef- ficiency problem due to the low efficiency of half precision (i.e., float16/bfloat16) sparse tensor processing. The state-of-the-art mixed-precision solution [50] aims for low-bit quantization but shows non-negligible accuracy drop, even inferior to the 4-bit weight-only quantization. Contributions. In this paper, we provide an extensive analysis of the general quantization principles. To address the limitations of the previous works and cover the three characteristics in the effectiveness triangle, we propose MixLLM, which makes the following contributions: ▶High accuracy with low memory consumption: mixed-precision between output features on the weight, with global salience identification. Given that different neurons matter differently to the model’s output, we use different bit-width for different output features (i.e., output channels) for the weight quantization, 8-bit for output features with high salience and 4-bit for others (Fig.1). Rather than using a uniformed number of outliers within each layer according to the estimated salience w.r.t. each single layer [50], MixLLM identifies the salience of different output features globally according to the estimated loss to the model’s output. This is because different layers can have different importance to the model. Besides, the mixed-precision between output features makes the system design easier than between input features because the calculation of different output features are disjoint sub-problems. ▶High accuracy with good system efficiency: the co-designed quantization configuration and GPU kernel optimization. We observe the sweet spot of several quantization decisions to achieve both good accuracy and system efficiency. MixLLM uses 8-bit for activation quantization as it can retain a good accuracy. Besides, MatMul execution tends to be bound more on the larger weight tensor rather than the smaller activation tensor, which weakens the need to push the activation smaller (refer to Sec.3.1). MixLLM uses symmetric quantization for 8-bit and asymmetric for 4-bit for good accuracy, both in group- wise manner. Such configuration makes it challenging to achieve good system efficiency. We design the two-step dequantization to enable using fast int8 Tensor Core for such configuration, along with the fast integer-float conversion to decrease the dequantization overhead. We also present the software pipeline design of the quantized linear kernel on the modern GPU1. Extensive evaluation shows that, with only 10% of 8-bits (i.e., W4.4A8), MixLLM outperforms all the existing 4-bit quantization algorithms while achieving the state-of-the-art system efficiency. 1We mainly discuss the model execution on the GPU in this paper. But the basic principle is general. 2 Page 3: 2 Background, Related Work, and Discussion 2.1 Background of Quantization The quantizaiton maps the tensor Xinto the target range with smaller bit-width representation through affine transformation: Xq=clamp (⌊X s⌉+z, range ), where sis the scale and zis the zero point. The value can be recovered (i.e., dequantization) through: X′= (Xq−z)×s.X′is pushed to the discrete chunks rather than recovered to the original value, thus has accuracy loss. The bit-width is essential for the accuracy of quantization as it determines the number of chunks for the quantized values (2bitwidth). Take an example, enlarging the bit-width from 4 to 5 can double the number of chunks, so that the 5-bit RTN quantization can easily beat the 4-bit quantizations of advanced techniques (Tab.1). The scale and zero point can be calculated from the whole channel/token vector or a small group within the channel/token, the former is called per-channel/token quantization and the latter is group-wise quanti- zation. The group-wise scheme results in smaller accuracy loss due to the smaller chunk scale, but requires more complex GPU kernel design. The symmetric quantization uses 0 as the zero point value, which sim- plifies the computations ( Xq=clamp (⌊W s⌉, range ),X′=Xq×s) and enables many works to design the per-channel/per-token quantized kernels by multiplying the scales at the epilogue of the whole MatMul (ma- trix multiplication) for dequantization [45, 40]. However, it leads to larger loss than the asymmetric one as the data distribution can be usually asymmetric, especially for smaller bit-widths like 4-bit. 2.2 Related Works and Discussion of General Quantization Principles This paper mainly focuses on post-training quantization (PTQ). Systems that affect the quantization requirement. The continuous batching technology [47] en- ables to batch the decoding tasks from different requests together to enlarge the batch dimension of MatMul during LLM inference. The chunked-prefill method [19, 2, 51] advances the continuous batching by merging the prefill and decoding tasks into the same batch, further enlarging the MatMul shapes. These technologies pushes many LLM jobs to become compute-bound and motivate the demand to reduce computation. Weight-only quantization and its limitation. There emerges a wide range of technologies to im- prove the accuracy of weight-only quantization. GPTQ [14] advances OBC [13] on OBS-based [18] weight compensation with blocked updating and reordering. AWQ [26] proposes to scale the weight according to the characteristic of activation. OminiQuant [37]) proposes the learnable scaling and weight clipping factors. SpQR [11], SqueezeLLM [21] and OWQ [25] separate the outliers from the quantiation and with half precision. QuiP [8] aims to achieve extreme low-bit quantization with incoherence processing. ZeroQuant(4+2) [42] aims to improve accuracy with medium-sized FP6 quantization. The weight-only quantization does not reduce the computation but introduces the extra dequantization operations. The low-bit weight will be dequantized to float16 to execute the MatMul in float16 datatype. The current weight-only quantization faces two challenges: 1) From the accuracy aspect, there is still an accuracy gap between the 4-bit quantization and the float16 model, especially for many real business scenarios sensitive to the small accuracy drop, as discussed in the recent works [42, 44]. 2) It can lead to system efficiency drop on busy servers as the recent LLM inference serving systems will usually batch the processing of different requests together on the server and form large MatMuls. The large MatMuls are compute-bound and will suffer from the dequantization overhead [27]. Weight-activation quantization and the challenges. The weight-activation quantization helps to make use of the low-bit computing unit. LLM.int8() [10] observes the activation outlier problem and separates outliers from quantization with half precision. ZeroQuant [46] proposes the per-token activation quantization and group-wise weight quantization. SmoothQuant [45] addresses the activation outlier problem through smoothing, and AffineQuant [29] proposes the general affine transformation for quantization. RPTQ [48] reorders the channels to cluster similar scaled values together. SpinQuant [28] and QuaRot [3] leverages matrix rotation properties to alleviate the outlier phenomenon. Atom [50] uses the mixed-precision between input features to improve accuracy of 4-bit activation quantization. QoQ [27] is a holistic quantization solution with progressive group quantization, attention smoothing, rotation, and channel reordering. 3 Page 4: Even though the weight-activation quantization has the advantage of reduced MatMul computation (i.e., MatMul in smaller bit-width to make use of the smaller bit-width computing unit with higher computation throughput2), it faces the challenge of accuracy drop caused by the activation quantization, especially that the activation is usually harder to quantize than the weight. The SOTA low-bit weight-activation solutions [3, 28, 27] still have a gap to the 4-bit weight only quantization. Beside the accuracy drop, the activation quantization will introduce more dequantization overhead than the weight-only one, which makes it challenging to design efficient GPU kernels. When enabling the asym- metric quantization, the result of ( Xq−z) may exceed the range of the bit-width of Xq, making it hard to use the corresponding Tensor Core computation. Systems like Atom [50] thus avoid using the asymmet- ric quantization, with the cost of larger accuracy drop. The group-wise quantization requires fine-grained integer-to-float (I2F) conversion to apply per-group scales. However, the I2F instruction is more expen- sive than the common computation instructions on the GPU [1] and can lead to large system performance drop ( >10% drop in our practice). Besides, the throughput of Tensor Core is much higher than that of SIMT Cores, 624 TOPS of int8 Tensor Core vs. 19.5 TFLOPS/TOPS of FP32/INT32 SIMT Cores. There still lacks a well designed software pipeline to overlap the Tensor Core computation and SIMT Core based dequantization in the existing works while achieving a high accuracy. In general, the existing solutions focus on partial of the effectiveness triangle, but cannot cover all of them well. MixLLM is orthogonal to the above works by exploring the mixed-precision between output features with global salience identification, and the co-designed quantization decision and GPU kernels. 3 Methodology 3.1 Quantization Design and Decision in MixLLM To cover the three aspects of the effectiveness triangle simultaneously, we make the following design and decision of weight and activation quantization according to the analysis in Sec.2.2. 3.1.1 Mixed-precision between output features of weight, with global salience identification. It is known that different elements of the weight show different salience to the network’s loss when being quantized [21, 11]. The outlier separation method can improve the accuracy by using float16 to store the high-salience elements, but can suffer from the inefficient sparse MatMul. We observe that the elements with high salience tend to show distribution along the output channels for most of the linear layers in many LLMs. Based on this observation, we can assign larger bit-width to the output channels of high salience, and smaller bit-with to the others, forming structured mixed-precision quantization. Through the experiments, we get the same conclusion with the existing works [21, 11] that there is only a small set of elements with high salience contributing significantly to the model’s accuracy drop. Thus we only need to assign the large bit-width to a small portion of the output channels to achieve good accuracy and retain a small memory consumption at the same time. The structured mixed-precision between different output channels can be friendly to the system efficiency and kernel development, due to the nature that different output features are disjoint in the MatMul and the computation of them are different sub-problems. Fig.1 shows how the linear layer computes with the mixed- precision between output features. It divides the linear into independent sub-problems, and finally gathers the output of the sub-problems together to form the final result. This optimization space is orthogonal to the existing quantization optimizations, e.g., GPTQ [14], and can be applied together with them. One critical problem is how to identify the high-salience output channels in the model. The fixed thresh- old [11] or the fixed number/ratio [50, 25] of high salience elements computed by the local loss of layers can be sub-optimal to the end-to-end model, as different layers can show different importance to the model’s final output [16, 30, 12]. A high salience channel w.r.t. a layer may not be a high salience channel of the 2The extra dynamic activation quantization kernel can be fused into other operators with very little system cost [50], thus we only discuss the MatMul itself. 4 Page 5: 020406080100 0 1 2 3 4 5 6 7 8 910 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31Percentage Decoder Layer IndexFigure 2: The percentage of high-salient out features within each linear layer of Llama 3.1 8B model according to each feature’s contribution to the final loss after quantizing to 4-bit, with 10% high-salient features globally. Each decoder layer contains qproj ,kproj ,vproj ,oproj ,gate proj ,upproj , and down proj in order. end-to-end model. In MixLLM, we compute the high salience channels globally according to their impact to the model’s final loss (Sec.3.2). As a result, different layers will have different number of high salience channels. Fig.2 shows the distribution of the top 10% high-salient out features in Llama 3.1 8B, showing huge difference in different linear layers. Note that this design is different from the mixed-precision in Atom [50] from two aspects. 1) MixLLM first addresses the problem of identifying the high-salience channels globally rather than locally. 2) MixLLM applies the mixed-precision between output features rather than input features, which is more system per- formant and algorithm flexible3as the output features are disjoint naturally. 3.1.2 Quantization decision with algorithm-system consideration: 8-bit symmetric activation and 4-bit asymmetric weight quantization in group-wise manner. MixLLM makes the same decision with QoQ [27] on activation quantization to use 8-bit, as the 4-bit acti- vation can lead to a large accuracy drop but does not lead to significant system efficiency improvement as MatMul execution tends to be bound more on the larger weight tensor rather than the smaller activation tensor. It can be partially indicated from the compute intensity of the linear layer. Given token number M and in/out features K/N, the compute intensity I=2MNK MKB act+KNBweight.BactandBweight are the bytes per element of activation and weight. Given M= 512 and N=K= 4096, reducing Bweight from 8 to 4 will results in an 80% increasement of I, while reducing Bactfrom 8 to 4 will only achieve 5.88% increasement. Instead of using per-token activation quantization, MixLLM uses group-wise RTN method. On the one hand, Tab.1 shows that simple group-wise RTN quantization performs better than token-wise smoothing method. On the other hand, the weight is already group-wise in MixLLM, and the group-wise activation does not lead to significant more dequantization overhead in the system. We observe symmetric quantization is enough for the 8-bit activation (refer to MixLLM W8A8 in Tab.1), while asymmetric can be essential for the 4-bit weight. The group-wise method with asymmetric can lead to a difficulty for the kernel to make use int8 Tensor Core, for which QoQ [27] introduces the two-step quantization method. Instead, we design atwo-step dequantization method with the property of the mix of symmetric and asymmetric (Sec.3.3). 3.2 Global Precision Search Algorithm As discussed in Sec.3.1, MixLLM determines the precision of all output features in all layers globally rather than locally. It identifies the salience of these features with respect to the final loss of the model, and assigns larger bit-width to the features leading to larger loss. Specifically, it calculates the salience Sof a channel cas: Sc=|l(cq)−l(c0)| (1) 3One example is that, the outlier number in each MatMul should be a multiplier of the corresponding tiling size of kernel design to achieve a good system efficiency in Atom, which limits the flexibility of algorithm. 5 Page 6: which is the distance of the model’s loss between quantizing and not quantizing this single channel. In Eq.1, l() is the loss function of the model w.r.t. a single channel, cqis the quantized weight of the channel and c0 is the original weight. Note that it regards other neurons except cas constant in l(). We use the Taylor Expansion method to estimate the loss function l(c) (similar with the existing quan- tization works, ignoring the high-order items): l(c)≈l(c0) +gT(c−c0) +1 2(c−c0)TH(c−c0) (2) where g=E[∂ ∂cl(c)] is the loss’s gradient w.r.t. the channel, and H=E[∂2 ∂c2l(c)] is the second-order gradient (i.e., Hessian matrix) w.r.t. the channel. It is infeasible to calculate the Hessian matrix as it is too costly. We approximate the Hessian Hwith the (empirical) Fisher information matrix Fon the calibration dataset D: H≈F=1 |D|X d∈DgdgT d (3) Note that Fis w.r.t. a channel, differing from the diagonal Fisher information matrix in the recent works that ignores any cross-neuron interactions [23, 21]. Based on this approximation, the second order loss factor1 2(c−c0)T(gdgT d)(c−c0) can be further simplified to1 2(gT d(c−c0))2, simplifying the expensive chained matrix multiplication into a single vector product. Finally, the salience can be calculated by: Sc=1 |D|X d∈D|gT d(cq−c0) +1 2(gT d(cq−c0))2| (4) We do not ignore the first order information during the calculation, differing from OBD [24] and many recent quantization works [14, 11, 21]. This is because the first order factor can be more significant in the estimation in Eq.4, as the estimated second order factor is the square of the first order factor divided by two for each sample. Considering that gcan be very small for the well pretrained models and the delta of the quantized weight is usually not large, the first order factor can be larger than the second order one. Besides, what we require is the loss itself rather than the arguments of the loss function, and thus we do not need to ignore the first order factor to simplify the arguments calculation. Algorithm 1 Global precision search procedure. Input: Linear layer number L, weight and gradient of all linear layers ( Wi∈RO,I,Gi∈RO,Ifor layer i). Output: Global channel index with large and small bit width (largebit channels, smallbit channels). 1:Sglobal←[] 2:fori= 1,2, ..., L do 3: Wdelta i←quantize( Wi) -Wi 4: S1st←sum(Gi⊙Wdelta i, dim=1) ▷Per-channel dot product between GiandWdelta i 5: S2nd←0.5∗(S1st)2 6: S← |S1st+S2nd| ▷ S∈RO, the salience of the Ochannels 7: forchannel id= 1,2, .., O do ▷Log the salience of each output channel of this layer 8: Sglobal .append(tuple( i,channel id,S[channel id])) 9:sort(Sglobal ) ▷Sort according to the salience, descending 10:largebit channels, smallbit channels ←Sglobal [:Nlargebit ],Sglobal [Nlargebit :] Algo.1 illustrates the procedure of the global precision search. It calculates the salience of all the output channels of all linear layers and sort them in descending order globally. Given the global threshold Nlargebit as the number of large-bit precision channels, the first Nlargebit channels are intended to be quantized with 8-bit, and the other channels will be quantized with smaller bit-width (i.e., 4-bit in this paper). Any quantization 6 Page 7: 010010110 Variable 23 bits: B22B21…B0 FP32 value: (1. B22B21…B0)binary × 223 Int32 value: 150 × 223 + (B22B21…B0)binaryFixed 9 bitsFigure 3: The float and integer value of binary (010010110xx...x) , each within a consecutive range. methodologies (e.g., GPTQ, clip search) can be applied independently to these two disjoint parts of channels. Note that we calculate the salience of the channels in one pass rather than iterative identifying the high- salience parts in a smaller step, as we observe the single-pass method show similar results with the iterative method and saves a lot of computation overhead than the latter. 3.3 Efficient Quantization Computation System Two-step dequantization to make use of int8 Tensor Core. As for the W4A8 computation, the dequantized weight and activations are ( Wq−z)swandAqsa, where Wqandzare uint4 datatype, Aqis int8 datatype, and swandsaare float16 datatype. Directly dequantizing the tensors into float16 datatype before the MatMul computation will prevent us using the fast 8-bit Tensor Core on the GPU. Instead, MixLLM uses a two step dequantization within each group. Specifically, MixLLM first partially dequantizes the weight into ( Wq−z), and then multiply it by Aqwith the 8-bit Tensor Core. Finally, it multiplies this MatMul result by the two scales within each group. Note that we use int8 datatype for ( Wq−z) so that there is no overflow problem. Fast Int to Float conversion with partially fusing into Tensor Core instruction. In the above two-step dequantization computation, the step 2 is the MatMul between the integer tensor Aq(Wq−z) and the float tensor sasw. It requires the integer to float conversion (I2F) before the multiply operation. The I2F instruction is expensive on the modern GPUs. Instead, we make use of the range-dependent fast I2F transformation to convert the I2F instruction into two add/sub instructions4. Specifically, it is based on the fact that there exists a certain range where an integer value’s binary is the same as a corresponding float binary. As shown in Fig.3, the binary with the first 9 bits as 010010110 represents a series of consecutive int32 and float32 values, respectively. We can add a bias to an integer value to make it within this consecutive range, and then subtract a corresponding bias in float (same underling binary) to restore its value in float type. We take the middle value in this range as the bias to maximize the data range that can be safely converted, whose hexadecimal number is 0x4b400000 (i.e., in the remaining 23 bits, the first bit is 1 and the other bits are 0). This allows to convert a consecutive range of 223int32 numbers to float32. The range of dot product of kint8 elements is 216k, thus the above fast I2F conversion allows the kvalue up to 128. We use quantization group size as 128 and can use the fast I2F safely: 1// b i a s i n t = a s i n t (0 x4b400000 ) , b i a s f p = a s f l o a t (0 x4b400000 ) ; 2int tmp = s r c i n t + b i a s i n t ; 3int d s t f l o a t = ∗((float ∗)&tmp) −b i a s f p ; We further fuse the integer subtraction into the Tensor Core mma (Matrix Multiply-Accumulate) in- struction. The mma instruction computes D=AB+Dduring the MatMul computation. We initialize the accumulator Das the bias intbefore MatMul computation of each quantization group, and will only need to subtract the bias fpafter the MatMul. In another word, the expensive I2F is converted into a single float subtraction. The above I2F simplification brings more than 20 TOPS performance improvement for the 512/4096/4096 (M/N/K) quantized MatMul computation on an A100 GPU. 4CUTLASS [9] also has an implementation of fast I2F for general purpose. 7 Page 8: S2RPre- deq Weight Tile0Act Tile0Weight Tile1Act Tile1S2RPre- deqInit I2F Pre- deqInit I2F S2R S2RPost -I2F S2R S2RPre- deq S2RPre- deq S2R S2R S2RPre- deq S2R S2R G2S G2S G2S G2S G2S G2S G2SMMA MMA G2S G2S G2S G2S G2S G2SMMA Weight Tile2Act Tile2Weight TileMMA G2SMMA Post -deqacc Register Shared Memory Pipeline InitializationWarp Tile 0 Warp Tile 1 Warp Tile 0 Warp Tile 1 Warp Tile 0 Block Tile 0 Block Tile 1 Block Tile 0 Quantization Group Tile 0Global Mem (input)Shared Mem (block reuse)Register (warp reuse)SIMT Core (dequant )Tensor Core (int MatMul )Register (group buff)Register (final buff) acc Tail OverheadFigure 4: The GPU kernel software pipeline of group-wise W4A8/W8A8 quantized MatMul. It assumes perfect overlapping. G2S: load global to shared memory; S2R: load shared memory to register; MMA: matrix multiply-accumulation; I2F: integer to float conversion; deq: dequantize; acc: accumulate. End-to-end software pipeline of the quantized linear kernel on the GPU. Fig.4 shows the software pipeline of the quantized kernel. Beside the basic warp tile and block tile, we introduce the quan- tization group tile for the fast I2F and per-group scale multiplication. It uses two output buffers for the output accumulation at register level, one for the per-group accumulation, and the other for the global ac- cumulation. This allows to apply the per-group scales on the group-level buffer. We initialize the group buffer with the bias intat the beginning of the group tile, and subtract bias fpat the end of the group tile. As for the two-step dequantization , the first step is within the warp tile where each input element will subtract the zero-point before feeding into MMA, the second step is at the end of the group tile by multiplying the scale. We use the vectorized intrinsic to perform four int8 subtract in a single instruction (vsub4 ) [27]. Besides, to improve the global memory loading efficiency, we prepack the memory layout of the weight tensor ahead-of-time to avoid the runtime permutation of the input elements. In general, this software pipeline can overlap the memory loading and computation, and the dequantization computation with SIMT Core and the MatMul computation with Tensor Core to the best, and minimizes the overhead of group-wise dequantization. Parallel execution of sub-problems of different bit-width. As for the execution shown in Fig.1, MixLLM executes different sub-problems in parallel on the GPU with CUDA Graph. Finally, the MatMul execution of the two parts write to the same target tensor with different channel indices to generate the final output. We implement this function with the fused epilogue of MatMul to scatter the output to the corresponding indices, which is basically costless. 4 Evaluation 4.1 Setup As for MixLLM evaluation in this paper, we use 0%, 10%, 20%, 50% and 100% percent of 8-bit based on the 4-bit quantization, respectively. Meanwhile, we use 8-bit for activation quantization. Both the weight and 8 Page 9: activation are group-wise quantized with group size 128. The 4-bit part is asymmetric quantized and the 8-bit part (including that in weight) is symmetric, which is a good trade-off between accuracy and system efficiency. Note that any other bit-width percentage configuration can be used for real scenarios to trade-off memory usage, system efficiency and accuracy in practice. We enable GPTQ in MixLLM for the all models. We also apply clip search for all the models except for the 32B, 70B and 72B models as the clip search is time consuming for large models. Baselines and configurations. We compare MixLLM with the state-of-the-art (SOTA) quantization solutions of both weight-only and weight-activation methods. As for the weight only quantization, we compare MixLLM with the basic round-to-nearst (RTN) 4-bit and 5-bit quantization, and the production- level SOTA GPTQ [14] and AWQ[26]. As for the weight-activation quantization, we compare MixLLM with the most widely used SmoothQuant [45] and the recent SOTA QoQ [27] and QuaRot[3] (of both W4A4 and W4A8). The 8-bit tensors are all symmetric quantized in all baselines and MixLLM. We also compare the perplexity with SqueezeLLM[21], OminiQuant[37], AffineQuant[29], Atom[50] and SpinQuant[28] according to their reported numbers. We make use of AutoGPTQ lib [5] (v0.7.1) to evaluate GPTQ, AutoAWQ lib [4] (v0.2.7) to evaluate AWQ, lmquant lib [33] (commit 9d62c5c) to evaluate SmoothQuant and QoQ, and the official repo to evaluat QuaRot. We enable the reorder trick for GPTQ evaluation, and use asymmetric and group size 128 for both GPTQ and AWQ. We follow the official configurations in lmquant to use 0.85/0.15 as the alpha/beta parameter for SmoothQuant, and 0.3/0.7 for QoQ. We use symmetric and per-channel/token quantization in QuaRot, following the configuration in its paper. We disable the KV quantization of QoQ and QuaRot in our experiments to make the comparison fair. Models and Datasets. We evaluate MixLLM and the baselines on a variety of widely used LLMs of different sizes, ranging from 0.5B to 72B. The models include Llama 3.1 8B and 70B [32], Llama 3.2 1B, Qwen2.5 0.5B, 1.5B, 7B and 32B [17], and Mistral 7B v0.3 [20]. We use wikitext2 dataset [31] as the calibration set for GPTQ and MixLLM. We use the default pile dataset [34] as the calibration dataset for AWQ, SmoothQuant and QoQ, to enable their better performance. GPTQ, AWQ and MixLLM uses 128 samples with sequence length of 2048 for calibration. SmoothQuant and QoQ uses 64 samples with sequence length of 1024 for calibration (larger dataset results in OOM in our experiment). Metrics. As for the algorithm accuracy, we compare the perplexity (ppl) between all the baselines on wikitext2 and C4 [35] dataset. Meanwhile, we compare a set of popular downstream tasks on Llama 3.1 8B, Qwen2.5 7B, and Mixtral 7B v0.3, including BBH [39], GPQA [36], MMLU-Pro [41], MuSR [38], ARC challenge [6], and HellaSwag [49]. We use lm-eval [15] for the downstream tasks evaluation, for which the task names are leaderboard bbh, leaderboard gpqa, leaderboard mmlu pro, leaderboard musr, arc challenge, and hellaswag. We use the average number for each of the tasks. We conduct the system experiments on NVIDIA A100 (80G) GPUs with CUDA 12.1. We use PyTorch 2.4.1 and transformers 4.45.2. 4.2 Perplexity Evaluation Tab.1 shows the perplexity on Wikitext2 and C4 dataset for the commonly used open source LLMs, of different baselines. It shows that: •Using 4.4 bits of weights with MixLLM can achieve the similar accuracy with 5 bits RTN weight-only quantization, even with 8-bit activation quantization enabled in MixLLM. This is mainly because MixLLM assigns the high-salience output channels with larger bit-widths than the uniform 5-bit solution. •As for the weight-only quantization baselines, MixLLM W4.4A8 outperforms the production SOTA so- lutions GPTQ and AWQ, with only 10% more bit-width, and even with 8-bit activation quantization enabled in MixLLM. Meanwhile, the RTN W5A16 method also outperforms GPTQ and AWQ, which means a slightly larger bit-width can defeat the well tuned smaller bit-width easily. MixLLM W4.4A8 benefits from the larger bits on the top 10% output features with high salience. 9 Page 10: Table 1: Perplexity evaluation ( ↓) on wikitext2/c4 (gray for c4), sequence length 2048. NAmeans no support. Abnmeans the value is too large ( >105). For MixLLM, pnmeans n% 8-bit. Llama 3.1/3.2 Qwen2.5 Mistralbaselines1B 8B 70B 0.5B 1.5B 7B 32B 7B v0.3 float16 9.75/12.72 6.24/8.95 2.81/6.68 13.07/17.55 9.26/13.11 6.85/10.44 5.02/8.95 5.32/7.84 W4A16 11.72/15.56 6.82/9.72 3.55/7.43 15.54/20.55 10.35/14.35 7.23/10.88 5.27/9.14 5.51/8.04RTNW5A16 10.15/13.25 6.40/9.15 3.16/9.52 13.61/18.17 9.52/13.38 6.95/10.53 5.09/8.99 5.38/7.91 GPTQ W4A16 10.38/14.15 6.52/9.55 Abn/Abn 14.01/19.04 9.64/13.75 7.09/10.75 5.20/9.08 5.49/8.19 AWQ W4A16 10.81/14.12 6.65/9.48 3.28/6.96 15.04/19.75 9.95/13.85 7.10/10.71 5.23/9.08 5.44/7.98 SmoothQuant W8A8 9.89/12.91 6.34/9.08 2.92/6.77 13.84/18.40 9.63/13.49 7.17/10.85 5.12/9.04 5.35/7.88 QoQ W4A8 Abn/Abn 6.64/9.49 3.49/7.07 Abn/Abn Abn/Abn 7.39/11.06 5.55/9.31 5.44/7.98 W4A4 Abn/Abn 8.34/11.95 6.16/9.91 NA/NA Abn/Abn 8.15/12.05 6.26/9.98 5.83/8.50QuaRotW4A8 Abn/Abn 6.60/9.67 3.43/7.10 NA/NA Abn/Abn 7.03/10.68 5.23/9.10 5.40/7.99 W4A8 (p0) 10.36/14.09 6.54/9.62 3.30/7.24 14.43/19.61 9.66/13.79 7.03/10.75 5.21/9.08 5.42/8.02 W4.4A8 (p10) 10.05/13.51 6.42/9.33 3.02/6.83 13.42/18.13 9.44/13.43 6.92/10.57 5.12/9.01 5.36/7.93 W4.8A8 (p20) 9.95/13.25 6.37/9.22 2.97/6.79 13.32/17.99 9.40/13.35 6.90/10.53 5.09/9.00 5.35/7.90 W6A8 (p50) 9.85/12.98 6.30/9.09 2.86/6.73 13.21/17.78 9.33/13.25 6.88/10.49 5.05/8.98 5.33/7.87MixLLM W8A8 (p100) 9.76/12.75 6.25/8.97 2.81/6.68 13.12/17.60 9.28/13.14 6.86/10.45 5.02/8.96 5.32/7.84 Table 2: PPL (wikitext2) comparison with the reported numbers in the related works. LLaMA 2 FP16SqueezeLLM W4A16 0.45%OminiQuant W4A16/W4A4AfineQuant W4A16/W4A4Atom W4A4 128 outliersSpinQuant W4A8MixLLM W4.4A8 7B 5.47 5.57 5.58/14.26 5.58/12.69 6.03 5.7 5.55 13B 4.88 4.96 4.95/12.30 4.95/11.45 5.27 5.0 4.93 •As for the weight-activation quantization baselines, MixLLM W4.4A8 shows a comparable accuracy with SmoothQuant with much smaller bit-width (60% of that in SmoothQuant). MixLLM W4.4A8 shows better accuracy than QoQ and QuaRot with only 10% larger bit-width. It shows MixLLM achieves a good balance of memory consumption and accuracy. •Note that MixLLM W8A8 quantization shows nearly lossless performance compared to the float16 baseline. This is part of the motivation that MixLLM uses group-wise quantization for the activation. Comparison with More Related Works We compare MixLLM with more recent quantization works according to the reported numbers in their papers (Tab.2), showing that MixLLM achieves superior accuracy to a broad range of related works with similar memory consumption. 4.3 System Performance We have evaluated MixLLM for the single linear layer of token number ranging from 1 to 1024 with in features 4096 and out features 4096/14336, and compared it with the SOTA W4A16 (TRT-LLM) and QoQ [27], shown in Fig.5. It also shows MixLLM kernel performance of different percent of 8-bits (W4A8 0% 8-bit, W4.4A8 10% 8-bit, and W8A8 100% 8-bit). It shows that: •MixLLM outperforms the float16 counterpart for all token numbers, achieving 1.90 ×, 2.75×, and 1.88 × averaged speedup with MixLLM W4A8, W8A8, and W4.8A8 respectively. •MixLLM outperforms the SOTA W4A16 solution, achieving 1.26 ×, 1.78×, and 1.25 ×averaged speedup with MixLLM W4A8, W8A8, and W4.8A8 respectively. •MixLLM achieves similar performance with QoQ with similar bit-width, achieving 0.99 ×, 1.39×, and 0.99×averaged speedup with MixLLM W4A8, W8A8, and W4.8A8 respectively. Note that MixLLM has better accuracy than QoQ (Tab.1, Tab.3). 10 Page 11: 0123 1 2 4 8 16 32 64 128 256 512 1024Speedup Token number01234SpeedupTorch FP16 TRT-LLM W4A16 QoQ W4A8 MixLLM W4A8 MixLLM W8A8 MixLLM W4.4A8 In features: 4096 Out features: 4096 In features: 4096 Out features: 14336Figure 5: The speedup of two types of single linear layers over torch float16 baseline on the A100 GPU. Table 3: Downstream tasks evaluation ( ↑) on Llama-3.1-8B/Qwen2.5-7B/Mistral-7B-v0.3. The above is the average of the three models. BBH is 3 shot, MMLU pro is 5 shot, and others are zero shot. BBH GPQA MMLU-Pro MuSR ARCc HellaSwag float1648.62 46.52/54.09/45.2530.86 31.08/33.11/28.3935.52 32.91/43.86/29.8041.07 37.99/44.51/40.7252.24 53.41/51.02/52.3079.43 78.92/78.94/80.43 GPTQ W4A1647.08 46.28/51.97/43.0030.98 30.81/34.59/27.5333.66 31.20/41.02/28.7541.83 39.06/44.64/41.7851.37 53.24/48.98/51.8878.46 77.85/78.10/79.44 AWQ W4A1647.62 45.59/52.71/44.5730.86 29.76/33.17/29.6534.31 31.57/42.78/28.5940.36 38.27/42.37/40.4451.17 52.39/50.60/50.5178.77 78.29/78.35/79.68 SmoothQuant W8A847.82 46.37/52.57/44.5230.90 31.40/33.94/27.3635.04 32.61/42.98/29.5242.06 39.05/46.39/40.7351.74 53.33/50.00/51.8879.20 78.88/78.48/80.24 QoQ W4A845.78 40.98/51.23/45.1430.02 28.99/32.50/28.5632.84 28.16/41.72/28.6339.92 37.60/41.59/40.5750.54 51.28/49.15/51.1978.10 76.90/77.52/79.89 Quarot W4A441.10 36.96/45.42/40.9227.53 25.41/28.94/28.2327.60 22.99/34.40/25.4239.46 37.92/40.68/39.7745.99 43.00/46.33/48.6374.85 72.87/73.54/78.14 Quarot W4A846.95 44.95/52.98/42.9230.28 30.96/30.71/29.1833.60 29.95/42.45/28.4141.65 39.05/45.58/40.3251.39 50.00/52.30/51.8878.55 77.83/77.84/79.98 MixLLM W4A846.92 43.44/44.75/52.5929.90 29.58/28.26/31.8733.75 30.18/29.59/41.4941.70 38.81/43.11/43.1951.82 51.71/51.88/51.8878.61 77.94/79.71/78.17 MixLLM W4.4A848.17 46.27/52.58/45.6630.09 29.17/31.75/29.3634.53 31.08/43.26/29.2641.74 39.32/44.79/41.1152.70 53.67/51.96/52.4779.00 78.20/78.58/80.21 MixLLM W8A848.84 46.84/54.35/45.3430.93 30.51/33.21/29.0735.54 33.00/43.80/29.8340.94 37.32/44.91/40.5952.10 53.24/50.94/52.1379.42 78.98/78.88/80.40 4.4 Downstream Tasks Evaluation Tab.3 shows the accuracy of the downstream tasks on three popular LLMs. The result shows that: •MixLLM W4.4A8 outperforms all the 4-bit weight quantizations, with only 10% more weight consumption. For example, for the MMLU-Pro task, the average metric of MixLLM W4.4A8 is improved by 1.69, 6.93, and 0.93 over QoQ, QuaRot W4A4, and QuaRot W4A8, respectively. •MixLLM W8A8 is nearly lossless, showing higher accuracy than SmoothQuant. This comes from the group-wise quantized activation of MixLLM. 4.5 Detailed Analysis 4.5.1 Ablation Study Fig.6 shows the perplexity of Llama 3.1 8B model by adding different optimizations gradually. With the basic RTN quantization, using 8-bit for activation, and asymmetric and group-wised weight quantization 11 Page 12: 744.414 9.867 8.551 6.937 6.830 6.496 6.489 6.424 6.422 6.420W4A4 RTN per-channel/token sym + activation 8-bit + weight asym + weight g128 + activation g128 + 10% 8-bit, diagnal Fisher, no 1st order + blocked Fisher and 1st order + GPTQ w/o reorder + group-reorder of GPTQ + clipBasic decisionGlobal mixed AddonsFigure 6: The perplexity (wikitext2) of Llama 3.1 8B model with different configurations. Table 4: The average percentage of 8-bit out features in the seven classes of linear layers in Llama 3.1 8B, with 10% global 8-bit out features in MixLLM. layer q proj k proj v proj o proj gate proj up proj down proj avg 8-bit percent (%) 3.93 12.36 71.22 18.70 0.73 1.46 53.82 contribute significantly to the accuracy improvement. This demonstrates the effectiveness of the decisions made in Sec.3.1.2. Based on these decisions, the 10% of 8-bit output features improves the accuracy to a high level, for which using blocked Fisher and including the first-order Taylor factor also contributes to the accuracy. Finally, applying GPTQ and clipping can further improve the accuracy. 4.5.2 High Precision Distribution Fig.2 shows the percentage of 8-bit out features in each of the linear layers of Llama 3.1 8B, with 10% global 8-bit out features searched by MixLLM (i.e., W4.4A8). It shows that high-salient (i.e., 8-bit) features are distributed very differently in different linear layers. Specifically, the vproj anddown proj layers show much higher percentage of high-salient features than other layers, for which Tab.4 shows the average percentage of different classes of linear layers. 4.5.3 One-pass vs. Progressive Search As described in Sec.3.2, MixLLM searches the high-salience features within a single pass. We have tried the progressive procedure on Llama 3.1 8B and Mistral 7B models, which identifies smaller portions of the high-salience features iteratively. Results show that the accuracy is the same to the one-pass method to two decimal places. However, the progressive method shows much higher search time due to the repeated procedure. The one-pass method takes 7 minutes for each of the two models to search 10% high-salience features, while the progressive method that searches 2% high-salience iteratively takes 30 minutes to find top 10% features. 4.5.4 Overhead of Global Precision Search Tab.5 shows the global precision search overhead described in Sec.3.2. As noted in Sec.4.1, the calibration dataset has 128 samples with sequence length of 2048. We use a single A100 GPU for the 1.5B, 7B and 8B models, and 4 A100 GPUs for the 70B models to perform the search. We make use of device mapin huggingface for multi-GPU execution, which is sequential execution of layers on different devices. The 7B/8B models require about 7 minutes and the 70B models require less than 60 minutes to complete the search. 12 Page 13: Table 5: The overhead of global precision search in MixLLM. ModelsLlama 3.1 Mistral Qwen2.5 8B 70B 7B v0.3 1.5B 7B Time (min) 7 55 7 2 7 Considering that the quantization only needs to be performed once, the searching algorithm is practical for the real workloads. 5 Summary We have presented MixLLM, achieving high accuracy with low memory consumption and high system effi- ciency with the rarely explored optimization space of mixed-precision quantization between output features. MixLLM identifies the salience of each output feature according to the loss distance estimation w.r.t. the global model loss rather than local layer loss. By assigning larger bit-width to the features need it most, MixLLM achieves the superior accuracy to SOTA with low memory consumption. The sub-problems of different bit-widths are disjoint and can run in parallel efficiently on the GPU. We have identified the sweet spot of the quantizaiton configuration that is friendly to both accuracy and system efficiency. To address the challenge of system efficiency, we design the two-step dequantization to enable using int8 Tensor Core computation and the fast integer-float conversion to reduce the dequantization overhead. We have designed the end-to-end software pipeline to overlap the memory access, the dequantization computation with SIMT Core and the MatMul with Tensor Core. Experiment results show that MixLLM achieves superior accuracy to existing works and state-of-the-art system efficiency with low memory cost. References [1] Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A. Badawy. Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In IEEE High Performance Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, September 19- 23, 2022 , pages 1–8. IEEE, 2022. [2] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024 , pages 117–134. USENIX Association, 2024. [3] Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. CoRR , abs/2404.00456, 2024. [4] AutoAWQ. Autoawq. https://github.com/casper-hansen/AutoAWQ , Cited Sep. 2024. [5] AutoGPTQ. Autogptq. https://github.com/AutoGPTQ/AutoGPTQ , Cited Sep. 2024. [6] Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew Mc- Callum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classification of knowledge, reasoning, and context within the ARC dataset. In Eunsol Choi, Minjoon Seo, Danqi Chen, Robin Jia, and Jonathan Berant, editors, Proceedings of the Workshop on Machine Reading for Question Answering@ACL 2018, Melbourne, Australia, July 19, 2018 , pages 60–70. Association for Computational Linguistics, 2018. 13 Page 14: [7] S´ ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco T´ ulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR , abs/2303.12712, 2023. [8] Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. [9] CUTLASS. Cutlass. https://github.com/NVIDIA/cutlass , Cited Sep. 2024. [10] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multipli- cation for transformers at scale. CoRR , abs/2208.07339, 2022. [11] Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless LLM weight compression. In The Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. [12] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ: hessian aware quantization of neural networks with mixed-precision. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pages 293–302. IEEE, 2019. [13] Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. [14] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR , abs/2210.17323, 2022. [15] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. [16] Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. The unreasonable ineffectiveness of the deeper layers. CoRR , abs/2403.17887, 2024. [17] Alibaba Group. Qwen2.5: A party of foundation models. https://qwenlm.github.io/blog/qwen2.5/ , Cited Nov. 2024. [18] Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon and general network pruning. In Proceedings of International Conference on Neural Networks (ICNN’88), San Francisco, CA, USA, March 28 - April 1, 1993 , pages 293–299. IEEE, 1993. [19] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Ra- jbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. Deepspeed-fastgen: High-throughput text generation for llms via MII and deepspeed-inference. CoRR , abs/2401.08671, 2024. 14 Page 15: [20] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral 7b. CoRR , abs/2310.06825, 2023. [21] Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. [22] Tanishq Kumar, Zachary Ankner, Benjamin F Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R´ e, and Aditi Raghunathan. Scaling laws for precision. arXiv preprint arXiv:2411.04330 , 2024. [23] Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. [24] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In David S. Touretzky, editor, Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989] , pages 598–605. Morgan Kaufmann, 1989. [25] Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. OWQ: outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelli- gence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada , pages 13355–13364. AAAI Press, 2024. [26] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: activation-aware weight quantization for on-device LLM compression and acceleration. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors, Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 . mlsys.org, 2024. [27] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving. CoRR , abs/2405.04532, 2024. [28] Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. CoRR , abs/2405.16406, 2024. [29] Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. [30] Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. CoRR , abs/2403.03853, 2024. [31] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. 15 Page 16: [32] Meta. Llama 3. https://ai.meta.com/blog/meta-llama-3 , Cited Sep. 2024. [33] MIT-Han-Lab. lmquant. https://github.com/mit-han-lab/lmquant , Cited Sep. 2024. [34] MIT-Han-Lab. Pileval. https://huggingface.co/datasets/mit-han-lab/pile-val-backup , Cited Sep. 2024. [35] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. , 21:140:1–140:67, 2020. [36] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. CoRR , abs/2311.12022, 2023. [37] Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. [38] Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. [39] Mirac Suzgun, Nathan Scales, Nathanael Sch¨ arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 13003–13051. Association for Compu- tational Linguistics, 2023. [40] TensorRT-LLM. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM , Cited Sep. 2024. [41] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR , abs/2406.01574, 2024. [42] Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wy- att, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, and Zhewei Yao. Zero- quant(4+2): Redefining llms quantization with a new fp6-centric strategy for diverse generative tasks. CoRR , abs/2312.08583, 2023. [43] Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. Flash-llm: Enabling low-cost and highly-efficient large generative model inference with unstructured sparsity. Proc. VLDB Endow. , 17(2):211–224, 2023. [44] Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song. Quant-llm: Accelerating the serving of large language models via fp6-centric algorithm-system co-design on modern gpus. In Saurabh Bagchi and Yiying Zhang, editors, Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024 , pages 699–713. USENIX Association, 2024. [45] Guangxuan Xiao, Ji Lin, Micka¨ el Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma 16 Page 17: Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Interna- tional Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research , pages 38087–38099. PMLR, 2023. [46] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , 2022. [47] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022 , pages 521–538. USENIX Association, 2022. [48] Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. RPTQ: reorder-based post-training quantization for large language models. CoRR , abs/2304.01089, 2023. [49] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Llu´ ıs M` arquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pages 4791–4800. Association for Computational Linguistics, 2019. [50] Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishna- murthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. In Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 , 2024. [51] Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching. arXiv preprint arXiv:2412.03594 , 2024. 17