Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung
Paper Content:
Page 1:
Published as a conference paper at ICLR 2025
RADAR : FAST LONG -CONTEXT DECODING FOR ANY
TRANSFORMER
Yongchang Hao∗
University of Alberta & RBC Borealis
yongcha1@ualberta.caMengyao Zhai
RBC Borealis
mengyao.zhai@borealisai.com
Hossein Hajimirsadeghi
RBC Borealis
hossein.hajimirsadeghi@borealisai.comSepidehsadat Hosseini
RBC Borealis
sepid.hosseini@borealisai.com
Frederick Tung
RBC Borealis
frederick.tung@borealisai.com
ABSTRACT
Transformer models have demonstrated exceptional performance across a wide
range of applications. Though forming the foundation of Transformer models,
the dot-product attention does not scale well to long-context data since its time
requirement grows quadratically with context length. In this work, we propose
Radar, a training-free approach that accelerates inference by dynamically search-
ing for the most important context tokens. For any pre-trained Transformer, Radar
can reduce the decoding time complexity without training or heuristically evicting
tokens. Moreover, we provide theoretical justification for our approach, demon-
strating that Radar can reliably identify the most important tokens with high
probability. We conduct extensive comparisons with the previous methods on a
wide range of tasks. The results demonstrate that Radar achieves the state-of-
the-art performance across different architectures with reduced time complexity,
offering a practical solution for efficient long-context processing of Transform-
ers. The code is publicly available at https://github.com/BorealisAI/
radar-decoding .
1 I NTRODUCTION
Transformer models demonstrate an extraordinary ability on different sequential processing tasks,
including language modeling (Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020; Touvron
et al., 2023b), image classification (Dosovitskiy et al., 2021; Liu et al., 2021; Hassani et al., 2022;
Zhu et al., 2023), translation (Tang et al., 2020; Yao & Wan, 2020), and many more (Carion et al.,
2020; Ding et al., 2022; Chang et al., 2023; Xu et al., 2022; Fang et al., 2023). In particular, Trans-
former models take each input as a sequence of tokens and compute the embedding of each token for
downstream tasks. Among all components, the dot-product attention has been shown to be critical
to the success of Transformer models (Choromanski et al., 2021). It not only enables parallel com-
putation of sequences during training (Vyas et al., 2020), but also provides a high-quality method
for sequence modeling (Sanford et al., 2023).
Despite being at the core of Transformer models, the dot-product attention is not ideal for long-
context data: the time to process each token increases with context lengths, significantly slowing
down the throughput on long-context data. Moreover, the maximum context length is limited during
training, resulting in an inability to perform inference on long-context tasks. Yet, many real-world
applications are naturally long-context (Tay et al., 2021; Beltagy et al., 2020; Wu et al., 2024). For
example, a code file could have more than 10K tokens (Lozhkov et al., 2024; Kocetkov et al., 2022).
∗Work done during an internship at RBC Borealis.
1arXiv:2503.10571v1 [cs.LG] 13 Mar 2025
Page 2:
Published as a conference paper at ICLR 2025
However, Llama-2 (Touvron et al., 2023b), a widely used Transformer-based model, is incapable of
processing the full source code because it has a maximum context length of 4K tokens.
To improve Transformer’s long-context performance, a line of research proposes to replace the orig-
inal attention with some faster variants (Xiao et al., 2024; Peng et al., 2021; Bertsch et al., 2023;
Beltagy et al., 2020; Choromanski et al., 2021; Mohtashami & Jaggi, 2023; Peng et al., 2023). One
notable example is attention kernelization, which reformulate the dot-product attention as a kernel
and approximate the kernel with linear operations (Peng et al., 2021). By reordering the matrix chain
multiplication, kernelization methods reduce the time complexity to O(t)for the context of length
t. However, such methods require a training process to fit the approximations, which becomes more
and more infeasible given the fast parameter scaling of the current Transformer models.
Without the need for training, another line of work accelerates inference by evicting previous tokens
used in each decoding step (Xiao et al., 2024; Zhang et al., 2023; Li et al., 2024). For instance,
StreamingLLM (Xiao et al., 2024) proposes to only use the constant-length recent tokens and discard
all other but the first tokens. Although they are faster in inference on long-context data, these
methods lose previous information in the context that is potentially critical for future use. Using the
code file as an example, StreamingLLM could fail to invoke a function if the function declaration is
not in the recent tokens.
In this paper, we propose Radar ( range se arch accelerate dby random featu res), a training-free ap-
proach that dynamically selects the important context ranges to accelerate inference. Specifically,
Radar groups the context into segments and maintains a “summarization representation” for each
group by random projections. To generate a new token, Radar calculates the importance of each
group and selects only the tokens in the most important segments. We theoretically justify our
approach by showing that Radar is guaranteed to pick the most dominant tokens with high proba-
bilities. For any pre-trained Transformers, our approach can instantly extend the maximum context
length while reducing the overall complexity to O(t1.5).
We conduct extensive experiments to verify the effectiveness of our approach. Specifically, we
compared our method with StreamingLLM (Xiao et al., 2024), Landmark attention (Mohtashami
& Jaggi, 2023), H 2O (Zhang et al., 2023), and SnapKV (Li et al., 2024). The results across dif-
ferent models and tasks show that Radar generally achieves state-of-the-art performance with a low
time complexity. Additionally, our in-depth analyses, while providing detailed information, highly
corroborate our main results.
2 M ETHODOLOGY
In this section, we present Radar, a principled approach that is guaranteed to select the most impor-
tant tokens with a high probability. We first introduce the notations and the problem formulation
in Section 2.1. The algorithm of Radar is formally presented in Section 2.2 with a major focus on
showing the time complexity. The correctness of our algorithm is rigorously shown in Section 2.3.
2.1 B ACKGROUND
Dot-product attention. Without loss of generality, we focus on the single-head self-attention for
simplicity. Specifically, it operates on the level of sequence. Each sequence contains several token
embeddings (x1, . . . ,xt), where every embedding is a vector in the high dimensional space Rd.
The attention first maps each embedding to query, key, and values by linear projections ( qi=
Wqxi,ki=Wkxi, andvi=Wvxi). The attention score is defined as
ai,j:=1
ziI(j≤i) exp
q⊤
ikj/√
d
(1)
for1≤i, j≤t. Here, ziis a normalization factor such thatPi
j=1ai,j= 1. The dot-product
attention then produces a sequence by
oi:=iX
j=1ai,jvj, (2)
indicating that oiis a fusion of (v1, . . . ,vi)weighted by the attention scores (ai,1, . . . , a i,i).
2
Page 3:
Published as a conference paper at ICLR 2025
k4k5k6
ϕΩ(k4), . . .
¯ϕΩ(k4:6)
(a) Construction process for a single
segment consists of three tokens k4:6.
Radar first maps each token embed-
ding to the random feature ϕΩ(ki).
The segment embedding ¯ϕΩ(k4:6)is
then the average of random features.k1k2k3k4k5k6k7k8k9
q⊤
¯ϕΩ(k1:3)
¯ϕΩ(k4:6)
¯ϕΩ(k7:9)
Rn ϕ⊤Ω(q)0.1a1:32.2a4:60.7a7:9
(b) Query process for a single query q. The query token
is first mapped to the random feature ϕΩ(q). The segment
importance is obtained by inner products between random
features ϕ⊤
Ω(q)¯ϕΩ(ki:i+2). In this example, the second
segment has the highest (unnormalized) segment attention
of2.2.
Figure 1: Overview of the approach.
The two equations above show that the dot-product attention requires O(t)to generate a new token,
resulting in an overall O(t2)time for the whole sequence. To reduce the complexity of the dot-
product attention, both operations should not be quadratic in t.
Context reduction. Without the need for training, one popular strategy for accelerating inference
is to reduce the number of tokens used in attention calculation. LongFormer (Beltagy et al., 2020)
proposes to use local sliding window attention, which only needs to look into the tailing tokens, to
accelerate decoding. More recently, StreamingLLM (Xiao et al., 2024) improves this by prepending
sink tokens to stabilize the decoding for contexts beyond the pre-training length. However, these
methods suffer from the issue of information loss , where the vast amount of middle tokens poten-
tially containing important information are permanently lost. Other remedies such as the heavy-
hitter oracle (H 2O) use heuristics to partially retain some middle tokens. Despite this, they depend
on heuristic rules and sometimes fail to completely prevent information loss.
Problem formulation. Given Equation (2), the attention score ai,jrepresents the importance of
thejth token to the current step i. To ensure the selected tokens are important, we would like to
select an index set Sthat
j∈arg topk
1≤l≤iai,l (3)
for all j∈S. This gives us an approximation of the original attention and reduces the compute to
O(|S|). However, generating such an index set takes O(t)time, failing to improve the overall time
complexity. We aim to approximately solve this problem in sublinear time complexity.
2.2 T HE PROPOSED ALGORITHM : RADAR
This section shows the algorithm of Radar with a focus on analyzing the time complexity. The
correctness of the algorithm is deferred to Section 2.3.
Hierarchical structure. We manage the key embeddings (k1, . . . ,kt)with a hierarchical data
structure consisting of two layers of nodes. The bottom-layer nodes are the original key embeddings,
whereas each top-layer node is a segment consisting of the bottom keys. The hierarchical structure
enables a two-step attention calculation procedure: for the first step, we search for the important
segments with the highest summed attention score; after that, we use the tokens in the selected
segments to approximate the attention for attention calculation.
The benefit of applying the hierarchical paradigm is reducing the time complexity. Assume there
arettokens and each segment contains ctokens on average, then the overall query complexity
isO(τ(c)·t/c+c), where O(τ(c))is the time complexity to obtain the importance score for a
segment. The term O(t/c)comes from calculating the importance of ⌈t/c⌉segments and selecting
the top ones; and the second term O(c)is for the actual attention by Equations (1) and (2).
3
Page 4:
Published as a conference paper at ICLR 2025
Accelerated segment search. To accelerate the segment search, we propose to assign each seg-
ment node a representation that “summarizes” the bottom-layer key embeddings. We resort to
the attention kernelization technique to accomplish this. Specifically, we first apply a mapping
ϕΩ:Rd→Rnin (Choromanski et al., 2021) for all keys
ϕΩ(k) :=1√n
exp
ω⊤
1k′− ∥k′∥2
2/2
, . . . , exp
ω⊤
nk′− ∥k′∥2
2/2
. (4)
Here, every element in Ω∈Rn×dis sampled from N0,1and the vector ωi∈Rdis the ith column
ofΩ. We let k′:=k/4√
dfor the scaling in attention. After projection, we average the projected
features as the segment embedding. The embedding for a segment ranging from itoi+cis
¯ϕΩ(ki:i+c) :=1
cc−1X
l=0ϕΩ(ki+l). (5)
The segment embeddings can be pre-computed and stored. This construction process is illustrated
in Figure 1a.
To search for the most important segments, we can search for
arg topk
l
al:l+c:=ϕ⊤
Ω(q)¯ϕΩ(kl:l+c)
(6)
forl∈ {1,1 +c, . . . , 1 + (⌈t/c]−1)c}. By using this acceleration, the time for calculating the
importance score of a single segment (i.e., τ(c)) becomes constant, and consequently O(τ(c)) =
O(1). Therefore, the total time is O(t/c)given there are ⌈t/c⌉segments. Based on our theoretical
guarantees in Section 2.3, the obtained segments will likely contain the most important tokens that
summed to the highest attention scores. This process is demonstrated in Figure 1b.
Dynamic restructuring. With the accelerated searching, we have O(t/c+c)time for each query.
The optimal asymptotic time complexity O(√
t)is obtained when c=O(√
t), indicating that the
segment range should change with the length of the context. However, changing the segment range
requires a reconstruction of the segment embeddings that takes O(t)time. To address this, we
propose a dynamic restructuring schedule, which only performs reconstruction when√
t∈N. This
schedule indicates that there will be maximum√
ttimes of restructuring happening for ttokens,
amortized to an O(√
t)time for each query step.
For all other steps that√
t̸∈N, we store the token in a buffer W, which serves as a sliding window
with a maximum size of 2√
t- 1. The sliding window is always used in querying steps. Although
this adds more tokens in the attention calculation, the asymptotic complexity remains O(√
t)for
each step.
The overall algorithm. Putting all together, we see that Radar demonstrates an overall time com-
plexity of O(t1.5)to generate ttokens. We provide the procedure of Radar with detailed time
complexity analysis in Appendix A.
2.3 T HEORETICAL JUSTIFICATIONS
In this section, we rigorously show that our algorithm described in Section 2.2 is guaranteed to solve
the problem in Equation (3) with high probability.
We first show that the projection 4 is sound with the following lemma.
Lemma 1. LetϕΩfollow the definition (4)whereΩ= (ω1, . . . ,ωn)is sampled from N0,1. Given
anyu,v∈Rd, we have EΩ[ϕ⊤
Ω(u)ϕΩ(v)] = exp
u⊤v/√
d
.
Proof. This is first shown in (Choromanski et al., 2021). We include as a part of our Lemma 6 in
Appendix B for completeness.
Based on this, we build the following theorem to show the correctness of our algorithm.
4
Page 5:
Published as a conference paper at ICLR 2025
Theorem 2. Let Radar be at a state of csegments where each segment contains ctokens. Without
loss of generality, assume the first segment k1:1+chas the highest attention score a1:1+cw.r.t. the
current query q. With confidence at least 1−δ, we have
ϕ⊤
Ω(q)¯ϕΩ(k1:c+1) = max
lϕ⊤
Ω(q)¯ϕΩ(kl:c+l) (7)
if the minimum gap a1:1+c−max lal:l+c≥1
cexp
ζ2/√
dq
8 log(2( c−1)/δ)
n. Here, ζis the maxi-
mum norm among kiandq, whereas l∈[1,1 +c,1 + 2 c, . . . , 1 + (c−1)c]is the starting index of
each segment.
Proof. We first build a series of essential lemmas in Appendix B. The proof of this theorem is then
shown in Appendix C.
Our theorem asserts that as long as the projection dimension nis large enough, we can correctly
retrieve the segment with the highest attention up to any precision. In fact, with the growth of
the sequence length, the difficulty of meeting this condition is not growing linearly because the
attention gap needed is O(log(c)/c)for an average segment attention score of O(1/c), suggesting
our approach is particularly favorable for long-context decoding.
3 E XPERIMENTS
In this section, we conduct extensive experiments to verify the effectiveness of Radar in comparison
with other methods on diverse tasks and models. We additionally carry out in-depth analyses and
ablation studies to demonstrate the effects of the length of prompts, the projection dimension, and
the number of selected segments.
3.1 E VALUATING LONG -CONTEXT PERPLEXITY
Datasets. Following previous work (Xiao et al., 2024; Zhang et al., 2023), we use the perplexity
on the first sample in the PG-19 dataset as the main test bed. Specifically, we evaluate the over-
all perplexity by feeding the ground-truth tokens one by one. Since PG-19 only contains natural
language, we additionally use a code sample from The Stack (Lozhkov et al., 2024) for a broader
comparison. The selected sample is an implementation of NASA’s CDF Project1in a single Python
file. To simulate the real-world use cases, we prefill the first 16,384 tokens into the model as a
prompt. The experiments without prompts are conducted in Section 3.3.
Competing methods. We consider the vanilla dot-product attention (Vaswani et al., 2017) and
StreamingLLM (Xiao et al., 2024) as the baselines in this experiment. Following the default setting
in previous work (Xiao et al., 2024; Zhang et al., 2023) the sliding window length is set to 1024 for
all runs. For our Radar, we choose the top 64 segments for each query with the random projection
dimension set to 2048. The effect of these two parameters is studied in Section 3.3. Each experiment
is conducted on a single A100 GPU with 40GB of memory.
Models. Following previous work (Xiao et al., 2024), we choose two popular architectures in
this experiment: Llama (Dubey et al., 2024) and Mistral (Jiang et al., 2023). Specifically, we use
Llama-Meta-3.1-8B andMistral-7B-v0.3 for each architecture, respectively.
Results. We show the results in Figure 2. The naive attention method achieves the best perplexity
in all runs. However, its elapsed time grows quadratically regardless of the architecture. In prac-
tice, the generation throughput drops to 10 tokens/s or even lower at the end. On the other hand,
StreamingLLM achieves a constantly high throughput at a cost of poor generation quality (measured
in perplexity), as it evicts most of the context during generation. Our method Radar, in contrast, is
able to effectively control the elapsed time while maintaining a similar perplexity by a maximum
1https://cdf.gsfc.nasa.gov/
5
Page 6:
Published as a conference paper at ICLR 2025
Figure 2: The performance comparison in perplexity (first row) and elapsed time (second row). The
lower the better for both metrics. For the Llama model, we annotate the perplexity value at the
last token; for the Mistral model, we annotate the perplexity at the maximum pre-training context
length (shown by the vertical dashed lines) because the full context exceeds its modeling ability. We
additionally show the generation throughput for all runs.
difference of around 0.2compared with the vanilla attention. Notably, the performance degener-
ation of Radar is at most 10% while the speed up is more than 2 times. The results confirm the
effectiveness of our method in effectively utilizing the context while accelerating inference.
We present additional results with H 2O (Zhang et al., 2023) and SnapKV (Li et al., 2024), two state-
of-the-art methods that employ different heuristics to reduce context length during inference, in a
separate figure in Appendix D as both methods produce out-of-scale perplexities in this setting.
3.2 L ONG -CONTEXT BENCHMARKS
In Section 3.1 above, we only evaluated the perplexity without inspecting the down-stream perfor-
mance. For a more comprehensive comparison, we additionally conduct the end-to-end evaluation
on numerous long-context tasks.
Datasets. Following previous work (Li et al., 2024), we use the LongBench dataset (Bai et al.,
2024) as the main benchmark. Specifically, we use all 16 subtasks in English and code languages.
These tasks belong to 6 categories in total: single-doc QA, multi-doc QA, summarization, few-shot
learning, synthetic, and code. Following the standard LongBench pipeline (Bai et al., 2024), we
truncate the context from the middle of the prompt length exceeds the pre-training length. Each
subtask is evaluated with a specific metric (e.g., ROUGE scores (Lin, 2004) for summarization tasks
and accuracies for synthetic tasks). After evaluating all subtasks, we report the average scores as
a summary on LongBench. It is worth noting that the metrics used by different subtasks can have
very different scales, indicating the arithmetic average over all subtasks may not reflect the overall
performance (e.g., the final average may be inaccurately dominated by the accuracy of the synthetic
tasks as pointed out by the original paper (Bai et al., 2024)). To this end, we additionally include the
average percentile over all 16 tasks ranked within the same base model. For instance, a percentile of
80% means that the method is expected to be strictly better than 80% of other methods.
Competing methods. We use all methods mentioned in Section 3.1, including H 2O (Zhang et al.,
2023) and SnapKV (Li et al., 2024). Following the previous work (Li et al., 2024), all competing
baselines can use a maximum of 32 + nctokens, where 32 is the length of the sliding window
andncis the number of middle tokens varied from 1024 to 4096. For StreamingLLM, we simply
extend the sliding window by nc. In addition to baselines used in previous work, we also evaluate
6
Page 7:
Published as a conference paper at ICLR 2025
Table 1: Performance comparisons of different methods on LongBench. On each model, the best-
performing method is highlighted in bold and the second-best method is underlined (excluding the
vanilla method). Each table contains the results for one particular model.
(a) The LongBench benchmark results of different methods applied on Llama-7b . Each prompt contains
maximum 1.5K tokens as the model can handle maximum 2K tokens.
Context MethodSingle QA Multi QA Summarization Few-show Synthetic Code
Avg. Score
Avg. Perc.NrtvQA
Qasper
MFQA
HtptQA
2WkQA
Musique
GovRep
QMSum
MulNews
TREC
TrivQA
SamSum
PsgCnt
PsgRet
LCC
RB-P
Full Vanilla 3.12 8.01 17.52 8.16 10.02 4.75 26.28 14.33 27.18 52.50 80.88 35.16 2.00 6.17 61.88 54.81 25.80 56.25
- Landmark 8.53 10.51 16.05 10.97 12.61 4.54 11.93 11.88 7.16 34.00 58.83 28.93 2.50 5.83 45.24 39.39 19.31 30.21
1024StreamingLLM 2.76 7.86 16.37 8.57 11.14 4.54 15.86 14.17 19.96 53.00 80.79 35.44 1.83 6.50 61.81 53.97 24.66 37.50
H2O 2.87 8.11 16.59 8.05 9.85 4.20 26.25 14.85 25.92 51.50 80.67 34.41 1.50 5.75 60.06 52.96 25.22 25.00
SnapKV 2.97 7.88 17.38 8.39 10.07 4.65 25.74 14.51 26.55 52.50 80.88 35.24 2.00 6.17 62.71 54.47 25.76 52.08
Radar 2.94 8.01 17.61 8.49 10.51 4.15 26.11 14.61 27.52 52.00 80.92 35.48 1.75 6.50 62.36 54.41 25.84 54.17
(b) The LongBench benchmark results of different methods applied on Llama-2-7b-chat-hf . Each
prompt contains maximum 3.5K tokens as the model can handle maximum 4K tokens.
Context MethodSingle QA Multi QA Summarization Few-shot Synthetic Code
Avg. Score
Avg. Perc.NrtvQA
Qasper
MFQA
HtptQA
2WkQA
Musique
GovRep
QMSum
MulNews
TREC
TrivQA
SamSum
PsgCnt
PsgRet
LCC
RB-P
Full Vanilla 16.87 17.78 36.13 34.06 27.53 9.10 26.09 21.01 25.99 64.00 83.84 41.24 4.50 12.00 58.36 52.31 33.18 74.38
1024SubGen 15.98 19.54 26.27 17.00 21.49 7.38 22.80 20.68 23.90 39.00 65.31 25.95 1.84 4.92 44.23 43.32 24.98 13.75
StreamingLLM 15.38 15.13 21.95 28.67 24.56 5.66 21.15 19.70 24.57 61.00 80.69 40.62 4.58 4.50 56.59 49.28 29.63 14.37
H2O 16.54 16.04 35.14 32.53 26.60 8.11 27.91 20.42 26.84 63.00 83.15 38.38 4.00 10.50 52.11 44.53 31.61 36.25
SnapKV 15.69 18.30 35.59 33.23 26.27 8.62 21.81 20.47 25.18 64.50 82.57 40.43 4.50 11.00 58.27 52.06 32.41 45.00
Radar 16.95 19.32 37.20 33.70 27.60 8.83 25.30 21.21 25.66 63.50 83.44 40.78 4.50 10.00 57.70 51.17 32.93 64.38
2048StreamingLLM 16.77 16.03 25.45 30.14 25.36 7.42 23.59 19.71 25.58 64.00 82.57 41.65 4.50 6.50 56.94 51.15 31.08 35.00
H2O 16.39 17.93 36.09 32.88 26.47 7.61 27.73 20.58 26.45 63.50 83.78 39.26 4.50 10.00 57.32 51.63 32.63 50.00
SnapKV 17.05 18.43 36.03 33.78 27.06 7.90 24.56 21.01 25.76 64.00 82.88 40.60 4.50 11.50 58.51 51.74 32.83 64.38
Radar 16.90 17.76 36.02 33.90 26.81 8.79 26.32 21.11 26.15 64.00 83.92 41.01 4.50 11.50 58.37 52.06 33.07 71.25
(c) The LongBench benchmark results of different methods applied on Mistral-7B-Instruct-v0.2 .
Each prompt contains maximum 31.5K tokens as the model can handle maximum 32K tokens.
Context MethodSingle QA Multi QA Summarization Few-shot Synthetic Code
Avg. Score
Avg. Perc.NrtvQA
Qasper
MFQA
HtptQA
2WkQA
Musique
GovRep
QMSum
MulNews
TREC
TrivQA
SamSum
PsgCnt
PsgRet
LCC
RB-P
Full Vanilla 27.06 32.22 49.61 43.49 27.96 18.85 33.35 24.30 27.13 71.00 86.07 43.50 2.80 86.98 56.17 53.63 42.76 75.63
1024StreamingLLM 21.70 18.95 32.03 32.72 21.69 11.89 23.55 20.28 25.41 64.00 84.95 42.13 3.13 22.33 54.79 49.94 33.09 8.75
SnapKV 24.97 30.03 49.51 41.35 25.82 18.92 25.85 24.09 26.24 70.00 86.53 42.04 2.83 89.06 54.33 50.96 41.41 43.75
Radar 23.81 28.26 47.91 38.96 26.17 16.08 27.99 22.52 26.95 70.50 85.94 41.70 3.13 53.33 54.17 50.03 38.59 28.75
2048StreamingLLM 22.54 22.46 35.35 33.36 23.11 13.30 26.60 20.67 26.47 66.00 85.92 42.00 2.81 26.58 55.92 51.58 34.67 18.75
SnapKv 26.10 32.14 49.31 41.71 27.60 18.83 28.90 24.53 26.65 70.50 86.28 42.97 2.78 86.27 54.50 50.87 41.87 51.87
Radar 26.31 31.89 49.55 43.18 26.83 17.47 30.38 23.30 27.18 71.50 86.15 42.49 3.39 72.25 55.76 51.87 41.22 60.00
4096StreamingLLM 24.48 29.86 41.04 37.19 24.47 14.94 30.38 21.62 26.96 69.50 86.15 42.70 2.58 39.53 56.49 52.64 37.53 34.38
SnapKV 26.31 33.61 50.35 42.67 27.89 18.74 30.73 24.24 27.00 71.00 86.25 43.14 2.60 86.10 54.19 51.22 42.25 61.25
Radar 26.31 33.77 49.17 42.85 28.68 17.91 32.15 24.17 27.11 71.00 86.23 43.41 2.87 81.75 56.42 52.92 42.30 69.38
the Landmark attention (Mohtashami & Jaggi, 2023) and SubGen (Zandieh et al., 2024). Each
experiment is conducted on one A100 GPU.
Models. We test a wide range of models in this experiment. We first choose the Llama-7b
model (Touvron et al., 2023a) because Landmark attention is fine-tuned based on it. We also have
Llama-2-7b-chat-hf , which is included in the original LongBench paper (Bai et al., 2024).
Finally, we also test with Mistral-7B-Instruct-v0.2 following the previous work (Li et al.,
2024).
7
Page 8:
Published as a conference paper at ICLR 2025
Results. The results are shown in Table 1. In all three subtables, we see that the vanilla attention
mechanism generally achieves the highest scores on all tasks, given that it has all the information
presented in the context. StreamingLLM is oftentimes the worst method because it only uses the
recent 32+ nctokens.
For the Llama-7b model (in Table 1a), we see that the Landmark attention method performs
better than the vanilla method on QA tasks. This is likely due to the additional data in fine-tuning.
However, the performance on other tasks is significantly deteriorated. For H 2O and SnapKV , we
observe similar results as in Li et al. (2024), where SnapKV is better than H 2O in both average score
and percentile. Among all the baselines, Radar achieves the highest average score and percentile
ranking.
The results on Llama-2-7b-chat-hf are shown in Table 1b. Aligned with the observation in
previous work (Li et al., 2024), SnapKV generally outperforms H 2O in this setting. SubGen, albeit
additionally saving memory, underperforms other methods on nearly all tasks. This indicates that
its emphasis on memory saving and acceleration might be at the expense of generation quality. Our
Radar achieves the highest average scores and average percentiles across all ncsettings. Notably,
Radar with 1024 middle tokens outperforms SnapKV with 2048 middle tokens in terms of average
score. The results strongly suggest the effectiveness of our method in utilizing the context.
Table 1c shows the results for Mistral-7B-Instruct-v0.2 .2Our method is consistently
better than SnapKV when the ground-truth answers require long generation (e.g., the government
report task). This not only suggests the effectiveness of Radar but also shows that Radar is utilizing
different tokens at different steps of token generation. Remarkably, Radar achieves an average score
close to the vanilla method with only 13% of the tokens used ( nc= 4096 ).
3.3 I N-DEPTH ANALYSES
Generating without prompts. In Section 3.1, we prefill 16,384 tokens for all runs to simulate
the usage of prompts in real-world applications. However, some previous work are focused on non-
conditional generation, where no prompt is provided. To compare the performance in this setting, we
conduct experiments similar to Section 3.1 without the prompts. The results are shown in Figure 3.
Note that we cannot compare with SnapKV because it is only applied to prompts.
Figure 3: Generation without prompts.
As observed, H 2O demonstrates a promising result on the Llama model—the overall perplexity is
only worse than Radar by 0.05. However, the perplexity of the Mistral model starts to drastically
deteriorate after hundreds of tokens. On the contrary, Radar continues to perform steadily. This
experiment suggests the versatility of Radar, offering acceleration with or without prompts on a
wide range of model architectures.
The effect of nandk.Our method introduces hyper-parameters n(the projection dimension for
the random matrix) and k(the number of top segments selected). We show the effect of these two
hyper-parameters with the Figure 4a and Figure 4b, respectively.
Aligned with our theoretical prediction with Theorem 2, increasing nimproves the performance
by providing better attention approximation. Similarly, increasing kalso improves the generation
2We skip the H 2O method in this setting because it runs out of 40GB memory with prompts of 31.5K tokens.
8
Page 9:
Published as a conference paper at ICLR 2025
(a) The effect of the hyper-parameter n.
(b) The effect of the hyper-parameter k.
Figure 4: The effect of the hyper-parameters n(the projection dimension for the random matrix)
andk(the number of top segments selected) introduced by Radar.
quality by using more tokens. However, a high norkimposes the need for more memory and
increases the computational overhead. Given these considerations, we use n= 2048 andk= 64 by
default to balance the generation quality and hardware efficiency.
Ablation study. Our algorithm involves an approximated top- kselection. To verify that such an
approximation is functioning as intended, we conduct three ablation studies in Figure 5 by replacing
it with three other strategies. The experiments are conduced with the Llama model on the PG-19
sample.
Figure 5: Ablation studies. Here, we compare with three different segment selection strategies.
As shown, when we select segments with the lowest approximated segment attention scores (left),
the performance becomes similar to StreamingLLM, suggesting that the selected tokens with low
segment attention scores are indeed not informative. In addition, our approximation is better than
the random selection strategy (middle), showing that such an approximation is choosing more infor-
mative tokens than the “uneducated guess”. Lastly, we see that our approximation has the closest
performance compared with the exact segment search (right), indicating our method, albeit poten-
tially missing some tokens, is a reasonable approximation given its low time complexity.
4 R ELATED WORK
Context reduction. Previous literature shows that Transformers with sparse attention patterns can
generate with fewer context tokens while not sacrificing the generation quality (Child et al., 2019;
Beltagy et al., 2020; Zaheer et al., 2020; Roy et al., 2020; Chen et al., 2021). In addition to the
success of sparse attention architectures, researchers also empirically demonstrate that the vanilla
Transformers do not need all tokens in the context to correctly generate the current token (O’Connor
& Andreas, 2021; Sun et al., 2021; Liu et al., 2024). Our work entails this discovery.
Based on the discovery, there are some closely related methods proposed Han et al. (2023). For
instance, StreamingLLM (Xiao et al., 2024) uses only the local sliding window along with a few
9
Page 10:
Published as a conference paper at ICLR 2025
beginning tokens to generate the current token. H 2O (Zhang et al., 2023) proposes to additionally
retain middle tokens based on a scoring function to maximize hit rate. SnapKV (Li et al., 2024)
proposes to retain middle tokens based on the attention pooling in a parallel manner. All these meth-
ods, albeit reducing the context length, cannot recover the lost information once the context token
is evicted. Moreover, they still suffer from the quadratic time complexity. By contrast, our method
is able to dynamically retrieve any tokens in the context with a low time complexity. Recently,
Zandieh et al. (2024) propose to compress the KV cache for memory and time efficiency, posing an
opportunity for better algorithm designs.
Efficient attention. The quadratic computing time of the dot-product attention becomes a bottle-
neck in long-context generation. Searching for an efficient design of the attention mechanisms is a
long-standing topic for Transformers (Wang et al., 2020; Kitaev et al., 2020). Recently, Landmark
Attention (Mohtashami & Jaggi, 2023) proposes to use grouped softmax with a special token to
reduce computation. Another line of work reduces the time complexity is through attention ker-
nelization, which reformulates the dot-product attention mechanism as a kernel (Tsai et al., 2019;
Katharopoulos et al., 2020; Peng et al., 2021; Zandieh et al., 2023). These methods can achieve lin-
ear complexity in both training and inference by recurrence. However, the parameters are required
to be trained/fine-tuned to fit the new architectures. On the contrary, our method is training-free and
can be applied to any pre-trained Transformers.
Instead of modifying the vanilla attention, recent literature also focuses on improving the efficiency
in implementation. For example, Flash Attention series (Dao et al., 2022; Dao, 2023) leverage the
tiling technique to avoid memory-bound attention operations. Memory-efficient attention (Rabe &
Staats, 2021) reorders the attention calculation to allocate only the constant space for any context
length. Our work is orthogonal to these methods and can be combined together.
Long-context ability via retrieval. In certain cases, Transformers can gain the long-context abil-
ity by retrieving related documents like retrieval-augmented generation (Lewis et al., 2020) based
onkNN search. For instance, Memorizing Transformer (Wu et al., 2022) recurrently caches previ-
ous tokens in a kNN index. Unlimiformer (Bertsch et al., 2023) uses a kNN index to retrieve the
encoded chunks in the encoder-decoder attention. However, these methods can only be used with
the specialized architectures after training. They may also suffer from the exponential search quality
degradation when the dimension increases (Beyer et al., 1999). Unlike these methods, Radar is a
training-free method with a rigorous performance guarantee independent of the dimension.
5 C ONCLUSION
In this work, we present Radar, a general and effective approach to accelerate inference for Trans-
formers. Unlike previous methods that train other variants of the attention mechanism for replace-
ment, Radar can accelerate any pre-trained Transformers without the need for training. In addition,
we provide theoretical justifications that show Radar can reliably identify the most important tokens
with high probability. More importantly, our empirical results across a wide range of tasks and
models confirm the effectiveness of our method. Compared with prior training-free methods that
evict tokens based on heuristics, Radar utilizes context tokens dynamically in a principled manner,
resulting in significantly higher generation quality while achieving a lower time complexity.
We hope that Radar opens up opportunities for pre-trained Transformers to be applied out-of-the-
box, without requiring expensive re-training, to long context settings (e.g. legal documents or finan-
cial transactions). We also hope our method further inspires advancements in high-quality generation
with low time complexities for LLMs.
ETHICS STATEMENT
Our research focuses on accelerating pre-trained Transformers to reduce power and time consump-
tion in inference. As far as we are aware, it does not pose any new ethical or societal risks that
require specific attention.
10
Page 11:
Published as a conference paper at ICLR 2025
REPRODUCIBILITY STATEMENT
In our paper, all of our models and datasets are publicly accessible. For the models, we download
them from HuggingFace’s hub using the following identifiers:
•meta-llama/Meta-Llama-3.1-8B
•meta-llama/Llama-2-7b-chat-hf
•huggyllama/Llama-7b
•mistralai/Mistral-7B-v0.3
•mistralai/Mistral-7B-Instruct-v0.2
For the datasets, we follow the previous work and similarly obtain them from HuggingFace:
•THUDM/LongBench
•emozilla/pg19-test
•bigcode/the-stack-smol
We run all of our experiments on A100 GPUs.
The procedure of Radar is explained in Algorithm 1 with the theoretical justification in Theorem 2.
REFERENCES
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit
Sanghai. GQA: Training generalized multi-query transformer models from multi-head check-
points. In EMNLP , 2023. URL https://openreview.net/forum?id=hmOwOZWzYE .
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du,
Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilin-
gual, multitask benchmark for long context understanding. In ACL, pp. 3119–3137, 2024. URL
https://aclanthology.org/2024.acl-long.172 .
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
arXiv:2004.05150 , 2020. URL https://arxiv.org/abs/2004.05150 .
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range
transformers with unlimited length input. In Thirty-seventh Conference on Neural Information
Processing Systems , 2023. URL https://openreview.net/forum?id=lJWUJWLCJo .
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When Is “Nearest
Neighbor” Meaningful? , pp. 217–235. Springer Berlin Heidelberg, 1999. doi: 10.1007/
3-540-49257-7 15. URL http://link.springer.com/content/pdf/10.1007/
3-540-49257-7_15 .
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV , pp. 213–229, 2020.
URLhttps://arxiv.org/abs/2005.12872 .
Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan
Yang, Kevin Patrick Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip
Krishnan. Muse: Text-to-image generation via masked generative transformers. In ICML , 2023.
URLhttps://arxiv.org/abs/2301.00704 .
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, A. Rudra, and C. R ´e. Scatterbrain: Unifying sparse
and low-rank attention approximation. 2021. URL https://arxiv.org/abs/2110.
15343 .
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509 , 2019. URL https://arxiv.org/abs/
1904.10509 .
11
Page 12:
Published as a conference paper at ICLR 2025
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea
Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser,
David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with per-
formers. In ICLR , 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH .
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv
preprint arXiv:2307.08691 , 2023. URL https://arxiv.org/abs/2307.08691 .
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory-
efficient exact attention with IO-awareness. NeurIPS , 35:16344–16359, 2022. URL https:
//arxiv.org/abs/2205.14135 .
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. In NAACL , 2019. URL https:
//arxiv.org/abs/1810.04805 .
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image
generation via hierarchical transformers. In NeurIPS , 2022. URL https://openreview.
net/forum?id=GkDbQb6qu_r .
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko-
reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at
scale. In ICLR , 2021. URL https://openreview.net/forum?id=YicbFdNTTy .
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and et al. The llama 3 herd of
models. arXiv preprint arXiv:2407.21783 , 2024. URL https://arxiv.org/abs/2407.
21783 .
Xiequan Fan, Ion Grama, and Quansheng Liu. Exponential inequalities for martingales with appli-
cations. Electronic Journal of Probability , 20:1 – 22, 2015. URL https://doi.org/10.
1214/EJP.v20-3496 .
Qingkai Fang, Yan Zhou, and Yang Feng. DASpeech: Directed acyclic transformer for fast and
high-quality speech-to-speech translation. In NeurIPS , 2023. URL https://openreview.
net/forum?id=JvYSSPtQyk .
Insu Han, Rajesh Jayaram, Amin Karbasi, V . Mirrokni, David P. Woodruff, and A. Zandieh. Hy-
perattention: Long-context attention in near-linear time. ICML , 2023. URL https://arxiv.
org/abs/2310.05869 .
Ali Hassani, Steven Walton, Jiachen Li, Abulikemu Abuduweili, Humphrey Shi, and Dacheng Tao.
Regionvit: Regional-to-local attention for vision transformers. In ICLR , 2022. URL https:
//arxiv.org/abs/2106.02689 .
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier,
et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023. URL https://arxiv.org/abs/
2310.06825 .
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are
rnns: Fast autoregressive transformers with linear attention. In ICML , pp. 5156–5165, 2020.
URL https://arxiv.org/abs/2006.16236 .
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In ICLR ,
2020. URL https://openreview.net/forum?id=rkgNKkHtvB .
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu ˜noz Ferrandis,
Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von
Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code. Preprint , 2022.
URL https://arxiv.org/abs/2211.15533 .
12
Page 13:
Published as a conference paper at ICLR 2025
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented gen-
eration for knowledge-intensive nlp tasks. NeurIPS , 33:9459–9474, 2020. URL https:
//arxiv.org/abs/2005.11401 .
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle
Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before
generation. NeruIPS , 2024. URL https://arxiv.org/abs/2404.14469 .
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization
Branches Out , pp. 74–81. Association for Computational Linguistics, July 2004. URL https:
//aclanthology.org/W04-1013 .
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and
Percy Liang. Lost in the middle: How language models use long contexts. TACL , 12:157–173,
2024. URL https://arxiv.org/abs/2307.03172 .
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV , pp. 10012–
10022, 2021. URL https://arxiv.org/abs/2103.14030 .
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua-
mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, De-
nis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov,
Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo,
Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yix-
uan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xian-
gru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank
Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Can-
wen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Car-
olyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Car-
los Mu ˜noz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von
Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL
https://arxiv.org/abs/2402.19173 .
Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for transformers.
InNeurIPS , 2023. URL https://openreview.net/forum?id=7eHn64wOVy .
Joe O’Connor and Jacob Andreas. What context features can transformer language models
use? In ACL, pp. 851–864, August 2021. URL https://aclanthology.org/2021.
acl-long.70 .
Bo Peng, Eric Alcaide, Quentin Gregory Anthony, Alon Albalak, Samuel Arcadinho, Stella Bider-
man, Huanqi Cao, Xin Cheng, Michael Nguyen Chung, Leon Derczynski, Xingjian Du, Matteo
Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming
Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, At-
sushi Saito, Guangyu Song, Xiangru Tang, Johan S. Wind, Stanisław Wo ´zniak, Zhenyuan Zhang,
Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. RWKV: Reinventing RNNs for the transformer era.
InEMNLP , 2023. URL https://openreview.net/forum?id=7SaXczaBpG .
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong.
Random feature attention. In International Conference on Learning Representations , 2021. URL
https://openreview.net/forum?id=QtTKTdVrFBB .
Markus N. Rabe and Charles Staats. Self-attention does not need o(n2)memory. arXiv preprint
arXiv: 2112.05682 , 2021. URL https://arxiv.org/abs/2112.05682 .
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. JMLR , 21(140):1–67, 2020. URL https://arxiv.org/abs/1910.10683 .
Aurko Roy, M. Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention
with routing transformers. TACL , 2020. doi: 10.1162/tacl a00353.
13
Page 14:
Published as a conference paper at ICLR 2025
Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Representational strengths and limitations of
transformers. In Thirty-seventh Conference on Neural Information Processing Systems , 2023.
URLhttps://openreview.net/forum?id=36DxONZ9bA .
Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, and Mohit Iyyer. Do long-range language
models actually use long-range context? In EMNLP , pp. 807–822, 2021. URL https://
aclanthology.org/2021.emnlp-main.62 .
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu,
and Angela Fan. Multilingual translation with extensible multilingual pretraining and finetuning.
arXiv preprint arXiv:2008.00401 , 2020. URL https://arxiv.org/abs/2008.00401 .
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu
Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient trans-
formers. In ICLR , 2021. URL https://openreview.net/forum?id=qVyeW-grC2k .
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee
Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023b. URL https:
//arxiv.org/abs/2307.09288 .
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan
Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via
the lens of kernel. In EMNLP-IJCNLP , pp. 4344–4353. Association for Computational Linguis-
tics, 2019. URL https://aclanthology.org/D19-1443 .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS , volume 30,
2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf .
Apoorv Vyas, Angelos Katharopoulos, and Franc ¸ois Fleuret. Fast transformers with clustered atten-
tion. NeurISP , 33:21665–21674, 2020. URL https://arxiv.org/abs/2007.04825 .
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention
with linear complexity. arXiv preprint arXiv:2006.04768 , 2020. URL https://arxiv.org/
abs/2006.04768 .
Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans-
formers. In ICLR , 2022. URL https://openreview.net/forum?id=TrjbxzRcnf- .
Zijun Wu, Bingyuan Liu, Ran Yan, Lei Chen, and Thomas Delteil. Reducing distraction in long-
context language models by focused learning. arXiv preprint arXiv: 2411.05928 , 2024. URL
https://arxiv.org/abs/2411.05928 .
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming
language models with attention sinks. In ICLR , 2024. URL https://openreview.net/
forum?id=NG7sS51zVF .
Tongkun Xu, Weihua Chen, Pichao WANG, Fan Wang, Hao Li, and Rong Jin. CDTrans: Cross-
domain transformer for unsupervised domain adaptation. In ICLR , 2022. URL https:
//openreview.net/forum?id=XGzk5OKWFFc .
Shaowei Yao and Xiaojun Wan. Multimodal transformer for multimodal machine translation.
InACL, pp. 4346–4350, 2020. doi: 10.18653/v1/2020.acl-main.400. URL https://
aclanthology.org/2020.acl-main.400 .
14
Page 15:
Published as a conference paper at ICLR 2025
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers
for longer sequences. NeurIPS , 33:17283–17297, 2020. URL https://arxiv.org/abs/
2007.14062 .
Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. KDEformer: Accelerating Trans-
formers via kernel density estimation. In ICML , pp. 40605–40623, 2023. URL https:
//proceedings.mlr.press/v202/zandieh23a.html .
Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. Subgen: Token generation in sublin-
ear time and memory. arXiv preprint arXiv: 2402.06082 , 2024. URL https://arxiv.org/
abs/2402.06082 .
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song,
Yuandong Tian, Christopher R ´e, Clark Barrett, et al. H 2o: Heavy-hitter oracle for efficient gen-
erative inference of large language models. NeruIPS , 36, 2023. URL https://arxiv.org/
abs/2306.14048 .
Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision
transformer with bi-level routing attention. In CVPR , pp. 10323–10333, 2023. URL https:
//arxiv.org/abs/2303.08810 .
15
Page 16:
Published as a conference paper at ICLR 2025
A C OMPLETE ALGORITHM
We provide the complete algorithm as follows.
Algorithm 1 The overall algorithm of Radar
Require: head dimension d∈N, projection dimension n∈N, number of segment selection k∈N
▷Initialize the Radar state
1:Current context length t←0
2:Current segment range c←0
3:Unregistered buffer W← ∅
4:Sample Ω= (ω1, . . . ,ωn)fromN0,1
5:Obtain the projection function ϕΩaccording to Equation (4)
▷Decoding procedure
6:while taking a new token with q,kandvdo
7: t←t+ 1
8: if√
t∈Nthen
9: ▷Restructuring process descried in Section 2.2.
10: c←√
t
11: Pre-calculate ¯ϕ(kl:l+c)forl∈[1,1 +c,1 + 2 c, . . . , 1 + (c−1)c] ▷O(t)
12: W← ∅
13: else
14: W←WS{t} ▷Append to the buffer otherwise
15: end if
▷Querying procedure in O(√
t)
16: forl∈[1,1 +c,1 + 2 c, . . . , 1 + (c−1)c]do
17: al:l+c←ϕ⊤
Ω(q)¯ϕΩ(kl:l+c)▷Calculate segment attentions according to Equation (6)
18: end for
19: Pick indices Sin the ksegments with the highest al:l+c ▷O(√
t)
20: S←SSW ▷Use the buffer as the sliding window
21: yield softmax( q⊤kS/√
d)vS ▷O(k√
t)
22:end while
In the reconstruction step (lines 9-12), the time complexity is O(t)because we have csegments with
each containing ctokens. Since each segment representation ¯ϕtakes O(c)to construct, line 11 takes
at mostPc
i=1O(c) =O(c2)time. This is equivalent to O(t)because c=√
tin line 10. As we will
restructure the hierarchical approximation by at most O(√
t)times (given by the condition in line
8), the overall time complexity of all restructuring steps (considering the outer loop containing lines
8-11) is O(t3/2). When amortized to all tsteps, each step will take O(√
t)on average.
For the query step (lines 16-18), it is straightforward to see the per-step time complexity is O(√
t)
because each time step computes O(√
t)segment attention scores.
The final attention step (line 21) takes O(k√
t)time because there are ksegments selected, each
withO(√
t)tokens.
Overall, we can see the per-step time complexity of Radar grows in O(√
t).
B E SSENTIAL LEMMAS
Lemma 3 (adapted from (Fan et al., 2015)) .LetX1, . . . , X nbe i.i.d. random variables such that
E[Xi] = 0 ,Xi≤b. For every ε≥0, we have
Pr1
nnX
i=1Xi≥ε
≤exp
−nε2
2c2
(8)
andPr
−1
nnX
i=1Xi≥ε
≤exp
−nε2
2c2
. (9)
16
Page 17:
Published as a conference paper at ICLR 2025
Here,
c2:=(
E[X2
i], ifE[X2
i]≥b2,
1
4
b+E[X2
i]
b2
,otherwise.(10)
Proof. Please refer to Corollary 2.7 in (Fan et al., 2015).
Lemma 4. LetY1, . . . , Y nbe i.i.d. random variables such that Yi≥0,E[Yi] =µ > 0, and
Var[Yi] =σ2. We have
Pr
µ−1
nnX
i=1Yi
≥ε
≤exp
−nε2
2 max( µ2, σ2)
(11)
andPr1
nnX
i=1Yi−µ
≥ε
≤exp
−nε2
2 max( µ2, σ2)
(12)
for every ε≥0.
Proof. LetXi:=µ−Yi. It is easy to verify that
1.E[Xi] =E[µ−Yi] =µ−E[Yi] = 0 , and
2.Xi=µ−Yi≤µ, and
3.E[X2
i] =E[Y2
i]−µ2= Var[ Yi] =σ2.
We can then apply Lemma 3 on X1, . . . , X nwithb=µ >0, which gives us
c2:=(
E[X2
i] =σ2, ifσ2≥µ2>0
1
4
µ+E[X2
i]
µ2
≤µ2,otherwise ,(13)
summarized as 0< c2≤max( σ2, µ2). We can then obtain the proof by noticing that
exp
−(nε2)/c2
is monotonically increasing with c2>0.
Definition 5. For any u∈Rd, we define a transformation function fu:Rd→Rasfu(ω) :=
exp
ω⊤u− ∥u∥2/2
.
Lemma 6. Given any u,v,z∈Rdandωas a random variable following ω∼ N 0,I∈Rd, we
have the following properties for f:
1.fu(ω)≥0, and
2.E[fu(ω)] = 1 , and
3.E[fu(ω)fv(ω)] = exp
u⊤v
, and
4.Var[fu(ω)fv(ω)] = exp
2u⊤v
exp
∥u+v∥2
−1
.
5.Cov ( fu(ω)fz(ω), fv(ω)fz(ω)) = exp
(u+v)⊤z
exp
(u+z)⊤(v+z)
−1
.
Proof. We prove each property as follows:
1. It is obvious that exp(x)≥0for all x∈R.
2.
E[fu(ω)] :=E
exp
ω⊤u− ∥u∥2/2
(14)
=Z
Pr(ω) exp
ω⊤u− ∥u∥2/2
dω (15)
17
Page 18:
Published as a conference paper at ICLR 2025
=(2π)−d/2Z
exp
−∥ω∥2/2
exp
ω⊤u− ∥u∥2/2
dω (16)
=(2π)−d/2Z
exp
ω⊤u− ∥u∥2/2− ∥ω∥2/2
dω (17)
=(2π)−d/2Z
exp
−∥ω−u∥2/2
dω (18)
= Pr
Rd
(19)
=1. (20)
3. This is first shown in Choromanski et al. (2021). We present it here for completeness.
E[fu(ω)fv(ω)] =E
exp
ω⊤u− ∥u∥2/2
exp
ω⊤v− ∥v∥2/2
(21)
=E
exp
ω⊤(u+v)− ∥u+v∥2/2
exp
u⊤v
(22)
=E[fu+v(ω)] exp
u⊤v
(23)
= exp
u⊤v
, (24)
where the last line is from the second property. Notice that ϕΩin the main text scales both
uandvby1/4√
dbefore projection, resulting in EΩ[ϕΩ(u)⊤ϕΩ(v)] = exp
u⊤v/√
d
.
4. We first show that
E[(fu(ω)fv(ω))2] (25)
=E
exp
2ω⊤(u+v)− ∥u+v∥2
exp
2u⊤v
(26)
=E
exp
2ω⊤(u+v)− ∥2u+ 2v∥2/2
exp
2u⊤v+∥u+v∥2
(27)
=E[f2u+2u(ω)] exp
2u⊤v+∥u+v∥2
(28)
= exp
2u⊤v+∥u+v∥2
, (29)
where the last line is from the second property. It is then clear that
Var[fu(ω)fv(ω)] =E[(fu(ω)fv(ω))2]−(E[fu(ω)fv(ω)])2(30)
= exp
2u⊤v+∥u+v∥2
−exp
2u⊤v
(31)
= exp
2u⊤v
exp
∥u+v∥2
−1
. (32)
5. More generally,
Cov ( fu(ω)fz(ω), fv(ω)fz(ω)) (33)
=E[fu(ω)fz(ω)fv(ω)fz(ω)]−E[fu(ω)fz(ω)]E[fv(ω)fz(ω)]. (34)
The first term is
E[fu(ω)fz(ω)fv(ω)fz(ω)] (35)
=E
exp
ω⊤(u+z)− ∥u+z∥2/2
exp
u⊤z
(36)
exp
ω⊤(v+z)− ∥v+z∥2/2
exp(v⊤z)
(37)
=E[fu+z(ω)fv+z(ω)] exp
u⊤z+v⊤z
(38)
= exp
(u+z)⊤(v+z)
exp
u⊤z+v⊤z
. (39)
The second term is
E[fu(ω)fz(ω)]E[fv(ω)fz(ω)] = exp
u⊤z+v⊤z
. (40)
Overall,
Cov ( fu(ω)fz(ω), fv(ω)fz(ω)) =
exp
(u+z)⊤(v+z)
−1
exp
(u+v)⊤z
.
(41)
18
Page 19:
Published as a conference paper at ICLR 2025
We can thus conclude our proofs.
Definition 7 (Random feature approximation) .Since Lemma 6 shows that Eω∼N0,I[fu(ω)fv(ω)]
= exp
u⊤v
, we can approximate exp
u⊤v
by sampling a random variable S(n)
u,v, defined as
S(n)
u,v:=1
nnX
i=1fu(ωi)fv(ωi) (42)
forωi∼ N 0,I. This random variable corresponds to the dot-product between two features using
Equation (4).
Lemma 8. Given any ui,vi,z∈Rdfori∈[1, c]. Without loss of generality, we assumePc
i=1exp
u⊤
iz
≤Pc
i=1exp
v⊤
iz
. Let ε:=1
cPc
i=1
exp
v⊤
iz
−exp
u⊤
iz
andζ:=
max{∥ui∥,∥vi∥,∥z∥}. We havePc
i=1S(n)
ui,z≤Pc
i=1S(n)
vi,zwith a high confidence.
Proof. If there is a constant α∈[0,1]such that we have 0≤1
cPc
i=1exp
v⊤
iz
−1
cPc
i=1S(n)
vi,z≤
αεand0≤1
cPc
i=1S(n)
ui,z−1
cPc
i=1exp
u⊤
iz
≤(1−α)εat the same time, we can show
cX
i=1S(n)
u,z≤cX
i=1S(n)
v,z⇐⇒cX
i=1
exp
v⊤z
−S(n)
v,z+S(n)
u,z−exp
u⊤z
≤cε. (43)
Conversely, if1
cPc
i=1S(n)
ui,z>1
cPc
i=1S(n)
vi,z, we have eitherPc
i=1exp
v⊤
iz
−Pc
i=1S(n)
vi,z> αcε
orPc
i=1S(n)
ui,z−Pc
i=1exp
u⊤
iz
>(1−α)cεfor the α. For the first case, we have
Pr cX
i=1exp
v⊤
iz
−cX
i=1S(n)
vi,z≥αcε!
(44)
≤exp
−nα2ε2c2
2 max
(Pc
i=1µi)2,Pi=c,j=c
i=1,j=1Cov(S(n)
vi,z, S(n)
vj,z)
(45)
≤exp
−nα2ε2c2
2Pi=c,j=c
i=1,j=1Cov(S(n)
vi,z, S(n)
vj,z) +µiµj!
(46)
= exp
−nα2ε2c2
2Pi=c,j=c
i=1,j=1exp((vi+z)⊤(vj+z)) exp
v⊤
iz+v⊤
jz!
(47)
≤exp
−nα2c2ε2
2(Pi=c
i=1exp(viz))2exp(2 ζ2)!
(48)
=: exp
−nα2ε2c2
2β2vexp(2 ζ2)
. (49)
Here, we define βv:=Pc
i=1exp(viz)for simplicity. Similarly, we have for the second case:
Pr cX
i=1S(n)
ui,z−cX
i=1exp
u⊤
iz
≥(1−α)cε!
(50)
≤exp
−n(1−α)2c2ε2
2 max
(Pc
i=1µi)2,Pi=c,j=c
i=1,j=1Cov(S(n)
ui,z, S(n)
uj,z)
(51)
≤exp
−n(1−α)2c2ε2
2Pi=c,j=c
i=1,j=1exp((ui+z)⊤(uj+z)) exp
u⊤
iz+u⊤
jz!
(52)
≤exp
−n(1−α)2c2ε2
2(Pi=c
i=1exp(uiz))2exp(2 ζ2)!
(53)
19
Page 20:
Published as a conference paper at ICLR 2025
=: exp
−n(1−α)2c2ε2
2β2uexp(2 ζ2)
. (54)
We choose α:=βv/(βv+βu), which means
exp
−nα2ε2c2
2 exp(2 ζ2)β2v
= exp
−nc2ε2
2 exp(2 ζ2)(βv+βu)2
(55)
and
exp
−n(1−α)2c2ε2
2 exp(2 ζ2)β2u
= exp
−nc2ε2
2 exp(2 ζ2)(βv+βu)2
. (56)
Putting all together, we have
PrcX
i=1S(n)
ui,z≥cX
i=1S(n)
vi,z
(57)
≤Pr cX
i=1
exp
v⊤
iz
−S(n)
vi,z
≥αcε!
+ Pr cX
i=1
S(n)
ui,z−exp
u⊤
iz
≥(1−α)cε!
(58)
≤2 exp
−nc2ε2
2 exp(2 ζ2)(βv+βu)2
(59)
=:2 exp
−√n(a′
v−a′
u)√
2 exp( ζ2)2!
, (60)
where we use a′
v:=βv/(βv+βu)anda′
u:=βu/(βv+βu)for simplicity.
Lemma 9. Given any ui,vi,z∈Rdfori∈[1, c]. Without loss of gen-
erality, we assumePc
i=1exp
u⊤
iz/√
d
≤Pc
i=1exp
v⊤
iz/√
d
. Let ε:=
1
cPc
i=1
exp
v⊤
iz/√
d
−exp
u⊤
iz/√
d
andζ:= max {∥ui∥,∥vi∥,∥z∥}. We have
Pc
i=1S(n)
ui,z≤Pc
i=1S(n)
vi,zwith confidence at least
1−2 exp
−
√n(a′
v−a′
u)
√
2 exp
ζ2/√
d
2
. (61)
Proof. This is a direct application of Lemma 8 by scaling vi,ui, andzin Lemma 8 by 1/4√
d.
C T HE PROOF OF THEOREM 2
Now we provide the proof for Theorem 2 in the main text.
Theorem 2. Let Radar be at a state of csegments where each segment contains ctokens. Without
loss of generality, assume the first segment k1:1+chas the highest attention score a1:1+cw.r.t. the
current query q. With confidence at least 1−δ, we have
ϕ⊤
Ω(q)¯ϕΩ(k1:c+1) = max
lϕ⊤
Ω(q)¯ϕΩ(kl:c+l) (7)
if the minimum gap a1:1+c−max lal:l+c≥1
cexp
ζ2/√
dq
8 log(2( c−1)/δ)
n. Here, ζis the maxi-
mum norm among kiandq, whereas l∈[1,1 +c,1 + 2 c, . . . , 1 + (c−1)c]is the starting index of
each segment.
20
Page 21:
Published as a conference paper at ICLR 2025
Proof. Letβidenote the summation of the unormalized attention weights in ith segmentPc
j=1exp
q⊤kci+j/√
d
. We know that β1≥βifor all i > 1by our assumption. Let
εi:= (β1−βi)/cbe the difference between the first and the ith segment. According to Lemma 9,
the probability of confusion between segment 1andihas
δi≤2 exp
−
√n(a′
1−a′
i)
√
2 exp
ζ2/√
d
2
, (62)
where a′
i:=βi/(β1+βi). The real attention score for ith segment is ai=βi/Pc
j=1βj, meaning
δi≤2 exp
−
√nεi√
2 exp
ζ2/√
d
(a1+ai)
2
≤2 exp
−
√nε
√
2 exp
ζ2/√
d
(2a1)
2
,
(63)
where ε:= min iεiis the minimum gap.
The probability of making at least one confusion is
δ≤cX
i=2δi≤2(c−1) exp
−
√nε
√
2 exp
ζ2/√
d
(2a1)
2
. (64)
This is equivalent to
ε≤a1exp
ζ2/√
dr
8 log(2( c−1)/δ)
n, (65)
which concludes the proof by taking the contraposition and noticing that a1≥1/c.
D A DDITIONAL EXPERIMENTS
In Section 3.1, we mainly test our methods against StreamingLLM (Xiao et al., 2024) and the vanilla
attention (Vaswani et al., 2017). In fact, we also consider H 2O (Zhang et al., 2023) and SnapKV (Li
et al., 2024) as our baselines. Following the same settings as in Section 3.1, we show their perfor-
mance in Figure 6.
Figure 6: Failures of H 2O and SnapKV on Mistral.
As observed, both methods suffer from the considerable degeneration under this setting.3The per-
formance of H 2O, similar to its non-conditional generation with Mistral models (in Section 3.3), has
3It should be pointed out that we do not consider their failures are due to our replication, because we
successfully reproduce their results under the settings in their papers (e.g., Section 3.2 and Section 3.3).
21
Page 22:
Published as a conference paper at ICLR 2025
a extremely high perplexity, indicating its unsuitability beyond a limited model architectures like
Llama. SnapKV , albeit having a low complexity at the beginning, shows a considerable loss degra-
dation during generation. This aligns with our observations in Section 3.2 where SnapKV generally
underperforms Radar when the ground-truth generation is long. The results also suggest that Radar
generalizes better than other baselines. We hypothesize that this is because methods like H2O and
SnapKV heuristically use the accumulated attention scores as the indicator of the importance. The
heuristics may lead to evicting the current unimportant tokens which will be very useful later for new
tokens, especially when each head is responsible for multiple query spaces (e.g., grouped-query at-
tention (Ainslie et al., 2023) used in Llama 3 and Mistral models). Evidently, we observed that both
methods tend to perform worse on these models. By contrast, Radar is based on a proven approxi-
mation of the vanilla attention. The error bound is well-established in our Theorem 2 regardless of
the model architectures and weights, making it more versatile in applications.
E V ISUALIZING THE APPROXIMATION
In Theorem 2, we theoretically show the correctness of our method. To provide an intuition for the
approximation, we visualized the Radar approximation using Llama 3 on the 100 tokens (after 1
sink token) of PG-19 partitioned in 10 segments. The results are displayed in Figure 7.
(a) Head 1.
(b) Head 20.
(c) Head 30.
Figure 7: Visualization of the vanilla token-level attention (top), the corresponding segment attention
(middle), and the approximated segment attention by Radar (bottom). Here we use three heads in the
first layer as an example. The dark blue and red represent low and high attention scores, respectively.
It is straightforward to see that Radar (the bottom row) provides a reasonable approximation of the
real segment attention (the middle row). Notably, Radar not only captures the top segments but
also correctly flags the runner-up segments, suggesting its ability in identifying multiple important
segments simultaneously. Empirically, we found that 34.38% of Radar’s approximation success-
fully flags the top segment among the 10 segments with k= 1. When we increase k= 3, the
successful rate is 62.5%. These are considerable higher than other segmental selection strategies
based on heuristics. For example, by selecting the most recent segments (i.e., extending the sliding
window for StreamingLLM (Xiao et al., 2024)), the corresponding rates are 18.75% and 46.88%.
The random selection strategy is expected to achieve 10% and 30%. Both methods underperform
Radar significantly.
F D ISCUSSION ON FUTURE WORK
Non-contiguous segmenting strategies. In this work, we propose to use contiguous tokens to
construct a segment. This is generally a safe practice because the important tokens are sparse in
a sequence (as visualized in Figure 7 and previous work like StreamingLLM Xiao et al. (2024)).
In addition, by selecting the surrounding tokens around the important ones, we could provide the
context for the model to correctly identify the semantics for these tokens. Our design could natu-
rally achieve this goal, while other methods like SnapKV (Li et al., 2024) need to explicitly apply
22
Page 23:
Published as a conference paper at ICLR 2025
smoothing techniques to encourage the surrounding selection. Nevertheless, we consider that there
could exist better segmenting mechanisms for Radar by developing an adaptive method. We hope
this paper serves as the foundation for the future exploration of this direction.
Memory-efficient Radar. Our method mainly focuses on accelerating the decoding efficiency
of modern Transformer models. However, Radar does not save runtime memory. In fact, it uses
marginally more memory (scaled at a O(√
t)rate) to maintain the segment representations. There-
fore, we consider the memory-efficient version of Radar as a promising future direction. For ex-
ample, one could exploit the hierarchical structure of Radar to align with the memory hierarchy by
off-loading. By doing this, we hope Radar could match the effective memory saving with previous
memory-efficient decoding algorithms (Zhang et al., 2023; Li et al., 2024; Zandieh et al., 2024).
Training with Radar. Our method is designed as a training-free acceleration algorithm to re-
duce the training cost for broader users. However, we expect training with this kind of hierarchical
approximation could yield better results. We leave this to future exploration.
23