Authors: Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang
Paper Content:
Page 1:
arXiv:2503.10432v1 [cs.LG] 13 Mar 2025BeamLLM: Vision-Empowered mmWave Beam
Prediction with Large Language Models
Can Zheng1, Jiguang He2, Guofa Cai3, Zitong Yu2, Chung G. Kang1
1School of Electrical Engineering, Korea University, Seoul , Republic of Korea
2School of Computing and Information Technology, Great Bay U niversity, Dongguan 523000, China
3School of Information Engineering, Guangdong University o f Technology, Guangzhou, China
Abstract —In this paper, we propose BeamLLM, a vision-aided
millimeter-wave (mmWave) beam prediction framework lever -
aging large language models (LLMs) to address the challenge s
of high training overhead and latency in mmWave communica-
tion systems. By combining computer vision (CV) with LLMs’
cross-modal reasoning capabilities, the framework extrac ts user
equipment (UE) positional features from RGB images and alig ns
visual-temporal features with LLMs’ semantic space throug h
reprogramming techniques. Evaluated on a realistic vehicl e-
to-infrastructure (V2I) scenario, the proposed method ach ieves
61.01% top-1accuracy and 97.39% top-3accuracy in standard
prediction tasks, significantly outperforming traditiona l deep
learning models. In few-shot prediction scenarios, the per for-
mance degradation is limited to 12.56% (top-1) and 5.55% (top-
3) from time sample 1to10, demonstrating superior prediction
capability.
Index Terms —Beam prediction, massive multi-input multi-
output (mMIMO), large language models (LLMs), computer
vision (CV).
I. I NTRODUCTION
Millimeter-wave (mmWave) communication has garnered
significant attention due to its abundant spectrum resource s
above 26GHz, enabling high-speed data transmission. How-
ever, the high operating frequency results in substantial
path loss. To address this challenge, massive multiple-inp ut
multiple-output (mMIMO) antenna arrays are extensively em -
ployed, which utilize highly directional beamforming tech -
niques to mitigate propagation losses. Furthermore, the sh ort
wavelength of mmWave signals facilitates compact antenna
spacing, which enables the integration of large-scale ante nna
arrays within constrained physical dimensions. The effect ive-
ness of directional beamforming depends on precise alignme nt
between transmit and receive beams. Beam training addresse s
this challenge by scanning predefined codebooks at both the
transmitter and receiver to identify the optimal beam pair,
thereby maximizing received signal power without requirin g
the exhaustive acquisition of full channel state informati on
(CSI).
Compared to legacy sub-6 GHz MIMO systems, beam
training in mmWave systems faces heightened challenges:
1) Large antenna arrays lead to high-dimensional channel
matrices, increasing training overhead; 2) Frequent beam
tracking, especially in high-mobility scenarios (e.g., ve hicle-
to-everything (V2X) and unmanned aerial vehicles (UA Vs)),
introduces prohibitive latency. Recent studies [1]–[4] ha ve ex-
plored sensing-aided beam prediction, leveraging multimo dalsensor data such as RGB images, radar, LiDAR, and GPS to
improve efficiency and reduce training overhead. As a key
enabler of integrated sensing and communication (ISAC) in
6G, this approach holds significant potential to enhance the
performance of mmWave MIMO systems.
To maintain beam prediction performance, deep learning
(DL) is commonly used to extract user equipment (UE)
movement features from received sensing data, enabling mor e
accurate future beam selection. Due to its powerful non-lin ear
feature extraction capability, DL has been widely explored
in wireless communication tasks, including channel estima -
tion [5] and beam prediction. Recent breakthroughs in large
language models (LLMs), such as GPT-4 [6] and DeepSeek
[7], have demonstrated remarkable contextual reasoning an d
few-shot generalization abilities. While LLMs are origina lly
designed for natural language processing (NLP), LLMs have
shown strong cross-modal learning capabilities, thus exte nding
their applications to time series for forecasting and compu ter
vision (CV) tasks.
Inspired by these advantages of the LLMs, several methods
applying LLMs have been proposed for channel prediction
[8], beam prediction [9], and port prediction for fluid anten nas
[10]. Built on these developments, in this paper, we propose a
vision-aided beam prediction framework, named BeamLLM,
which utilizes LLMs to process RGB images, thereby enabling
more efficient and adaptive beam selection. Unlike [9], our
method does not rely on historical beam indices or angle of
departure (AoD) information. Instead, BeamLLM relies sole ly
on visual features for beam prediction. Furthermore, to ens ure
robust performance and practical applicability, we valida te our
framework using real-world measurement datasets, to demon -
strate its potential for deployment in real-world scenario s.
The rest of this paper is organized as follows: Section II
provides a system model and problem formulation of the
beam prediction task. The proposed BeamLLM framework
is presented in Section III. Section IV presents extensive
simulation results, including performance comparisons wi th
benchmark methods, along with detailed discussions. Final ly,
we conclude our work in Section V .
II. S YSTEM MODEL AND PROBLEM FORMULATION
A. System Model
Fig. 1 illustrates the system model considered for vehicle- to-
infrastructure (V2I) mmWave communication. In this model,
Page 2:
Base Statio
RGB
Camera
Fig. 1. Illustration of the system model considered.
the base station (BS) deploys a mmWave phased-array receive r
with/u1D440elements of half-wavelength spacing and an RGB
camera. The antenna array enables the BS to perform beam-
forming, while the camera captures images within its field
of view at a certain frame rate for sensing and downstream
applications.
We assume that the BS has a predefined beamforming
codebook F={f1,···,f|F|}, containing |F|beams, where
f/u1D45A∈C/u1D441×1,/u1D45A=1,···,/u1D440represents the /u1D45A-th beamforming
vector. The UE is assumed to have a single antenna. At time
step/u1D461, the user transmits a single symbol /u1D460[/u1D461] ∈Cthat satisfies
the power constraint E[|/u1D460[/u1D461]|2]=/u1D443, where/u1D443represents the
transmit power. At the BS, the received signal /u1D466[/u1D461]can be
expressed as:
/u1D466[/u1D461]=h/u1D43B[/u1D461]f/u1D45A[/u1D461]/u1D460[/u1D461] +/u1D45B[/u1D461], (1)
where h[/u1D461]is the channel vector, f/u1D45A[/u1D461]is the/u1D45A-th beam-
forming vector from the codebook in time step /u1D461, and/u1D45B[/u1D461] ∼
CN(0,/u1D70E2)is the additive white Gaussian noise (AWGN) with
variance/u1D70E2.
B. Problem Formulation
This paper mainly focuses on the beam prediction problem
at the BS. Given the available sensing information up to time
/u1D461−1, the BS attempts to determine the optimal beams for
/u1D43B∈Z+future time steps, specifically for /u1D461,···,(/u1D461+/u1D43B−1).
We define the optimal beam at time step /u1D461as the beam that
provides the highest beamforming gain, given by:
f∗
/u1D45A[/u1D461]=arg max
f/u1D45A[/u1D461]∈F|h/u1D43B[/u1D461]f/u1D45A[/u1D461]|2. (2)
When perfect CSI knowledge is unavailable, beam training
serves as an alternative method for determining the optimal
beam. However, with a narrow beam codebook, the training
overhead can be significant, and the likelihood of identifyi ng
the optimal beam is often low when the pre-beamforming SNR
is poor. Because the optimal beam selection at the transmitt er
and receiver depends on the surrounding environment of the
transceiver, our work aims to leverage visual information a t
the BS to assist beam selection and develop a beam prediction
framework.III. L ARGE LAGUAGE MODEL -BASED BEAM PREDICTION
In this section, we introduce BeamLLM to tackle the vision-
assisted beam prediction task outlined in Section II. Fig. 2
illustrates the proposed BeamLLM. The architecture mainly
comprises three components, i.e., the visual data feature e x-
traction module and the backbone module.
A. Visual Data Feature Extraction Module
To process raw RGB data for the vision-aided beam pre-
diction task, we employ the YOLOv4 object detector [11].
This detector identifies potential UEs within RGB images
and extracts bounding box vectors b. For a single image
X/u1D43C∈R/u1D44A×/u1D43B×/u1D436, bounding box vector is given as:
b=YOLO(X/u1D43C)=[/u1D465/u1D450,/u1D466/u1D450,/u1D464,ℎ]/u1D447, (3)
which consists of the detected object’s center coordinates (/u1D465-
axis,/u1D466-axis), width, and height within the RGB image. Since
the optimal beam selection is highly dependent on the direct ion
and position of the transmission target, we use a sequence
of bounding box vectors as the extracted visual feature. The
objective is to predict the optimal beam index for the next
/u1D43Bsteps based on the historical /u1D447step bounding box vectors,
denoted as B=[b[/u1D461−/u1D447+1],···,b[/u1D461−1]] ∈R4×/u1D447.
B. The Backbone Module
The inherent potential of LLMs can be utilized to address
the beam prediction task. However, a key challenge lies in
aligning the visual feature modality with the textual modal ity
to enable LLMs to effectively comprehend the task. Further-
more, fine-tuning LLMs requires extensive datasets, which i s
often unrealistic in practical scenarios.
Time-LLM simultaneously addresses both challenges
through reprogramming [12], which consists of two key
steps: adaptation and alignment . Specifically, adaptation is
achieved via the patch reprogramming module, which enables
LLMs to process input data effectively, thereby breaking do -
main isolation and facilitating knowledge sharing. Alignm ent,
on the other hand, is accomplished through the prompt-as-
prefix (PaP) module, which further eliminates domain bound-
aries to enhance knowledge acquisition.
Input Embedding: For each row of B, denoted as B(/u1D456)∈
R1×/u1D447for/u1D456=1,2,3,4, reversible instance normalization
(RevIN) [13] is applied individually to normalize the data, en-
suring a mean of 0and a variance of 1. RevIN dynamically ad-
justs the normalization parameters to accommodate variati ons
in the data distribution. Subsequently, B(/u1D456)is segmented into
several contiguous overlapping or non-overlapping patche s,
each of length /u1D43F/u1D45D. The total number of input patches is given
by⌊/u1D447−/u1D43F/u1D45D
/u1D446⌋−2, where/u1D446represents the horizontal sliding size.
This operation is inspired by techniques in CV , wherein loca l
temporal information is aggregated within each patch to bet ter
preserve local semantic features. Finally, a simple linear layer
is employed to embed B(/u1D456)
/u1D443∈R/u1D443×/u1D43F/u1D45DintoˆB(/u1D456)
/u1D443∈R/u1D443×/u1D451/u1D45A.
Patch Reprogramming: Since natural language and input
features belong to different modalities, with different wa ys of
representing semantics, LLMs cannot directly process ˆB(/u1D456)
/u1D443.
Page 3:
Instance NormPre-trained LLM
(Embedder)Patch Reprogram
Feature Extraction
Prompts[Dataset Description] <|start_prompt|> Beam prediction dataset is
a real-world dataset that comprises coexisting multi-modal sensing
and communication data, describing the movement of beam
direction over time, which typical remains constant within several
time steps.
Now the scenario emulates a Vehicle-to-Infrastructure mmWave
communication setup. The testbed is deployed during the daytime.
It is a two-way street with 2 lanes, a width of 10.6 meters, and a
vehicle speed limit of 25mph (40.6 km per hour. The stationary unit
is placed close to the entrance of a vehicle parking lot. Vehicles can
be seen driving through the street or driving into or out of the
parking structure.
[Instruction] Predict the next <H> steps given the previous <T>
steps bounding box coordinates (x, y, w ,h) information attached.
[Input Trends] : The trend of input is <upward>/<downward> .<||> Output Patch EmbeddingsFlatten & Linear &
ReLUForecasts
Output Projection
Pre-trained LLM
(Body)
Frozen Training Prompt Embeddings Patch Embeddings Forward BackwardFuture Optimal Beam Indexes
Pre-trained
Word EmbeddingsTime Series
PatchesLinearPatch
EmbedderMulti-Head AttentionLinearReprogrammed
Patch Embeddings
Text Prototypes
... ...
Fig. 2. The model framework of BeamLLM.
To address this, the reprogramming layer maps the input
time sequence of visual features into an NLP task, enabling
the utilization of LLMs’ reasoning and inference capabilit ies.
A common technique for aligning different modalities is
cross-attention [14], which enables interactions between word
embeddings and input features by dynamically attending to
relevant information across modalities. In this framework , the
temporal input features serve as the query, while the word
embeddings act as the key and value. However, given that
the backbone model is a general-purpose LLM, the original
vocabulary of size /u1D449is not entirely relevant to our task.
Directly aligning inputs features with all words is impract ical,
as many words do not carry semantic relevance to the task.
Therefore, a simple linear layer is employed to extract text
prototypes (semantic prototypes) by projecting pre-train ed
word embeddings E∈R/u1D449×/u1D437onto a small collection of text
prototypes E′∈R/u1D449′×/u1D437, where/u1D449′≪/u1D449. Here,/u1D437is the hidden
dimension of the backbone model. This projection effective ly
reduces the number of words from /u1D449to/u1D449′, allowing the
temporal input features to align only with these prototypes .
For each head /u1D458=1,2,···,/u1D43E, we define:
Q(/u1D458)
/u1D456=ˆB(/u1D456)
/u1D443W/u1D444
/u1D458, (4)
K(/u1D458)
/u1D456=E′W/u1D43E
/u1D458, (5)
V(/u1D458)
/u1D456=E′W/u1D449
/u1D458, (6)
where W/u1D444
/u1D458∈R/u1D451/u1D45A×⌊/u1D451/u1D45A
/u1D43E⌋andW/u1D43E
/u1D458,W/u1D449
/u1D458∈R/u1D437×⌊/u1D451/u1D45A
/u1D43E⌋.The following process adaptively obtains the text descrip-
tions corresponding to patches through a multi-head self-
attention mechanism:
Z(/u1D456)
/u1D458=ATTENTION/parenleftBig
Q(/u1D456)
/u1D458,K(/u1D456)
/u1D458,V(/u1D456)
/u1D458/parenrightBig
=SOFTMAX/parenlefttpA/parenleftexA
/parenleftbtAQ(/u1D456)
/u1D458K(/u1D456)
/u1D458⊤
√/u1D451/u1D458/parenrighttpA/parenrightexA
/parenrightbtAV(/u1D456)
/u1D458. (7)
By aggregating each Z(/u1D456)
/u1D458∈R/u1D443×/u1D451across all heads, we
obtain Z(/u1D456)∈R/u1D443×/u1D451/u1D45A. This is then linearly projected to align
the hidden dimension with the backbone model, resulting in
O(/u1D456)∈R/u1D443×/u1D437.
PaP: Natural language-based prompts serve as prefixes
to enrich the input context and guide the transformation of
reprogrammed patches. We have identified three essential
components for constructing an effective prompt: (1) datas et
description, (2) task description, and (3) input statistic s. The
dataset description offers the LLM with fundamental back-
ground information about the input features, which often
exhibit distinct characteristics across different domain s. The
task description offers crucial guidance to the LLM for tran s-
forming patch embeddings in the context of the specific task.
Additionally, we incorporate supplementary key statistic s, such
as trends, to further enrich the input features, facilitati ng
pattern recognition and reasoning.
Output Projection: By packaging and forwarding the
prompts along with the patch embeddings O(/u1D456)through the
Page 4:
frozen LLM, we discard the prefix portion and obtain the
output representations. These representations are then fla t-
tened and linearly projected to produce the final outputs,
ˆP=[ˆp[/u1D461],···,ˆp[/u1D461+/u1D43B−1]] ∈R/u1D440×/u1D43B. The index of the
dimension corresponding to the maximum value of each ˆp[/u1D461],
is predicted as the optimal future beam index, given by:
ˆ/u1D45A∗[/u1D461]=arg max
/u1D45A∈[1,|F|]ˆp[/u1D461]. (8)
C. Learning Phase
The beam prediction task is essentially a classification
problem; therefore, the model parameters are optimized by
minimizing the cross-entropy, which is expressed as:
L=/u1D461+/u1D43B−1/summationdisplay.1
/u1D457=/u1D461|F|/summationdisplay.1
/u1D45A=1/u1D453∗[/u1D457]log2(/u1D45D/u1D45A[/u1D457]), (9)
where/u1D453∗
/u1D45A[/u1D457] ∈ { 0,1}/u1D440is the/u1D45A-th element of the one-hot
encoded vector of f∗
/u1D45A[/u1D457]and/u1D45D/u1D45A[/u1D457]is the/u1D45A-th element of the
output vector ˆp[/u1D457]at time step /u1D457, respectively.
IV. P ERFORMANCE EVALUATION
We utilize the DeepSense 6G dataset [15] for simulation and
performance evaluation. DeepSense 6G is a multimodal datas et
from real-world measurements, including wireless beam dat a,
RGB images, GPS locations, radar, and LiDAR.
A. Experimental Settings
Dataset Processing: We adopt Scenario 8 of the DeepSense
6G dataset for our simulation, which simulates a V2I mmWave
communication setup. The BS is equipped with an RGB
camera and a 16-element 60GHz mmWave phased array,
while the mobile UE serves as a mmWave transmitter. During
data collection, the UE passes by the BS multiple times. At
each time step, the BS captures an RGB image of the UE while
scanning all predefined beams and measuring the received
power for all |F|=32beams in a codebook. The multimodal
data streams are synchronized to ensure temporal consisten cy.
The dataset is split into 70% training, 10% validation, and
20% test sets. The dataset consists of multiple data sequences.
In each data sequence, the vehicle passes by the BS once. Each
data sequence is a pair comprising an RGB image sequence
and a beam index sequence. For each data sequence, we
decompose it into data samples using a sliding window of
size13. As previously mentioned, during training, we use
an observation window of size /u1D447, and we train the model to
predict future beams over a horizon /u1D43B. Therefore, the input to
the encoder for the model is X/u1D43C[1],..., X/u1D43C[/u1D447]. In both beam
prediction methods, the expected output from the decoder is
ˆp[/u1D447+1],..., ˆp[/u1D447+/u1D43B]. Since we maintain a fixed sequence
length of 13, we set /u1D447=8,/u1D43B=5as standard prediction and
/u1D447=3,/u1D43B=10as few-shot prediction.
Baselines: We compare our approach with several classical
time-series models, including RNN [1], GRU, and LSTM.
Additionally, to validate the effectiveness of the PaP modu le,
we conduct an ablation study by comparing our model with
and without PaP in the standard prediction setup.Parameter Settings: BeamLLM is configured as follow-
ing: 1) A widely-used language model, i.e., GPT- 2[16], is
employed as the LLM backbone; 2) It is trained with Adam
optimizer, where the batch size and initial learning rate (L R)
are16and0.001, respectively. Additionally, a multi-step LR
scheduler in 1,5,10,15,20,25,30,40epochs with a decay
factor of/u1D6FE=0.9is employed; 3) The training process is set
to200 epochs. The detailed model parameters are shown in
Table I.
TABLE I
PARAMETER SETTINGS OF DIFFERENT MODELS
LLM RNN, GRU, LSTM
Patch Reprogramming Output Projection Embedding Layer Sequence Model
Same as [12] except /u1D449′=64Linear 1: 4×8
ReLU
Linear 2: 8×16
ReLU
Linear 3: 16×32
SoftmaxLinear: 4×32Layer 1−3:32×32
Linear: 32×32
Performance Metrics: Top-/u1D43Eaccuracy is a metric that
quantifies the percentage of validation samples for which th e
best ground truth beam is among the top /u1D43Emodel predictions
with the highest probability. Mathematically, it is repres ented
as:
Top-/u1D43Eaccuracy =1
/u1D441/u1D446/u1D441/u1D446/summationdisplay.1
/u1D456=1/u1D7D9{/u1D45A/u1D456∈/u1D444/u1D458}, (10)
where/u1D441/u1D446represents the total number of samples in the test
set,/u1D45A/u1D456denotes the index of the ground truth optimal beam
for the/u1D456-th sample, and /u1D444/u1D458is the set of indices for the top- /u1D43E
predicted beams, sorted by the element values in ˆPfor each
time sample.
B. Standard Prediction
In Figs. 3 and 4, we present a comparative analysis of the
top-1and top- 3accuracy in the standard predictions across
all models. Increasing /u1D43Eimproves top- /u1D43Eaccuracy, while as
prediction horizon extends further into the future, the acc uracy
gradually decreases. Among the models, BeamLLM achieves
the highest top- 1and top- 3accuracy scores, reaching 61.01%
and97.39%, respectively. Additionally, as the number of time
samples increases, the decay in the top- /u1D43Eaccuracy for the
LSTM model is minimal. Specifically, the top- 1and top- 3
accuracy only decrease by 6.03%and1.65%, respectively,
across time samples ranging from 1to5. This smaller re-
duction highlights the adaptability of LSTMs, as their gati ng
mechanism adjusts information retention and updating base d
on task demands.
The results of the ablation study highlight the performance
differences of the BeamLLM with and without the use of
PaP. The average performance gap in top- 1accuracy between
the two models is 5.81%, while the gap in top- 3accuracy
is3.62%. When comparing these scenarios, we observe that
the integration of PaP significantly improves both performa nce
and stability, compared to simply inputting the reprogramm ed
Page 5:
/s49 /s50 /s51 /s52 /s53/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s49/s32/s65/s99/s99/s117/s114/s97/s99/s121
/s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78
/s32/s71/s82/s85
/s32/s76/s83/s84/s77
/s32/s66/s101/s97/s109/s76/s76/s77/s32/s119/s47/s111/s32/s80/s97/s80
/s32/s66/s101/s97/s109/s76/s76/s77
Fig. 3. Top- 1accuracy performance of the proposed method comparing to
several baselines in the standard prediction task.
/s49 /s50 /s51 /s52 /s53/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s51/s32/s65/s99/s99/s117/s114/s97/s99/s121
/s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78
/s32/s71/s82/s85
/s32/s76/s83/s84/s77
/s32/s66/s101/s97/s109/s76/s76/s77/s32/s119/s47/s111/s32/s80/s97/s80
/s32/s66/s101/s97/s109/s76/s76/s77
Fig. 4. Top- 3accuracy performance of the proposed method comparing to
several baselines in the standard prediction task.
patch into the frozen LLM. This underscores the effectivene ss
of PaP in the context of this task.
C. Few-Shot Prediction
In Figs. 5 and 6, we present the top- 1and top- 3accuracy
performance for the few-shot forecasting task. Existing DL
prediction methods perform poorly in this scenario, partic -
ularly as the prediction horizon extends, resulting in seve re
performance degradation. Even for the previously most stab le
LSTM model, during the progression from time sample 1
to 10, the top- 1accuracy is decreased by 16.48%, and the
top-3accuracy is decreased by 11.58%. In contrast, Beam-
LLM significantly outperforms all baseline methods, with on ly/s49 /s50 /s51 /s52 /s53 /s54 /s55 /s56 /s57 /s49/s48/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s49/s32/s65/s99/s99/s117/s114/s97/s99/s121
/s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78
/s32/s71/s82/s85
/s32/s76/s83/s84/s77
/s32/s66/s101/s97/s109/s76/s76/s77
Fig. 5. Top- 1accuracy performance of the proposed method comparing to
several baselines in the few-shot prediction task.
/s49 /s50 /s51 /s52 /s53 /s54 /s55 /s56 /s57 /s49/s48/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s51/s32/s65/s99/s99/s117/s114/s97/s99/s121
/s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78
/s32/s71/s82/s85
/s32/s76/s83/s84/s77
/s32/s66/s101/s97/s109/s76/s76/s77
Fig. 6. Top- 3accuracy performance of the proposed method comparing to
several baselines in the few-shot prediction task.
12.56%and5.55%performance degradation, respectively. We
attribute this superior performance to the successful acti vation
of knowledge through the reprogrammed LLM.
D. Analysis of Reprogramming
We provide a case study of reprogramming 64 time series
patches with 64 text prototypes, as shown in Fig. 7. The
figure consists of three subplots, each visualizing the simi larity
between text prototypes computed as the scaled dot product
Q(/u1D456)
/u1D458K(/u1D456)
/u1D458⊤/√/u1D451/u1D458, across distinct training epochs. A color bar
accompanies three subplots, with values ranging from 0 (dar k
purple, denoting low similarity) to 1 (bright yellow, denot -
ing high similarity). The observed transition from a noisy,
Page 6:
Text PrototypePatch(a) Epoch 1 (b) Epoch 3 (c) Epoch 5
Fig. 7. A showcase of text prototype evolution across traini ng epochs.
scattered pattern at epoch 1to a sparse and concentrated
representation by epoch 5illustrates that the learned proto-
types effectively capture the local semantic information o f the
input features. Moreover, the heatmaps indicate that only a
few text prototypes exhibit significant correlations with t he
patches, suggesting BeamLLM’s ability to adaptively prior itize
prototypes most relevant to the local semantic context.
E. Analysis of Complexity
All experiments are conducted in the same environment,
specifically on Google Colab with an NVIDIA A 100 GPU
and40GB of RAM. We investigate the training complexity
and inference complexity of different models in terms of the
number of trainable and non-trainable parameters, as well
as the average inference time per epoch. From Table II,
we observe that although the backbone model is frozen, the
number of trainable parameters of BeamLLM remains large.
Meanwhile, the average inference time is significantly long er
than that of traditional models. While its high deployment c ost
poses a challenge, this also indicates that the full potenti al of
BeamLLM has yet to be fully explored.
TABLE II
THENUMBER OF MODEL PARAMETERS AND AVERAGE INFERENCE TIME
Models# of trainable
parameters# of non-trainable
parametersAverage
inf. time (sec)
RNN 8,641 0 0.17
GRU 18,641 0 0.15
LSTM 26,593 0 0.25
BeamLLM 130,056,118 124,439,808 10.85
V. C ONCLUSIONS
This work has presented an innovative BeamLLM for
vision-empowered beam prediction, significantly improvin g
accuracy and robustness in mmWave systems through repro-
gramming. Experimental results have highlighted LLMs’ sup e-
rior contextual inference capabilities compared to conven tional
DL models in standard and few-shot prediction.
However, the performance gains come with increased in-
ference complexity. The massive parameter scale of LLMsmay introduce higher resource consumption and latency. Nev -
ertheless, BeamLLM remains practical, particularly due to its
exceptional few-shot prediction capability, which enable s pre-
dictions over a longer horizon. Practical deployments requ ire a
trade-off between model complexity and real-time constrai nts,
necessitating optimizations such as model compression or
lightweight architecture design. By advancing these aspec ts,
the proposed framework could serve as a scalable and efficien t
beam management solution for 6G ISAC systems.
REFERENCES
[1] S. Jiang and A. Alkhateeb, “Computer Vision Aided Beam Tr acking
in A Real-World Millimeter Wave Deployment,” in IEEE Globecom
Workshops (GC Wkshps) . IEEE, 2022, p. 142–147.
[2] U. Demirhan and A. Alkhateeb, “Radar Aided 6G Beam Predic tion:
Deep Learning Algorithms and Real-World Demonstration,” i nIEEE
Wireless Communications and Networking Conference (WCNC) , 2022,
pp. 2655–2660.
[3] S. Jiang, G. Charan, and A. Alkhateeb, “LiDAR Aided Futur e Beam
Prediction in Real-World Millimeter Wave V2I Communicatio ns,”IEEE
Wireless Commun. Lett. , vol. 12, no. 2, pp. 212–216, 2023.
[4] J. Morais, A. Bchboodi, H. Pezeshki, and A. Alkhateeb, “P osition-Aided
Beam Prediction in the Real World: How Useful GPS Locations A ctually
are?” in IEEE International Conference on Communications , 2023, pp.
1824–1829.
[5] J. He, H. Wymeersch, M. Di Renzo, and M. Juntti, “Learning to Estimate
RIS-Aided mmWave Channels,” IEEE Wireless Commun. Lett. , vol. 11,
no. 4, pp. 841–845, Apr. 2022.
[6] OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774 ,
2024.
[7] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X . Bi,
Y . Wu, Y . Li et al. , “DeepSeek-Coder: When the Large Language Model
Meets Programming–The Rise of Code Intelligence,” arXiv preprint
arXiv:2401.14196 , 2024.
[8] B. Liu, X. Liu, S. Gao, X. Cheng, and L. Yang, “LLM4CP: Adap ting
Large Language Models for Channel Prediction,” Journal of Commu-
nications and Information Networks , vol. 9, no. 2, pp. 113–125, Jun.
2024.
[9] Y . Sheng, K. Huang, L. Liang, P. Liu, S. Jin, and G. Y . Li, “B eam
Prediction Based on Large Language Models,” IEEE Wireless Commun.
Lett., pp. 1–1, 2025.
[10] Y . Zhang, H. Yin, W. Li, E. Bjornson, and M. Debbah, “Port -LLM: A
Port Prediction Method for Fluid Antenna based on Large Lang uage
Models,” arXiv preprint arXiv:2502.09857 , 2025.
[11] A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “YOLOv4: Op-
timal Speed and Accuracy of Object Detection,” arXiv preprint
arXiv:2004.10934 , 2020.
[12] M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Ch en,
Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series
forecasting by reprogramming large language models,” in International
Conference on Learning Representations (ICLR) , 2024.
[13] T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo, “R eversible
Instance Normalization for Accurate Time-Series Forecast ing against
Distribution Shift,” in International Conference on Learning Represen-
tations (ICLR) , 2021.
[14] H. Lin, X. Cheng, X. Wu, and D. Shen, “CAT: Cross Attentio n in
Vision Transformer,” in IEEE International Conference on Multimedia
and Expo (ICME) , Los Alamitos, CA, USA, Jul. 2022, pp. 1–6.
[15] A. Alkhateeb, G. Charan, T. Osman, A. Hredzak, J. Morais ,
U. Demirhan, and N. Srinivas, “DeepSense 6G: A Large-Scale R eal-
World Multi-Modal Sensing and Communication Dataset,” IEEE Com-
mun. Mag. , vol. 61, no. 9, pp. 122–128, Sept. 2023.
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutske veret al. ,
“Language Models are Unsupervised Multitask Learners,” OpenAI blog ,
vol. 1, no. 8, p. 9, 2019.