loader
Generating audio...

arxiv

Paper 2503.10432

BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models

Authors: Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang

Published: 2025-03-13

Abstract:

In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework leveraging large language models (LLMs) to address the challenges of high training overhead and latency in mmWave communication systems. By combining computer vision (CV) with LLMs' cross-modal reasoning capabilities, the framework extracts user equipment (UE) positional features from RGB images and aligns visual-temporal features with LLMs' semantic space through reprogramming techniques. Evaluated on a realistic vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01% top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks, significantly outperforming traditional deep learning models. In few-shot prediction scenarios, the performance degradation is limited to 12.56% (top-1) and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction capability.

Paper Content:
Page 1: arXiv:2503.10432v1 [cs.LG] 13 Mar 2025BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models Can Zheng1, Jiguang He2, Guofa Cai3, Zitong Yu2, Chung G. Kang1 1School of Electrical Engineering, Korea University, Seoul , Republic of Korea 2School of Computing and Information Technology, Great Bay U niversity, Dongguan 523000, China 3School of Information Engineering, Guangdong University o f Technology, Guangzhou, China Abstract —In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave) beam prediction framework lever - aging large language models (LLMs) to address the challenge s of high training overhead and latency in mmWave communica- tion systems. By combining computer vision (CV) with LLMs’ cross-modal reasoning capabilities, the framework extrac ts user equipment (UE) positional features from RGB images and alig ns visual-temporal features with LLMs’ semantic space throug h reprogramming techniques. Evaluated on a realistic vehicl e- to-infrastructure (V2I) scenario, the proposed method ach ieves 61.01% top-1accuracy and 97.39% top-3accuracy in standard prediction tasks, significantly outperforming traditiona l deep learning models. In few-shot prediction scenarios, the per for- mance degradation is limited to 12.56% (top-1) and 5.55% (top- 3) from time sample 1to10, demonstrating superior prediction capability. Index Terms —Beam prediction, massive multi-input multi- output (mMIMO), large language models (LLMs), computer vision (CV). I. I NTRODUCTION Millimeter-wave (mmWave) communication has garnered significant attention due to its abundant spectrum resource s above 26GHz, enabling high-speed data transmission. How- ever, the high operating frequency results in substantial path loss. To address this challenge, massive multiple-inp ut multiple-output (mMIMO) antenna arrays are extensively em - ployed, which utilize highly directional beamforming tech - niques to mitigate propagation losses. Furthermore, the sh ort wavelength of mmWave signals facilitates compact antenna spacing, which enables the integration of large-scale ante nna arrays within constrained physical dimensions. The effect ive- ness of directional beamforming depends on precise alignme nt between transmit and receive beams. Beam training addresse s this challenge by scanning predefined codebooks at both the transmitter and receiver to identify the optimal beam pair, thereby maximizing received signal power without requirin g the exhaustive acquisition of full channel state informati on (CSI). Compared to legacy sub-6 GHz MIMO systems, beam training in mmWave systems faces heightened challenges: 1) Large antenna arrays lead to high-dimensional channel matrices, increasing training overhead; 2) Frequent beam tracking, especially in high-mobility scenarios (e.g., ve hicle- to-everything (V2X) and unmanned aerial vehicles (UA Vs)), introduces prohibitive latency. Recent studies [1]–[4] ha ve ex- plored sensing-aided beam prediction, leveraging multimo dalsensor data such as RGB images, radar, LiDAR, and GPS to improve efficiency and reduce training overhead. As a key enabler of integrated sensing and communication (ISAC) in 6G, this approach holds significant potential to enhance the performance of mmWave MIMO systems. To maintain beam prediction performance, deep learning (DL) is commonly used to extract user equipment (UE) movement features from received sensing data, enabling mor e accurate future beam selection. Due to its powerful non-lin ear feature extraction capability, DL has been widely explored in wireless communication tasks, including channel estima - tion [5] and beam prediction. Recent breakthroughs in large language models (LLMs), such as GPT-4 [6] and DeepSeek [7], have demonstrated remarkable contextual reasoning an d few-shot generalization abilities. While LLMs are origina lly designed for natural language processing (NLP), LLMs have shown strong cross-modal learning capabilities, thus exte nding their applications to time series for forecasting and compu ter vision (CV) tasks. Inspired by these advantages of the LLMs, several methods applying LLMs have been proposed for channel prediction [8], beam prediction [9], and port prediction for fluid anten nas [10]. Built on these developments, in this paper, we propose a vision-aided beam prediction framework, named BeamLLM, which utilizes LLMs to process RGB images, thereby enabling more efficient and adaptive beam selection. Unlike [9], our method does not rely on historical beam indices or angle of departure (AoD) information. Instead, BeamLLM relies sole ly on visual features for beam prediction. Furthermore, to ens ure robust performance and practical applicability, we valida te our framework using real-world measurement datasets, to demon - strate its potential for deployment in real-world scenario s. The rest of this paper is organized as follows: Section II provides a system model and problem formulation of the beam prediction task. The proposed BeamLLM framework is presented in Section III. Section IV presents extensive simulation results, including performance comparisons wi th benchmark methods, along with detailed discussions. Final ly, we conclude our work in Section V . II. S YSTEM MODEL AND PROBLEM FORMULATION A. System Model Fig. 1 illustrates the system model considered for vehicle- to- infrastructure (V2I) mmWave communication. In this model, Page 2: Base Statio RGB Camera Fig. 1. Illustration of the system model considered. the base station (BS) deploys a mmWave phased-array receive r with/u1D440elements of half-wavelength spacing and an RGB camera. The antenna array enables the BS to perform beam- forming, while the camera captures images within its field of view at a certain frame rate for sensing and downstream applications. We assume that the BS has a predefined beamforming codebook F={f1,···,f|F|}, containing |F|beams, where f/u1D45A∈C/u1D441×1,/u1D45A=1,···,/u1D440represents the /u1D45A-th beamforming vector. The UE is assumed to have a single antenna. At time step/u1D461, the user transmits a single symbol /u1D460[/u1D461] ∈Cthat satisfies the power constraint E[|/u1D460[/u1D461]|2]=/u1D443, where/u1D443represents the transmit power. At the BS, the received signal /u1D466[/u1D461]can be expressed as: /u1D466[/u1D461]=h/u1D43B[/u1D461]f/u1D45A[/u1D461]/u1D460[/u1D461] +/u1D45B[/u1D461], (1) where h[/u1D461]is the channel vector, f/u1D45A[/u1D461]is the/u1D45A-th beam- forming vector from the codebook in time step /u1D461, and/u1D45B[/u1D461] ∼ CN(0,/u1D70E2)is the additive white Gaussian noise (AWGN) with variance/u1D70E2. B. Problem Formulation This paper mainly focuses on the beam prediction problem at the BS. Given the available sensing information up to time /u1D461−1, the BS attempts to determine the optimal beams for /u1D43B∈Z+future time steps, specifically for /u1D461,···,(/u1D461+/u1D43B−1). We define the optimal beam at time step /u1D461as the beam that provides the highest beamforming gain, given by: f∗ /u1D45A[/u1D461]=arg max f/u1D45A[/u1D461]∈F|h/u1D43B[/u1D461]f/u1D45A[/u1D461]|2. (2) When perfect CSI knowledge is unavailable, beam training serves as an alternative method for determining the optimal beam. However, with a narrow beam codebook, the training overhead can be significant, and the likelihood of identifyi ng the optimal beam is often low when the pre-beamforming SNR is poor. Because the optimal beam selection at the transmitt er and receiver depends on the surrounding environment of the transceiver, our work aims to leverage visual information a t the BS to assist beam selection and develop a beam prediction framework.III. L ARGE LAGUAGE MODEL -BASED BEAM PREDICTION In this section, we introduce BeamLLM to tackle the vision- assisted beam prediction task outlined in Section II. Fig. 2 illustrates the proposed BeamLLM. The architecture mainly comprises three components, i.e., the visual data feature e x- traction module and the backbone module. A. Visual Data Feature Extraction Module To process raw RGB data for the vision-aided beam pre- diction task, we employ the YOLOv4 object detector [11]. This detector identifies potential UEs within RGB images and extracts bounding box vectors b. For a single image X/u1D43C∈R/u1D44A×/u1D43B×/u1D436, bounding box vector is given as: b=YOLO(X/u1D43C)=[/u1D465/u1D450,/u1D466/u1D450,/u1D464,ℎ]/u1D447, (3) which consists of the detected object’s center coordinates (/u1D465- axis,/u1D466-axis), width, and height within the RGB image. Since the optimal beam selection is highly dependent on the direct ion and position of the transmission target, we use a sequence of bounding box vectors as the extracted visual feature. The objective is to predict the optimal beam index for the next /u1D43Bsteps based on the historical /u1D447step bounding box vectors, denoted as B=[b[/u1D461−/u1D447+1],···,b[/u1D461−1]] ∈R4×/u1D447. B. The Backbone Module The inherent potential of LLMs can be utilized to address the beam prediction task. However, a key challenge lies in aligning the visual feature modality with the textual modal ity to enable LLMs to effectively comprehend the task. Further- more, fine-tuning LLMs requires extensive datasets, which i s often unrealistic in practical scenarios. Time-LLM simultaneously addresses both challenges through reprogramming [12], which consists of two key steps: adaptation and alignment . Specifically, adaptation is achieved via the patch reprogramming module, which enables LLMs to process input data effectively, thereby breaking do - main isolation and facilitating knowledge sharing. Alignm ent, on the other hand, is accomplished through the prompt-as- prefix (PaP) module, which further eliminates domain bound- aries to enhance knowledge acquisition. Input Embedding: For each row of B, denoted as B(/u1D456)∈ R1×/u1D447for/u1D456=1,2,3,4, reversible instance normalization (RevIN) [13] is applied individually to normalize the data, en- suring a mean of 0and a variance of 1. RevIN dynamically ad- justs the normalization parameters to accommodate variati ons in the data distribution. Subsequently, B(/u1D456)is segmented into several contiguous overlapping or non-overlapping patche s, each of length /u1D43F/u1D45D. The total number of input patches is given by⌊/u1D447−/u1D43F/u1D45D /u1D446⌋−2, where/u1D446represents the horizontal sliding size. This operation is inspired by techniques in CV , wherein loca l temporal information is aggregated within each patch to bet ter preserve local semantic features. Finally, a simple linear layer is employed to embed B(/u1D456) /u1D443∈R/u1D443×/u1D43F/u1D45DintoˆB(/u1D456) /u1D443∈R/u1D443×/u1D451/u1D45A. Patch Reprogramming: Since natural language and input features belong to different modalities, with different wa ys of representing semantics, LLMs cannot directly process ˆB(/u1D456) /u1D443. Page 3: Instance NormPre-trained LLM (Embedder)Patch Reprogram Feature Extraction Prompts[Dataset Description] <|start_prompt|> Beam prediction dataset is a real-world dataset that comprises coexisting multi-modal sensing and communication data, describing the movement of beam direction over time, which typical remains constant within several time steps. Now the scenario emulates a Vehicle-to-Infrastructure mmWave communication setup. The testbed is deployed during the daytime. It is a two-way street with 2 lanes, a width of 10.6 meters, and a vehicle speed limit of 25mph (40.6 km per hour. The stationary unit is placed close to the entrance of a vehicle parking lot. Vehicles can be seen driving through the street or driving into or out of the parking structure. [Instruction] Predict the next <H> steps given the previous <T> steps bounding box coordinates (x, y, w ,h) information attached. [Input Trends] : The trend of input is <upward>/<downward> .<||> Output Patch EmbeddingsFlatten & Linear & ReLUForecasts Output Projection Pre-trained LLM (Body) Frozen Training Prompt Embeddings Patch Embeddings Forward BackwardFuture Optimal Beam Indexes Pre-trained Word EmbeddingsTime Series PatchesLinearPatch EmbedderMulti-Head AttentionLinearReprogrammed Patch Embeddings Text Prototypes ... ... Fig. 2. The model framework of BeamLLM. To address this, the reprogramming layer maps the input time sequence of visual features into an NLP task, enabling the utilization of LLMs’ reasoning and inference capabilit ies. A common technique for aligning different modalities is cross-attention [14], which enables interactions between word embeddings and input features by dynamically attending to relevant information across modalities. In this framework , the temporal input features serve as the query, while the word embeddings act as the key and value. However, given that the backbone model is a general-purpose LLM, the original vocabulary of size /u1D449is not entirely relevant to our task. Directly aligning inputs features with all words is impract ical, as many words do not carry semantic relevance to the task. Therefore, a simple linear layer is employed to extract text prototypes (semantic prototypes) by projecting pre-train ed word embeddings E∈R/u1D449×/u1D437onto a small collection of text prototypes E′∈R/u1D449′×/u1D437, where/u1D449′≪/u1D449. Here,/u1D437is the hidden dimension of the backbone model. This projection effective ly reduces the number of words from /u1D449to/u1D449′, allowing the temporal input features to align only with these prototypes . For each head /u1D458=1,2,···,/u1D43E, we define: Q(/u1D458) /u1D456=ˆB(/u1D456) /u1D443W/u1D444 /u1D458, (4) K(/u1D458) /u1D456=E′W/u1D43E /u1D458, (5) V(/u1D458) /u1D456=E′W/u1D449 /u1D458, (6) where W/u1D444 /u1D458∈R/u1D451/u1D45A×⌊/u1D451/u1D45A /u1D43E⌋andW/u1D43E /u1D458,W/u1D449 /u1D458∈R/u1D437×⌊/u1D451/u1D45A /u1D43E⌋.The following process adaptively obtains the text descrip- tions corresponding to patches through a multi-head self- attention mechanism: Z(/u1D456) /u1D458=ATTENTION/parenleftBig Q(/u1D456) /u1D458,K(/u1D456) /u1D458,V(/u1D456) /u1D458/parenrightBig =SOFTMAX/parenlefttpA/parenleftexA /parenleftbtAQ(/u1D456) /u1D458K(/u1D456) /u1D458⊤ √/u1D451/u1D458/parenrighttpA/parenrightexA /parenrightbtAV(/u1D456) /u1D458. (7) By aggregating each Z(/u1D456) /u1D458∈R/u1D443×/u1D451across all heads, we obtain Z(/u1D456)∈R/u1D443×/u1D451/u1D45A. This is then linearly projected to align the hidden dimension with the backbone model, resulting in O(/u1D456)∈R/u1D443×/u1D437. PaP: Natural language-based prompts serve as prefixes to enrich the input context and guide the transformation of reprogrammed patches. We have identified three essential components for constructing an effective prompt: (1) datas et description, (2) task description, and (3) input statistic s. The dataset description offers the LLM with fundamental back- ground information about the input features, which often exhibit distinct characteristics across different domain s. The task description offers crucial guidance to the LLM for tran s- forming patch embeddings in the context of the specific task. Additionally, we incorporate supplementary key statistic s, such as trends, to further enrich the input features, facilitati ng pattern recognition and reasoning. Output Projection: By packaging and forwarding the prompts along with the patch embeddings O(/u1D456)through the Page 4: frozen LLM, we discard the prefix portion and obtain the output representations. These representations are then fla t- tened and linearly projected to produce the final outputs, ˆP=[ˆp[/u1D461],···,ˆp[/u1D461+/u1D43B−1]] ∈R/u1D440×/u1D43B. The index of the dimension corresponding to the maximum value of each ˆp[/u1D461], is predicted as the optimal future beam index, given by: ˆ/u1D45A∗[/u1D461]=arg max /u1D45A∈[1,|F|]ˆp[/u1D461]. (8) C. Learning Phase The beam prediction task is essentially a classification problem; therefore, the model parameters are optimized by minimizing the cross-entropy, which is expressed as: L=/u1D461+/u1D43B−1/summationdisplay.1 /u1D457=/u1D461|F|/summationdisplay.1 /u1D45A=1/u1D453∗[/u1D457]log2(/u1D45D/u1D45A[/u1D457]), (9) where/u1D453∗ /u1D45A[/u1D457] ∈ { 0,1}/u1D440is the/u1D45A-th element of the one-hot encoded vector of f∗ /u1D45A[/u1D457]and/u1D45D/u1D45A[/u1D457]is the/u1D45A-th element of the output vector ˆp[/u1D457]at time step /u1D457, respectively. IV. P ERFORMANCE EVALUATION We utilize the DeepSense 6G dataset [15] for simulation and performance evaluation. DeepSense 6G is a multimodal datas et from real-world measurements, including wireless beam dat a, RGB images, GPS locations, radar, and LiDAR. A. Experimental Settings Dataset Processing: We adopt Scenario 8 of the DeepSense 6G dataset for our simulation, which simulates a V2I mmWave communication setup. The BS is equipped with an RGB camera and a 16-element 60GHz mmWave phased array, while the mobile UE serves as a mmWave transmitter. During data collection, the UE passes by the BS multiple times. At each time step, the BS captures an RGB image of the UE while scanning all predefined beams and measuring the received power for all |F|=32beams in a codebook. The multimodal data streams are synchronized to ensure temporal consisten cy. The dataset is split into 70% training, 10% validation, and 20% test sets. The dataset consists of multiple data sequences. In each data sequence, the vehicle passes by the BS once. Each data sequence is a pair comprising an RGB image sequence and a beam index sequence. For each data sequence, we decompose it into data samples using a sliding window of size13. As previously mentioned, during training, we use an observation window of size /u1D447, and we train the model to predict future beams over a horizon /u1D43B. Therefore, the input to the encoder for the model is X/u1D43C[1],..., X/u1D43C[/u1D447]. In both beam prediction methods, the expected output from the decoder is ˆp[/u1D447+1],..., ˆp[/u1D447+/u1D43B]. Since we maintain a fixed sequence length of 13, we set /u1D447=8,/u1D43B=5as standard prediction and /u1D447=3,/u1D43B=10as few-shot prediction. Baselines: We compare our approach with several classical time-series models, including RNN [1], GRU, and LSTM. Additionally, to validate the effectiveness of the PaP modu le, we conduct an ablation study by comparing our model with and without PaP in the standard prediction setup.Parameter Settings: BeamLLM is configured as follow- ing: 1) A widely-used language model, i.e., GPT- 2[16], is employed as the LLM backbone; 2) It is trained with Adam optimizer, where the batch size and initial learning rate (L R) are16and0.001, respectively. Additionally, a multi-step LR scheduler in 1,5,10,15,20,25,30,40epochs with a decay factor of/u1D6FE=0.9is employed; 3) The training process is set to200 epochs. The detailed model parameters are shown in Table I. TABLE I PARAMETER SETTINGS OF DIFFERENT MODELS LLM RNN, GRU, LSTM Patch Reprogramming Output Projection Embedding Layer Sequence Model Same as [12] except /u1D449′=64Linear 1: 4×8 ReLU Linear 2: 8×16 ReLU Linear 3: 16×32 SoftmaxLinear: 4×32Layer 1−3:32×32 Linear: 32×32 Performance Metrics: Top-/u1D43Eaccuracy is a metric that quantifies the percentage of validation samples for which th e best ground truth beam is among the top /u1D43Emodel predictions with the highest probability. Mathematically, it is repres ented as: Top-/u1D43Eaccuracy =1 /u1D441/u1D446/u1D441/u1D446/summationdisplay.1 /u1D456=1/u1D7D9{/u1D45A/u1D456∈/u1D444/u1D458}, (10) where/u1D441/u1D446represents the total number of samples in the test set,/u1D45A/u1D456denotes the index of the ground truth optimal beam for the/u1D456-th sample, and /u1D444/u1D458is the set of indices for the top- /u1D43E predicted beams, sorted by the element values in ˆPfor each time sample. B. Standard Prediction In Figs. 3 and 4, we present a comparative analysis of the top-1and top- 3accuracy in the standard predictions across all models. Increasing /u1D43Eimproves top- /u1D43Eaccuracy, while as prediction horizon extends further into the future, the acc uracy gradually decreases. Among the models, BeamLLM achieves the highest top- 1and top- 3accuracy scores, reaching 61.01% and97.39%, respectively. Additionally, as the number of time samples increases, the decay in the top- /u1D43Eaccuracy for the LSTM model is minimal. Specifically, the top- 1and top- 3 accuracy only decrease by 6.03%and1.65%, respectively, across time samples ranging from 1to5. This smaller re- duction highlights the adaptability of LSTMs, as their gati ng mechanism adjusts information retention and updating base d on task demands. The results of the ablation study highlight the performance differences of the BeamLLM with and without the use of PaP. The average performance gap in top- 1accuracy between the two models is 5.81%, while the gap in top- 3accuracy is3.62%. When comparing these scenarios, we observe that the integration of PaP significantly improves both performa nce and stability, compared to simply inputting the reprogramm ed Page 5: /s49 /s50 /s51 /s52 /s53/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s49/s32/s65/s99/s99/s117/s114/s97/s99/s121 /s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78 /s32/s71/s82/s85 /s32/s76/s83/s84/s77 /s32/s66/s101/s97/s109/s76/s76/s77/s32/s119/s47/s111/s32/s80/s97/s80 /s32/s66/s101/s97/s109/s76/s76/s77 Fig. 3. Top- 1accuracy performance of the proposed method comparing to several baselines in the standard prediction task. /s49 /s50 /s51 /s52 /s53/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s51/s32/s65/s99/s99/s117/s114/s97/s99/s121 /s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78 /s32/s71/s82/s85 /s32/s76/s83/s84/s77 /s32/s66/s101/s97/s109/s76/s76/s77/s32/s119/s47/s111/s32/s80/s97/s80 /s32/s66/s101/s97/s109/s76/s76/s77 Fig. 4. Top- 3accuracy performance of the proposed method comparing to several baselines in the standard prediction task. patch into the frozen LLM. This underscores the effectivene ss of PaP in the context of this task. C. Few-Shot Prediction In Figs. 5 and 6, we present the top- 1and top- 3accuracy performance for the few-shot forecasting task. Existing DL prediction methods perform poorly in this scenario, partic - ularly as the prediction horizon extends, resulting in seve re performance degradation. Even for the previously most stab le LSTM model, during the progression from time sample 1 to 10, the top- 1accuracy is decreased by 16.48%, and the top-3accuracy is decreased by 11.58%. In contrast, Beam- LLM significantly outperforms all baseline methods, with on ly/s49 /s50 /s51 /s52 /s53 /s54 /s55 /s56 /s57 /s49/s48/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s49/s32/s65/s99/s99/s117/s114/s97/s99/s121 /s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78 /s32/s71/s82/s85 /s32/s76/s83/s84/s77 /s32/s66/s101/s97/s109/s76/s76/s77 Fig. 5. Top- 1accuracy performance of the proposed method comparing to several baselines in the few-shot prediction task. /s49 /s50 /s51 /s52 /s53 /s54 /s55 /s56 /s57 /s49/s48/s48/s46/s48/s48/s46/s50/s48/s46/s52/s48/s46/s54/s48/s46/s56/s49/s46/s48/s84/s111/s112/s45/s51/s32/s65/s99/s99/s117/s114/s97/s99/s121 /s116/s105/s109/s101/s32/s115/s97/s109/s112/s108/s101/s32/s82/s78/s78 /s32/s71/s82/s85 /s32/s76/s83/s84/s77 /s32/s66/s101/s97/s109/s76/s76/s77 Fig. 6. Top- 3accuracy performance of the proposed method comparing to several baselines in the few-shot prediction task. 12.56%and5.55%performance degradation, respectively. We attribute this superior performance to the successful acti vation of knowledge through the reprogrammed LLM. D. Analysis of Reprogramming We provide a case study of reprogramming 64 time series patches with 64 text prototypes, as shown in Fig. 7. The figure consists of three subplots, each visualizing the simi larity between text prototypes computed as the scaled dot product Q(/u1D456) /u1D458K(/u1D456) /u1D458⊤/√/u1D451/u1D458, across distinct training epochs. A color bar accompanies three subplots, with values ranging from 0 (dar k purple, denoting low similarity) to 1 (bright yellow, denot - ing high similarity). The observed transition from a noisy, Page 6: Text PrototypePatch(a) Epoch 1 (b) Epoch 3 (c) Epoch 5 Fig. 7. A showcase of text prototype evolution across traini ng epochs. scattered pattern at epoch 1to a sparse and concentrated representation by epoch 5illustrates that the learned proto- types effectively capture the local semantic information o f the input features. Moreover, the heatmaps indicate that only a few text prototypes exhibit significant correlations with t he patches, suggesting BeamLLM’s ability to adaptively prior itize prototypes most relevant to the local semantic context. E. Analysis of Complexity All experiments are conducted in the same environment, specifically on Google Colab with an NVIDIA A 100 GPU and40GB of RAM. We investigate the training complexity and inference complexity of different models in terms of the number of trainable and non-trainable parameters, as well as the average inference time per epoch. From Table II, we observe that although the backbone model is frozen, the number of trainable parameters of BeamLLM remains large. Meanwhile, the average inference time is significantly long er than that of traditional models. While its high deployment c ost poses a challenge, this also indicates that the full potenti al of BeamLLM has yet to be fully explored. TABLE II THENUMBER OF MODEL PARAMETERS AND AVERAGE INFERENCE TIME Models# of trainable parameters# of non-trainable parametersAverage inf. time (sec) RNN 8,641 0 0.17 GRU 18,641 0 0.15 LSTM 26,593 0 0.25 BeamLLM 130,056,118 124,439,808 10.85 V. C ONCLUSIONS This work has presented an innovative BeamLLM for vision-empowered beam prediction, significantly improvin g accuracy and robustness in mmWave systems through repro- gramming. Experimental results have highlighted LLMs’ sup e- rior contextual inference capabilities compared to conven tional DL models in standard and few-shot prediction. However, the performance gains come with increased in- ference complexity. The massive parameter scale of LLMsmay introduce higher resource consumption and latency. Nev - ertheless, BeamLLM remains practical, particularly due to its exceptional few-shot prediction capability, which enable s pre- dictions over a longer horizon. Practical deployments requ ire a trade-off between model complexity and real-time constrai nts, necessitating optimizations such as model compression or lightweight architecture design. By advancing these aspec ts, the proposed framework could serve as a scalable and efficien t beam management solution for 6G ISAC systems. REFERENCES [1] S. Jiang and A. Alkhateeb, “Computer Vision Aided Beam Tr acking in A Real-World Millimeter Wave Deployment,” in IEEE Globecom Workshops (GC Wkshps) . IEEE, 2022, p. 142–147. [2] U. Demirhan and A. Alkhateeb, “Radar Aided 6G Beam Predic tion: Deep Learning Algorithms and Real-World Demonstration,” i nIEEE Wireless Communications and Networking Conference (WCNC) , 2022, pp. 2655–2660. [3] S. Jiang, G. Charan, and A. Alkhateeb, “LiDAR Aided Futur e Beam Prediction in Real-World Millimeter Wave V2I Communicatio ns,”IEEE Wireless Commun. Lett. , vol. 12, no. 2, pp. 212–216, 2023. [4] J. Morais, A. Bchboodi, H. Pezeshki, and A. Alkhateeb, “P osition-Aided Beam Prediction in the Real World: How Useful GPS Locations A ctually are?” in IEEE International Conference on Communications , 2023, pp. 1824–1829. [5] J. He, H. Wymeersch, M. Di Renzo, and M. Juntti, “Learning to Estimate RIS-Aided mmWave Channels,” IEEE Wireless Commun. Lett. , vol. 11, no. 4, pp. 841–845, Apr. 2022. [6] OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774 , 2024. [7] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X . Bi, Y . Wu, Y . Li et al. , “DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence,” arXiv preprint arXiv:2401.14196 , 2024. [8] B. Liu, X. Liu, S. Gao, X. Cheng, and L. Yang, “LLM4CP: Adap ting Large Language Models for Channel Prediction,” Journal of Commu- nications and Information Networks , vol. 9, no. 2, pp. 113–125, Jun. 2024. [9] Y . Sheng, K. Huang, L. Liang, P. Liu, S. Jin, and G. Y . Li, “B eam Prediction Based on Large Language Models,” IEEE Wireless Commun. Lett., pp. 1–1, 2025. [10] Y . Zhang, H. Yin, W. Li, E. Bjornson, and M. Debbah, “Port -LLM: A Port Prediction Method for Fluid Antenna based on Large Lang uage Models,” arXiv preprint arXiv:2502.09857 , 2025. [11] A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “YOLOv4: Op- timal Speed and Accuracy of Object Detection,” arXiv preprint arXiv:2004.10934 , 2020. [12] M. Jin, S. Wang, L. Ma, Z. Chu, J. Y . Zhang, X. Shi, P.-Y . Ch en, Y . Liang, Y .-F. Li, S. Pan, and Q. Wen, “Time-LLM: Time series forecasting by reprogramming large language models,” in International Conference on Learning Representations (ICLR) , 2024. [13] T. Kim, J. Kim, Y . Tae, C. Park, J.-H. Choi, and J. Choo, “R eversible Instance Normalization for Accurate Time-Series Forecast ing against Distribution Shift,” in International Conference on Learning Represen- tations (ICLR) , 2021. [14] H. Lin, X. Cheng, X. Wu, and D. Shen, “CAT: Cross Attentio n in Vision Transformer,” in IEEE International Conference on Multimedia and Expo (ICME) , Los Alamitos, CA, USA, Jul. 2022, pp. 1–6. [15] A. Alkhateeb, G. Charan, T. Osman, A. Hredzak, J. Morais , U. Demirhan, and N. Srinivas, “DeepSense 6G: A Large-Scale R eal- World Multi-Modal Sensing and Communication Dataset,” IEEE Com- mun. Mag. , vol. 61, no. 9, pp. 122–128, Sept. 2023. [16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutske veret al. , “Language Models are Unsupervised Multitask Learners,” OpenAI blog , vol. 1, no. 8, p. 9, 2019.

---