Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.05493

Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders

Authors: Qijiong Liu, Jieming Zhu, Lu Fan, Kun Wang, Hengchang Hu, Wei Guo, Yong Liu, Xiao-Ming Wu

Published: 2025-03-07

Abstract:

In recent years, integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our findings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, configurations, and platform to enable other researchers to reproduce and build upon our experimental results.

Paper Content:

Page 1: Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders Qijiong Liu The HK PolyU Hong Kong SAR liu@qijiong.workJieming Zhu Huawei Noah’s Ark Lab Hong Kong SAR jiemingzhu@ieee.orgLu Fan The HK PolyU Hong Kong SAR cslfan@comp.polyu.edu.hk Kun Wang Nanyang Technology University Singapore wk520529@mail.ustc.edu.cnHengchang Hu National University of Singapore Singapore hengchang.hu@u.nus.eduWei Guo Huawei Noah’s Ark Lab Singapore guowei67@huawei.com Yong Liu Huawei Noah’s Ark Lab Singapore liu.yong6@huawei.comXiao-Ming Wu∗ The HK PolyU Hong Kong SAR xiao-ming.wu@polyu.edu.hk ABSTRACT In recent years, integrating large language models (LLMs) into rec- ommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce RecBench , which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive ex- periments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders out- perform conventional recommenders, achieving up to a 5% AUC improvement in the CTR scenario and up to a 170% NDCG@10 improvement in the SeqRec scenario. However, these substantial performance gains come at the expense of significantly reduced in- ference efficiency, rendering the LLM-as-RS paradigm impractical for real-time recommendation environments. We aim for our find- ings to inspire future research, including recommendation-specific model acceleration methods. We will release our code, data, config- urations, and platform1to enable other researchers to reproduce and build upon our experimental results. ∗Xiao-Ming Wu is the corresponding author. 1https://recbench.github.io Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Conference’17, July 2017, Washington, DC, USA ©2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM https://doi.org/10.1145/nnnnnnn.nnnnnnnCCS CONCEPTS •Information systems →Recommender systems ;Language models ;•General and reference →Evaluation . KEYWORDS Recommender systems, Large language models, Benchmark ACM Reference Format: Qijiong Liu, Jieming Zhu, Lu Fan, Kun Wang, Hengchang Hu, Wei Guo, Yong Liu, and Xiao-Ming Wu. 2025. Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Recommender systems are essential for providing personalized in- formation to internet users. The design of these systems typically involves balancing multiple objectives, including fairness, diversity, and interpretability. However, in industrial applications, accuracy and efficiency are the two most crucial metrics. Accuracy forms the foundation of user experience, greatly influencing user satisfac- tion and engagement. Meanwhile, efficiency is crucial for system deployment, ensuring that recommendations are generated and delivered promptly. In recent years, the integration of large language models (LLMs) into recommender systems (denoted as LLM+RS ) has garnered significant attention from both academia and industry. These inte- grations can be broadly categorized into two paradigms [ 2,4,70,77]: LLM-for-RS andLLM-as-RS .LLM-for-RS retains traditional deep learning-based recommender models (DLRMs) and enhances them through advanced feature engineering or feature encoding tech- niques using LLMs [ 34,67]. This paradigm functions as a plug-in module, seamlessly integrating with existing recommender systems. It is easy to deploy, maintains high efficiency, and often improves recommendation accuracy without significant overhead, making it well-suited for industrial scenarios. LLM-as-RS , in contrast, directlyarXiv:2503.05493v1 [cs.IR] 7 Mar 2025 Page 2: Conference’17, July 2017, Washington, DC, USA Trovato and Tobin, et al. ITEM ITEMITEM ITEMA user has browsed the following items: A user has browsed the following items: , , , , ?. Will the user be interested in . Next, the user will interact with:YES LLM as RS Pair-wise Recommendation (Click-through Rate Prediction) List-wise Recommendation (Sequential Recommendation / Generative Retrieval)DLR M DLR M LLM as RSITEMITEM ITEMITEMITEM ITEMITEMITEM ITEMITEM ITEM Figure 1: Illustration of DLRM and LLM recommender in two scenarios. Each ITEM represents a placeholder that can be filled with various item representations, including unique identifier ,text,semantic embedding orsemantic identifier . using LLMs as recommenders to generate recommendations. Stud- ies have shown the superiority of this paradigm in recommendation accuracy in specific contexts, such as cold-start scenarios [ 1], and tasks requiring natural language understanding and generation, like interpretable and interactive recommendations [ 16,41,63]. Despite its potential, the extremely low inference efficiency of large models poses challenges for high-throughput recommendation tasks. Nev- ertheless, the LLM-as-RS paradigm is transforming the traditional recommendation pipeline designs. Several benchmarks have been proposed for the LLM-as-RS paradigm, including LLMRec [ 32], PromptRec [ 71], and others [ 21, 33,76]. However, as illustrated in Table 1, these benchmarks i) provide only a limited evaluation of recommendation scenarios, often focusing on a single scenario. Furthermore, ii)their coverage of item representation forms for alignment within LLMs is narrow, typically restricted to conventional unique identifier ortextformats. In addition, iii)the number of traditional models, large-scale models, and datasets evaluated remains relatively small, resulting in an incomplete and fragmented performance landscape in this domain. To address this gap, we propose the RecBench platform, which offers a comprehensive evaluation of the LLM-as-RS paradigm. Firstly, we investigate various item representation and alignment methods between recommendation scenarios and LLMs, including unique identifier ,text, and semantic embedding , and semantic iden- tifier , to understand their impact on recommendation performance. Secondly, the benchmark covers two main recommendation tasks: click-through rate (CTR) prediction and sequential recommendation (SeqRec) , corresponding to pair-wise and list-wise recommenda- tion scenarios, respectively. Thirdly, our study evaluates up to 17LLMs, encompassing general-purpose models (e.g., Llama [ 7]) and recommendation-specific models (e.g., RecGPT [ 44]). This ex- tensive evaluation supports multidimensional comparisons across models of different sizes (e.g., OPT baseand OPT large), from various institutions (e.g., Llama and Qwen), and different versions from the same institution (e.g., Llama-1 7Band Llama-2 7B).Fourthly, the ex- periments are conducted across fiverecommendation datasets from different domains –including fashion (HM [ 30]), news (MIND [ 69]), video (MicroLens [ 45]), books (Goodreads [ 58]), and music (Ama- zon CDs [ 15])–to avoid reliance on a single platform and ensure balanced comparisons. Fifthly, we assess both recommendation accuracy and efficiency , providing a holistic comparison between conventional DLRMs and the LLM-as-RS paradigm. Our evalua- tion includes both zero-shot and fine-tuning schemes. The zero-shotevaluation explores the inherent recommendation knowledge and reasoning capabilities of LLMs, while the fine-tuning evaluation assesses their adaptability and learning ability in new scenarios. To summarize, our RecBench benchmark offers an in-depth assessment of the LLM-as-RS paradigm and yields several key in- sights: Firstly , although LLM-based recommenders demonstrate substantial performance improvements in various scenarios, their efficiency limitations impede practical deployment. Future research should focus on developing inference acceleration techniques for LLMs in recommendations. Secondly , conventional DLRMs en- hanced with LLM support (i.e., the LLM-for-RS paradigm, Group Cin Figure 2) can achieve up to 95% of the performance of stan- dalone LLM recommenders while operating much faster. Therefore, improving the integration of LLM capabilities into conventional DLRMs represents a promising research direction. We hope our established, reusable, and standardized RecBench to lower the eval- uation barrier and accelerate the development of new models in the recommendation community. 2 PRELIMINARIES AND RELATED WORK In this section, we provide an overview of the key techniques for integrating LLMs with recommender systems. We begin by describ- ing various forms of item representation, which is the foundation of the recommender systems. Given that existing LLM-as-RS ap- proaches employ different representations across diverse tasks, we present an abstract framework to illustrate both LLM-based rec- ommenders and DLRMs at a conceptual level. Subsequently, we review representative works within each subarea to highlight cur- rent advancements. Finally, we compare proposed RecBench with existing benchmarks to underscore its unique contributions. 2.1 Item Representations Item representation is a critical component of recommender sys- tems. Since the introduction of deep learning in this field, the most prevalent approach [ 13,60,61] has been to use item unique identi- fier. These identifiers initially lack intrinsic meaning, and their corresponding vectors are randomly initialized before training. Through user–item interactions, these vectors progressively learn and encode collaborative signals, which are used to infer unknown interactions. With advancements in computational power and the advent of the big data era, item content–such as product images and news headlines–has increasingly been utilized for item representation. Page 3: RecBench Conference’17, July 2017, Washington, DC, USA Table 1: Comparison of RecBench with existing benchmarks within the LLM-as-RS paradigm. The notation “–” indicates that, despite its claims, LLMRec does not practically support list-wise recommendation. Benchmark Zhang et al. OpenP5 LLMRec PromptRec Jiang et al. RSBench RecBench Year 2021 2024 2023c 2024b 2024 2024d (ours) Scale#DLRM 2 9 13 4 6 0 10 #LLM 4 2 7 4 7 1 17 #Dataset 1 3 1 3 4 3 5 SchemeZero-shot ✓× ✓ ✓ × ✓ ✓ Fine-tune ✓ ✓ ✓ × ✓× ✓ Item Representationunique identifier × ✓ ✓ × × × ✓ text ✓× ✓ ✓ ✓ ✓ ✓ semantic embedding × × × × × × ✓ semantic identifier × × × × × × ✓ ScenarioPair-wise × ✓ ✓ ✓ ✓ ✓ ✓ List-wise ✓ ✓ –× × × ✓ MetricQuality ✓ ✓ ✓ ✓ ✓ ✓ ✓ Efficiency × × × × × × ✓ Incorporating content features significantly enhances the robust- ness of item representations, making their quality independent of the number of interactions. As long as content is available, any item can be represented equally. By employing pooling operations, convolutional neural networks [ 26], attention networks [ 57], or other shallow modules, the item textcan be easily fused and served as a unified item representation for the recommendation model. In recent years, the pretrained language models (PLMs), learned from general semantic corpus and possessing powerful semantic understanding abilities, are widely used to extract textual represen- tation in various domains. In the recommendation domain, these open-source language models are also integrated with the recom- mendation model and served as the end-to-end item encoder, fine- tuned with the recommendation tasks. The semantic embedding has been proven to be more effective than text, as the former introduce rich general semantics into the recommendation model [ 35,42,68]. Additionally, a new form of item representation: semantic iden- tifier , is introduced in the most recent years. Based on semantic embeddings obtained from the LLMs, discrete encoding techniques like RQ-VAE [ 27,36] are used to map all items into unique, share- able identifier combinations. Items with similar content will have longer common subsequences. The use of semantic identifier not only efficiently compresses the item vocabulary, but also maintains solid semantic connections during training [38, 40]. The emergence and advantages of the semantic identifier have reshaped sequential recommendation methods, also known as gen- erative retrieval [ 48,50,62,64]. They provide new input forms and alignment strategies between LLMs and recommender sys- tems, paving the way for advancements in the LLM-as-RS para- digm [40, 78].2.2 Evaluation Scenarios As LLMs demonstrate significant reasoning capabilities across vari- ous domains [ 23,25,66], the recommendation community has be- gun to explore their direct application to recommendation tasks [ 6, 31]. This LLM-as-RS paradigm completely abandons conventional DLRMs, aiming to leverage the robust semantic understanding and deep Transformer architectures of LLMs to capture item features and model user preferences, ultimately generating recommenda- tion results. To better understand how LLMs function within this paradigm, we consider two common recommendation evaluation scenarios, illustrated in Figure 1: Pair-wise Recommendation , also known as straightforward recommendation [ 72], corresponds to the traditional Click-Through Rate (CTR) prediction task [ 13,60]. The input consists of a user- item pair, and the LLM is expected to output a recommendation score for this pair (e.g., the predicted likelihood that the user will click on the item). List-wise Recommendation typically corresponds to sequen- tial recommendation tasks [ 22,54]. The input comprises a sequence of items with positive feedback from a user, and the LLM is expected to predict the next item that the user is likely to engage with. In contrast to DLRMs that use structured feature inputs, the LLM-as- RSparadigm requires concatenating inputs in natural language and guiding the LLM to generate the final results. 2.3 LLMs as Recommender Systems The progression of LLM-as-RS can be divided into three stages: Stage One: Utilizing LLMs for Recommendations without Fine-tuning. In the initial stage, researchers explored whether general-purpose LLMs possess inherent recommendation abilities without any fine-tuning–a zero-shot setting. Experimental results Page 4: Conference’17, July 2017, Washington, DC, USA Trovato and Tobin, et al. indicated that while these methods [ 6,31,76] did not outperform conventional recommendation models, they were more effective than purely random recommendations, demonstrating a limited but noteworthy ability for LLMs to make recommendations. Represen- tative works in this stage include LMasRS [ 76] based on BERT [ 24] and studies utilizing ChatGPT [ 46]. Since LLMs at this point could only process textual information, textwas the sole form of item rep- resentation, serving as a bridge between LLMs and recommender systems across domains. Stage Two: Fine-Tuning LLMs for Recommendation. In the second stage, researchers leveraged the deep reasoning abilities of LLMs by conducting supervised training on specific datasets to adapt them to recommendation scenarios. For example, Uni- CTR [9] and Recformer [28] continued to use semantic text as the medium for aligning recommender systems with LLMs, perform- ing multi-scenario learning in pair-wise recommendation settings. Additionally, LLMs began to learn from non-textual signals dur- ing this phase. Models like P5 [ 10,72] and VIP5 [ 11] used item unique identifier for multi-task training on Amazon datasets [ 15], covering tasks such as score prediction, next-item prediction, and review generation. Furthermore, LLaRA [ 29] and LLM4IDRec [ 5] fine-tuned LLMs in sequential recommendation scenarios, enabling them to handle user behavior sequences more effectively. Stage Three: Integration of Semantic Identifiers with LLMs. In the most recent stage, researchers combined semantic identifier with LLMs to enhance recommendation performance [ 17]. For in- stance, LC-Rec [ 78] extended the multi-task learning paradigm of P5 but replaced item representations with semantic identifiers, achieving breakthrough results. STORE [ 40] innovatively proposed a unified framework that integrates discrete semantic encoding with generative recommendation, further advancing the capabilities of LLM-based recommender systems in multiple recommendation scenarios. 2.4 Comparison with Previous Benchmarks Table 1 summarizes the existing benchmarks within the LLM-as-RS paradigm. Zhang et al .pioneered the use of language models for sequential recommendation, evaluating BERT’s recabilities in both zero-shot and fine-tuning scenarios on the MovieLens dataset [ 14]. This work marked the inception of research into using LLMs directly as recommender systems. OpenP5 [ 72] builds upon the P5 [ 10] method to evaluate multiple recommendation scenarios alongside conventional methods but uti- lizes only unique identifiers for item representation. PromptRec [ 71] focuses on cold-start scenarios, comparing LLMs and conventional DLRMs using solely semantic text for zero-shot recommendation, thereby highlighting the advantages of LLMs in content understand- ing. Jiang et al .[21] employs multidimensional evaluation metrics but fine-tunes LLMs exclusively using semantic text. RSBench [ 33] primarily optimizes for conversational recommendation scenarios but uses only a single LLM and lacks comparisons with traditional recommendation models. LLMRec [ 32] imitates P5 by training LLMs through multitask learning and employs both unique identifiers and semantic text as item representations. However, LLMRec does not incorporate semantic identifiers into item representations. More importantly, it conducts experiments only on the Amazon Beautydataset [ 15], limiting its generalizability and credibility as a bench- mark. OurRecBench provides a comprehensive evaluation of the rec- ommendation abilities of seventeen LLMs across five datasets, encompassing both zero-shot and fine-tuning paradigms. Based on four forms of item representation and assessed in two recommen- dation scenarios, our benchmark uniquely evaluates the efficiency of recommendation models, aligning with the principles of Green AI [52] in the era of large models. 3 PROPOSED BENCHMARK: RECBENCH In this section, we will provide a comprehensive description on our benchmarking approaches in two recommendation scenarios. 3.1 Pair-wise Recommendation Pair-wise recommendation estimates the probability ˆ𝑦𝑢,𝑡that a user 𝑢interacts with (e.g., clicks on) an item 𝑡. Models are typically trained with binary cross-entropy loss: L=−∑︁ (𝑢,𝑡)∈D 𝑦𝑢,𝑡logˆ𝑦𝑢,𝑡+(1−𝑦𝑢,𝑡)log(1−ˆ𝑦𝑢,𝑡) ,(1) whereDis the set of all user–item interactions. As illustrated in Figure 1, we use user behavior sequence as the user-side feature. Group A: Deep CTR models with unique identifier .For these models, each item embedding tis randomly initialized. A user is represented by averaging the embeddings of items in their behavior sequence: u=1 𝑁𝑢𝑁𝑢∑︁ 𝑖=1t𝑢𝑖, (2) where 𝑁𝑢is the sequence length. The CTR model Φpredicts the click probability as: ˆ𝑦𝑢,𝑡=Φ(u,t). (3) The models to be benchmarked include DNN, PNN [ 49], DCN [ 60], DCNv2 [ 60], DeepFM [ 13], MaskNet [ 65], FinalMLP [ 43], AutoInt [ 53], and GDCN [59]. Group B: Deep CTR models with text.These models learn item representations from textual features: t=1 𝑁𝑡𝑁𝑡∑︁ 𝑖=1w𝑡𝑖, (4) where 𝑁𝑡is the text sequence length, and w𝑡denotes the item text sequence embeddings. Models include DNN text, DCNv2 text, AutoInt text, and GDCN text. Group C: Deep CTR models with semantic embedding .Here, item embeddings are initialized with pretrained semantic represen- tations: t=𝑔(w𝑡), (5) where 𝑔represents a large language model. We benchmark DNN emb, DCNv2 emb, AutoInt emb, and GDCN embmodels. Group D: LLM with unique identifier .Following P5 [ 10], we treat item unique identifiers as special tokens and fine-tune LLMs for recommendation. The classification logits 𝑙yesand𝑙nofor the YESandNOtokens are obtained from the final token. After softmax Page 5: RecBench Conference’17, July 2017, Washington, DC, USA Unique Identier <Item I D: 834> <Item I D: 208> <Item I D: 023> <Item I D: 679> Love Story Baby Stay FortnightText Love Story Fortnight0.12 0.39 0.15 0.440.81 0.27 0.73 0.20… … … …0.38 0.66 0.30 0.59Baby StaySemantic Embedding Love Story FortnightBaby Stay3 4 3 45 2 1 1 5 2 2 5Semantic Identier Group A Group D DeepFMtext Group B DeepFMemb Group CN/A N/AN/A N/ADeepFM SASRec Group G Group I P5-Llama-3 Group E Llama-3P5-BERT SID-Llama-3 Group J Group H Pair-wise Recommend ation i.e., Click-through Rate Prediction Item Representation List-wise Recommend ation i.e., Sequential Recommendation Traditional RS Traditional RS LLM as RS LLM as RS v.s. v.s. SID-SASRec SID-Llama-3N/AN/A Group E Group F   Figure 2: (Left) Various forms of item representations. (Right) Benchmarking groups and their representative methods. normalization over these two tokens, the click probability is: ˆ𝑦𝑢,𝑡=𝑒𝑙yes 𝑒𝑙yes+𝑒𝑙no. (6) Benchmarks inlcude P5-BERT base, P5-OPT 350M , P5-OPT 1B, and P5- Llama-3 7B. Group E: LLM with text.In this group, items are represented solely by their textual features, without adding extra tokens. Owing to their natural language understanding, these LLMs are evalu- ated in both zero-shot and fine-tuned settings. Benchmarks in- clude general-purpose models such as GPT-3.5 [ 46], the LLaMA se- ries [ 7,55,56], Qwen [ 73], OPT [ 75], Phi [ 19], Mistral [ 20], GLM [ 12], DeepSeek-Qwen-2 [ 3], as well as recommendation-specific models like P5 [10] and RecGPT [44]. Group F: LLM with semantic identifier .Here, we replace the single unique identifier with multiple semantic identifiers per item. We benchmark SID-BERT baseand SID-OPT 350M, which use BERT baseand OPT 350M as LLM backbone, respectively. 3.2 List-wise Recommendation List-wise recommendation predicts the next item 𝑡𝑢𝑥that a user 𝑢will interact with, given their historical behavior sequence s𝑢= 𝑠𝑢𝑖 𝑥−1 𝑖=1. The model is trained using categorical cross-entropy loss: L=−∑︁ 𝑢∈Ulogexp𝑓(s𝑢, 𝑡𝑢𝑥) Í 𝑡′∈Texp(𝑓(s𝑢, 𝑡′)), (7) whereUdenotes the set of users, 𝑡𝑢𝑥is the true next item, Tis the candidate set, and 𝑓(s𝑢, 𝑡′)computes the compatibility score. Group G: SeqRec models with unique identifier .We bench- mark a typical sequential recommendation model, SASRec [ 22], which uses item identifiers. The prediction score is defined as: 𝑓(s𝑢, 𝑡𝑢𝑖)=v𝑇 𝑢𝑖h𝑢𝑖−1, (8) where h𝑢𝑖−1summarizes the user history up to 𝑖−1, and v𝑢𝑖denotes the latent classification vector for item 𝑡𝑢𝑖.Group H: SeqRec models with semantic identifier .In this group, we extend the next-token prediction task (as in Group G) by representing each item with multiple semantic identifiers instead of a single unique identifier. This formulation decomposes an item into a sequence of tokens, where each valid token combination corresponds to a specific item. We benchmark the SID-SASRec model, which use SASRec model as backbone. During inference, we employ an autoregressive decoding strat- egy using beam search. At each decoding step, the model predicts a set of candidate tokens and maintains the top K partial sequences (beams) based on their cumulative scores. However, since the item representation is structured as a path in a pre-constructed semantic identifier tree, standard beam search can produce token sequences that do not correspond to any valid item. To overcome this limitation, we introduce a conditional beam search (CBS) technique. In our CBS approach, the semantic iden- tifier tree organizes valid token sequences as paths from the root to a leaf node. At every decoding step, the candidate tokens for each beam are filtered to retain only those that extend the cur- rent partial sequence to a valid prefix in the semantic identifier tree. This restriction ensures that each beam can eventually form a complete, valid item identifier. Only the tokens that lead to a leaf node–representing a complete and valid semantic identifier sequence–are allowed to contribute a positive prediction logit. We use -CBS to denote the model inference with CBS. Group I: LLMs with unique identifier .We extend the LLM- based framework to list-wise recommendation by incorporating item unique identifiers directly into the input prompt. The model is fine-tuned on the next-item prediction task by minimizing the cat- egorical cross-entropy loss introduced in Group G. Benchmarks in- clude P5-BERT base, P5-Qwen-2 0.5B, P5-Qwen-2 1.5B, and P5-OPT 1B. Group J: LLMs with semantic identifier .Compared with Group I, we replace the item unique identifier with the seman- tic identifier in the input prompt. The model is fine-tuned using the categorical cross-entropy loss introduced in Group G. Condi- tional beam search is employed to ensure that the decoded semantic Page 6: Conference’17, July 2017, Washington, DC, USA Trovato and Tobin, et al. Table 2: Datasets statistics. “Micro.” and “Good.” represent the MicroLens and Goodreads dataset, respectively. Dataset H&M MIND Micro. Good. CDs Type Fashion News Video Book Music Text Attribute desc title title name name Pair-wise Test set#Sample 20,000 20,006 20,000 20,009 20,003 #Item 26,270 3,088 15,166 26,664 36,765 #User 5,000 1,514 5,000 1,736 4,930 Pair-wise Finetune set#Sample 100,000 100,000 100,000 100,005 100,003 #Item 60,589 17,356 19,111 74,112 113,671 #User 25,000 8,706 25,000 8,604 24,618 List-wise Test set#Seq 5,000 5,000 5,000 5,000 5,000 #Item 15,889 10,634 12,273 38,868 19,684 List-wise Finetune set#Seq 40,000 40,000 40,000 40,000 40,000 #Item 35,344 24,451 18,841 136,296 95,409 identifier sequence maps to a valid item. Benchmarks include SID- BERT baseand SID-Llama-3 7B. 4 EXPERIMENTAL SETUP 4.1 Datasets To avoid reliance on a single platform, we conduct all the exper- iments on five datasets from distinct domains and institutions: H&M for fashion recommendation, MIND for news recommenda- tion, MicroLens for video recommendation, Goodreads for book recommendation, and CDs for music recommendation. Moreover, since the training and testing data sizes of the original datasets vary significantly, the comprehensive evaluation scores of the final models could be influenced by these discrepancies. To mitigate this issue, we perform uniform preprocessing on all datasets to obtain approximately similar dataset sizes. The specific details of the datasets are summarized in Table 2. 4.2 Evaluation Metrics Following common practice [ 8,37,39,74], we evaluate recommen- dation performance using widely adopted metrics, including rank- ing metrics such as GAUC ,nDCG , and MRR , as well as matching metrics like F1andRecall . However, due to space limitations, we present only the GAUC metric for pair-wise recommendation tasks andnDCG@10 for list-wise recommendation scenarios. The full evaluation results will available on our webpage. Moreover, we use the latency (ms) metric to evaluate the model’s inference efficiency, calculated as the average time per inference over 1,000 runs on a single CPU device. 4.3 Implementation Details Data Pre-processing. For datasets lacking user behavior sequences (i.e., HM, CDs, Goodreads, and MicroLens), we construct these se- quences by arranging each user’s positive interactions in chronolog- ical order. In the pair-wise recommendation scenario, for datasets without provided negative samples (i.e., MicroLens and HM), weperform negative sampling for each user with a negative ratio of 2. Additionally, we truncate user behavior sequences to a maximum length of 20 to ensure consistency across datasets. For deep CTR models, i) we utilize the nltk package to tokenize the text data and subsequently retain only those tokens present in the GloVe vocab- ulary [ 47] under the textsettings, and we did not use pretrained GloVe vectors during training; ii) we use Llama-1 7Bmodel to ex- tract the pretrained item embeddings under the semantic embedding settings. Semantic Identifier Generation. We employ the pipeline proposed by TIGER [ 50] to generate semantic identifier . First, we use an LLM, i.e., SentenceBERT [ 51], to extract embeddings for each item content. Then, we perform discretization training using the RQ- VAE [ 27] model on these embeddings. Following common prac- tice [ 40,50,64], we utilize a 4-layer codebook, with each layer having a size of 256. The representation space of this codebook approximately reaches 4 billion. Identifier Vocabulary. Regardless of whether we use unique iden- tifier orsemantic identifier , we construct new identifier vocabularies for the LLM. Specifically, the vocabulary size 𝑉matches the number of items when using unique identifier , or𝑉=256×4=1,024when using semantic identifier . We initialize a randomly generated em- bedding matrix Eid∈R𝑉×𝑑, where 𝑑is the embedding dimension of the current LLM. Model Fine-tuning. We employ the low-rank adaptation (LoRA) technique [ 18] for parameter-efficient fine-tuning of large language models. For the pair-wise recommendation scenario, LoRA is config- ured with a rank of 32 and an alpha of 128, whereas for the list-wise recommendation scenario, these parameters are set to (128, 128). The learning rate is fixed at 1×10−4for LLM-based models and 1×10−3for other models. In addition, we set the batch size to 5,000 for all deep CTR models, 64 for models with fewer than 7B parameters, and 16 for models with 7B parameters. 5 PAIR-WISE RECOMMENDATION: FINDINGS In this section, we present a comprehensive analysis of experimental results evaluating the recommendation abilities of LLMs in pair- wise recommendation scenarios. 5.1 Can LLMs Recommend in Zero-Shot Mode? Most LLMs exhibit limited zero-shot recommendation abilities; however, models pretrained on data containing implicit recommendation signals, such as Mistral [ 20], GLM [12], and Qwen-2 [73], perform significantly better. Table 3 presents the performance of various LLMs on pair-wise recommendation scenario across multiple recommendation datasets. We report the AUC metric, where values closer to 0.5 indicate performance near random recommendations. Our findings reveal that most LLMs–from small-scale BERT [ 24] models to large-scale Llama [ 7,55,56] variants–struggle with general recommendation tasks. Although item representations are provided in textform that Page 7: RecBench Conference’17, July 2017, Washington, DC, USA Table 3: LLM zero-shot performance in the pair-wise recom- mendation scenario. We display AUC metric in this table. Latency is the averaged inference time per sample. Recommender MIND Micro. Good. CDs H&M Overall Latency BERT base 0.4963 0.4992 0.4958 0.5059 0.5204 0.5035 53.26ms OPT 350M 0.5490 0.4773 0.5015 0.5093 0.4555 0.4985 332.34ms OPT 1B 0.5338 0.5236 0.5042 0.4994 0.5650 0.5252 1.14s Llama-1 7B 0.4583 0.4572 0.4994 0.4995 0.4035 0.4636 3.17s Llama-2 7B 0.4945 0.4877 0.5273 0.5191 0.4519 0.4961 6.20s Llama-3 8B 0.4904 0.5577 0.5191 0.5136 0.5454 0.5252 6.80s Llama-3.1 8B 0.5002 0.5403 0.5271 0.5088 0.5462 0.5245 6.58s Mistral 7B 0.6300 0.6579 0.5718 0.5230 0.7166 0.6199 7.68s GLM-4 9B 0.6304 0.6647 0.5671 0.5213 0.7319 0.6231 9.69s Qwen-2 0.5B 0.4868 0.5717 0.5148 0.5043 0.6287 0.5413 543.73 Qwen-2 1.5B 0.5411 0.6072 0.5264 0.5174 0.6615 0.5707 1.42s Qwen-2 7B 0.5862 0.6640 0.5494 0.5256 0.7124 0.6075 6.15s DS-Qwen-2 7B0.5127 0.5631 0.5165 0.5146 0.5994 0.5413 7.52s Phi-2 3B 0.4851 0.5078 0.5049 0.4991 0.5447 0.5083 2.10s GPT-3.5 0.5057 0.5110 0.5122 0.5046 0.5801 0.5227 - RecGPT 7B 0.5078 0.4703 0.5083 0.5019 0.4875 0.4952 7.16s P5Beauty 0.4911 0.5017 0.5027 0.5447 0.4845 0.5049 74.11ms the LLMs can process, these models appear to have difficulty ex- tracting user interests from behavior sequences and assessing the relevance between user interests and candidate items. Moreover, specialized recommendation models such as P5 [ 10] and RecGPT [ 44] also underperformed in our evaluations. P5, being an ID-based LLM recommender, effectively captures item semantics only on fine-tuned datasets (e.g., Beauty [ 15]), while RecGPT, a text-based recommendation model, suffers from similar limitations due to dataset-specific fine-tuning. This suggests both models lack strong generalization and zero-shot inference capabilities. Notably, the Mistral [ 20], GLM [ 12], and Qwen-2 [ 73] models demonstrated comparatively robust CTR prediction performance, with the recommendation effectiveness of Qwen-2 showing a posi- tive correlation with model size. We hypothesize that these models may have been exposed to a broader mix of web content–including user interactions, reviews, and implicit recommendation signals– which could contribute to their enhanced generalization to recom- mendation tasks. 5.2 Can Fine-tuning Enhance LLM Recommendation Performance? Fine-tuning significantly enhances the recommendation accuracy of LLMs. For instance, Llama-3 7Byields improve- ments of up to 43% after fine-tuning. Subsequently, we perform instruction tuning on various LLMs across each dataset, aligning their capabilities with recommenda- tion tasks through click-through rate prediction. Based on Table 3 and Table 4, our experiments indicate that such fine-tuning yields a relative improvement in recommendation accuracy ranging from22% to 43%, underscoring the importance of domain-specific align- ment. Notably, Llama-3 7Boutperformed Mistral-2 7Bon the MicroLens and Goodreads datasets. Although Mistral-2 ranked among the top three in zero-shot scenarios, the overall performance of Llama-3 was comparable to that of Mistral-2, while smaller models such as BERT and OPT consistently lagged behind. These results highlight the superior semantic understanding and deep reasoning capabilities inherent in larger models. Table 4: Comparison between fine-tuned LLM and conven- tional DLRMs in the pair-wise recommendation scenario. We display AUC metric in this table. Latency is the averaged inference time per sample. Recommender MIND Micro. Good. CDs H&M Overall Latency DNN 0.6692 0.7421 0.5831 0.5757 0.7952 0.6731 0.43ms PNN 0.6581 0.7359 0.5801 0.5331 0.7648 0.6544 0.51ms DeepFM 0.6670 0.7594 0.5782 0.5681 0.7749 0.6695 0.51ms DCN 0.6625 0.7410 0.5902 0.5780 0.7913 0.6726 0.58ms DCNv2 0.6707 0.7578 0.5778 0.5664 0.7950 0.6735 4.43ms MaskNet 0.6631 0.7179 0.5719 0.5532 0.7481 0.6508 3.12ms FinalMLP 0.6649 0.7600 0.5807 0.5670 0.7858 0.6717 0.62ms AutoInt 0.6690 0.7451 0.5879 0.5789 0.8027 0.6767 0.93ms GDCN 0.6704 0.7571 0.5948 0.5784 0.8120 0.6825 0.69ms DNN text 0.6867 0.7741 0.5857 0.5655 0.8475 0.6919 1.05ms DCNv2 text 0.6802 0.7804 0.5789 0.5577 0.8560 0.6906 5.15ms AutoInt text 0.6701 0.7761 0.5803 0.5687 0.8490 0.6888 1.41ms GDCN text 0.6783 0.7842 0.5796 0.5641 0.8555 0.6923 1.21ms DNN emb 0.7154 0.8141 0.5997 0.5848 0.8717 0.7171 1.32ms DCNv2 emb 0.7167 0.8061 0.5999 0.5944 0.8626 0.7159 4.81ms AutoInt emb 0.7081 0.8099 0.6015 0.5560 0.8594 0.7070 1.71ms GDCN emb 0.7093 0.7997 0.5943 0.5828 0.8565 0.7085 1.63ms P5-BERT base 0.5507 0.5850 0.5038 0.5162 0.5402 0.5392 38.01ms P5-OPT base 0.6330 0.5099 0.5031 0.4989 0.4939 0.5278 255.70ms P5-OPT large 0.6512 0.6984 0.5110 0.5281 0.6177 0.6013 950.89ms P5-Llama-3 7B0.6697 0.7457 0.5780 0.5688 0.7260 0.6576 6.35s BERT base 0.7175 0.8066 0.5148 0.5789 0.8635 0.6962 53.26ms OPT 1B 0.7346 0.8016 0.5889 0.5850 0.5121 0.6444 1.14s Llama-3 7B 0.7345 0.8328 0.6826 0.6268 0.8771 0.7508 6.80s Mistral-2 7B 0.7353 0.8295 0.6680 0.6754 0.8810 0.7578 7.68s SID-BERT base 0.5704 0.5860 0.4914 0.5042 0.5401 0.5384 36.40ms SID-OPT base 0.5987 0.4989 0.5004 0.4977 0.4957 0.5183 286.56ms 5.3 Performance Comparison: LLMs vs. Conventional Deep CTR Models Large-scale LLMs (e.g., Llama, Mistral) achieve over a 5% improvement in recommendation accuracy compared to the best conventional recommender (DNN emb) using se- mantic embedding . However, these gains come with signif- icant latency; the best conventional recommender retains 95% of the performance while being 5,800 times faster. Page 8: Conference’17, July 2017, Washington, DC, USA Trovato and Tobin, et al. Table 5: Comparison between LLM recommenders and con- ventional DLRMs in the list-wise recommendation scenario. We display NDCG@10 metric in this table. Latency is the averaged inference time per sample. “ -CBS” denote models applying conditional beam search technique (described in Sec 3.2) during inference. Recommender MIND Micro. Good. CDs H&M Overall Latency SASRec 3L 0.0090 0.0000 0.0165 0.0016 0.0209 0.0096 23.30ms SASRec 6L 0.0097 0.0006 0.0224 0.0012 0.0297 0.0127 38.43ms SASRec 12L 0.0241 0.0297 0.0548 0.1041 0.1235 0.0672 51.77ms SASRec 24L 0.0119 0.0312 0.0601 0.1267 0.1191 0.0698 103.41ms BERT base 0.0430 0.1867 0.0557 0.1198 0.1075 0.1025 41.54ms QWen-2 0.5B 0.0549 0.0201 0.0322 0.0128 0.0234 0.0287 556.95ms QWen-2 1.5B 0.0506 0.0254 0.0316 0.0015 0.0217 0.0262 1.12s Llama-3 7B 0.0550 0.0178 0.0134 0.0072 0.0353 0.0257 28.06s SID-SASRec 3L 0.0266 0.0028 0.0029 0.0000 0.0084 0.0081 36.12ms SID-SASRec 3L-CBS 0.0849 0.0123 0.0127 0.0007 0.0422 0.0306 66.67ms SID-SASRec 6L 0.0225 0.0047 0.0038 0.0140 0.0097 0.0109 59.08ms SID-SASRec 6L-CBS 0.0647 0.0179 0.0141 0.0331 0.0406 0.0341 90.41ms SID-SASRec 12L 0.0201 0.0044 0.0039 0.0136 0.0165 0.0117 1.31s SID-SASRec 12L-CBS 0.0695 0.0234 0.0140 0.0324 0.0598 0.0398 1.34s SID-BERT base 0.0654 0.0022 0.0025 0.3539 0.0467 0.0941 1.83s SID-BERT base-CBS 0.1682 0.1195 0.0059 0.4616 0.1834 0.1877 1.90s SID-Llama-3 7B 0.0456 0.0255 0.0221 0.2443 0.0337 0.0742 167.25s SID-Llama-3 7B-CBS 0.1677 0.0827 0.0508 0.3898 0.1125 0.1607 177.54s Table 4 compares recommendation performance using various item representation forms for both conventional recommenders (i.e., DLRM) and LLM-based approaches. The key findings are as follows: Firstly, even without textual modalities, conventional unique identifier -based CTR models outperform the zero-shot LLM-based recommenders (Table 3), highlighting the importance of interac- tion data. Moreover, fine-tuned unique identifier -based LLMs still lag behind, likely because they struggle to capture explicit fea- ture interactions. Secondly, incorporating textual data into CTR models yields significant gains. We did not use pretrained word em- beddings, as the item-side text itself effectively learns robust item relationships. Thirdly, initializing item representations with em- beddings from Llama-1 for semantic embedding -based CTR models introduces high-quality semantic information, outperforming both prior methods and small text-based LLMs (e.g., BERT, OPT) due to Llama’s superior semantic quality and deeper network architecture. Fourthly, text-based LLMs using large models like Llama-3 and Mistral-2 outperform all baselines, demonstrating their disruptive potential in recommendation tasks. Fifthly, conversely, fine-tuning semantic identifier -based LLMs yields poor performance in CTR scenarios, likely due to smaller models’ limited ability to learn dis- crete semantic information. Sixthly, in terms of efficiency, semantic embedding -based CTR models within the LLM-for-RS paradigm offer the best cost-effectiveness with minimal modifications to tradi- tional architectures, making this approach one of the most practical in industry.6 LIST-WISE RECOMMENDATION: FINDINGS In this section, we present the results from the list-wise recommen- dation scenario. Notably, sequential recommenders [ 22,50] typi- cally rely on next-item prediction to map user histories to specific items, which is incompatible with using textas the item represen- tation. Consequently, we focus on evaluating two forms: unique identifier and semantic identifier . Since LLMs do not inherently recognize these unseen tokens, they exhibit no zero-shot recom- mendation abilities and require fine-tuning. 6.1 Unique ID vs. Semantic ID Overall, semantic identifier has shown to be a more effective representation than unique identifier , whether integrated with LLMs or traditional recommenders, highlighting the value of incorporating item content knowledge into se- quential recommenders. Based on Table 5, which evaluates the recommendation abilities of LLMs and conventional DLRMs in the list-wise recommendation scenario, we can make the following observations: Firstly, within the SASRec series, performance generally im- proves with an increasing number of transformer layers, reflecting the scaling behavior of conventional sequential recommenders. No- tably, SID-SASRec outperforms standard SASRec when using a smaller number of layers. This suggests that semantic identifier –by decomposing item representations into logically and hierarchically structured tokens–enables shallower networks to better capture user interests. However, as the number of layers increases, the ad- vantage of semantic identifier diminishes, likely because deeper architectures in SASRec can more effectively learn user sequence patterns, even without pretrained semantic information. Secondly, comparing the pairs (BERT base, SID-BERT base-CBS ), and (Llama-3 7B, SID-Llama-3 7B-CBS ) pairs, we observe that LLMs with semantic identifier consistently outperform their unique iden- tifier counterparts, achieving improvements of up to 83%. This underscores the efficiency and potential of the semantic identifier representation in enhancing recommendation performance. 6.2 Performance Comparison: LLMs vs. Conventional Sequential Recommenders LLMs outperform traditional sequential recommenders in accuracy using either unique identifier orsemantic identifier representations, but their inference efficiency remains a critical issue requiring urgent improvement. Based on unique identifier representations, the BERT basemodel outperforms both SASRec 12L–which shares the same network ar- chitecture as BERT base–and the deeper SASRec 24L. Despite the absence of textual features in item representations, this observation suggests that language patterns acquired during pretraining bear an abstract similarity to user interest patterns in recommender systems, thereby facilitating effective knowledge transfer. Page 9: RecBench Conference’17, July 2017, Washington, DC, USA Furthermore, LLM recommenders employing semantic identifier representations exhibit markedly superior performance compared to the SID-SASRec series. By incorporating semantic item knowl- edge, semantic identifier enables LLMs to more effectively interpret user sequences and capture high-quality user interests. Additionally, models utilizing conditional beam search constraints (the -CBS series) achieve further improvements in recommendation performance. However, these gains come at a substantial cost in inference efficiency; overall, LLM recommenders require nearly 1,000 times more inference time than SASRec. This significant effi- ciency gap represents a critical challenge that should be addressed to ensure the practical deployment of LLM recommenders. 7 CONCLUSION In this work, we introduced the RecBench platform–a compre- hensive benchmark designed to evaluate the LLM-as-RS paradigm in recommender systems. By systematically investigating various item representation forms and covering both click-through rate prediction and sequential recommendation tasks, our study spans diverse datasets and a wide range of models. Our evaluation reveals that, while LLM-based recommenders–especially those leverag- ing large-scale models—can achieve significant performance gains across multiple recommendation scenarios, they continue to face substantial efficiency challenges relative to conventional DLRMs. This trade-off underscores the imperative for further research into inference acceleration techniques, which are crucial for the practi- cal deployment of LLM-based recommenders in high-throughput industrial settings. REFERENCES [1]Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems . 1007–1014. [2]Keqin Bao, Jizhi Zhang, Yang Zhang, Wang Wenjie, Fuli Feng, and Xiangnan He. 2023. Large language models for recommendation: Progresses and future directions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region . 306–309. [3]Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al .2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024). [4]Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al .2024. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 27, 4 (2024), 42. [5]Lei Chen, Chen Gao, Xiaoyi Du, Hengliang Luo, Depeng Jin, Yong Li, and Meng Wang. 2024. Enhancing ID-based Recommendation with Large Language Models. arXiv preprint arXiv:2411.02041 (2024). [6]Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongx- iang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering chatgpt’s capabilities in recommender systems. In Proceedings of the 17th ACM Conference on Recom- mender Systems . 1126–1132. [7]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024). [8]Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, and Joemon M Jose. 2024. IISAN: Efficiently adapting multimodal repre- sentation for sequential recommendation with decoupled PEFT. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 687–697. [9]Zichuan Fu, Xiangyang Li, Chuhan Wu, Yichao Wang, Kuicai Dong, Xiangyu Zhao, Mengchen Zhao, Huifeng Guo, and Ruiming Tang. 2023. A unified frame- work for multi-domain ctr prediction via large language models. ACM Transac- tions on Information Systems (2023).[10] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems . 299–315. [11] Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. Vip5: Towards multimodal foundation models for recommendation. arXiv preprint arXiv:2305.14302 (2023). [12] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al .2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024). [13] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. InProceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017 , Carles Sierra (Ed.). ijcai.org, 1725–1731. https://doi.org/10.24963/IJCAI.2017/239 [14] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19. [15] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web . 507–517. [16] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management . 720–730. [17] Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xi- aoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, et al .2025. EAGER- LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration. arXiv preprint arXiv:2502.14735 (2025). [18] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al .2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3. [19] Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al .2023. Phi-2: The surprising power of small language models. Microsoft Research Blog 1, 3 (2023), 3. [20] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al .2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023). [21] Chumeng Jiang, Jiayin Wang, Weizhi Ma, Charles LA Clarke, Shuai Wang, Chuhan Wu, and Min Zhang. 2024. Beyond Utility: Evaluating LLM as Recommender. arXiv preprint arXiv:2411.00331 (2024). [22] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In 2018 IEEE international conference on data mining (ICDM) . IEEE, 197–206. [23] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al .2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274. [24] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT , Vol. 1. Minneapolis, Minnesota, 2. [25] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213. [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi- cation with deep convolutional neural networks. Advances in neural information processing systems 25 (2012). [27] Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. 2022. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 11523–11532. [28] Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text is all you need: Learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 1258–1267. [29] Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. Llara: Large language-recommendation assistant. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1785–1795. [30] Carlos García Ling, ElizabethHMGroup, FridaRim, inversion, Jaime Ferrando, Maggie, neuraloverflow, and xlsrln. 2022. H&M Personalized Fashion Recom- mendations. https://kaggle.com/competitions/h-and-m-personalized-fashion- recommendations. Kaggle. Page 10: Conference’17, July 2017, Washington, DC, USA Trovato and Tobin, et al. [31] Junling Liu, Chao Liu, Peilin Zhou, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149 (2023). [32] Junling Liu, Chao Liu, Peilin Zhou, Qichen Ye, Dading Chong, Kang Zhou, Yueqi Xie, Yuwei Cao, Shoujin Wang, Chenyu You, et al .2023. Llmrec: Benchmarking large language models on recommendation task. arXiv preprint arXiv:2308.12241 (2023). [33] Jiao Liu, Zhu Sun, Shanshan Feng, and Yew-Soon Ong. 2024. Language Model Evolutionary Algorithms for Recommender Systems: Benchmarks and Algorithm Comparisons. arXiv preprint arXiv:2411.10697 (2024). [34] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023. A first look at llm-powered generative news recommendation. CoRR (2023). [35] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2024. Once: Boosting content-based recommendation with both open-and closed-source large language models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining . 452–461. [36] Qijiong Liu, Xiaoyu Dong, Jiaren Xiao, Nuo Chen, Hengchang Hu, Jieming Zhu, Chenxu Zhu, Tetsuya Sakai, and Xiao-Ming Wu. 2024. Vector quantization for recommender systems: a review and outlook. arXiv preprint arXiv:2405.03110 (2024). [37] Qijiong Liu, Lu Fan, and Xiao-Ming Wu. 2025. Legommenders: A Comprehensive Content-Based Recommendation Library with LLM Support. [38] Qijiong Liu, Hengchang Hu, Jiahao Wu, Jieming Zhu, Min-Yen Kan, and Xiao- Ming Wu. 2024. Discrete Semantic Tokenization for Deep CTR Prediction. In Companion Proceedings of the ACM on Web Conference 2024 . 919–922. [39] Qijiong Liu, Jieming Zhu, Quanyu Dai, and Xiao-Ming Wu. 2022. Boosting deep CTR prediction with a plug-and-play pre-trainer for news recommendation. In Proceedings of the 29th International Conference on Computational Linguistics . 2823–2833. [40] Qijiong Liu, Jieming Zhu, Lu Fan, Zhou Zhao, and Xiao-Ming Wu. 2024. STORE: Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM. arXiv preprint arXiv:2409.07276 (2024). [41] Yucong Luo, Mingyue Cheng, Hao Zhang, Junyu Lu, Qi Liu, and Enhong Chen. 2023. Unlocking the potential of large language models for explainable recom- mendations. arXiv preprint arXiv:2312.15661 (2023). [42] Itzik Malkiel, Oren Barkan, Avi Caciularu, Noam Razin, Ori Katz, and Noam Koenigstein. 2020. RecoBERT: A catalog language model for text-based recom- mendations. arXiv preprint arXiv:2009.13292 (2020). [43] Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong. 2023. FinalMLP: an enhanced two-stream MLP model for CTR prediction. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 37. 4552–4560. [44] Hoang Ngo and Dat Quoc Nguyen. 2024. RecGPT: Generative Pre-training for Text-based Recommendation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Short Papers, Bangkok, Thailand, August 11-16, 2024 , Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, 302–313. https://aclanthology. org/2024.acl-short.29 [45] Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2023. A Content-Driven Micro-Video Recom- mendation Dataset at Scale. arXiv preprint arXiv:2309.15379 (2023). [46] OpenAI. 2023. GPT-3.5. https://openai.com/gpt Large language model. [47] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532–1543. [48] Haohao Qu, Wenqi Fan, Zihuai Zhao, and Qing Li. 2024. TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation. arXiv preprint arXiv:2406.10450 (2024). [49] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In 2016 IEEE 16th international conference on data mining (ICDM) . IEEE, 1149–1154. [50] Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al . 2023. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems 36 (2023), 10299–10315. [51] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019). [52] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green ai. Commun. ACM 63, 12 (2020), 54–63. [53] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self- attentive neural networks. In Proceedings of the 28th ACM international conference on information and knowledge management . 1161–1170. [54] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep- resentations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management . 1441–1450. [55] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, FaisalAzhar, et al .2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023). [56] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al .2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023). [57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). [58] Mengting Wan and Julian J. McAuley. 2018. Item recommendation on mono- tonic behavior chains. In Proceedings of the 12th ACM Conference on Recom- mender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018 , Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan (Eds.). ACM, 86–94. https://doi.org/10.1145/3240323.3240369 [59] Fangye Wang, Hansu Gu, Dongsheng Li, Tun Lu, Peng Zhang, and Ning Gu. 2023. Towards deeper, lighter and interpretable cross network for CTR predic- tion. In Proceedings of the 32nd ACM international conference on information and knowledge management . 2523–2533. [60] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17 . 1–7. [61] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021 . 1785–1797. [62] Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See- Kiong Ng, and Tat-Seng Chua. 2024. Learnable Item Tokenization for Generative Recommendation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management . 2400–2409. [63] Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023. Rethinking the evaluation for conversational recommendation in the era of large language models. arXiv preprint arXiv:2305.13112 (2023). [64] Ye Wang, Jiahao Xun, Minjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, et al .2024. EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 3245–3254. [65] Zhiqiang Wang, Qingyun She, and Junlin Zhang. 2021. Masknet: Introducing feature-wise multiplication to CTR ranking models by instance-guided mask. arXiv preprint arXiv:2102.07619 (2021). [66] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al .2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [67] Wei Wei, Xubin Ren, Jiabin Tang, Qinyong Wang, Lixin Su, Suqi Cheng, Jun- feng Wang, Dawei Yin, and Chao Huang. 2024. Llmrec: Large language models with graph augmentation for recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining . 806–815. [68] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Empowering news recommendation with pre-trained language models. In Proceedings of the 44th international ACM SIGIR conference on research and development in informa- tion retrieval . 1652–1656. [69] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. 2020. MIND: A Large-scale Dataset for News Recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics, Online. Presented at ACL 2020. [70] Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al .2024. A survey on large language models for recommendation. World Wide Web 27, 5 (2024), 60. [71] Xuansheng Wu, Huachi Zhou, Yucheng Shi, Wenlin Yao, Xiao Huang, and Ning- hao Liu. 2024. Could Small Language Models Serve as Recommenders? Towards Data-centric Cold-start Recommendation. In Proceedings of the ACM on Web Conference 2024 . 3566–3575. [72] Shuyuan Xu, Wenyue Hua, and Yongfeng Zhang. 2024. Openp5: An open-source platform for developing, training, and evaluating llm-based recommender sys- tems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 386–394. [73] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al .2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671 (2024). [74] Chiyu Zhang, Yifei Sun, Minghao Wu, Jun Chen, Jie Lei, Muhammad Abdul- Mageed, Rong Jin, Angli Liu, Ji Zhu, Sem Park, et al .2024. EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations. In Proceedings of the 18th ACM Conference on Recommender Systems . 1010–1015. [75] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al .2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022). Page 11: RecBench Conference’17, July 2017, Washington, DC, USA [76] Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang. 2021. Language models as recommender systems: Evalua- tions and limitations. In NeurIPS 2021 Workshop on I (Still) Can’t Believe It’s Not Better . https://www.amazon.science/publications/language-models-as- recommender-systems-evaluations-and-limitations [77] Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2024. RecommenderSystems in the Era of Large Language Models (LLMs). IEEE Trans. Knowl. Data Eng. 36, 11 (2024), 6889–6907. https://doi.org/10.1109/TKDE.2024.3392335 [78] Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting large language models by integrating collaborative semantics for recommendation. In 2024 IEEE 40th International Conference on Data Engineering (ICDE) . IEEE, 1435–1448.