Authors: Chiyu Zhang, Yifei Sun, Minghao Wu, Jun Chen, Jie Lei, Muhammad Abdul-Mageed, Rong Jin, Angli Liu, Ji Zhu, Sem Park, Ning Yao, Bo Long
Page 1:
EmbSum: Leveraging the Summarization Capabilities of Large
Language Models for Content-Based Recommendations
Chiyu Zhang†∗
University of British
Columbia, CanadaYifei Sun∗
Meta AI, USAMinghao Wu
Monash University,
AustraliaJun Chen
Meta AI, USA
Jie Lei
Meta AI, USAMuhammad
Abdul-Mageed
University of British
Columbia, Canada
MBZUAI, UAERong Jin
Meta AI, USAAngli Liu
Meta AI, USA
Ji Zhu
Meta AI, USASem Park
Meta AI, USANing Yao
Meta AI, USABo Long
Meta AI, USA
Abstract
Content-based recommendation systems play a crucial role in de-
livering personalized content to users in the digital world. In this
work, we introduce EmbSum, a novel framework that enables of-
fline pre-computations of users and candidate items while capturing
the interactions within the user engagement history. By utilizing
the pretrained encoder-decoder model and poly-attention layers,
EmbSum derives User Poly-Embedding (UPE) and Content Poly-
Embedding (CPE) to calculate relevance scores between users and
candidate items. EmbSum actively learns the long user engagement
histories by generating user-interest summary with supervision
from large language model (LLM). The effectiveness of EmbSum
is validated on two datasets from different domains, surpassing
state-of-the-art (SoTA) methods with higher accuracy and fewer pa-
rameters. Additionally, the model’s ability to generate summaries
of user interests serves as a valuable by-product, enhancing its
usefulness for personalized content recommendations.
CCS Concepts
•Information systems →Content ranking ;•Computing
methodologies→Natural language generation .
Keywords
Recommendation System, User Interest Summarization, Large Lan-
guage Model
ACM Reference Format:
Chiyu Zhang†, Yifei Sun, Minghao Wu, Jun Chen, Jie Lei, Muhammad Abdul-
Mageed, Rong Jin, Angli Liu, Ji Zhu, Sem Park, Ning Yao, and Bo Long. 2024.
∗Corresponding Authors: chiyuzh@mail.ubc.ca; sunyifei@meta.com
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
RecSys ’24, October 14–18, 2024, Bari, Italy
©2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06
https://doi.org/10.1145/nnnnnnn.nnnnnnnEmbSum: Leveraging the Summarization Capabilities of Large Language
Models for Content-Based Recommendations. In 18th ACM Conference on
Recommender Systems (RecSys ’24), October 14–18, 2024, Bari, Italy. ACM,
New York, NY, USA, 8 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 Introduction
In the thriving digital world, billions of users interact daily with a
diverse range of digital content, encompassing news, social media
updates, e-books, etc. Content-based recommendation systems, as
discussed in various studies [ 4,16–18], leverage the textual con-
tent, such as news articles and books, and the sequence of a user’s
interaction history. It facilitates the delivery of content recommen-
dations that are more precise, relevant, and customized to each
user’s preferences.
Recent studies have successfully incorporated Pretrained Lan-
guage Models (PLMs) into recommendation systems for processing
textual inputs [ 13,14,30]. This integration has significantly im-
proved the efficiency of content-based recommendations. Due to
the well-known memory limitation of the attention mechanism,
previous studies [ 11,33] typically encode each piece of user histor-
ical content separately and then aggregate them. This approach,
however, falls short in modeling the interactions among the user’s
historical contents. Addressing this, Mao et al . [17] introduce local
and global attention mechanisms to encode user histories hierarchi-
cally. However, this method needs to truncate the history sequence
to 1K tokens due to the limitation of PLMs, thus diminishing the
benefits of leveraging extensive engagement history to capture
users’ comprehensive interests. In another vein, to improve the
alignment between user and candidate content, many studies have
directly integrated candidate items into user modeling [ 11,20,32].
This online real-time strategy prevents recommendation systems
from performing offline pre-computations for efficient inference,
which restricts their real-world applications.
To tackle the aforementioned challenges, we introduce a new
framework, EmbSum, which enables offline pre-computations of
†Work done during Meta internship.arXiv:2405.11441v2 [cs.IR] 19 Aug 2024
Page 2:
RecSys ’24, October 14–18, 2024, Bari, Italy Zhang et al.
Figure 1: Overview of our EmbSum framework. Note that the user summaries generated by LLMs are only used in training.
embeddings of users and candidate items while capturing the in-
teractions within the user’s long engagement history. As Figure 1
shows, we utilize poly-attention [ 5] layers to derive multiple em-
beddings for the detailed features of both users and candidate
items, referred to as User Poly-Embedding (UPE) and Content Poly-
Embedding (CPE), respectively. These embeddings are then used to
calculate the relevance scores between users and candidate items.
More specifically, we use the pretrained T5 encoder [ 21] to en-
code user engagement sessions independently. We hypothesize that
merely concatenating the embeddings of history sequences does
not effectively model the interactions between user engagement
sessions. To address this, we use the T5 decoder to fuse session-
based encoded sequences by training it to generate user-interest
summaries, supervised by the large language models (LLMs) [ 9]
generated summaries of user’s holistic interests when modeling
user representations.
Our contributions in this work can be summarized as follows:
(1)We present a new framework, EmbSum , for embedding and
summarizing user interests in content-based recommenda-
tion systems. This framework employs an encoder-decoder
architecture to encode extensive user engagement histories
and produce summaries of user interests.
(2)We validate the effectiveness of EmbSum by testing it on
two popular datasets from different domains. Our approach
surpasses SoTA methods, delivering higher accuracy with
fewer parameters.
(3)Our model can generate summaries of user interests, which
serves as a beneficial by-product, thereby enhancing its use-
fulness for personalized content recommendations.
2 Methodology
In this section, we first describe the problem formulation of our
work (Section 2.1). Then, we provide an overview of our EmbSum
framework (Section 2.2). Next, we introduce the details of modeling
user engagements (Section 2.3), candidate contents (Section 2.4),
and the click-through predictor and training objectives (Section 2.5).2.1 Problem Formulation
Given a user 𝑢𝑖and a candidate content item ¯𝑒𝑗(such as news
articles or books), the objective is to derive a relevance score 𝑠𝑖
𝑗,
which indicates the likelihood of user 𝑢𝑖engaging with (e.g., click-
ing on) the content item ¯𝑒𝑗. Considering a set of candidate contents
𝐶={¯𝑒1,¯𝑒2,..., ¯𝑒𝑗}, these contents are ranked based on their rel-
evance scores{𝑠𝑖
1,𝑠𝑖
2,...,𝑠𝑖
𝑗}for user𝑢𝑖. It is crucial to effectively
extract user interests from their engagement history. User 𝑢𝑖is
characterized by a sequence of 𝑘historically engaged contents 𝐸𝑢𝑖
(such as browsed news articles or positively rated books), sorted in
descending order by engagement time.
2.2 Overview of EmbSum
Figure 1 presents an overview of our proposed model, EmbSum.
Both the user engagements and the candidate contents are encoded
using a pretrained encoder-decoder Transformer model. The click-
through rate (CTR) predictions between the users and the candidate
content are modeled using noisy contrastive estimation (NCE) loss.
Moreover, we also introduce a user-interest summarization objec-
tive that is supervised by LLM generations. By leveraging both
poly-embedding of user engagement history and candidate content,
EmbSum can identify patterns and preferences from user interac-
tions and match them with the most suitable content.
2.3 User Engagement Modeling
Session Encoding .The user engagement history, denoted as
𝐸𝑢𝑖, comprises a sequence of 𝑘content items that a user has pre-
viously engaged with. To address the high memory demands of
processing long sequences with attention mechanisms, we parti-
tion these𝑘content items into 𝑔distinct sessions, represented as
𝐸𝑢𝑖={𝜂1,𝜂2,...,𝜂 𝑔}. Each session 𝜂𝑖encapsulates 𝑙tokens from 𝑝
content items, expressed as 𝜂𝑖={𝑒1,𝑒2,...,𝑒 𝑝}. This structure is
designed to reflect the user’s interests over specific time periods and
to improve the interactions within each session. We encode each
session using a T5 encoder [ 21] independently. The representation
Page 3:
EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations RecSys ’24, October 14–18, 2024, Bari, Italy
of each piece of content is then derived from the hidden state out-
put corresponding to its first token, which is the start-of-sentence
symbol [SOS] . This process yields 𝑘representation vectors.
User Engagement Summarization .We posit that merely uti-
lizing these 𝑘representations falls short in capturing the sub-
tleties of user interests and the dynamics among long-range en-
gaged contents. To address this, we exploit the capabilities of SoTA
LLMs to synthesize a distilled summary of user interests, given
that recent studies [ 1,19] have demonstrated LLMs’ capacity for
summarizing long sequences. This summary encapsulates a rich
and comprehensive perspective of user preferences. We utilize
Mixtral-8x22B-Instruct [9] to generate these interest summaries
from the engagement histories. The resulting summaries are then
incorporated into the T5 decoder. Drawing inspiration from the
fusion-in-decoder concept [ 7], we concatenate the hidden states
of all tokens from all subsequences encoded in a session-based
manner and input this combined sequence into the T5 decoder. We
then train the model to produce a summary of the user’s interests,
employing the following loss:
Lsum=−|𝑦𝑢𝑖
𝑗|∑︁
𝑗=1𝑙𝑜𝑔(𝑝(𝑦𝑢𝑖
𝑗|𝐸,𝑦𝑢𝑖
<𝑗)), (1)
where𝑦𝑢𝑖
𝑗represents the summary generated for user 𝑢𝑖, and|𝑦𝑢𝑖
𝑗|
is the length of the user-interest summary.
User Poly-Embedding .Given that each session is encoded inde-
pendently, a global representation for all engaged contents is also
necessary. We hence acquire a global representation that is derived
from the last token of the decoder output (i.e., [EOS] token), repre-
senting all user engagements collectively. We then concatenate it
with the𝑘representation vectors from session encoding to form a
matrix𝑍∈R(𝑘+1)×𝑑. With these representations of user interaction
history, we employ a poly-attention layer [ 5] to extract the user’s
nuanced interests into multiple representations. The computation
of each user-interest vector 𝛼𝑎is as follows:
𝛼𝑎=softmaxh
𝑐𝑎tanh(𝑍𝑊𝑓)⊤i
𝑍, (2)
where𝑐𝑎∈R1×𝑝and𝑊𝑓∈R𝑑×𝑝are the trainable parameters. We
then concatenate 𝑚user-interest vectors into a matrix 𝐴∈R𝑚×𝑑,
which serves as the user representation, referring to as the User
Poly-Embedding (UPE) in our framework.
2.4 Candidate Content Modeling
Unlike the conventional practice of using only the first token of a
sequence for representation, we introduce the novel Content Poly-
Embedding (CPE). Similar to UPE, this method employs a set of
context codes, denoted as {𝑏1,𝑏2,...,𝑏 𝑛}, to create multiple embed-
dings for a piece of candidate content. For each piece of candidate
content, we generate a corresponding vector 𝛽𝑎through a poly-
attention layer defined by the equation referenced as Equation 2,
which includes a trainable parameter 𝑊𝑜. The resulting 𝑛vectors
for candidate content are then aggregated into a matrix 𝐵∈R𝑛×𝑑,
providing a more nuanced representation that is expected to im-
prove the performance of relevance scoring in the user-candidate
matching predictor.2.5 CTR Prediction and Training
CTR Prediction .To compute the relevance score 𝑠𝑖
𝑗, we first
establish the matching scores between the user representation em-
bedding𝐴𝑖and the candidate content representation embedding 𝐵𝑗.
This computation is performed using the inner product, followed
by flattening the resultant matrix:
𝐾𝑖
𝑗=flatten(𝐴⊤
𝑖𝐵𝑗). (3)
where𝐾𝑖
𝑗∈𝑅𝑚𝑛is the flattened attention matrix. Subsequently, an
attention mechanism is employed to aggregate the matching scores
represented by this flattened vector:
𝑊𝑝=softmax(flatten(𝐴·gelu(𝐵𝑊𝑠)⊤)),
𝑠𝑖
𝑗=𝑊𝑝·𝐾𝑖
𝑗,(4)
where𝑊𝑠∈𝑅𝑑×𝑑signifies a trainable parameter matrix, 𝑊𝑝∈
𝑅𝑚𝑛represents the attention weights obtained after flattening and
the softmax function, and 𝑠𝑖
𝑗is the scaled relevance score.
Training .We follow the method of training the end-to-end
recommendation models using the NCE loss [13, 28]:
LNCE=−log
exp(𝑠𝑖+)
exp(𝑠𝑖+)+Í
𝑗exp(𝑠𝑖
−,𝑗)!
, (5)
where𝑠𝑖+represents the score of the positive sample with which
the user engaged, and 𝑠𝑖−represents the scores of negative samples.
Therefore, the overall loss function is defined as:
L=LNCE+𝜆Lsum, (6)
where𝜆is a scaling factor determined to be 0.05 based on the
performance on the validation set.
3 Experiments
Baselines .We evaluate our EmbSum against a range of com-
monly used and SoTA neural network-based content recommenda-
tion approaches. These include methods that train text encoders
from scratch, such as (1) NAML [ 26], (2) NRMS [ 27], (3) Fast-
former [ 29], (4) CAUM [ 20], and (5) MINS [ 24]. We also consider sys-
tems that utilize PLMs, including (6) NAML-PLM, (7) UNBERT [ 33],
(8) MINER [ 11], and (9) UniTRec [ 17]. More Details on these base-
lines and our implementation are provided in Appendix A.
Dataset .We employ two publicly available benchmark datasets
for content-based recommendation. The first dataset, MIND [ 31],
consists of user engagement logs from Microsoft News, incorpo-
rating both positive and negative labels determined by user clicks
on news articles. We use the smaller version of this dataset, which
includes 94K users and 65K news articles. The second dataset is
obtained from Goodreads [ 23], which focuses on book recommen-
dations derived from user ratings. In this dataset, ratings above 3
are considered positive labels, while ratings below 3 are regarded
as negative labels. The Goodreads dataset comprises 50K users and
330K books. More details on dataset statistics are at Appendix B.
Evaluation .We utilize a variety of metrics to assess the perfor-
mance of content-based recommendation systems. These metrics
include the classification-based metric AUC [ 3], and ranking-based
metrics such as MRR [ 22] and nDCG@top 𝑁(with top𝑁= 5, 10) [ 8].
Page 4:
RecSys ’24, October 14–18, 2024, Bari, Italy Zhang et al.
Figure 2: Illustration of using an LLM for user interest profil-
ing. The input provided to the LLM is enclosed in a red box,
and the output generated by the LLM is shown in a green
box. The segment marked in orange within the input speci-
fies the instruction for the task, whereas the portion in blue
highlights the history of news browsed by the user.
Metric calculations are performed using the Python library Torch-
Metrics [ 2]. We determine the best model on the Dev set using AUC
and report the corresponding Test performance.
Content Formatting .As an example, for each content in MIND
dataset, we combine its fields into a single text sequence using the
following template: “News Title: ⟨𝑡𝑖𝑡𝑙𝑒⟩; News Abstract:⟨𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡⟩;
News Category:⟨𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦⟩”.
LLM Based User-Interest Summary .We leverage the open-
source Mixtral-8x22B-Instruct [9] to create summaries that re-
flect users’ interests based on their engagement history. Figure 2
illustrates an example of the input provided and the corresponding
summary generated for the MIND dataset. The process starts with
an instruction to frame the task, followed by a list of news items the
user has viewed, ordered from the most recent to the oldest. Each
item includes its title, abstract, and category. The input is limited
to 60 engagement items, with extended news abstracts or book de-
scriptions being condensed to 100 words. The instruction concludes
with a request for the model to condense the user’s interests into
three sentences. The latter part of Figure 2 shows a sample output
from the LLM. Additionally, we evaluate the LLM-generated user-
interest summaries using GPT-4 (i.e., gpt-4o API) as a judge. The
results indicate that most of the generated summaries accurately
capture the user’s interests. Further details of this experiment are
provided in Section C of Appendix. As shown in Table 4 in Appen-
dix, the average length of summaries generated by this process is
76 tokens for the MIND dataset and 115 tokens for the Goodreads
dataset.
Implementation .We utilize the pretrained T5-small model [ 21],
which comprises 61M parameters. After hyperparameter tuning,
we determined that the optimal codebook sizes for the UPE andMIND
AUC MRR nDCG@5 nDCG@10
NAML 66.10 34.65 32.80 39.14
NRMS 63.28 33.10 31.50 37.68
Fastformer 66.32 34.75 33.03 39.30
CAUM 62.56 34.40 32.88 38.90
MINS 61.43 35.99 34.13 40.54
NAML-PLM 67.01 35.67 34.10 40.32
UNBERT 71.73 38.06 36.67 42.92
MINER 70.20 38.10 36.35 42.63
UniTRec 69.38 37.62 36.01 42.20
EmbSum (ours) 71.95 38.58 36.75 42.97
Goodreads
NAML 59.35 72.16 53.49 67.81
NRMS 60.51 72.15 53.69 68.03
Fastformer 59.39 71.11 52.38 67.05
CAUM 55.13 73.06 54.97 69.02
MINS 53.02 71.81 53.72 68.00
NAML-PLM 59.57 72.54 53.98 68.41
UNBERT 61.40 73.34 54.67 68.71
MINER 60.72 72.72 54.17 68.42
UniTRec 60.00 72.60 53.73 67.96
EmbSum (ours) 61.64 73.75 54.86 69.08
Table 1: Results on MIND-small and Goodreads. The best
results are highlighted in bold. The second-best results are
highlighted in underscore .
CPE layers are 32 and 4, respectively, for both datasets. The model
is trained with a learning rate of 5𝑒−4and a batch size of 128
over 10 epochs. In all experiments, we cosnsider the latest 60 in-
teractions as the user’s engagement history. For the MIND dataset,
we apply a negative sampling ratio of 4, limit news titles to 32
tokens, and restrict news abstracts to 72 tokens. Including approx-
imately 20 additional tokens for the news category and template,
each user’s engaged history in MIND can total up to 7,440 tokens.
In the Goodreads dataset, we set the negative sampling ratio to 2,
limit book titles to 24 tokens, and constrain book descriptions to 85
tokens. Consequently, a user in the Goodreads dataset can have up
to 7,740 tokens for their engagement history.
4 Results
Main Results .Table 1 shows the test results of nine baselines
and our EmbSum. We can find that models initialized with PLMs ob-
tained much better performance than baselines trained from scratch.
We observe that EmbSum outperforms previous SoTA AUC scores
given by UNBERT [ 33]. Compared to UNBERT, EmbSum achieves
an improvement of 0.22 and 0.24 AUC on the MIND and Goodreads
datasets, respectively. Notably, EmbSum uses only T5-small as the
backbone, which has 61M parameters, significantly fewer than
the 125M parameters of BERT-based methods (e.g., UNBERT and
MINER). On other ranking-based metrics, EmbSum achieves the
Page 5:
EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations RecSys ’24, October 14–18, 2024, Bari, Italy
MIND
AUC MRR nDCG@5 nDCG@10
Ours 71.95 38.58 36.75 42.97
wo CPE 68.17 36.49 33.72 40.20
wo grouping 71.34 38.29 36.41 42.55
wo UPE 71.41 38.64 36.70 42.90
woL𝑠𝑢𝑚 71.43 38.42 36.39 42.60
Goodreads
Ours 61.64 73.75 54.86 69.08
wo CPE 60.97 72.94 54.39 68.53
wo grouping 61.39 73.53 54.79 68.86
wo UPE 61.35 73.55 54.67 68.81
woL𝑠𝑢𝑚 61.50 73.55 54.74 68.91
Table 2: Result of ablation study for EmbSum. The best results
are highlighted in bold.
best MRR and nDCG@10 scores on both datasets. EmbSum outper-
forms UniTRec which also uses an encoder-decoder architecture
but cannot produce standalone user and candidate embeddings.
Ablation Studies .To better understand the effectiveness of our
framework, we conduct ablation studies on both datasets, the re-
sults of which are presented in Table 2. We first remove the CPE
for the candidate item and use only the encoder hidden state of
the[SOS] token to represent the candidate content. This alter-
ation consistently results in the largest performance drop on both
datasets, decreasing the AUC scores by 3.78 and 0.67 on the MIND
and Goodreads datasets, respectively. These results suggest the ef-
fectiveness of utilizing multiple embeddings to represent candidate
content and enhance user-item interactions. Removing session-
based grouping and encoding each content separately leads to AUC
drops of 0.61 and 0.25 on the MIND and Goodreads datasets, re-
spectively. When we reduce the codebook size of the UPE layer to
1, meaning each user is represented by a single vector, it results
in a performance decrease of 0.54 and 0.29 AUC on the MIND and
Goodreads datasets, respectively. This also justifies the efficacy of
using multiple embeddings in EmbSum. Furthermore, we investi-
gate the efficacy of using LLM-generated user-interest summaries
by removingL𝑠𝑢𝑚. WithoutL𝑠𝑢𝑚, we only provide the [SOS] to-
ken as the decoder input and take the hidden state of the [SOS]
token as the global representation. As Table 2 shows, this change
results in a performance drop on both datasets (0.52 on MIND and
0.14 on Goodreads).
Influence of Hyperparameters. We investigate the model sen-
sitivity to the weight for summarization loss ( 𝜆) and the size of CPE
and UPE. We randomly sampled 20% Train data of MIND dataset
for training and validated performance on the Dev set. As Figure 3a
shows, the (𝜆) values of 0.05, 0.1, and 0.3 yielded Dev performances
of 70.76, 70.55, and 70.47, respectively. The CPE size values of 2,
4, and 8 resulted in Dev performances of 70.43, 70.76, and 70.56,
respectively. The UPE size values of 16, 32, and 48 achieved Dev
performances of 70.46, 70.76, and 70.58, respectively. These results
Figure 3: Influence of different hyperparameters.
ROUGE 1 ROUGE 2 ROUGE L
MIND 51.42 30.33 39.12
Goodreads 46.99 20.86 28.16
Table 3: Evaluation on EmbSum-generated user summary.
indicate that our EmbSum model is not highly sensitive to hyper-
parameters. For CPE size, our findings suggest that increasing the
number of context codes on the candidate item side does not im-
prove model performance. We believe that excessively numerous
codebooks for a single item might introduce superfluous parame-
ters, thereby negatively impacting performance.
EmbSum-Generated Summary .We also evaluate the quality
of the summaries generated by our model and report the ROUGE
scores in Table 3. Since CTR is our main task, we generate sum-
maries using the checkpoint that achieves the best AUC score on
the Dev set. We select the test users who are not included in the
training set and then generate their interest summaries based on
their engagement history. We use the summaries generated by
Mixtral-8x22B-Instruct as references for calculating ROUGE
scores. For the MIND and Goodreads datasets, our model achieves
ROUGE-L scores of 39.12 and 28.16, respectively. More concrete
examples of the EmbSum-generated summaries can be found at
Appendix D, demonstrating that EmbSum is capable of accurately
capturing the user’s diverse interests.
5 Conclusion
We present a novel framework EmbSum for content-based rec-
ommendation. EmbSum utilize encoder-decoder architecture and
poly-attention modules to learn independent user and candidate
content embeddings as well as generate user-interest summaries
based on long user engagement histories. Through our experiments
on two benchmark datasets, we have demonstrated that our frame-
work achieves SoTA performance while using fewer parameters,
and being able to generate a user-interest summary which can be
used for recommendation explainability/transparency.
References
[1]Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican,
David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrit-
twieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki
Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom
Page 6:
RecSys ’24, October 14–18, 2024, Bari, Italy Zhang et al.
Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan
Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem
Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr,
Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen,
Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha
Khalman, Jakub Sygnowski, and et al. 2023. Gemini: A Family of Highly Capable
Multimodal Models. ArXiv preprint abs/2312.11805 (2023).
[2]Nicki Skafte Detlefsen, Jirí Borovec, Justus Schock, Ananya Harsh Jha, Teddy
Koker, Luca Di Liello, Daniel Stancl, Changsheng Quan, Maxim Grechkin, and
William Falcon. 2022. TorchMetrics - Measuring Reproducibility in PyTorch. J.
Open Source Softw. 7, 69 (2022), 4101. https://doi.org/10.21105/JOSS.04101
[3]Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 8
(2006), 861–874. https://doi.org/10.1016/J.PATREC.2005.10.010
[4]Youyang Gu, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Learning
to refine text based recommendations. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing . Association for Computational
Linguistics, Austin, Texas, 2103–2108. https://doi.org/10.18653/v1/D16-1227
[5]Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019.
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast
and Accurate Multi-sentence Scoring.
[6]Andreea Iana, Goran Glavaš, and Heiko Paulheim. 2023. NewsRecLib: A PyTorch-
Lightning Library for Neural News Recommendation. In Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing: System Demon-
strations , Yansong Feng and Els Lefever (Eds.). Association for Computational
Linguistics, Singapore, 296–310.
[7]Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with
Generative Models for Open Domain Question Answering. In Proceedings of the
16th Conference of the European Chapter of the Association for Computational
Linguistics: Main Volume . Association for Computational Linguistics, Online,
874–880. https://doi.org/10.18653/v1/2021.eacl-main.74
[8] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446. https://doi.org/10.
1145/582415.582418
[9]Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche
Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou
Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,
Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep
Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet,
Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024.
Mixtral of Experts. ArXiv preprint abs/2401.04088 (2024).
[10] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics . Association for Computational
Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
[11] Jian Li, Jieming Zhu, Qiwei Bi, Guohao Cai, Lifeng Shang, Zhenhua Dong, Xin
Jiang, and Qun Liu. 2022. MINER: Multi-Interest Matching Network for News
Recommendation. In Findings of the Association for Computational Linguistics:
ACL 2022 . Association for Computational Linguistics, Dublin, Ireland, 343–352.
https://doi.org/10.18653/v1/2022.findings-acl.29
[12] Qijiong Liu. 2023. Legommenders: A Modular Framework for Recommender
Systems.
[13] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023. ONCE: Boosting
Content-based Recommendation with Both Open- and Closed-source Large
Language Models.
[14] Rui Liu, Bin Yin, Ziyi Cao, Qianchen Xia, Yong Chen, and Dell Zhang. 2023. Per-
CoNet: News Recommendation with Explicit Persona and Contrastive Learning.
[15] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Robustly Optimized BERT Pretraining Approach. ArXiv preprint abs/1907.11692
(2019).
[16] Itzik Malkiel, Oren Barkan, Avi Caciularu, Noam Razin, Ori Katz, and Noam
Koenigstein. 2020. RecoBERT: A Catalog Language Model for Text-Based Rec-
ommendations. In Findings of the Association for Computational Linguistics:
EMNLP 2020 . Association for Computational Linguistics, Online, 1704–1714.
https://doi.org/10.18653/v1/2020.findings-emnlp.154
[17] Zhiming Mao, Huimin Wang, Yiming Du, and Kam-Fai Wong. 2023. UniTRec:
A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework
for Text-based Recommendation. In Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers) , Anna Rogers,
Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational
Linguistics, Toronto, Canada, 1160–1170. https://doi.org/10.18653/v1/2023.acl-
short.100
[18] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.
Embedding-based News Recommendation for Millions of Users. In Proceed-
ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017 . ACM, 1933–1942.https://doi.org/10.1145/3097983.3098108
[19] OpenAI. 2023. GPT-4 Technical Report. ArXiv preprint abs/2303.08774 (2023).
[20] Tao Qi, Fangzhao Wu, Chuhan Wu, and Yongfeng Huang. 2022. News Recommen-
dation with Candidate-aware User Modeling. In SIGIR ’22: The 45th International
ACM SIGIR Conference on Research and Development in Information Retrieval,
Madrid, Spain, July 11 - 15, 2022 , Enrique Amigó, Pablo Castells, Julio Gonzalo,
Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 1917–1921.
https://doi.org/10.1145/3477495.3531778
[21] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach.
Learn. Res. 21 (2020), 140:1–140:67.
[22] Ellen M. Voorhees. 1999. The TREC-8 Question Answering Track Report. In
Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg,
Maryland, USA, November 17-19, 1999 (NIST Special Publication, Vol. 500-246) ,
Ellen M. Voorhees and Donna K. Harman (Eds.). National Institute of Standards
and Technology (NIST).
[23] Mengting Wan and Julian J. McAuley. 2018. Item recommendation on mono-
tonic behavior chains. In Proceedings of the 12th ACM Conference on Recom-
mender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018 , Sole Pera,
Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan (Eds.). ACM, 86–94.
https://doi.org/10.1145/3240323.3240369
[24] Rongyao Wang, Shoujin Wang, Wenpeng Lu, and Xueping Peng. 2022. News Rec-
ommendation Via Multi-Interest News Sequence Modelling. In IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Sin-
gapore, 23-27 May 2022 . IEEE, 7942–7946. https://doi.org/10.1109/ICASSP43922.
2022.9747149
[25] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel
Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language
Models with Self-Generated Instructions. In Proceedings of ACL 2023 (Volume 1:
Long Papers) . ACL, 13484–13508. https://doi.org/10.18653/V1/2023.ACL-LONG.
754
[26] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang,
and Xing Xie. 2019. Neural News Recommendation with Attentive Multi-View
Learning. In Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 , Sarit Kraus
(Ed.). ijcai.org, 3863–3869. https://doi.org/10.24963/ijcai.2019/536
[27] Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie. 2019.
Neural News Recommendation with Multi-Head Self-Attention. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP) . Association for Computational Linguistics, Hong Kong, China, 6389–
6394. https://doi.org/10.18653/v1/D19-1671
[28] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Empowering
News Recommendation with Pre-trained Language Models.
[29] Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, and Xing Xie. 2021. Fast-
former: Additive Attention Can Be All You Need.
[30] Chuhan Wu, Fangzhao Wu, Yang Yu, Tao Qi, Yongfeng Huang, and Qi Liu. 2021.
NewsBERT: Distilling Pre-trained Language Model for Intelligent News Applica-
tion. In Findings of the Association for Computational Linguistics: EMNLP 2021 .
Association for Computational Linguistics, Punta Cana, Dominican Republic,
3285–3295. https://doi.org/10.18653/v1/2021.findings-emnlp.280
[31] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian,
Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. 2020. MIND:
A Large-scale Dataset for News Recommendation. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics . Association for
Computational Linguistics, Online, 3597–3606. https://doi.org/10.18653/v1/2020.
acl-main.331
[32] Liancheng Xu, Xiaoxiang Wang, Lei Guo, Jinyu Zhang, Xiaoqi Wu, and Xinhua
Wang. 2023. Candidate-Aware Dynamic Representation for News Recommenda-
tion. In Artificial Neural Networks and Machine Learning - ICANN 2023 - 32nd Inter-
national Conference on Artificial Neural Networks, Heraklion, Crete, Greece, Septem-
ber 26-29, 2023, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 14260) ,
Lazaros Iliadis, Antonios Papaleonidas, Plamen P. Angelov, and Chrisina Jayne
(Eds.). Springer, 272–284. https://doi.org/10.1007/978-3-031-44195-0_23
[33] Qi Zhang, Jingjie Li, Qinglin Jia, Chuyuan Wang, Jieming Zhu, Zhaowei Wang,
and Xiuqiang He. 2021. UNBERT: User-News Matching BERT for News Rec-
ommendation. In Proceedings of the Thirtieth International Joint Conference on
Artificial Intelligence . International Joint Conferences on Artificial Intelligence Or-
ganization, Montreal, Canada, 3356–3362. https://doi.org/10.24963/ijcai.2021/462
Page 7:
EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations RecSys ’24, October 14–18, 2024, Bari, Italy
Appendices
A Baselines
Our model is evaluated in comparison with the prior state-of-the-art
neural network-based methods for content-based recommendation:
(1)NAML [ 26] employs a content representation approach that
integrates CNNs, additive attention, and pretrained word
embeddings. It further uses additive attention to create a
user representation based on their engagement history.
(2)NRMS [ 27] combines pretrained word embeddings with
multi-head self-attention and additive attention mechanisms
to develop representations for user preferences and candi-
date content.
(3)Fastformer [ 29] presents an efficient Transformer model that
consists of additive attention.
(4)CAUM [ 20] extends NRMS by incorporating content title
entities into embeddings and utilizing candidate-aware self-
attention for generating user embeddings.
(5)MINS [ 24] improves upon NRMS with a multi-channel GRU-
based network designed to better capture the sequential
dynamics in user engagement history.
(6)NAML-PLM utilizes a PLM as the content encoder, lever-
aging a pretrained model rather than training from scratch.
RoBERTa-base [ 15] (with 125M parameters) is used as the
text encoder.
(7)UNBERT [ 33] employs a PLM for content encoding, deriving
user-content matching indicators at both item and word
levels, with RoBERTa-base [ 15] (with 125M parameters) as
the backbone model.
(8)MINER [ 11] uses a PLM for text encoding and introduces a
poly-attention mechanism to extract diverse user interest
vectors for enhanced user representation, standing out as a
leading model on the MIND dataset leaderboard.1RoBERTa-
base [ 15] (with 125M parameters) is the text encoder for
MINER.
(9)UniTRec [ 17] applies an encoder-decoder structure (i.e., BART)
to separately encode user history and candidate content, as-
sessing candidate content using perplexity scores from the
decoder and a discriminative scoring head.2BART-base [ 10]
(with 139M parameters) is the backbone in our implementa-
tion.
We utilize the optimal hyperparameters recommended for these
baselines and perform training and evaluations on our dataset splits.
For NAML, NRMS, Fastformer, and NAML-PLM, we employ the
implementations provided by Liu [12]. The CAUM and MINS imple-
mentations are obtained from Iana et al . [6]. For UNBERT, MINER,
and UniTRec, we use the original scripts released by their respective
authors.3
1https://msnews.github.io/
2Metrics are calculated using predictions from the discriminative scoring head.
3UNBERT: https://github.com/reczoo/RecZoo/tree/main/pretraining/news/UNBERT,
MINER: https://github.com/duynguyen-0203/miner, UniTRec: https://github.com/
Veason-silverbullet/UniTRec.B Dataset Statistics
In this work, we evaluate our model EmbSum on two public bench-
mark datasets for content-based recommendation. The dataset sta-
tistics is presented in Table 4.
C Quality of LLM Generated User-Interest
Summary
Due to the lack of gold labels for evaluating generated user-interest
summarization and the difficulty of human evaluation, we follow
recent works that use LLM as a judge to evaluate AI response. We
adapt the rubric from [ 25],4which scores the AI response in four
levels, A (accurate and comprehensive), B (acceptable but with few
minor errors), C (related but has significant errors), and D (invalid
and unrelated response). We prompt the SoTA LLM GPT-4 (i.e.,
gpt-4o API) to evaluate the generated user-interest summaries. To
be budget-friendly, we randomly sample 500 users for this evalu-
ation. We provide the original user engagement history and our
generated user-interest summary. The GPT-4 judge classifies 187
summaries to A, 305 to B, 8 to C, and 0 to D. We also conducted this
evaluation on the generated user-interest summaries of Goodreads
users. The GPT-4 judge classifies 156, 331, 13, and 0 summaries to
A, B, C and D, respectively. This result indicates that most of our
generated summaries are able to capture the user’s interests.
D EmbSum-Generated Summary Examples
We present the examples of EmbSum-generated summaries in Ta-
ble 5.
4https://github.com/yizhongw/self-instruct
Page 8:
RecSys ’24, October 14–18, 2024, Bari, Italy Zhang et al.
Dataset MIND GoodreadsDataset MIND Goodreads
Split Train Dev Test Train Dev Test
# content 51,283 21,352 41,496 309,047 234,232 247,242 # of history/user 22 47
# users 50,000 6,679 46,549 21,450 16,339 17,967 # category 18 11
# new users - 5,862 41,020 - 2,930 3,199 # tokens/title 17 18
# positive 236,344 10,775 100,608 198,403 75,445 93,156 # tokens/abstract 50 128
# negative 5,607,100 249,607 2,380,008 458,435 141,977 154,016 # tokens/user summary 76 115
Table 4: Dataset Statistics. “# new users” indicates the number of users not included in the Train set. “# tokens/user summary”
represents the average length of user interest summaries generated by LLM. The number of tokens are calculated using the
RoBERTa-base model’s vocabulary.
Historical Clicked News
(1) 45 Amazing Facts About Airplanes That Will Make Your Mind Soar travel
(2) 29 Foods Diabetics Should Avoid health
(3)This $12 million ’mansion yacht’ is made entirely of stainless steel and it’s a first for the industry. Take a peek inside. travel
(4) The #1 Worst Menu Option at 76 Popular Restaurants health
(5) Celebs celebrate Halloween 2019 entertainment
(6) Woman who made it on Delta flight without a ticket or boarding pass says ’it’s not my fault’ travel
(7) This Dog’s Terrifyingly Cute ’Killer’ Costume Just Won Halloween lifestyle
(8) Body of missing Alabama girl found; 2 being charged news
(9) Meghan Markle Always Stands the Exact Same Way at Events and There’s a Specific Reason Why lifestyle
(10) Prince William and Kate Middleton arrive in Pakistan for royal tour lifestyle
EmbSum-Generated Summary
Based on the user’s browsed news history, their interests seem to be focused on entertainment, lifestyle, and current events. They appear to enjoy reading
about celebrities, royal families, and unusual or quirky stories. Additionally, they seem to have an interest in health and wellness, specifically for weight loss
and fitness.
Table 5: Example of EmbSum-generated summary.