Paper Content:
Page 1:
HyperSeq: A Hyper-Adaptive Representation for Predictive
Sequencing of States
Roham Koohestani Maliheh Izadi
{rkoohestani, m.izadi}@tudelft.nl
Delft University of Technology
Delft, The Netherlands
Abstract
In the rapidly evolving world of software development, the surge in
developers’ reliance on AI-driven tools has transformed Integrated
Development Environments into powerhouses of advanced fea-
tures. This transformation, while boosting developers’ productivity
to unprecedented levels, comes with a catch: increased hardware
demands for software development. Moreover, the significant eco-
nomic and environmental toll of using these sophisticated models
necessitates mechanisms that reduce unnecessary computational
burdens. We propose HyperSeq — Hyper-Adaptive Representation
for Predictive Sequencing of States — a novel, resource-efficient
approach designed to model developers’ cognitive states. HyperSeq
facilitates precise action sequencing and enables real-time learn-
ing of user behavior. Our preliminary results show how HyperSeq
excels in forecasting action sequences and achieves remarkable
prediction accuracies that go beyond 70%. Notably, the model’s
online-learning capability allows it to substantially enhance its pre-
dictive accuracy in a majority of cases and increases its capability
in forecasting next user actions with sufficient iterations for adap-
tation. Ultimately, our objective is to harness these predictions to
refine and elevate the user experience dynamically within the IDE.
CCS Concepts
•Human-centered computing ;•Software and its engineering ;
Keywords
User Behavior Modeling, IDE Design, Formal Methods
ACM Reference Format:
Roham Koohestani Maliheh Izadi, {rkoohestani, m.izadi}@tudelft.nl, Delft
University of Technology, Delft, The Netherlands . 2025. HyperSeq: A Hyper-
Adaptive Representation for Predictive Sequencing of States. In Proceedings
of ACM International Conference on the Foundations of Software Engineering
(FSE ’25). ACM, New York, NY, USA, 7 pages. https://doi.org/XXXXXXX.
XXXXXXX
1 Introduction
Since the advent of artificial intelligence, the software development
life-cycle has become increasingly entangled with AI tooling. The
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
FSE ’25, June 23–27, 2025, Trondheim, Norway
©2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06
https://doi.org/XXXXXXX.XXXXXXXintroduction of tools like Copilot [ 4] and JetBrains AI [ 5] is en-
hancing user productivity, while editors like Cursor [ 3] simplify
programming, thereby transforming the way developers engage
with software repositories.
Together with these benefits, it should be noted that the under-
lying models have added complexity to the systems we program,
consequently raising the computational demands of programming.
This increased computation demand has led to greater environmen-
tal repercussions; a recent technical blog by the team responsible for
the Llama 3.1 model series reveals that training the leading 405 bil-
lion parameter model consumed 30.84 million GPU hours [ 1]. This
translates into 7.92 kilotons of CO2 emissions, based on the setup
detailed in the article [ 11] and using the United States emission
factor of 367 grams of 𝐶𝑂2per kWh [ 2]. Furthermore, executing
inferences with this model demands a multi-node arrangement,
with each node equipped with 8 H100 GPUs operating at a 700W
TDP, highlighting that a substantial impact of these models occurs
during the post-deployment phase.
Previous studies have attempted to leverage user telemetry data
to model user behavior [ 15] and to filter out completion requests
in the IDE by taking into account the user’s present state [ 8,16].
Furthermore, earlier research has differentiated between two cate-
gories of user states: exploration, where a user seeks suggestions,
and acceleration, where the user aims to continue programming
without interruption. However, these existing methods are still
rather costly to train and require a large amount of training data.
Moreover, customizing these models for individual users presents
challenges, as it is quite improbable that a universal model could
accurately capture the behavior of all users to the same satisfactory
degree. To tailor a model to an individual’s behavior, online learning
is required to consistently enhance the existing model data from
that user, necessitating efficient implementation.
Hyperdimensional computing (HDC)[ 13] provides a framework
for holistic representation, manipulation, and querying of abstract
concepts. This approach encodes ideas within high-dimensional
vector spaces, taking advantage of the quasi-orthogonality inherent
in these vectors.
This paper introduces HyperSeq, a Hyper-Adaptive Represen-
tation for Predictive Sequencing of States. We assess our model’s
effectiveness under two distinct scenarios, considering both adap-
tive and non-adaptive conditions using an established dataset of
real-world user interactions and states. Our results indicate that
the model effectively predicts the states of the users, with an aver-
age accuracy of more than 70%. Moreover, leveraging adaptation
mode allows the model to substantially improve its accuracy by
employing online learning on user states.arXiv:2503.10254v1 [cs.SE] 13 Mar 2025
Page 2:
FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al.
In this paper, we (1) establish the theoretical basis for a hy-
perdimensional sequence model, (2) formally define predictive se-
quencing and demonstrate that the model allows for training in
𝑂(𝐷·|𝑇𝑟𝑎𝑖𝑛|), and online adaptation in 𝑂(𝐷)time, (3) conduct
an initial assessment of the model’s performance, and (4) outline
potential future applications for HyperSeq.
2 Motivation
There is a tremendous amount of cost associated with running these
large models, especially when dealing with large-scale inference. An
approach to addressing this issue involves minimizing the number
of incorrect invocations, meaning decreasing the volume of calls
that are likely to result in the rejection of generated responses. By
analyzing user states, we can determine the likelihood of a user
accepting or rejecting completions, thus alleviating server strain.
Moreover, recent research indicates [ 19] that developers prefer
to incorporate proactivity within their systems. By forecasting
upcoming states and user actions, we can strive to embed state-
aware mechanisms into IDEs.
For example, detecting when a user is about to enter refactoring
mode could allow the IDE to proactively initiate a search for po-
tential refactorings and suggest these options as soon as the mode
change is detected.
3 Background Literature
3.1 Hyper-dimensional Computing
Hyper-Dimensional Computing (HDC) is a computational paradigm
in which high-dimensional vectors, typically ranging from 104to
105, are utilized to represent, process and query abstract concepts.
Although HDC was coined by Pentti Kanerva [ 13], the concept dates
to the 1990s, with holographic reduced representation (HRR) [ 18]
and Vector-Symbolic Architectures [10].
In recent years, there has been growing interest in employing
HDC for intelligent tasks because of the efficiency of models based
on HDC. Such HDC-based methods have been applied to challeng-
ing problems like Raven’s Progressive Matrices [ 12], as well as in
encoding visual representations efficiently and retrieving this data
using what is referred to as resonator networks [ 14]. Recent studies
have explored the integration of HDC-inspired modifications into
current architectures. InfiniAttention [ 17] has demonstrated signif-
icant advancements in processing exceptionally lengthy contexts
by encoding them into a sparse hyperdimensional memory.
3.2 User-IDE interaction
Prior research has developed taxonomies of user interactions in
IDEs, focusing on AI features. One study [ 7] identified two modes:
exploration, where developers seek recommendations and solu-
tions, and acceleration, where they work confidently with minimal
distractions.
Various investigations have employed telemetry data from within
the IDE to determine programmers’ conditions. By analyzing fea-
tures like caret movements and typing speed, these studies inferred
the action states and developed a classification of 12 user states,
along with their transitional characteristics [ 15]. The concept ofleveraging telemetry data to enhance Human-AI interaction re-
mains prevalent, with numerous studies applying it to effectively
filter out requests that are highly likely to be declined [8, 16].
Researchers have examined how to design Human-AI eXperi-
ences (HAX) within the IDE by establishing a classification of meth-
ods through which developers seek assistance and identifying vari-
ous domains where distinct models can provide support, including
testing, maintenance, and optimization.
3.3 Next-action prediction
So far, limited research has focused on predicting users’ next ac-
tions within IDEs. Cursor, a newly developed AI-integrated IDE,
frequently references next-action prediction on its website, sug-
gesting significant investment in this area [ 3]. Moreover, existing
studies have explored employing gaze-tracking to dynamically ad-
just the IDE according to the user’s forthcoming actions [20].
4 Problem Definition
In this section, we formally define the problem statement to im-
prove clarity for subsequent sections. As noted in section 8, our
present work concentrates on predicting future user states given
the limited availability of data. Nevertheless, we plan to broaden the
investigation in future research to deduce subsequent user states
directly from user telemetry.
4.1 Predictive Sequencing
Let𝑆be the set containing all potential states, and let 𝑠indicate
an individual state within 𝑆. Similarly, let 𝑈signify the set of all
users, with 𝑢representing a single user in 𝑈. A session state se-
quence, denoted 𝑆𝑒𝑠𝑠, is defined as an 𝑚-tuple of states 𝑆𝑒𝑠𝑠 =
(𝑠0,𝑠1,...,𝑠𝑚),where each 𝑠𝑖∈𝑆.Furthermore, we denote the 𝑖-th
session of user 𝑢as𝑆𝑒𝑠𝑠𝑢,𝑖. The objective of next-state prediction
is to develop a model 𝑀capable of forecasting the following state
for a user𝑢based on an 𝑛-tuple of states. Formally, the model is
articulated as: 𝑀:𝑆𝑛↦→𝑆.Additionally, we define a user-specific
model as𝑀𝑢, tailored for the individual user 𝑢.
4.2 Hyper-Dimensional Computing (HDC)
In this section, we expand upon the formal structure introduced
earlier to lay the theoretical groundwork for our approach. We
adhere to the following notational conventions:
•The bundling operator is represented by ⊕,
•The binding operator is signified by ⊗, and
•The permutation operator is symbolized by 𝑃.
Further, we utilize the Bipolar Map framework, wherein hy-
pervectors are drawn from the domain 𝐷𝑂𝑀 ={−1,1}𝐷, with𝐷
denoting the dimensionality of the hyperspace. The elements of the
chosen hypervectors are sampled from the probability distribution
𝑃=2·(Ber(0.5)−0.5)[9].
A hypervector 𝑣is characterized as 𝑣∈𝐷𝑂𝑀 . A codebook
𝐶links the state space 𝑆to the hyperspace 𝐷𝑂𝑀 , i.e.,𝐶:𝑆↦→
𝐷𝑂𝑀 . Likewise,𝐶′provides the reverse mapping, defined as 𝐶′:
𝐷𝑂𝑀↦→𝑆. The function 𝐸𝑛𝑐𝑜𝑑𝑒 is designed to process an 𝑛-tuple of
hypervectors 𝑣𝑖while maintaining their sequential characteristics.
Formally,𝐸𝑛𝑐𝑜𝑑𝑒 :𝐷𝑂𝑀𝑚↦→𝐷𝑂𝑀 , and it is implemented as
Page 3:
HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway
following where 𝑉=(𝑣0,𝑣1,...,𝑣𝑛−1),
𝐸𝑛𝑐𝑜𝑑𝑒(𝑉)=𝑛−1Ì
𝑖=0(𝑃𝑛−𝑖−1(𝑣𝑖)).
Additionally, we define a function 𝑆𝑖𝑚 that takes two inputs
from the defined hyperspace and yields a real-number indicat-
ing the similarity between the two vectors. More formally, 𝑆𝑖𝑚 :
(𝐷𝑂𝑀,𝐷𝑂𝑀)↦→R. In our context, we measure similarity using co-
sine similarity, with values ranging from -1 to 1, where -1 represents
complete dissimilarity and 1 represents complete similarity.
5 HyperSeq
5.1 Representation
As stated earlier, the challenge of depicting action states involves
encoding every action subsequence of length 𝑛and then using the
model to predict the most probable action from a subsequence of
𝑛−1actions. Hence, a foundational model 𝑀Basecan be developed
utilizing the training dataset 𝑇𝑟𝑎𝑖𝑛
|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖))
where
𝑇𝑟𝑎𝑖𝑛 ={𝑆𝑒𝑞0,𝑆𝑒𝑞 1,···,𝑆𝑒𝑞|𝑇𝑟𝑎𝑖𝑛|−1}
𝑆𝑒𝑞𝑖={𝐶(𝑠0),𝐶(𝑠1),···,𝐶(𝑠𝑛)}
with𝑠𝑗being the ordered states within a session such that 𝑆𝑒𝑞𝑖is a
continuous subsequence of that session. In the following sections,
we explain the method used to derive the 𝑇𝑟𝑎𝑖𝑛 and𝑇𝑒𝑠𝑡 datasets
from our primary dataset 𝐷𝑎𝑡𝑎 .
The primary objective, naturally, of training a model of this kind
is to utilize it for querying future user states, essentially engaging
in predictive sequencing. This can be done by querying model
𝑀with the prefix of the anticipated user-state. In other words,
since the model is trained on n-grams of user states, the (n-1)-
gram preceding the anticipated action can be used to query for the
forthcoming action. The 𝐸𝑛𝑐𝑜𝑑𝑒 function allows for this (n-1)-gram
prefix to be represented within the n-gram’s context as a query
vector𝑞=𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞′
𝑘[0 :𝑛−1])), where𝑆𝑒𝑞′
𝑘is intended to
predict its n-th element.
By using our query vector 𝑞with our Model 𝑀through the
binding operator⊗, and due to the quasi orthogonality of the vec-
tors, we solely unbind the suffix vector with a similar prefix (re-
fer to ...). The resulting vector can be interpreted as a frequency
vector representing the states linked to that prefix, expressed as
𝑅=𝑀∗𝑞=(𝐶(𝑠0),𝐶(𝑠1),...,𝐶(𝑠𝑙)), where𝑙is determined by the
occurrences of specific patterns within the training dataset. Because
vector𝑅is a frequency vector, performing a similarity analysis of
𝑅against all valid keys yields a discrete distribution pointing to
the similarity with a specific state. A valid key 𝑘is characterized
by𝑘∈Domain(𝐶′), implying that the collection of all valid keys
corresponds to the items for which mapping is available in the
function𝐶′, namely Domain(𝐶′).
Thus, based on the frequency vector 𝑅, we can predict the n-th
element in the sequence using 𝑠𝑛=𝐶′(arg max𝑘(𝑆𝑖𝑚(𝑅,𝑘))). This
methodology enables us to effectively perform predictive sequenc-
ing connected to action states.5.2 Adaptiveness
Enhancing the foundation of the non-adaptive model, we propose
altering our model’s definition to enable greater flexibility. To ac-
complish this, we must revise the structure of our model 𝑀. This
new iteration, model 𝑀, consists of two segments: 𝑀base, akin to
what was previously mentioned. Additionally, our model now incor-
porates an adaptive section, 𝑀adap, which mirrors the base model’s
design. The complete model 𝑀is formed by combining these two
segments using the bundling ⊕operator. The inclusion of this adap-
tive component supports online learning and facilitates incremental
training, much like the base model. In the following section, we will
examine how this adaptation influences model performance and
evaluate the computational resources demanded by this additional
component. The implementation details of our methodology, along
with the evaluation procedure and results, are accessible in the
paper’s replication package[6].
6 Evaluation Methodology
6.1 Datasets
Due to the challenges in accessing actual user data, we decided to
utilize an already available dataset from a prior study, as outlined
in the background section. Mozannar et al. [ 15] developed a tax-
onomy comprising twelve user states and gathered data from 21
developers to compile a labeled dataset of user states. This dataset
can be employed to train and test the model in a constrained envi-
ronment, allowing future opportunities for more comprehensive
evaluation. Our preliminary analysis reveals that three labels pro-
vide limited information, leading us to exclude them from this study.
Consequently, we are left with a total of 9 labels.
6.2 Data Partitioning Strategies
Let𝐷𝑎𝑡𝑎 denote the complete dataset of user session state se-
quences, where 𝐷𝑎𝑡𝑎 ={𝑆𝑒𝑠𝑠𝑢,𝑖|𝑢∈𝑈′,𝑖∈{1,...,𝑛𝑢}},𝑈′⊂𝑈,
and𝑛𝑢is the number of sessions for user 𝑢. The task is to split 𝐷𝑎𝑡𝑎
into two disjoint subsets, 𝑇𝑟𝑎𝑖𝑛 and𝑇𝑒𝑠𝑡 , such that𝑇𝑟𝑎𝑖𝑛∪𝑇𝑒𝑠𝑡=
𝐷𝑎𝑡𝑎 and𝑇𝑟𝑎𝑖𝑛∩𝑇𝑒𝑠𝑡 =∅. We define three data partitioning
strategies as follows.
6.2.1 Disjoint Split. In a disjoint split, the dataset is divided by
users: one group for training and another for testing. We use data
from 18 developers for training and the rest for testing.
6.2.2 Overlapping Split. In the overlapping split, each user’s data
is divided between training and testing sets. For example, with 10
sessions, 8 are for training and 2 for testing.
6.2.3 K-Fold Cross-Validation. This strategy creates a fold for each
user by excluding them from training and using them only for
evaluation, repeating for every user in the dataset.
6.3 Hyper-Parameters
In addition to exploring various data-splitting strategies, we investi-
gate the impact of four other model (hyper-)parameters: dimension,
subsequence length, cyclic shift, and adaptiveness. The dimension 𝐷
was adjusted, selecting from fixed values {1000,5000,10000,20000}.
Similarly, the sequence length was altered using the set {3,5,7,9}.
Moreover, within the MAP framework, where the 𝑃operator acts
Page 4:
FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al.
Figure 1: Overall accuracies of the model across the different
splitting strategies and with adaptive on or off.
Figure 2: Best-performing model’s accuracy; Sliding window
size = 30, Dimension = 20,000, Sequence Length = 3, Cyclic
Shift = 4, Adaptive = True, and Split Strategy = Disjoint.
as a cyclic shift of the representation vector, we assess the effect
by altering the shift values among {2,4,6}. Lastly, we compare
two proposed models: one incorporating the adaptive model and
another without it.
6.4 metrics
6.4.1 Overall Accuracy. According to our setup, we assess the
model’s overall accuracy using the test set. When implementing
cross-validation with several test sets, we calculate the average ac-
curacy. For all other scenarios, we determine the model’s accuracy
considering every available user.
6.4.2 Sliding Window Accuracy. For adaptive models, we assess
the sliding window accuracy to understand how the model’s perfor-
mance evolves. Specifically, a window of size 𝑘is used to traverse
the prediction sequence, and the model’s accuracy is computed
within this window. This approach helps us determine the effec-
tiveness of employing online learning to tailor models to users.
7 Preliminary Results
7.1 Model Performance
Our findings indicate that the models, on average, can perform
within a range of 40 to 50 percent, with some configurations reach-
ing above 70% on average. Furthermore, as depicted in Figure 1,
the adaptive version of the model consistently surpasses the perfor-
mance of the base version without employing any online learning.
Our assessments indicate that the cyclic shift parameter has a
negligible impact on performance. Interestingly, our analysis of
the results demonstrate that longer subsequences do not enhance
predictive capabilities. In fact, an increase in sequence length corre-
sponds with a decline in accuracy. This may result from the vector’sinformation capacity being surpassed, suggesting further research
is needed to determine if this pattern persists in higher dimensions.
Moreover, when examining the top-performing model, we ob-
serve a noteworthy trend in two of the three test scenarios: there
is an upward trajectory in the model’s average accuracy. This pat-
tern suggests that online learning effectively enhances the model’s
performance. But more interestingly, there is not only an upward
trend but rather that adapting the model can achieve results that
make the predictions of the model near perfect.
7.2 Model Efficiency
7.2.1 Training Efficiency. Training the model involves a single
pass through the training data. N-grams can be constructed in
constant time, provided that previous sequences within a session
are available and that the cold start of the session is amortized (see
the appendix). Since the model operates on D-dimensional vectors,
the time complexity for training is 𝑂(𝐷·|𝑇𝑟𝑎𝑖𝑛|).
7.2.2 Update Efficiency. The update logic is implemented as a con-
stant amount of operations on D dimensional vector, making the
update𝑂(𝐷).
7.2.3 Storage Efficiency. The storage requirements for the model
depend on several components, which vary based on the specific
implementation of the approach. To optimize memory usage, the
domain of stored values per entry can be limited, allowing for stor-
age in fewer bits than the conventional 32 or 64 bits. For instance,
using 8 or 16 bits per entry can enhance storage efficiency. Let B
represent the number of bits per entry. The stored items include
the model memory (doubled in the case of the adaptive model) and
the codebook. Consequently, the space complexity of the model is
𝑂(𝐵·(|𝐶|+𝐷)).
8 Future Work
In future work, we intend to broaden our approach for more ex-
tensive, real-world applicability. We plan to enhance the proposed
framework to process incoming telemetry data from users similarly
to other current methodologies. Moreover, we seek to leverage
the predicted user state to adaptively optimize the user experience
within the IDE. Furthermore, we are interested in investigating the
impact of different Vector-Symbolic Architectures and assessing
how they influence model accuracy.
9 Conclusion
This paper introduced HyperSeq, a Hyper-Adaptive Representation
for Predictive Sequencing of States, as a novel, efficient approach
to predicting developer actions in IDEs. Using hyperdimensional
computing and online adaptation, HyperSeq achieved competitive
accuracy, surpassing 70% in certain configurations, while maintain-
ing low computational and storage requirements. Our evaluation
demonstrated that HyperSeq not only excels in forecasting user
actions but also dynamically improves its accuracy through online
learning, enabling personalized predictions for individual users.
This efficiency positions HyperSeq as a promising solution to re-
duce computational overhead in AI-powered IDEs and enhance
user experiences through context-aware proactive adaptations. Fu-
ture work will focus on integrating telemetry-based predictions,
Page 5:
HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway
exploring alternative vector-symbolic architectures, and deploy-
ing HyperSeq in real-world environments to optimize developer
productivity and promote sustainable computing practices.
References
[1]2024. Llama 3.1 - 405B, 70B and 8B with multilinguality and long context. https:
//huggingface.co/blog/llama31
[2]2024. U.S. Energy Information Administration (EIA). https://www.eia.gov/tools/
faqs/faq.php?id=74&t=11
[3] 2025. Cursor - the AI code editor. https://www.cursor.com/
[4]2025. GitHub Copilot ·Your AI pair programmer. https://github.com/features/
copilot
[5]2025. JetBrains AI service and In-IDE AI Assistant. https://www.jetbrains.com/ai/
[6]Anonymous Authors. 2025. Replication Package. https://github.com/hyperseq-
replication/package
[7]Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2022. Grounded
Copilot: How Programmers Interact with Code-Generating Models.
arXiv:2206.15000 [cs.HC] https://arxiv.org/abs/2206.15000
[8] Aral de Moor, Arie van Deursen, and Maliheh Izadi. 2024. A transformer-based
approach for smart invocation of automatic code completion. In Proceedings of
the 1st ACM International Conference on AI-Powered Software . 28–37.
[9]Ross W Gayler. 1998. Multiplicative binding, representation operators & analogy
(workshop poster). (1998).
[10] Ross W Gayler. 2004. Vector symbolic architectures answer Jackendoff’s chal-
lenges for cognitive neuroscience. arXiv preprint cs/0412059 (2004).
[11] Aaron Grattafiori and Abhimanyu Dubey et. al. 2024. The Llama 3 Herd of
Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
[12] Michael Hersche, Mustafa Zeqiri, Luca Benini, Abu Sebastian, and Abbas Rahimi.
2023. A Neuro-vector-symbolic Architecture for Solving Raven’s Progressive
Matrices. arXiv:2203.04571 [cs.LG] https://arxiv.org/abs/2203.04571
[13] Pentti Kanerva. 2009. Hyperdimensional computing: An introduction to com-
puting in distributed representation with high-dimensional random vectors.
Cognitive computation 1 (2009), 139–159.
[14] Spencer J. Kent, E. Paxon Frady, Friedrich T. Sommer, and Bruno A. Olshausen.
2020. Resonator Networks outperform optimization methods at solving high-
dimensional vector factorization. arXiv:1906.11684 [cs.NE] https://arxiv.org/
abs/1906.11684
[15] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading
Between the Lines: Modeling User Behavior and Costs in AI-Assisted Program-
ming. arXiv:2210.14306 [cs.SE] https://arxiv.org/abs/2210.14306
[16] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. When
to show a suggestion? Integrating human feedback in AI-assisted programming.
InProceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 10137–
10144.
[17] Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. 2024. Leave No
Context Behind: Efficient Infinite Context Transformers with Infini-attention.
arXiv:2404.07143 [cs.CL] https://arxiv.org/abs/2404.07143
[18] Tony A Plate. 1995. Holographic reduced representations. IEEE Transactions on
Neural networks 6, 3 (1995), 623–641.
[19] Agnia Sergeyuk, Ekaterina Koshchenko, Ilya Zakharov, Timofey Bryksin, and
Maliheh Izadi. 2024. The Design Space of in-IDE Human-AI Experience.
arXiv:2410.08676 [cs.SE] https://arxiv.org/abs/2410.08676
[20] Thomas Weber, Rafael Vinicius Mourao Thiel, and Sven Mayer. 2023. Supporting
Software Developers Through a Gaze-Based Adaptive IDE. In Proceedings of
Mensch Und Computer 2023 (Rapperswil, Switzerland) (MuC ’23) . 267–276. https:
//doi.org/10.1145/3603555.3603571
Page 6:
FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al.
A The result of querying the model M with q is
a frequency vector R
We define a frequency vector as a vector𝐹in hyperspace that repre-
sents the weighted combination of a set of basis vectors 𝑠0,𝑠1,...,𝑠𝑚,
such that:
𝑅=𝑎·𝑓0⊕𝑏·𝑓1⊕···⊕𝑧·𝑓𝑚,
where𝑎,𝑏,...,𝑧≥0and each weight corresponds to the frequency
or relevance of the respective basis vector.
The model is trained as:
𝑀=|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖))
where:
𝑇𝑟𝑎𝑖𝑛 ={𝑆𝑒𝑞0,𝑆𝑒𝑞 1,...,𝑆𝑒𝑞|𝑇𝑟𝑎𝑖𝑛|−1},
and each sequence is defined as:
𝑆𝑒𝑞𝑖={𝐶(𝑠0),𝐶(𝑠1),...,𝐶(𝑠𝑛)}.
By the definition of the 𝐸𝑛𝑐𝑜𝑑𝑒 function, the model formulation
can be expanded as:
𝑀=|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0 𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖[:𝑛−1]))⊗𝑆𝑒𝑞𝑖,𝑛.
For brevity, we define 𝑆𝑒𝑞𝑖[𝑛−1]as𝑆𝑒𝑞𝑖,𝑛. Similarly, the query
vector𝑞is defined as:
𝑞=𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞′
𝑘[0 :𝑛−1])).
Due to the high quasi-orthogonality and the low probability of
collisions between vectors in hyperspace, multiplicative unbinding
(or querying, as described in the paper) results in the expression:
𝑀⊗𝑞=|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0 𝑞⊗𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖[:𝑛−1]))⊗𝑆𝑒𝑞𝑖,𝑛.
This simplifies further under the assumption that all subse-
quences𝑆𝑒𝑞𝑖[:𝑛−1]in the training data are unique and match the
queried subsequence. In this case, the formulation becomes:
𝑀⊗𝑞=|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0( I⊗𝑆𝑒𝑞𝑖,𝑛),
where Iis the identity vector.
However, this assumption is unlikely in practice. When the pre-
ceding subsequence does not match exactly, the quasi-orthogonality
ensures that the similarity between unrelated vectors is normally
distributed around zero, resulting in negligible noise. Thus, in the
absence of matching subsequences, 𝑀⊗𝑞simplifies to noise, which
is equally dissimilar to all vectors.
Therefore, in the general case, 𝑀⊗𝑞can be expressed as:
|𝑇𝑟𝑎𝑖𝑛|−1Ê
𝑖=0(𝑆𝑒𝑞𝑖,𝑛)+𝑛𝑜𝑖𝑠𝑒.
This represents a frequency vector of possible states. The result-
ing vector𝑅can be written as:
𝑅=𝑎·𝑠0⊕𝑏·𝑠1⊕···⊕𝑧·𝑠𝑚,
where𝑎,𝑏,...,𝑧≥0.This is by definition a frequency vector and by the nature of
the hyperspace, the vector 𝑅will be most similar to 𝑠𝑗, where the
corresponding factor (e.g., 𝑎,𝑏) is the greatest.
B N-grams can be constructed in amortized
constant time
Claim: Let 𝑠0,𝑠1,...,𝑠𝑚−1be a session of length 𝑚from which we
form𝑛-grams. Once the first 𝑛-gram is built (cold start) and amortized,
constructing each subsequent 𝑛-gram (i.e., sliding the window by one)
requires only constant time, irrespective of 𝑚.
Let𝑠𝑖denote the state at index 𝑖within a session of length 𝑚. We
define an𝑛-gram at index 𝑖by 𝑠𝑖,𝑠𝑖+1, ...,𝑠𝑖+𝑛−1. Let𝐺𝑖denote
the hypervector encoding of the 𝑛-gram starting at 𝑖:
𝐺𝑖=Encode 𝑠𝑖,𝑠𝑖+1,...,𝑠𝑖+𝑛−1.
Theorem B.1. Suppose we have a session of length 𝑚and we wish
to form every 𝑛-gram𝐺𝑖for𝑖=0,1,...,𝑚−𝑛. Once𝐺0is computed
and its cost amortized, 𝐺𝑖+1can be computed from 𝐺𝑖in constant
time. Consequently, forming all 𝑛-grams in the session costs O(𝑚)
(excluding operations in dimension 𝐷), i.e., constant time per 𝑛-gram.
B.1 Proof
1. Amortized Cold-Start. The first𝑛-gram𝐺0is computed by
encoding 𝑠0,𝑠1,...,𝑠𝑛−1. This cold-start takes O(𝑛)work plus
dimensional operations O(𝐷)(where𝐷is the dimension of the
hypervectors). We spread this cost over all 𝑚−𝑛+1𝑛-grams in
the session. Thus, amortized per 𝑛-gram, the cold-start overhead
becomes negligible as 𝑚≫𝑛. Formally,𝑛+O(𝐷)
𝑚−𝑛+1≈O 1
𝑚when
𝑚≫𝑛.
2. Sliding Window Update: 𝐺𝑖→𝐺𝑖+1.Assume𝐺𝑖is known. We
need𝐺𝑖+1=Encode(𝑠𝑖+1,...,𝑠𝑖+𝑛). Under typical vector-symbolic
architectures:
𝐺𝑖+1=Update 𝐺𝑖, 𝑠𝑖, 𝑠𝑖+𝑛,
where𝑈𝑝𝑑𝑎𝑡𝑒 “removes” the effect of 𝑠𝑖and “adds” the effect of
𝑠𝑖+𝑛via hyperdimensional operations. For instance, if Encode used
a binding operator ⊗that behaves like elementwise multiplication,
then removing a state amounts to multiplying by the inverse of
its hypervector (here, the hypervector inverse is the same as the
hypervector itself in {−1,+1}𝐷). Adding a new state is another
elementwise multiply or combine step.
Crucially, this sliding update from 𝐺𝑖to𝐺𝑖+1involves a constant
amount of small, discrete operations (unbinding one vector, binding
another). The number of operations does not scale with 𝑖or𝑚. Thus,
𝐺𝑖+1←−𝐺𝑖inO(1)(excluding dimension 𝐷).
3. Per-Step Cost is Constant. For each new position 𝑖, we spend
O(1)time to compute 𝐺𝑖+1from𝐺𝑖, apart from theO(𝐷)dimen-
sional operations that do not depend on 𝑚. Therefore, constructing
all𝑛-grams in a session of length 𝑚involves(𝑚−𝑛+1)incremental
steps, each inO(1)plus dimension 𝐷factors.
4. Total Complexity. Over one session, forming every 𝑛-gram
therefore costs:
O(𝑛+(𝑚−𝑛)×1)=O(𝑚),
Page 7:
HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway
plus the dimensional factor O(𝐷)per step. Importantly, it is inde-
pendent of𝑚in the sense that each step is constant. When summingover a large training set |Train|, this procedure extends naturally
to yieldO 𝐷×|Train|for the entire dataset in a single pass.
■