Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.10254

HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States

Authors: Roham Koohestani, Maliheh Izadi

Published: 2025-03-13

Abstract:

In the rapidly evolving world of software development, the surge in developers' reliance on AI-driven tools has transformed Integrated Development Environments into powerhouses of advanced features. This transformation, while boosting developers' productivity to unprecedented levels, comes with a catch: increased hardware demands for software development. Moreover, the significant economic and environmental toll of using these sophisticated models necessitates mechanisms that reduce unnecessary computational burdens. We propose HyperSeq - Hyper-Adaptive Representation for Predictive Sequencing of States - a novel, resource-efficient approach designed to model developers' cognitive states. HyperSeq facilitates precise action sequencing and enables real-time learning of user behavior. Our preliminary results show how HyperSeq excels in forecasting action sequences and achieves remarkable prediction accuracies that go beyond 70%. Notably, the model's online-learning capability allows it to substantially enhance its predictive accuracy in a majority of cases and increases its capability in forecasting next user actions with sufficient iterations for adaptation. Ultimately, our objective is to harness these predictions to refine and elevate the user experience dynamically within the IDE.

Paper Content:

Page 1: HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States Roham Koohestani Maliheh Izadi {rkoohestani, m.izadi}@tudelft.nl Delft University of Technology Delft, The Netherlands Abstract In the rapidly evolving world of software development, the surge in developers’ reliance on AI-driven tools has transformed Integrated Development Environments into powerhouses of advanced fea- tures. This transformation, while boosting developers’ productivity to unprecedented levels, comes with a catch: increased hardware demands for software development. Moreover, the significant eco- nomic and environmental toll of using these sophisticated models necessitates mechanisms that reduce unnecessary computational burdens. We propose HyperSeq — Hyper-Adaptive Representation for Predictive Sequencing of States — a novel, resource-efficient approach designed to model developers’ cognitive states. HyperSeq facilitates precise action sequencing and enables real-time learn- ing of user behavior. Our preliminary results show how HyperSeq excels in forecasting action sequences and achieves remarkable prediction accuracies that go beyond 70%. Notably, the model’s online-learning capability allows it to substantially enhance its pre- dictive accuracy in a majority of cases and increases its capability in forecasting next user actions with sufficient iterations for adap- tation. Ultimately, our objective is to harness these predictions to refine and elevate the user experience dynamically within the IDE. CCS Concepts •Human-centered computing ;•Software and its engineering ; Keywords User Behavior Modeling, IDE Design, Formal Methods ACM Reference Format: Roham Koohestani Maliheh Izadi, {rkoohestani, m.izadi}@tudelft.nl, Delft University of Technology, Delft, The Netherlands . 2025. HyperSeq: A Hyper- Adaptive Representation for Predictive Sequencing of States. In Proceedings of ACM International Conference on the Foundations of Software Engineering (FSE ’25). ACM, New York, NY, USA, 7 pages. https://doi.org/XXXXXXX. XXXXXXX 1 Introduction Since the advent of artificial intelligence, the software development life-cycle has become increasingly entangled with AI tooling. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FSE ’25, June 23–27, 2025, Trondheim, Norway ©2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-XXXX-X/18/06 https://doi.org/XXXXXXX.XXXXXXXintroduction of tools like Copilot [ 4] and JetBrains AI [ 5] is en- hancing user productivity, while editors like Cursor [ 3] simplify programming, thereby transforming the way developers engage with software repositories. Together with these benefits, it should be noted that the under- lying models have added complexity to the systems we program, consequently raising the computational demands of programming. This increased computation demand has led to greater environmen- tal repercussions; a recent technical blog by the team responsible for the Llama 3.1 model series reveals that training the leading 405 bil- lion parameter model consumed 30.84 million GPU hours [ 1]. This translates into 7.92 kilotons of CO2 emissions, based on the setup detailed in the article [ 11] and using the United States emission factor of 367 grams of 𝐶𝑂2per kWh [ 2]. Furthermore, executing inferences with this model demands a multi-node arrangement, with each node equipped with 8 H100 GPUs operating at a 700W TDP, highlighting that a substantial impact of these models occurs during the post-deployment phase. Previous studies have attempted to leverage user telemetry data to model user behavior [ 15] and to filter out completion requests in the IDE by taking into account the user’s present state [ 8,16]. Furthermore, earlier research has differentiated between two cate- gories of user states: exploration, where a user seeks suggestions, and acceleration, where the user aims to continue programming without interruption. However, these existing methods are still rather costly to train and require a large amount of training data. Moreover, customizing these models for individual users presents challenges, as it is quite improbable that a universal model could accurately capture the behavior of all users to the same satisfactory degree. To tailor a model to an individual’s behavior, online learning is required to consistently enhance the existing model data from that user, necessitating efficient implementation. Hyperdimensional computing (HDC)[ 13] provides a framework for holistic representation, manipulation, and querying of abstract concepts. This approach encodes ideas within high-dimensional vector spaces, taking advantage of the quasi-orthogonality inherent in these vectors. This paper introduces HyperSeq, a Hyper-Adaptive Represen- tation for Predictive Sequencing of States. We assess our model’s effectiveness under two distinct scenarios, considering both adap- tive and non-adaptive conditions using an established dataset of real-world user interactions and states. Our results indicate that the model effectively predicts the states of the users, with an aver- age accuracy of more than 70%. Moreover, leveraging adaptation mode allows the model to substantially improve its accuracy by employing online learning on user states.arXiv:2503.10254v1 [cs.SE] 13 Mar 2025 Page 2: FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al. In this paper, we (1) establish the theoretical basis for a hy- perdimensional sequence model, (2) formally define predictive se- quencing and demonstrate that the model allows for training in 𝑂(𝐷·|𝑇𝑟𝑎𝑖𝑛|), and online adaptation in 𝑂(𝐷)time, (3) conduct an initial assessment of the model’s performance, and (4) outline potential future applications for HyperSeq. 2 Motivation There is a tremendous amount of cost associated with running these large models, especially when dealing with large-scale inference. An approach to addressing this issue involves minimizing the number of incorrect invocations, meaning decreasing the volume of calls that are likely to result in the rejection of generated responses. By analyzing user states, we can determine the likelihood of a user accepting or rejecting completions, thus alleviating server strain. Moreover, recent research indicates [ 19] that developers prefer to incorporate proactivity within their systems. By forecasting upcoming states and user actions, we can strive to embed state- aware mechanisms into IDEs. For example, detecting when a user is about to enter refactoring mode could allow the IDE to proactively initiate a search for po- tential refactorings and suggest these options as soon as the mode change is detected. 3 Background Literature 3.1 Hyper-dimensional Computing Hyper-Dimensional Computing (HDC) is a computational paradigm in which high-dimensional vectors, typically ranging from 104to 105, are utilized to represent, process and query abstract concepts. Although HDC was coined by Pentti Kanerva [ 13], the concept dates to the 1990s, with holographic reduced representation (HRR) [ 18] and Vector-Symbolic Architectures [10]. In recent years, there has been growing interest in employing HDC for intelligent tasks because of the efficiency of models based on HDC. Such HDC-based methods have been applied to challeng- ing problems like Raven’s Progressive Matrices [ 12], as well as in encoding visual representations efficiently and retrieving this data using what is referred to as resonator networks [ 14]. Recent studies have explored the integration of HDC-inspired modifications into current architectures. InfiniAttention [ 17] has demonstrated signif- icant advancements in processing exceptionally lengthy contexts by encoding them into a sparse hyperdimensional memory. 3.2 User-IDE interaction Prior research has developed taxonomies of user interactions in IDEs, focusing on AI features. One study [ 7] identified two modes: exploration, where developers seek recommendations and solu- tions, and acceleration, where they work confidently with minimal distractions. Various investigations have employed telemetry data from within the IDE to determine programmers’ conditions. By analyzing fea- tures like caret movements and typing speed, these studies inferred the action states and developed a classification of 12 user states, along with their transitional characteristics [ 15]. The concept ofleveraging telemetry data to enhance Human-AI interaction re- mains prevalent, with numerous studies applying it to effectively filter out requests that are highly likely to be declined [8, 16]. Researchers have examined how to design Human-AI eXperi- ences (HAX) within the IDE by establishing a classification of meth- ods through which developers seek assistance and identifying vari- ous domains where distinct models can provide support, including testing, maintenance, and optimization. 3.3 Next-action prediction So far, limited research has focused on predicting users’ next ac- tions within IDEs. Cursor, a newly developed AI-integrated IDE, frequently references next-action prediction on its website, sug- gesting significant investment in this area [ 3]. Moreover, existing studies have explored employing gaze-tracking to dynamically ad- just the IDE according to the user’s forthcoming actions [20]. 4 Problem Definition In this section, we formally define the problem statement to im- prove clarity for subsequent sections. As noted in section 8, our present work concentrates on predicting future user states given the limited availability of data. Nevertheless, we plan to broaden the investigation in future research to deduce subsequent user states directly from user telemetry. 4.1 Predictive Sequencing Let𝑆be the set containing all potential states, and let 𝑠indicate an individual state within 𝑆. Similarly, let 𝑈signify the set of all users, with 𝑢representing a single user in 𝑈. A session state se- quence, denoted 𝑆𝑒𝑠𝑠, is defined as an 𝑚-tuple of states 𝑆𝑒𝑠𝑠 = (𝑠0,𝑠1,...,𝑠𝑚),where each 𝑠𝑖∈𝑆.Furthermore, we denote the 𝑖-th session of user 𝑢as𝑆𝑒𝑠𝑠𝑢,𝑖. The objective of next-state prediction is to develop a model 𝑀capable of forecasting the following state for a user𝑢based on an 𝑛-tuple of states. Formally, the model is articulated as: 𝑀:𝑆𝑛↦→𝑆.Additionally, we define a user-specific model as𝑀𝑢, tailored for the individual user 𝑢. 4.2 Hyper-Dimensional Computing (HDC) In this section, we expand upon the formal structure introduced earlier to lay the theoretical groundwork for our approach. We adhere to the following notational conventions: •The bundling operator is represented by ⊕, •The binding operator is signified by ⊗, and •The permutation operator is symbolized by 𝑃. Further, we utilize the Bipolar Map framework, wherein hy- pervectors are drawn from the domain 𝐷𝑂𝑀 ={−1,1}𝐷, with𝐷 denoting the dimensionality of the hyperspace. The elements of the chosen hypervectors are sampled from the probability distribution 𝑃=2·(Ber(0.5)−0.5)[9]. A hypervector 𝑣is characterized as 𝑣∈𝐷𝑂𝑀 . A codebook 𝐶links the state space 𝑆to the hyperspace 𝐷𝑂𝑀 , i.e.,𝐶:𝑆↦→ 𝐷𝑂𝑀 . Likewise,𝐶′provides the reverse mapping, defined as 𝐶′: 𝐷𝑂𝑀↦→𝑆. The function 𝐸𝑛𝑐𝑜𝑑𝑒 is designed to process an 𝑛-tuple of hypervectors 𝑣𝑖while maintaining their sequential characteristics. Formally,𝐸𝑛𝑐𝑜𝑑𝑒 :𝐷𝑂𝑀𝑚↦→𝐷𝑂𝑀 , and it is implemented as Page 3: HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway following where 𝑉=(𝑣0,𝑣1,...,𝑣𝑛−1), 𝐸𝑛𝑐𝑜𝑑𝑒(𝑉)=𝑛−1Ì 𝑖=0(𝑃𝑛−𝑖−1(𝑣𝑖)). Additionally, we define a function 𝑆𝑖𝑚 that takes two inputs from the defined hyperspace and yields a real-number indicat- ing the similarity between the two vectors. More formally, 𝑆𝑖𝑚 : (𝐷𝑂𝑀,𝐷𝑂𝑀)↦→R. In our context, we measure similarity using co- sine similarity, with values ranging from -1 to 1, where -1 represents complete dissimilarity and 1 represents complete similarity. 5 HyperSeq 5.1 Representation As stated earlier, the challenge of depicting action states involves encoding every action subsequence of length 𝑛and then using the model to predict the most probable action from a subsequence of 𝑛−1actions. Hence, a foundational model 𝑀Basecan be developed utilizing the training dataset 𝑇𝑟𝑎𝑖𝑛 |𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖)) where 𝑇𝑟𝑎𝑖𝑛 ={𝑆𝑒𝑞0,𝑆𝑒𝑞 1,···,𝑆𝑒𝑞|𝑇𝑟𝑎𝑖𝑛|−1} 𝑆𝑒𝑞𝑖={𝐶(𝑠0),𝐶(𝑠1),···,𝐶(𝑠𝑛)} with𝑠𝑗being the ordered states within a session such that 𝑆𝑒𝑞𝑖is a continuous subsequence of that session. In the following sections, we explain the method used to derive the 𝑇𝑟𝑎𝑖𝑛 and𝑇𝑒𝑠𝑡 datasets from our primary dataset 𝐷𝑎𝑡𝑎 . The primary objective, naturally, of training a model of this kind is to utilize it for querying future user states, essentially engaging in predictive sequencing. This can be done by querying model 𝑀with the prefix of the anticipated user-state. In other words, since the model is trained on n-grams of user states, the (n-1)- gram preceding the anticipated action can be used to query for the forthcoming action. The 𝐸𝑛𝑐𝑜𝑑𝑒 function allows for this (n-1)-gram prefix to be represented within the n-gram’s context as a query vector𝑞=𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞′ 𝑘[0 :𝑛−1])), where𝑆𝑒𝑞′ 𝑘is intended to predict its n-th element. By using our query vector 𝑞with our Model 𝑀through the binding operator⊗, and due to the quasi orthogonality of the vec- tors, we solely unbind the suffix vector with a similar prefix (re- fer to ...). The resulting vector can be interpreted as a frequency vector representing the states linked to that prefix, expressed as 𝑅=𝑀∗𝑞=(𝐶(𝑠0),𝐶(𝑠1),...,𝐶(𝑠𝑙)), where𝑙is determined by the occurrences of specific patterns within the training dataset. Because vector𝑅is a frequency vector, performing a similarity analysis of 𝑅against all valid keys yields a discrete distribution pointing to the similarity with a specific state. A valid key 𝑘is characterized by𝑘∈Domain(𝐶′), implying that the collection of all valid keys corresponds to the items for which mapping is available in the function𝐶′, namely Domain(𝐶′). Thus, based on the frequency vector 𝑅, we can predict the n-th element in the sequence using 𝑠𝑛=𝐶′(arg max𝑘(𝑆𝑖𝑚(𝑅,𝑘))). This methodology enables us to effectively perform predictive sequenc- ing connected to action states.5.2 Adaptiveness Enhancing the foundation of the non-adaptive model, we propose altering our model’s definition to enable greater flexibility. To ac- complish this, we must revise the structure of our model 𝑀. This new iteration, model 𝑀, consists of two segments: 𝑀base, akin to what was previously mentioned. Additionally, our model now incor- porates an adaptive section, 𝑀adap, which mirrors the base model’s design. The complete model 𝑀is formed by combining these two segments using the bundling ⊕operator. The inclusion of this adap- tive component supports online learning and facilitates incremental training, much like the base model. In the following section, we will examine how this adaptation influences model performance and evaluate the computational resources demanded by this additional component. The implementation details of our methodology, along with the evaluation procedure and results, are accessible in the paper’s replication package[6]. 6 Evaluation Methodology 6.1 Datasets Due to the challenges in accessing actual user data, we decided to utilize an already available dataset from a prior study, as outlined in the background section. Mozannar et al. [ 15] developed a tax- onomy comprising twelve user states and gathered data from 21 developers to compile a labeled dataset of user states. This dataset can be employed to train and test the model in a constrained envi- ronment, allowing future opportunities for more comprehensive evaluation. Our preliminary analysis reveals that three labels pro- vide limited information, leading us to exclude them from this study. Consequently, we are left with a total of 9 labels. 6.2 Data Partitioning Strategies Let𝐷𝑎𝑡𝑎 denote the complete dataset of user session state se- quences, where 𝐷𝑎𝑡𝑎 ={𝑆𝑒𝑠𝑠𝑢,𝑖|𝑢∈𝑈′,𝑖∈{1,...,𝑛𝑢}},𝑈′⊂𝑈, and𝑛𝑢is the number of sessions for user 𝑢. The task is to split 𝐷𝑎𝑡𝑎 into two disjoint subsets, 𝑇𝑟𝑎𝑖𝑛 and𝑇𝑒𝑠𝑡 , such that𝑇𝑟𝑎𝑖𝑛∪𝑇𝑒𝑠𝑡= 𝐷𝑎𝑡𝑎 and𝑇𝑟𝑎𝑖𝑛∩𝑇𝑒𝑠𝑡 =∅. We define three data partitioning strategies as follows. 6.2.1 Disjoint Split. In a disjoint split, the dataset is divided by users: one group for training and another for testing. We use data from 18 developers for training and the rest for testing. 6.2.2 Overlapping Split. In the overlapping split, each user’s data is divided between training and testing sets. For example, with 10 sessions, 8 are for training and 2 for testing. 6.2.3 K-Fold Cross-Validation. This strategy creates a fold for each user by excluding them from training and using them only for evaluation, repeating for every user in the dataset. 6.3 Hyper-Parameters In addition to exploring various data-splitting strategies, we investi- gate the impact of four other model (hyper-)parameters: dimension, subsequence length, cyclic shift, and adaptiveness. The dimension 𝐷 was adjusted, selecting from fixed values {1000,5000,10000,20000}. Similarly, the sequence length was altered using the set {3,5,7,9}. Moreover, within the MAP framework, where the 𝑃operator acts Page 4: FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al. Figure 1: Overall accuracies of the model across the different splitting strategies and with adaptive on or off. Figure 2: Best-performing model’s accuracy; Sliding window size = 30, Dimension = 20,000, Sequence Length = 3, Cyclic Shift = 4, Adaptive = True, and Split Strategy = Disjoint. as a cyclic shift of the representation vector, we assess the effect by altering the shift values among {2,4,6}. Lastly, we compare two proposed models: one incorporating the adaptive model and another without it. 6.4 metrics 6.4.1 Overall Accuracy. According to our setup, we assess the model’s overall accuracy using the test set. When implementing cross-validation with several test sets, we calculate the average ac- curacy. For all other scenarios, we determine the model’s accuracy considering every available user. 6.4.2 Sliding Window Accuracy. For adaptive models, we assess the sliding window accuracy to understand how the model’s perfor- mance evolves. Specifically, a window of size 𝑘is used to traverse the prediction sequence, and the model’s accuracy is computed within this window. This approach helps us determine the effec- tiveness of employing online learning to tailor models to users. 7 Preliminary Results 7.1 Model Performance Our findings indicate that the models, on average, can perform within a range of 40 to 50 percent, with some configurations reach- ing above 70% on average. Furthermore, as depicted in Figure 1, the adaptive version of the model consistently surpasses the perfor- mance of the base version without employing any online learning. Our assessments indicate that the cyclic shift parameter has a negligible impact on performance. Interestingly, our analysis of the results demonstrate that longer subsequences do not enhance predictive capabilities. In fact, an increase in sequence length corre- sponds with a decline in accuracy. This may result from the vector’sinformation capacity being surpassed, suggesting further research is needed to determine if this pattern persists in higher dimensions. Moreover, when examining the top-performing model, we ob- serve a noteworthy trend in two of the three test scenarios: there is an upward trajectory in the model’s average accuracy. This pat- tern suggests that online learning effectively enhances the model’s performance. But more interestingly, there is not only an upward trend but rather that adapting the model can achieve results that make the predictions of the model near perfect. 7.2 Model Efficiency 7.2.1 Training Efficiency. Training the model involves a single pass through the training data. N-grams can be constructed in constant time, provided that previous sequences within a session are available and that the cold start of the session is amortized (see the appendix). Since the model operates on D-dimensional vectors, the time complexity for training is 𝑂(𝐷·|𝑇𝑟𝑎𝑖𝑛|). 7.2.2 Update Efficiency. The update logic is implemented as a con- stant amount of operations on D dimensional vector, making the update𝑂(𝐷). 7.2.3 Storage Efficiency. The storage requirements for the model depend on several components, which vary based on the specific implementation of the approach. To optimize memory usage, the domain of stored values per entry can be limited, allowing for stor- age in fewer bits than the conventional 32 or 64 bits. For instance, using 8 or 16 bits per entry can enhance storage efficiency. Let B represent the number of bits per entry. The stored items include the model memory (doubled in the case of the adaptive model) and the codebook. Consequently, the space complexity of the model is 𝑂(𝐵·(|𝐶|+𝐷)). 8 Future Work In future work, we intend to broaden our approach for more ex- tensive, real-world applicability. We plan to enhance the proposed framework to process incoming telemetry data from users similarly to other current methodologies. Moreover, we seek to leverage the predicted user state to adaptively optimize the user experience within the IDE. Furthermore, we are interested in investigating the impact of different Vector-Symbolic Architectures and assessing how they influence model accuracy. 9 Conclusion This paper introduced HyperSeq, a Hyper-Adaptive Representation for Predictive Sequencing of States, as a novel, efficient approach to predicting developer actions in IDEs. Using hyperdimensional computing and online adaptation, HyperSeq achieved competitive accuracy, surpassing 70% in certain configurations, while maintain- ing low computational and storage requirements. Our evaluation demonstrated that HyperSeq not only excels in forecasting user actions but also dynamically improves its accuracy through online learning, enabling personalized predictions for individual users. This efficiency positions HyperSeq as a promising solution to re- duce computational overhead in AI-powered IDEs and enhance user experiences through context-aware proactive adaptations. Fu- ture work will focus on integrating telemetry-based predictions, Page 5: HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway exploring alternative vector-symbolic architectures, and deploy- ing HyperSeq in real-world environments to optimize developer productivity and promote sustainable computing practices. References [1]2024. Llama 3.1 - 405B, 70B and 8B with multilinguality and long context. https: //huggingface.co/blog/llama31 [2]2024. U.S. Energy Information Administration (EIA). https://www.eia.gov/tools/ faqs/faq.php?id=74&t=11 [3] 2025. Cursor - the AI code editor. https://www.cursor.com/ [4]2025. GitHub Copilot ·Your AI pair programmer. https://github.com/features/ copilot [5]2025. JetBrains AI service and In-IDE AI Assistant. https://www.jetbrains.com/ai/ [6]Anonymous Authors. 2025. Replication Package. https://github.com/hyperseq- replication/package [7]Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2022. Grounded Copilot: How Programmers Interact with Code-Generating Models. arXiv:2206.15000 [cs.HC] https://arxiv.org/abs/2206.15000 [8] Aral de Moor, Arie van Deursen, and Maliheh Izadi. 2024. A transformer-based approach for smart invocation of automatic code completion. In Proceedings of the 1st ACM International Conference on AI-Powered Software . 28–37. [9]Ross W Gayler. 1998. Multiplicative binding, representation operators & analogy (workshop poster). (1998). [10] Ross W Gayler. 2004. Vector symbolic architectures answer Jackendoff’s chal- lenges for cognitive neuroscience. arXiv preprint cs/0412059 (2004). [11] Aaron Grattafiori and Abhimanyu Dubey et. al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 [12] Michael Hersche, Mustafa Zeqiri, Luca Benini, Abu Sebastian, and Abbas Rahimi. 2023. A Neuro-vector-symbolic Architecture for Solving Raven’s Progressive Matrices. arXiv:2203.04571 [cs.LG] https://arxiv.org/abs/2203.04571 [13] Pentti Kanerva. 2009. Hyperdimensional computing: An introduction to com- puting in distributed representation with high-dimensional random vectors. Cognitive computation 1 (2009), 139–159. [14] Spencer J. Kent, E. Paxon Frady, Friedrich T. Sommer, and Bruno A. Olshausen. 2020. Resonator Networks outperform optimization methods at solving high- dimensional vector factorization. arXiv:1906.11684 [cs.NE] https://arxiv.org/ abs/1906.11684 [15] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Program- ming. arXiv:2210.14306 [cs.SE] https://arxiv.org/abs/2210.14306 [16] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. When to show a suggestion? Integrating human feedback in AI-assisted programming. InProceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 10137– 10144. [17] Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. 2024. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. arXiv:2404.07143 [cs.CL] https://arxiv.org/abs/2404.07143 [18] Tony A Plate. 1995. Holographic reduced representations. IEEE Transactions on Neural networks 6, 3 (1995), 623–641. [19] Agnia Sergeyuk, Ekaterina Koshchenko, Ilya Zakharov, Timofey Bryksin, and Maliheh Izadi. 2024. The Design Space of in-IDE Human-AI Experience. arXiv:2410.08676 [cs.SE] https://arxiv.org/abs/2410.08676 [20] Thomas Weber, Rafael Vinicius Mourao Thiel, and Sven Mayer. 2023. Supporting Software Developers Through a Gaze-Based Adaptive IDE. In Proceedings of Mensch Und Computer 2023 (Rapperswil, Switzerland) (MuC ’23) . 267–276. https: //doi.org/10.1145/3603555.3603571 Page 6: FSE ’25, June 23–27, 2025, Trondheim, Norway Koohestani et. al. A The result of querying the model M with q is a frequency vector R We define a frequency vector as a vector𝐹in hyperspace that repre- sents the weighted combination of a set of basis vectors 𝑠0,𝑠1,...,𝑠𝑚, such that: 𝑅=𝑎·𝑓0⊕𝑏·𝑓1⊕···⊕𝑧·𝑓𝑚, where𝑎,𝑏,...,𝑧≥0and each weight corresponds to the frequency or relevance of the respective basis vector. The model is trained as: 𝑀=|𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖)) where: 𝑇𝑟𝑎𝑖𝑛 ={𝑆𝑒𝑞0,𝑆𝑒𝑞 1,...,𝑆𝑒𝑞|𝑇𝑟𝑎𝑖𝑛|−1}, and each sequence is defined as: 𝑆𝑒𝑞𝑖={𝐶(𝑠0),𝐶(𝑠1),...,𝐶(𝑠𝑛)}. By the definition of the 𝐸𝑛𝑐𝑜𝑑𝑒 function, the model formulation can be expanded as: 𝑀=|𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖[:𝑛−1]))⊗𝑆𝑒𝑞𝑖,𝑛. For brevity, we define 𝑆𝑒𝑞𝑖[𝑛−1]as𝑆𝑒𝑞𝑖,𝑛. Similarly, the query vector𝑞is defined as: 𝑞=𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞′ 𝑘[0 :𝑛−1])). Due to the high quasi-orthogonality and the low probability of collisions between vectors in hyperspace, multiplicative unbinding (or querying, as described in the paper) results in the expression: 𝑀⊗𝑞=|𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0𝑞⊗𝑃(𝐸𝑛𝑐𝑜𝑑𝑒(𝑆𝑒𝑞𝑖[:𝑛−1]))⊗𝑆𝑒𝑞𝑖,𝑛. This simplifies further under the assumption that all subse- quences𝑆𝑒𝑞𝑖[:𝑛−1]in the training data are unique and match the queried subsequence. In this case, the formulation becomes: 𝑀⊗𝑞=|𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0( I⊗𝑆𝑒𝑞𝑖,𝑛), where Iis the identity vector. However, this assumption is unlikely in practice. When the pre- ceding subsequence does not match exactly, the quasi-orthogonality ensures that the similarity between unrelated vectors is normally distributed around zero, resulting in negligible noise. Thus, in the absence of matching subsequences, 𝑀⊗𝑞simplifies to noise, which is equally dissimilar to all vectors. Therefore, in the general case, 𝑀⊗𝑞can be expressed as: |𝑇𝑟𝑎𝑖𝑛|−1Ê 𝑖=0(𝑆𝑒𝑞𝑖,𝑛)+𝑛𝑜𝑖𝑠𝑒. This represents a frequency vector of possible states. The result- ing vector𝑅can be written as: 𝑅=𝑎·𝑠0⊕𝑏·𝑠1⊕···⊕𝑧·𝑠𝑚, where𝑎,𝑏,...,𝑧≥0.This is by definition a frequency vector and by the nature of the hyperspace, the vector 𝑅will be most similar to 𝑠𝑗, where the corresponding factor (e.g., 𝑎,𝑏) is the greatest. B N-grams can be constructed in amortized constant time Claim: Let𝑠0,𝑠1,...,𝑠𝑚−1be a session of length 𝑚from which we form𝑛-grams. Once the first 𝑛-gram is built (cold start) and amortized, constructing each subsequent 𝑛-gram (i.e., sliding the window by one) requires only constant time, irrespective of 𝑚. Let𝑠𝑖denote the state at index 𝑖within a session of length 𝑚. We define an𝑛-gram at index 𝑖by𝑠𝑖,𝑠𝑖+1, ...,𝑠𝑖+𝑛−1. Let𝐺𝑖denote the hypervector encoding of the 𝑛-gram starting at 𝑖: 𝐺𝑖=Encode𝑠𝑖,𝑠𝑖+1,...,𝑠𝑖+𝑛−1. Theorem B.1. Suppose we have a session of length 𝑚and we wish to form every 𝑛-gram𝐺𝑖for𝑖=0,1,...,𝑚−𝑛. Once𝐺0is computed and its cost amortized, 𝐺𝑖+1can be computed from 𝐺𝑖in constant time. Consequently, forming all 𝑛-grams in the session costs O(𝑚) (excluding operations in dimension 𝐷), i.e., constant time per 𝑛-gram. B.1 Proof 1. Amortized Cold-Start. The first𝑛-gram𝐺0is computed by encoding𝑠0,𝑠1,...,𝑠𝑛−1. This cold-start takes O(𝑛)work plus dimensional operations O(𝐷)(where𝐷is the dimension of the hypervectors). We spread this cost over all 𝑚−𝑛+1𝑛-grams in the session. Thus, amortized per 𝑛-gram, the cold-start overhead becomes negligible as 𝑚≫𝑛. Formally,𝑛+O(𝐷) 𝑚−𝑛+1≈O1 𝑚when 𝑚≫𝑛. 2. Sliding Window Update: 𝐺𝑖→𝐺𝑖+1.Assume𝐺𝑖is known. We need𝐺𝑖+1=Encode(𝑠𝑖+1,...,𝑠𝑖+𝑛). Under typical vector-symbolic architectures: 𝐺𝑖+1=Update𝐺𝑖, 𝑠𝑖, 𝑠𝑖+𝑛, where𝑈𝑝𝑑𝑎𝑡𝑒 “removes” the effect of 𝑠𝑖and “adds” the effect of 𝑠𝑖+𝑛via hyperdimensional operations. For instance, if Encode used a binding operator ⊗that behaves like elementwise multiplication, then removing a state amounts to multiplying by the inverse of its hypervector (here, the hypervector inverse is the same as the hypervector itself in {−1,+1}𝐷). Adding a new state is another elementwise multiply or combine step. Crucially, this sliding update from 𝐺𝑖to𝐺𝑖+1involves a constant amount of small, discrete operations (unbinding one vector, binding another). The number of operations does not scale with 𝑖or𝑚. Thus, 𝐺𝑖+1←−𝐺𝑖inO(1)(excluding dimension 𝐷). 3. Per-Step Cost is Constant. For each new position 𝑖, we spend O(1)time to compute 𝐺𝑖+1from𝐺𝑖, apart from theO(𝐷)dimen- sional operations that do not depend on 𝑚. Therefore, constructing all𝑛-grams in a session of length 𝑚involves(𝑚−𝑛+1)incremental steps, each inO(1)plus dimension 𝐷factors. 4. Total Complexity. Over one session, forming every 𝑛-gram therefore costs: O(𝑛+(𝑚−𝑛)×1)=O(𝑚), Page 7: HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of States FSE ’25, June 23–27, 2025, Trondheim, Norway plus the dimensional factor O(𝐷)per step. Importantly, it is inde- pendent of𝑚in the sense that each step is constant. When summingover a large training set |Train|, this procedure extends naturally to yieldO𝐷×|Train|for the entire dataset in a single pass. ■