loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2501.06252

Transformer-Squared: Self-adaptive LLMs

Authors: Qi Sun, Edoardo Cetin, Yujin Tang

Published: 2025-01-09

Abstract:

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer-Squared, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer-Squared employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific 'expert' vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Furthermore, Transformer-Squared demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

Paper Content: on Alphaxiv
Page 1: Published as a conference paper at ICLR 2025 TRANSFORMER -SQUARED : SELF-ADAPTIVE LLM S Qi Sun1,2*, Edoardo Cetin1*, Yujin Tang1* 1Sakana AI, Japan2Institute of Science Tokyo, Japan {qisun,edo,yujintang }@sakana.ai *Equal contribution ABSTRACT Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer2 (Transformer-Squared), a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer2employs a two-pass mech- anism: first, a dispatch system identifies the task properties, and then task-specific “expert” vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method consistently out- performs ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Furthermore, Transformer2demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer2 represents a significant leap forward, offering a scalable, efficient solution for en- hancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems. We provide our full source code at https://github.com/SakanaAI/self-adaptive-llms . 1 I NTRODUCTION SVD of Weights Self-Adaptation VectorsCodingVLM…Dispatch User Query Hidden States“This is a math question” Hidden StatesAnswer to User Query<latexit sha1_base64="N1vdtuDDj/v4W8Gh3EJBeDVKdH4=">AAAB9HicbVBNS8NAEJ34WetX1aOXYCt4KolI9WbBi8cK9gPaWDbbabt0s4m7m0IJ/R2ieFDEq3f/hjf/jZu2B219MPB4b4aZeX7EmdKO820tLa+srq1nNrKbW9s7u7m9/ZoKY0mxSkMeyoZPFHImsKqZ5tiIJJLA51j3B1epXx+iVCwUt3oUoReQnmBdRok2kleo3bWY0Cgp4YV2Lu8UnQnsReLOSP7y8zHFU6Wd+2p1QhoHKDTlRKmm60TaS4jUjHIcZ1uxwojQAelh01BBAlReMjl6bB8bpWN3Q2lKaHui/p5ISKDUKPBNZ0B0X817qfif14x198JLmIhijYJOF3VjbuvQThOwO0wi1XxkCKGSmVtt2ieSUBODypoQ3PmXF0nttOiWiqUbN18+gykycAhHcAIunEMZrqECVaBwDw/wAq/W0Hq23qz3aeuSNZs5gD+wPn4AOWeWLw==</latexit>V| <latexit sha1_base64="N1vdtuDDj/v4W8Gh3EJBeDVKdH4=">AAAB9HicbVBNS8NAEJ34WetX1aOXYCt4KolI9WbBi8cK9gPaWDbbabt0s4m7m0IJ/R2ieFDEq3f/hjf/jZu2B219MPB4b4aZeX7EmdKO820tLa+srq1nNrKbW9s7u7m9/ZoKY0mxSkMeyoZPFHImsKqZ5tiIJJLA51j3B1epXx+iVCwUt3oUoReQnmBdRok2kleo3bWY0Cgp4YV2Lu8UnQnsReLOSP7y8zHFU6Wd+2p1QhoHKDTlRKmm60TaS4jUjHIcZ1uxwojQAelh01BBAlReMjl6bB8bpWN3Q2lKaHui/p5ISKDUKPBNZ0B0X817qfif14x198JLmIhijYJOF3VjbuvQThOwO0wi1XxkCKGSmVtt2ieSUBODypoQ3PmXF0nttOiWiqUbN18+gykycAhHcAIunEMZrqECVaBwDw/wAq/W0Hq23qz3aeuSNZs5gD+wPn4AOWeWLw==</latexit>V| <latexit sha1_base64="eBMAWvMj6r7BRZByVUOzrFT18f0=">AAAB73icbZC5TgMxEIZnwxXCFY6OxiJBoop2KQIdkSigDIIcUrKKvI6TWLG9i+1FCqu8BA0FCNHSUPEkdJS8Cc5RQMIvWfr0/zPyzAQRZ9q47peTWlhcWl5Jr2bW1jc2t7LbO1UdxorQCgl5qOoB1pQzSSuGGU7rkaJYBJzWgv75KK/dUaVZKG/MIKK+wF3JOoxgY616vnnNugLnW9mcW3DHQvPgTSF39nH/ffG+l5Rb2c9mOySxoNIQjrVueG5k/AQrwwinw0wz1jTCpI+7tGFRYkG1n4znHaJD67RRJ1T2SYPG7u+OBAutByKwlQKbnp7NRuZ/WSM2nVM/YTKKDZVk8lEn5siEaLQ8ajNFieEDC5goZmdFpIcVJsaeKGOP4M2uPA/V44JXLBSvvFzJhYnSsA8HcAQenEAJLqEMFSDA4QGe4Nm5dR6dF+d1Uppypj278EfO2w9IDpMn</latexit>⌃<latexit sha1_base64="eBMAWvMj6r7BRZByVUOzrFT18f0=">AAAB73icbZC5TgMxEIZnwxXCFY6OxiJBoop2KQIdkSigDIIcUrKKvI6TWLG9i+1FCqu8BA0FCNHSUPEkdJS8Cc5RQMIvWfr0/zPyzAQRZ9q47peTWlhcWl5Jr2bW1jc2t7LbO1UdxorQCgl5qOoB1pQzSSuGGU7rkaJYBJzWgv75KK/dUaVZKG/MIKK+wF3JOoxgY616vnnNugLnW9mcW3DHQvPgTSF39nH/ffG+l5Rb2c9mOySxoNIQjrVueG5k/AQrwwinw0wz1jTCpI+7tGFRYkG1n4znHaJD67RRJ1T2SYPG7u+OBAutByKwlQKbnp7NRuZ/WSM2nVM/YTKKDZVk8lEn5siEaLQ8ajNFieEDC5goZmdFpIcVJsaeKGOP4M2uPA/V44JXLBSvvFzJhYnSsA8HcAQenEAJLqEMFSDA4QGe4Nm5dR6dF+d1Uppypj278EfO2w9IDpMn</latexit>⌃ <latexit sha1_base64="matz0JW476ILPpVL90CB+3EA69o=">AAAB6nicbZDNTsJAFIVv8Q/xD3XpZiKYuCKtIehOEjcuMVoggYZMhylMmE6bmakJaXgENy406NaX8DXc+TZOgYWCJ5nkyzn3Zu69fsyZ0rb9beXW1jc2t/LbhZ3dvf2D4uFRU0WJJNQlEY9k28eKciaoq5nmtB1LikOf05Y/usny1iOVikXiQY9j6oV4IFjACNbGui+75V6xZFfsmdAqOAsoXX9OM701esWvbj8iSUiFJhwr1XHsWHsplpoRTieFbqJojMkID2jHoMAhVV46G3WCzozTR0EkzRMazdzfHSkOlRqHvqkMsR6q5Swz/8s6iQ6uvJSJONFUkPlHQcKRjlC2N+ozSYnmYwOYSGZmRWSIJSbaXKdgjuAsr7wKzYuKU6vU7pxSvQpz5eEETuEcHLiEOtxCA1wgMIAneIFXi1vP1tR6n5fmrEXPMfyR9fEDiQqRvg==</latexit>U<latexit sha1_base64="matz0JW476ILPpVL90CB+3EA69o=">AAAB6nicbZDNTsJAFIVv8Q/xD3XpZiKYuCKtIehOEjcuMVoggYZMhylMmE6bmakJaXgENy406NaX8DXc+TZOgYWCJ5nkyzn3Zu69fsyZ0rb9beXW1jc2t/LbhZ3dvf2D4uFRU0WJJNQlEY9k28eKciaoq5nmtB1LikOf05Y/usny1iOVikXiQY9j6oV4IFjACNbGui+75V6xZFfsmdAqOAsoXX9OM701esWvbj8iSUiFJhwr1XHsWHsplpoRTieFbqJojMkID2jHoMAhVV46G3WCzozTR0EkzRMazdzfHSkOlRqHvqkMsR6q5Swz/8s6iQ6uvJSJONFUkPlHQcKRjlC2N+ozSYnmYwOYSGZmRWSIJSbaXKdgjuAsr7wKzYuKU6vU7pxSvQpz5eEETuEcHLiEOtxCA1wgMIAneIFXi1vP1tR6n5fmrEXPMfyR9fEDiQqRvg==</latexit>U MathFirst passSecond passElement-wise multiplicationMatrix multiplicationN layers inside an LLM Figure 1: Overview of Transformer2.In the training phase, we tune the scales of the singular values of the weight matrices to generate a set of “expert” vectors, each of which specializes in one type of tasks. In the inference phase, a two-pass process is adopted where the first applies the task- specific expert and the second generates the answer.Self-adaptive large language models (LLMs) would represent a significant advancement in artificial intelligence, providing a framework where mod- els can adjust to varied tasks and dy- namic contexts in real time. This con- cept draws inspiration from the long- standing idea of neural networks modifying their own weights to adapt to tasks dynamically (Schmidhuber, 1993; Irie et al., 2022) and neural net- works generating weights for other networks, as popularized by Hyper- Networks and related methods (Ha et al., 2017; Stanley et al., 2009). While compositionality and scalabil- ity are crucial for effective adapta- tion, current LLM training method- ologies fall short of achieving both these properties simultaneously. Our research aims to present a pioneering solution to realize this vision. Traditionally, LLM post-training has sought to optimize a model for a wide range of capabilities in a single, extensive training session. While this “one-shot” fine-tuning framework is ideal from a sim- plicity perspective, it is also difficult to achieve in practice. For instance, post-training is still highly resource-intensive, leading to significant computational costs and training times. Additionally, there 1arXiv:2501.06252v3 [cs.LG] 24 Jan 2025 Page 2: Published as a conference paper at ICLR 2025 tends to be notable performance trade-offs when introducing additional breadth to the data, making it challenging to overcome overfitting and task interference at the same time. In contrast, self-adaptive models offer a more flexible and efficient approach. Rather than attempting to train an LLM for all tasks in one step, expert modules can be developed offline and augmented to the base LLM on-demand (Kang et al., 2024). This allows the model to dynamically modify its behavior based on the task at hand, without the need for constant re-tuning. In addition to the bene- fit of having independent components, this modularity also supports continual learning, enabling the model to add new skills over time without catastrophic forgetting. Moreover, self-adaptive LLMs mirror a well-established principle in neuroscience and computational biology, where the brain activates specific regions depending on the task at hand (Loose et al., 2017) and dynamically reconfigures its functional networks in response to changing task demands (Davison et al., 2015). In principle, the first step toward achieving self-adaptive LLMs can be realized through the devel- opment of specialized expert modules, each fine-tuned (Kaplan et al., 2020) via techniques such as low-rank adaptation (LoRA) (Hu et al., 2021). These expert modules can then be dynamically composed at runtime based on the task demands, a process that can be efficiently managed through Mixture of Experts (MoE)-like systems (Tianlong et al., 2024). However, several challenges need to be addressed to make this approach both scalable and compositional. First, fine-tuning LLMs to cre- ate multiple expert modules significantly increases the number of parameters that need to be trained. In practice, even with parameter-efficient methods like LoRA, the cumulative size of these mod- ules can quickly escalate, leading to increased storage and computational demands. Second, these expert modules are often prone to overfitting, a phenomenon especially prevalent when training on smaller datasets or narrow task domains. Third, the flexible composition of these expert modules also presents largely unresolved challenges currently posing as open research problems. To overcome these limitations, we first propose Singular Value Fine-tuning (SVF), a novel parameter-efficient fine-tuning (PEFT) method to obtain effective building blocks for self- adaptation. SVF works by extracting and tuning only the singular values within the model’s weight matrices. By focusing on this principled parameterization, our approach mitigates the risk of over- fitting, drastically reduces computational demands, and allows for inherent compositionality. We show these properties enable us to cheaply obtain a set of effective domain-specific “expert” vectors by training on narrow datasets with RL, directly optimizing task performance on individual topics. We then introduce our full Transformer2(Transformer-Squared) framework to empower LLMs through the underlying principles of self-adaptation. Given a prompt from an unknown task, Transformer2entails a two-pass inference mechanism which we illustrate in Figure 1. During the first pass, Transformer2executes the model and observes its test-time behavior, gathering the rele- vant information to understand the necessary skills to tackle the current problem. During the second pass, our framework uses this information to combine the available expert vectors and provide a new modification to the base weights of the LLM specifically tailored to its test-time conditions. We design three different adaptation strategies that can be used within Transformer2, which we show provide monotonic performance benefits with increasing access to the test-time conditions. We evaluate SVF and the full Transformer2framework through extensive experiments across a di- verse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as LoRA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer2is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering. Finally, we analyze the proper- ties of our new framework, validating that it provides increasing benefits with additional access to its current test-time conditions and even allow for recycling pre-trained SVF experts across model architectures. In summary, our key technical contributions are the following: • The development of Transformer2as a pivotal self-adaptation framework for LLMs, pro- viding a universal blueprint to dynamically adapt the behavior of LLMs from a growing set of pre-trained skills. • The introduction of SVF, a novel PEFT method trainable with RL on small datasets, pro- ducing compact expert vectors with inherent compositionality, all key properties necessary for our scalable self-adaptation framework. 2 Page 3: Published as a conference paper at ICLR 2025 • The implementation of three adaptation strategies within Transformer2, effectively dis- patching SVF-trained experts with properties designed to cope with different requirements and deployment scenarios. 2 R ELATED WORKS Self-adaptive LLMs We define self-adaptive LLMs as a group of LLMs or a standalone LLM that can evaluate and modify its behavior in response to changes in its operating environment or internal state, without external intervention. This dynamic adjustment has parallels to concepts like fast- weight memories, which enable networks to update weights in response to task demands (Schmid- huber, 1992; Gomez & Schmidhuber, 2005), and neural network weights being treated as dynamic programs (Schmidhuber, 2015). Recently, Panigrahi et al. (2023) introduces an approach where a smaller auxiliary transformer is updated dynamically within a larger model, aligning with the prin- ciples of self-adaptive behavior. This adaptation can be explored from two perspectives: a macroview, where multiple LLMs collab- orate and/or compete, and a microview, where internal adaptations allow a single LLM to specialize in different tasks. Macroview: From this perspective, the system directs queries to LLMs with domain specific exper- tise, prioritizing outputs from expert models, thereby achieving higher accuracy and task-specific optimization. Such task-specific ensembles can be realized through various mechanisms: multiple LLMs playing distinct roles and coordinate toward a shared goal (Zhuge et al., 2023), engaging in mutual listening and debate (Du et al., 2023), or using meticulously crafted prompt construc- tions (Zhang et al., 2024) to integrate knowledge library and skill planning. Naturally, the improve- ment in the specialization and adaptive capabilities of individual LLMs in the ensemble enhances the collective performance. Thus, in this paper, we focus on the microview of self-adaptive LLMs. Microview: MoE in LLMs plays a critical role in this perspective (Tianlong et al., 2024). In MoE systems, inputs are dynamically routed to a subset of specialized modules or layers (e.g., MLPs) containing domain-specific knowledge (Rajbhandari et al., 2022; Fedus et al., 2022). To reduce inference time, researchers introduce sparsely activated MoE where only a subset of the experts are selected per token Jiang et al. (2024); Qwen Team (2024). While it is possible to view Transformer2 loosely as a type of MoE, there are two major differences. In the aforementioned systems, self- adaptation is achieved through token-level routing, whereas Transformer2employs a sample-level module selection strategy. The second difference lies in the construction of expert modules. In traditional MoE systems, expert modules are either trained from scratch (Fedus et al., 2022; Jiang et al., 2024) or dense models (e.g., upcycling) (Qwen Team, 2024; Zhu et al., 2024), without an auxiliary loss to ensure module specialization. In contrast, Transformer2specifically trains expert vectors with RL to acquire domain specific-knowledge, making them true experts. Low-rank adaptation PEFT methods such as LoRA (Hu et al., 2021) works by freezing the orig- inal model’s parameters and introducing small trainable low-rank matrices for task-specific up- dates. It significantly lowers the computational and memory costs while providing performance comparable to full fine-tuning. Inspired by LoRA’s design, various modifications have been pro- posed (Zhang et al., 2023; Kopiczko et al., 2023; Liu et al., 2024; Bałazy et al., 2024; Cetoli, 2024; ?). Transformer2does not rely on low-rank matrices, and instead scales the singular vectors of the original parameter matrix that span the full rank space. SVD for LLM Fine-tuning SVD is increasingly being used as an inductive bias for PEFT in LLMs. For example, Wang et al. (2024) decompose a weight matrix and use the minor singular components, associated with noisy or long-tail information, to initialize low-rank matrices for LoRA fine-tuning. Earlier work proposed using compressed forms like DCT coefficients for generating weight matrices in neural networks (Koutnik et al., 2010), offering efficiency in memory-constrained environments, which resonates with our approach. In a similar vein, SVD is employed to approximate an original weight matrix with the top rsingular vectors, corresponding to the highest singular values. A small trainable matrix is then introduced on top of the truncated singular value matrix to adjust the magnitude and orientations within this top- rsubspace (Bałazy et al., 2024; Cetoli, 2024). However, the drawback of this approach is that retaining only the top singular components can result in the loss of important information, particularly when the singular values distribution is less skewed. The 3 Page 4: Published as a conference paper at ICLR 2025 work most similar to ours is a concurrent effort by Lingam et al. (2024), where they introduce various sparsification methods that utilize the SVD of the weights. However, it is not for self-adaptive LLMs and does not use RL to enhance learning efficiency. 3 M ETHODS 3.1 P RELIMINARIES Singular value decomposition (SVD) offers a fundamental view of matrix multiplications. In the context of neural networks, each weight matrix W∈Rn×mcan be decomposed into three compo- nents W=UΣV⊺, yielding semi-orthogonal matrices U∈Rm×randV∈Rn×rtogether with an ordered vector of rsingular values (in descending order) arranged in the diagonal matrix Σ∈Rr×r. The linear operation defined by applying Wontox, can be then decomposed into a sum of indepen- dent terms, derived from mapping each column vifrom Vinto the corresponding column uifrom Uasy=Pr i=1σiuiv⊺ ix. Hence, each singular component represented by the rank-1 matrix uiv⊺ i independently processes the input, providing an orthogonal contribution to the layer’s outputs, with the singular values σimodulating the degree of the contributions. Cross-entropy method (CEM) is a Monte Carlo method for importance sampling and optimiza- tion (Rubinstein & Kroese, 2004). The method is based on the concept of minimizing the KL divergence between two probability distributions DKL(P∥Q), where Pis the target distribution and Qis a maintained distribution. At its core, CEM repeatedly generates a set of samples from Q, evaluates these samples with a performance function, and then updates the distribution Qwith the characteristics of the elite samples that have performed best. In the standard setup employed in most applications, Qis set to a diagonal multivariate Gaussian, reducing the problem to simply estimating the empirical mean and standard deviation of the latest elites until a stopping criterion is met. We illustrate a complete CEM step in the Python pseudocode below. 3.2 T RANSFORMER2 The construction of Transformer2comprises two main steps, for which we provide an illustrative overview in Figure 2. First, we introduce Singular Value Fine-tuning (SVF), a method to learn with RL compact and compositional expert vectors based on the SVD of the base model’s weights. Then, we describe three different adaptation strategies within Transformer2, inspired by three or- thogonal principles, which adaptively combine the SVF-trained expert vectors during inference. We motivate how the properties of SVF are highly complementary to our adaptation strategies, making Transformer2an effective and scalable framework for the design of new self-adaptive LLMs. Singular value fine-tuning is a key building block in Transformer2. It offers an extremely efficient parameterization for fine-tuning and provides inherent compositionality for adaptation. Conven- tional fine-tuning techniques often aim to augment pre-trained models with new capabilities by mod- ifying their weight matrices. However, in large-scale transformers, these weights are already rich repositories of abstracted knowledge, thanks to the breadth of the pre-training data and expansive architectural design. In fact, as evidenced in much of the prior literature, the requisite capabilities for solving many downstream tasks appear to already exist within these pre-trained models (Sharma et al., 2023). Therefore, instead of seeking to add new features, an efficient fine-tuning approach should focus on making these latent capabilities more expressible. Motivated by these considera- tions, for any weight matrix W, SVF learns a simple vector z∈Rrthat provides targeted modifica- tions to each singular component of Windependently, yielding a new weight matrix W′=UΣ′V⊺, where Σ′= Σ⊗diag(z). This essential parameterization enjoys several benefits: 4 Page 5: Published as a conference paper at ICLR 2025 Layer Norm<latexit sha1_base64="hnfzLeUw92WJ9yYUkRkk/DJmo2g=">AAAB8nicbZDLSgMxFIYzXmu9VV26CRbBVZkRqe4suHFZwV5gOpZMmmlDM8mQnBHK0MdwYReKuPUFfA13vo2Ztgtt/SHw8f/nkHNOmAhuwHW/nZXVtfWNzcJWcXtnd2+/dHDYNCrVlDWoEkq3Q2KY4JI1gINg7UQzEoeCtcLhTZ63Hpk2XMl7GCUsiElf8ohTAtbymw8dLoFpSkS3VHYr7lR4Gbw5lK8/n3NN6t3SV6enaBozCVQQY3zPTSDIiAZOBRsXO6lhCaFD0me+RUliZoJsOvIYn1qnhyOl7ZOAp+7vjozExozi0FbGBAZmMcvN/zI/hegqyLhMUmCSzj6KUoFB4Xx/3OOaURAjC4RqbmfFdEA0ofYKpmiP4C2uvAzN84pXrVTvvHLtAs1UQMfoBJ0hD12iGrpFddRAFCn0hF7QqwPOxHlz3melK8685wj9kfPxA3w0ldM=</latexit>V| <latexit sha1_base64="hnfzLeUw92WJ9yYUkRkk/DJmo2g=">AAAB8nicbZDLSgMxFIYzXmu9VV26CRbBVZkRqe4suHFZwV5gOpZMmmlDM8mQnBHK0MdwYReKuPUFfA13vo2Ztgtt/SHw8f/nkHNOmAhuwHW/nZXVtfWNzcJWcXtnd2+/dHDYNCrVlDWoEkq3Q2KY4JI1gINg7UQzEoeCtcLhTZ63Hpk2XMl7GCUsiElf8ohTAtbymw8dLoFpSkS3VHYr7lR4Gbw5lK8/n3NN6t3SV6enaBozCVQQY3zPTSDIiAZOBRsXO6lhCaFD0me+RUliZoJsOvIYn1qnhyOl7ZOAp+7vjozExozi0FbGBAZmMcvN/zI/hegqyLhMUmCSzj6KUoFB4Xx/3OOaURAjC4RqbmfFdEA0ofYKpmiP4C2uvAzN84pXrVTvvHLtAs1UQMfoBJ0hD12iGrpFddRAFCn0hF7QqwPOxHlz3melK8685wj9kfPxA3w0ldM=</latexit>V| <latexit sha1_base64="8jULmK0JRDiG/w9c6wlV1jj7guI=">AAAB7XicbZC7SgNBFIbPxluMt6ilIINBSBV2LaKdARvLBM0FkhBmJ7PJmNmZZWZWCEtKexsLRWztrPMcdj6DL+HkUmjiDwMf/38Oc87xI860cd0vJ7Wyura+kd7MbG3v7O5l9w9qWsaK0CqRXKqGjzXlTNCqYYbTRqQoDn1O6/7gapLX76nSTIpbM4xoO8Q9wQJGsLFWrXXDeiHuZHNuwZ0KLYM3h9zlx7jy/XA8Lneyn62uJHFIhSEca9303Mi0E6wMI5yOMq1Y0wiTAe7RpkWBQ6rbyXTaETq1ThcFUtknDJq6vzsSHGo9DH1bGWLT14vZxPwva8YmuGgnTESxoYLMPgpijoxEk9VRlylKDB9awEQxOysifawwMfZAGXsEb3HlZaidFbxioVjxcqU8zJSGIziBPHhwDiW4hjJUgcAdPMIzvDjSeXJenbdZacqZ9xzCHznvPxkHky0=</latexit>⌃<latexit sha1_base64="8jULmK0JRDiG/w9c6wlV1jj7guI=">AAAB7XicbZC7SgNBFIbPxluMt6ilIINBSBV2LaKdARvLBM0FkhBmJ7PJmNmZZWZWCEtKexsLRWztrPMcdj6DL+HkUmjiDwMf/38Oc87xI860cd0vJ7Wyura+kd7MbG3v7O5l9w9qWsaK0CqRXKqGjzXlTNCqYYbTRqQoDn1O6/7gapLX76nSTIpbM4xoO8Q9wQJGsLFWrXXDeiHuZHNuwZ0KLYM3h9zlx7jy/XA8Lneyn62uJHFIhSEca9303Mi0E6wMI5yOMq1Y0wiTAe7RpkWBQ6rbyXTaETq1ThcFUtknDJq6vzsSHGo9DH1bGWLT14vZxPwva8YmuGgnTESxoYLMPgpijoxEk9VRlylKDB9awEQxOysifawwMfZAGXsEb3HlZaidFbxioVjxcqU8zJSGIziBPHhwDiW4hjJUgcAdPMIzvDjSeXJenbdZacqZ9xzCHznvPxkHky0=</latexit>⌃ <latexit sha1_base64="ka/WwxC9Nk0i99aMsQm1DvJgrg0=">AAAB6HicbZDPTsJAEMan+A/xH+rRSyMx8URaY9CbJF48QmKBBBqyXaawst02u1sTQngCLx40Bo++ha/hzbdxCxwU/JJNfvm+mezMBAlnSjvOt5VbW9/Y3MpvF3Z29/YPiodHDRWnkqJHYx7LVkAUcibQ00xzbCUSSRRwbAbD2yxvPqJULBb3epSgH5G+YCGjRBur7nWLJafszGSvgruA0s3nNNN7rVv86vRimkYoNOVEqbbrJNofE6kZ5TgpdFKFCaFD0se2QUEiVP54NujEPjNOzw5jaZ7Q9sz93TEmkVKjKDCVEdEDtZxl5n9ZO9XhtT9mIkk1Cjr/KEy5rWM729ruMYlU85EBQiUzs9p0QCSh2tymYI7gLq+8Co2LslspV+puqXoJc+XhBE7hHFy4gircQQ08oIDwBC/waj1Yz9abNZ2X5qxFzzH8kfXxA9F0kWI=</latexit>U<latexit sha1_base64="ka/WwxC9Nk0i99aMsQm1DvJgrg0=">AAAB6HicbZDPTsJAEMan+A/xH+rRSyMx8URaY9CbJF48QmKBBBqyXaawst02u1sTQngCLx40Bo++ha/hzbdxCxwU/JJNfvm+mezMBAlnSjvOt5VbW9/Y3MpvF3Z29/YPiodHDRWnkqJHYx7LVkAUcibQ00xzbCUSSRRwbAbD2yxvPqJULBb3epSgH5G+YCGjRBur7nWLJafszGSvgruA0s3nNNN7rVv86vRimkYoNOVEqbbrJNofE6kZ5TgpdFKFCaFD0se2QUEiVP54NujEPjNOzw5jaZ7Q9sz93TEmkVKjKDCVEdEDtZxl5n9ZO9XhtT9mIkk1Cjr/KEy5rWM729ruMYlU85EBQiUzs9p0QCSh2tymYI7gLq+8Co2LslspV+puqXoJc+XhBE7hHFy4gircQQ08oIDwBC/waj1Yz9abNZ2X5qxFzzH8kfXxA9F0kWI=</latexit>UAttentionLayer NormMLP<latexit sha1_base64="WPTq7ovTCCel7T147D53/f38NRg=">AAAB6HicbZC7SgNBFIbPxluMt3jpbAaDYBV2LaKdAQutJAFzgWQJs5OzyZjZCzOzQlzyBDYWitj6AFY+iZ2lb+LkUmj0h4GP/z+HOed4seBK2/anlVlYXFpeya7m1tY3Nrfy2zt1FSWSYY1FIpJNjyoUPMSa5lpgM5ZIA09gwxucj/PGLUrFo/BaD2N0A9oLuc8Z1caqXnXyBbtoT0T+gjODwtn73dfF215a6eQ/2t2IJQGGmgmqVMuxY+2mVGrOBI5y7URhTNmA9rBlMKQBKjedDDoih8bpEj+S5oWaTNyfHSkNlBoGnqkMqO6r+Wxs/pe1Eu2fuikP40RjyKYf+YkgOiLjrUmXS2RaDA1QJrmZlbA+lZRpc5ucOYIzv/JfqB8XnVKxVHUKZRumysI+HMAROHACZbiECtSAAcI9PMKTdWM9WM/Wy7Q0Y816duGXrNdvyOqQmg==</latexit>N<latexit sha1_base64="WPTq7ovTCCel7T147D53/f38NRg=">AAAB6HicbZC7SgNBFIbPxluMt3jpbAaDYBV2LaKdAQutJAFzgWQJs5OzyZjZCzOzQlzyBDYWitj6AFY+iZ2lb+LkUmj0h4GP/z+HOed4seBK2/anlVlYXFpeya7m1tY3Nrfy2zt1FSWSYY1FIpJNjyoUPMSa5lpgM5ZIA09gwxucj/PGLUrFo/BaD2N0A9oLuc8Z1caqXnXyBbtoT0T+gjODwtn73dfF215a6eQ/2t2IJQGGmgmqVMuxY+2mVGrOBI5y7URhTNmA9rBlMKQBKjedDDoih8bpEj+S5oWaTNyfHSkNlBoGnqkMqO6r+Wxs/pe1Eu2fuikP40RjyKYf+YkgOiLjrUmXS2RaDA1QJrmZlbA+lZRpc5ucOYIzv/JfqB8XnVKxVHUKZRumysI+HMAROHACZbiECtSAAcI9PMKTdWM9WM/Wy7Q0Y816duGXrNdvyOqQmg==</latexit>N layers<latexit sha1_base64="yC9X/vIczvf0cNc9NaHz45Lo31s=">AAAB6HicbZC7SgNBFIbPxluMt6ilIItBSBV2LaKdARvLBMwFkxBmJ2eTMbOzy8ysEJaUVjYWitj6ANZ5DjufwZdwcik08YeBj/8/hznneBFnSjvOl5VaWV1b30hvZra2d3b3svsHNRXGkmKVhjyUDY8o5ExgVTPNsRFJJIHHse4NriZ5/R6lYqG40cMI2wHpCeYzSrSxKredbM4pOFPZy+DOIXf5Ma58PxyPy53sZ6sb0jhAoSknSjVdJ9LthEjNKMdRphUrjAgdkB42DQoSoGon00FH9qlxurYfSvOEtqfu746EBEoNA89UBkT31WI2Mf/LmrH2L9oJE1GsUdDZR37MbR3ak63tLpNINR8aIFQyM6tN+0QSqs1tMuYI7uLKy1A7K7jFQrHi5kp5mCkNR3ACeXDhHEpwDWWoAgWER3iGF+vOerJerbdZacqa9xzCH1nvP2ZpkQg=</latexit>Z<latexit sha1_base64="yC9X/vIczvf0cNc9NaHz45Lo31s=">AAAB6HicbZC7SgNBFIbPxluMt6ilIItBSBV2LaKdARvLBMwFkxBmJ2eTMbOzy8ysEJaUVjYWitj6ANZ5DjufwZdwcik08YeBj/8/hznneBFnSjvOl5VaWV1b30hvZra2d3b3svsHNRXGkmKVhjyUDY8o5ExgVTPNsRFJJIHHse4NriZ5/R6lYqG40cMI2wHpCeYzSrSxKredbM4pOFPZy+DOIXf5Ma58PxyPy53sZ6sb0jhAoSknSjVdJ9LthEjNKMdRphUrjAgdkB42DQoSoGon00FH9qlxurYfSvOEtqfu746EBEoNA89UBkT31WI2Mf/LmrH2L9oJE1GsUdDZR37MbR3ak63tLpNINR8aIFQyM6tN+0QSqs1tMuYI7uLKy1A7K7jFQrHi5kp5mCkNR3ACeXDhHEpwDWWoAgWER3iGF+vOerJerbdZacqa9xzCH1nvP2ZpkQg=</latexit>ZLearnable parameterstrained with RLFrozen parameters Training TimeInference Time <latexit sha1_base64="Zy7mRheCx49r7k9l9pfYkhWF0qk=">AAAB6HicbZC7SgNBFIbPxluMt3jpbAaDYBV2LaKdAQtthATMBZIlzE7OJmNmL8zMCnHJE9hYKGLrA1j5JHaWvomTS6HRHwY+/v8c5pzjxYIrbdufVmZhcWl5JbuaW1vf2NzKb+/UVZRIhjUWiUg2PapQ8BBrmmuBzVgiDTyBDW9wPs4btygVj8JrPYzRDWgv5D5nVBuretXJF+yiPRH5C84MCmfvd18Xb3tppZP/aHcjlgQYaiaoUi3HjrWbUqk5EzjKtROFMWUD2sOWwZAGqNx0MuiIHBqnS/xImhdqMnF/dqQ0UGoYeKYyoLqv5rOx+V/WSrR/6qY8jBONIZt+5CeC6IiMtyZdLpFpMTRAmeRmVsL6VFKmzW1y5gjO/Mp/oX5cdErFUtUplG2YKgv7cABH4MAJlOESKlADBgj38AhP1o31YD1bL9PSjDXr2YVfsl6/AcdmkJk=</latexit>M<latexit sha1_base64="Zy7mRheCx49r7k9l9pfYkhWF0qk=">AAAB6HicbZC7SgNBFIbPxluMt3jpbAaDYBV2LaKdAQtthATMBZIlzE7OJmNmL8zMCnHJE9hYKGLrA1j5JHaWvomTS6HRHwY+/v8c5pzjxYIrbdufVmZhcWl5JbuaW1vf2NzKb+/UVZRIhjUWiUg2PapQ8BBrmmuBzVgiDTyBDW9wPs4btygVj8JrPYzRDWgv5D5nVBuretXJF+yiPRH5C84MCmfvd18Xb3tppZP/aHcjlgQYaiaoUi3HjrWbUqk5EzjKtROFMWUD2sOWwZAGqNx0MuiIHBqnS/xImhdqMnF/dqQ0UGoYeKYyoLqv5rOx+V/WSrR/6qY8jBONIZt+5CeC6IiMtyZdLpFpMTRAmeRmVsL6VFKmzW1y5gjO/Mp/oX5cdErFUtUplG2YKgv7cABH4MAJlOESKlADBgj38AhP1o31YD1bL9PSjDXr2YVfsl6/AcdmkJk=</latexit>M matrices<latexit sha1_base64="hnfzLeUw92WJ9yYUkRkk/DJmo2g=">AAAB8nicbZDLSgMxFIYzXmu9VV26CRbBVZkRqe4suHFZwV5gOpZMmmlDM8mQnBHK0MdwYReKuPUFfA13vo2Ztgtt/SHw8f/nkHNOmAhuwHW/nZXVtfWNzcJWcXtnd2+/dHDYNCrVlDWoEkq3Q2KY4JI1gINg7UQzEoeCtcLhTZ63Hpk2XMl7GCUsiElf8ohTAtbymw8dLoFpSkS3VHYr7lR4Gbw5lK8/n3NN6t3SV6enaBozCVQQY3zPTSDIiAZOBRsXO6lhCaFD0me+RUliZoJsOvIYn1qnhyOl7ZOAp+7vjozExozi0FbGBAZmMcvN/zI/hegqyLhMUmCSzj6KUoFB4Xx/3OOaURAjC4RqbmfFdEA0ofYKpmiP4C2uvAzN84pXrVTvvHLtAs1UQMfoBJ0hD12iGrpFddRAFCn0hF7QqwPOxHlz3melK8685wj9kfPxA3w0ldM=</latexit>V| <latexit sha1_base64="hnfzLeUw92WJ9yYUkRkk/DJmo2g=">AAAB8nicbZDLSgMxFIYzXmu9VV26CRbBVZkRqe4suHFZwV5gOpZMmmlDM8mQnBHK0MdwYReKuPUFfA13vo2Ztgtt/SHw8f/nkHNOmAhuwHW/nZXVtfWNzcJWcXtnd2+/dHDYNCrVlDWoEkq3Q2KY4JI1gINg7UQzEoeCtcLhTZ63Hpk2XMl7GCUsiElf8ohTAtbymw8dLoFpSkS3VHYr7lR4Gbw5lK8/n3NN6t3SV6enaBozCVQQY3zPTSDIiAZOBRsXO6lhCaFD0me+RUliZoJsOvIYn1qnhyOl7ZOAp+7vjozExozi0FbGBAZmMcvN/zI/hegqyLhMUmCSzj6KUoFB4Xx/3OOaURAjC4RqbmfFdEA0ofYKpmiP4C2uvAzN84pXrVTvvHLtAs1UQMfoBJ0hD12iGrpFddRAFCn0hF7QqwPOxHlz3melK8685wj9kfPxA3w0ldM=</latexit>V| <latexit sha1_base64="8jULmK0JRDiG/w9c6wlV1jj7guI=">AAAB7XicbZC7SgNBFIbPxluMt6ilIINBSBV2LaKdARvLBM0FkhBmJ7PJmNmZZWZWCEtKexsLRWztrPMcdj6DL+HkUmjiDwMf/38Oc87xI860cd0vJ7Wyura+kd7MbG3v7O5l9w9qWsaK0CqRXKqGjzXlTNCqYYbTRqQoDn1O6/7gapLX76nSTIpbM4xoO8Q9wQJGsLFWrXXDeiHuZHNuwZ0KLYM3h9zlx7jy/XA8Lneyn62uJHFIhSEca9303Mi0E6wMI5yOMq1Y0wiTAe7RpkWBQ6rbyXTaETq1ThcFUtknDJq6vzsSHGo9DH1bGWLT14vZxPwva8YmuGgnTESxoYLMPgpijoxEk9VRlylKDB9awEQxOysifawwMfZAGXsEb3HlZaidFbxioVjxcqU8zJSGIziBPHhwDiW4hjJUgcAdPMIzvDjSeXJenbdZacqZ9xzCHznvPxkHky0=</latexit>⌃<latexit sha1_base64="8jULmK0JRDiG/w9c6wlV1jj7guI=">AAAB7XicbZC7SgNBFIbPxluMt6ilIINBSBV2LaKdARvLBM0FkhBmJ7PJmNmZZWZWCEtKexsLRWztrPMcdj6DL+HkUmjiDwMf/38Oc87xI860cd0vJ7Wyura+kd7MbG3v7O5l9w9qWsaK0CqRXKqGjzXlTNCqYYbTRqQoDn1O6/7gapLX76nSTIpbM4xoO8Q9wQJGsLFWrXXDeiHuZHNuwZ0KLYM3h9zlx7jy/XA8Lneyn62uJHFIhSEca9303Mi0E6wMI5yOMq1Y0wiTAe7RpkWBQ6rbyXTaETq1ThcFUtknDJq6vzsSHGo9DH1bGWLT14vZxPwva8YmuGgnTESxoYLMPgpijoxEk9VRlylKDB9awEQxOysifawwMfZAGXsEb3HlZaidFbxioVjxcqU8zJSGIziBPHhwDiW4hjJUgcAdPMIzvDjSeXJenbdZacqZ9xzCHznvPxkHky0=</latexit>⌃ <latexit sha1_base64="ka/WwxC9Nk0i99aMsQm1DvJgrg0=">AAAB6HicbZDPTsJAEMan+A/xH+rRSyMx8URaY9CbJF48QmKBBBqyXaawst02u1sTQngCLx40Bo++ha/hzbdxCxwU/JJNfvm+mezMBAlnSjvOt5VbW9/Y3MpvF3Z29/YPiodHDRWnkqJHYx7LVkAUcibQ00xzbCUSSRRwbAbD2yxvPqJULBb3epSgH5G+YCGjRBur7nWLJafszGSvgruA0s3nNNN7rVv86vRimkYoNOVEqbbrJNofE6kZ5TgpdFKFCaFD0se2QUEiVP54NujEPjNOzw5jaZ7Q9sz93TEmkVKjKDCVEdEDtZxl5n9ZO9XhtT9mIkk1Cjr/KEy5rWM729ruMYlU85EBQiUzs9p0QCSh2tymYI7gLq+8Co2LslspV+puqXoJc+XhBE7hHFy4gircQQ08oIDwBC/waj1Yz9abNZ2X5qxFzzH8kfXxA9F0kWI=</latexit>U<latexit sha1_base64="ka/WwxC9Nk0i99aMsQm1DvJgrg0=">AAAB6HicbZDPTsJAEMan+A/xH+rRSyMx8URaY9CbJF48QmKBBBqyXaawst02u1sTQngCLx40Bo++ha/hzbdxCxwU/JJNfvm+mezMBAlnSjvOt5VbW9/Y3MpvF3Z29/YPiodHDRWnkqJHYx7LVkAUcibQ00xzbCUSSRRwbAbD2yxvPqJULBb3epSgH5G+YCGjRBur7nWLJafszGSvgruA0s3nNNN7rVv86vRimkYoNOVEqbbrJNofE6kZ5TgpdFKFCaFD0se2QUEiVP54NujEPjNOzw5jaZ7Q9sz93TEmkVKjKDCVEdEDtZxl5n9ZO9XhtT9mIkk1Cjr/KEy5rWM729ruMYlU85EBQiUzs9p0QCSh2tymYI7gLq+8Co2LslspV+puqXoJc+XhBE7hHFy4gircQQ08oIDwBC/waj1Yz9abNZ2X5qxFzzH8kfXxA9F0kWI=</latexit>U<latexit sha1_base64="yC9X/vIczvf0cNc9NaHz45Lo31s=">AAAB6HicbZC7SgNBFIbPxluMt6ilIItBSBV2LaKdARvLBMwFkxBmJ2eTMbOzy8ysEJaUVjYWitj6ANZ5DjufwZdwcik08YeBj/8/hznneBFnSjvOl5VaWV1b30hvZra2d3b3svsHNRXGkmKVhjyUDY8o5ExgVTPNsRFJJIHHse4NriZ5/R6lYqG40cMI2wHpCeYzSrSxKredbM4pOFPZy+DOIXf5Ma58PxyPy53sZ6sb0jhAoSknSjVdJ9LthEjNKMdRphUrjAgdkB42DQoSoGon00FH9qlxurYfSvOEtqfu746EBEoNA89UBkT31WI2Mf/LmrH2L9oJE1GsUdDZR37MbR3ak63tLpNINR8aIFQyM6tN+0QSqs1tMuYI7uLKy1A7K7jFQrHi5kp5mCkNR3ACeXDhHEpwDWWoAgWER3iGF+vOerJerbdZacqa9xzCH1nvP2ZpkQg=</latexit>Z<latexit sha1_base64="yC9X/vIczvf0cNc9NaHz45Lo31s=">AAAB6HicbZC7SgNBFIbPxluMt6ilIItBSBV2LaKdARvLBMwFkxBmJ2eTMbOzy8ysEJaUVjYWitj6ANZ5DjufwZdwcik08YeBj/8/hznneBFnSjvOl5VaWV1b30hvZra2d3b3svsHNRXGkmKVhjyUDY8o5ExgVTPNsRFJJIHHse4NriZ5/R6lYqG40cMI2wHpCeYzSrSxKredbM4pOFPZy+DOIXf5Ma58PxyPy53sZ6sb0jhAoSknSjVdJ9LthEjNKMdRphUrjAgdkB42DQoSoGon00FH9qlxurYfSvOEtqfu746EBEoNA89UBkT31WI2Mf/LmrH2L9oJE1GsUdDZR37MbR3ak63tLpNINR8aIFQyM6tN+0QSqs1tMuYI7uLKy1A7K7jFQrHi5kp5mCkNR3ACeXDhHEpwDWWoAgWER3iGF+vOerJerbdZacqa9xzCH1nvP2ZpkQg=</latexit>Z…A) Prompt-based adaptation, or B) Job classifier-based adaptationReplaced with one learned vector C) Mixture-based adaptation…<latexit sha1_base64="ZoBKuE7P69Fe4FstEBqOVIcwRL0=">AAAB+HicbVDLSsNAFJ20Pmp9NCqu3AwWQRBK4qK6LLhxWcE+oClhMp20QyeTMHMj1NAvceNCEbf+gb/gQnDlp+j0sdDWAxcO59zLvfcEieAaHOfTyuVXVtfWCxvFza3tnZK9u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQwvJ37rlinNY3kDo4R1I9KXPOSUgJF8u3TqEZEMiD/0gEdM+3bZqThT4GXizkm5lv/4fjv4YnXffvd6MU0jJoEKonXHdRLoZkQBp4KNi16qWULokPRZx1BJzJJuNj18jI+N0sNhrExJwFP190RGIq1HUWA6IwIDvehNxP+8TgrhRTfjMkmBSTpbFKYCQ4wnKeAeV4yCGBlCqOLmVkwHRBEKJquiCcFdfHmZNM8qbrVSvXbLNQfNUECH6AidIBedoxq6QnXUQBSl6B49oifrznqwnq2XWWvOms/soz+wXn8ATKCXPg==</latexit>+↵k⇥ <latexit sha1_base64="ZoBKuE7P69Fe4FstEBqOVIcwRL0=">AAAB+HicbVDLSsNAFJ20Pmp9NCqu3AwWQRBK4qK6LLhxWcE+oClhMp20QyeTMHMj1NAvceNCEbf+gb/gQnDlp+j0sdDWAxcO59zLvfcEieAaHOfTyuVXVtfWCxvFza3tnZK9u9fUcaooa9BYxKodEM0El6wBHARrJ4qRKBCsFQwvJ37rlinNY3kDo4R1I9KXPOSUgJF8u3TqEZEMiD/0gEdM+3bZqThT4GXizkm5lv/4fjv4YnXffvd6MU0jJoEKonXHdRLoZkQBp4KNi16qWULokPRZx1BJzJJuNj18jI+N0sNhrExJwFP190RGIq1HUWA6IwIDvehNxP+8TgrhRTfjMkmBSTpbFKYCQ4wnKeAeV4yCGBlCqOLmVkwHRBEKJquiCcFdfHmZNM8qbrVSvXbLNQfNUECH6AidIBedoxq6QnXUQBSl6B49oifrznqwnq2XWWvOms/soz+wXn8ATKCXPg==</latexit>+↵k⇥ <latexit sha1_base64="rvY7URDetHJ3uMUQZv0k44XFelM=">AAAB+HicbVDLSsNAFJ1YH7U+GhVXbgaLIAgl6aK6LLhxWcE+oCnhZjpph04mYWYi1NAvceNCEbf+gb/gQnDlp+j0sdDWAxcO59zLvfcECWdKO86ntZJbXVvfyG8WtrZ3dov23n5TxakktEFiHst2AIpyJmhDM81pO5EUooDTVjC8nPitWyoVi8WNHiW0G0FfsJAR0Eby7eKZBzwZgF/xNIuo8u2SU3amwMvEnZNSLffx/Xb4Reu+/e71YpJGVGjCQamO6yS6m4HUjHA6LnipogmQIfRpx1ABZkk3mx4+xidG6eEwlqaExlP190QGkVKjKDCdEeiBWvQm4n9eJ9XhRTdjIkk1FWS2KEw51jGepIB7TFKi+cgQIJKZWzEZgASiTVYFE4K7+PIyaVbKbrVcvXZLNQfNkEdH6BidIhedoxq6QnXUQASl6B49oifrznqwnq2XWeuKNZ85QH9gvf4A9NeXBQ==</latexit>+↵2⇥ <latexit sha1_base64="rvY7URDetHJ3uMUQZv0k44XFelM=">AAAB+HicbVDLSsNAFJ1YH7U+GhVXbgaLIAgl6aK6LLhxWcE+oCnhZjpph04mYWYi1NAvceNCEbf+gb/gQnDlp+j0sdDWAxcO59zLvfcECWdKO86ntZJbXVvfyG8WtrZ3dov23n5TxakktEFiHst2AIpyJmhDM81pO5EUooDTVjC8nPitWyoVi8WNHiW0G0FfsJAR0Eby7eKZBzwZgF/xNIuo8u2SU3amwMvEnZNSLffx/Xb4Reu+/e71YpJGVGjCQamO6yS6m4HUjHA6LnipogmQIfRpx1ABZkk3mx4+xidG6eEwlqaExlP190QGkVKjKDCdEeiBWvQm4n9eJ9XhRTdjIkk1FWS2KEw51jGepIB7TFKi+cgQIJKZWzEZgASiTVYFE4K7+PIyaVbKbrVcvXZLNQfNkEdH6BidIhedoxq6QnXUQASl6B49oifrznqwnq2XWeuKNZ85QH9gvf4A9NeXBQ==</latexit>+↵2⇥ <latexit sha1_base64="/hIatESGxW7y6US6Az4TZbyHk9M=">AAAB9XicbVDJSgNBEK1JXGLcouLJS2MQPIUZD9FjwIvHCGaBzBhqOj1Jk56F7h4lDPkPLx4U8eo3+AseBE9+inaWg0YfFDzeq6Kqnp8IrrRtf1i5/NLyymphrbi+sbm1XdrZbao4lZQ1aCxi2fZRMcEj1tBcC9ZOJMPQF6zlD88nfuuGScXj6EqPEuaF2I94wClqI127KJIBdh1X85CpbqlsV+wpyF/izEm5ln//et3/ZPVu6c3txTQNWaSpQKU6jp1oL0OpORVsXHRTxRKkQ+yzjqERmiVeNr16TI6M0iNBLE1FmkzVnxMZhkqNQt90hqgHatGbiP95nVQHZ17GoyTVLKKzRUEqiI7JJALS45JRLUaGIJXc3EroACVSbYIqmhCcxZf/kuZJxalWqpdOuWbDDAU4gEM4BgdOoQYXUIcGUJBwBw/waN1a99aT9TxrzVnzmT34BevlGxMUlp4=</latexit>↵1⇥ <latexit sha1_base64="/hIatESGxW7y6US6Az4TZbyHk9M=">AAAB9XicbVDJSgNBEK1JXGLcouLJS2MQPIUZD9FjwIvHCGaBzBhqOj1Jk56F7h4lDPkPLx4U8eo3+AseBE9+inaWg0YfFDzeq6Kqnp8IrrRtf1i5/NLyymphrbi+sbm1XdrZbao4lZQ1aCxi2fZRMcEj1tBcC9ZOJMPQF6zlD88nfuuGScXj6EqPEuaF2I94wClqI127KJIBdh1X85CpbqlsV+wpyF/izEm5ln//et3/ZPVu6c3txTQNWaSpQKU6jp1oL0OpORVsXHRTxRKkQ+yzjqERmiVeNr16TI6M0iNBLE1FmkzVnxMZhkqNQt90hqgHatGbiP95nVQHZ17GoyTVLKKzRUEqiI7JJALS45JRLUaGIJXc3EroACVSbYIqmhCcxZf/kuZJxalWqpdOuWbDDAU4gEM4BgdOoQYXUIcGUJBwBw/waN1a99aT9TxrzVnzmT34BevlGxMUlp4=</latexit>↵1⇥Replaced with a mixture of the learned vectors Figure 2: Method overview. Left) At training time, we employ SVF and RL to learn the “expert” vectors z’s that scale the singular values of the weight matrices. Right) At inference time, we propose three distinct methods to adaptively select/combine the learned expert vectors. Negligible parameters: Learning only a vector zfor each weight matrix allows for very efficient fine-tuning with orders of magnitudes fewer optimized parameters even when compared to prior approaches specifically designed for efficiency. For example, the widely popular LoRA approach requires (m+n)×r′learnable parameters per weight matrix, where r′is a hyper-parameter that gen- erally needs to be set large enough for expressivity. While recent extensions, such LoRA-XS (Bałazy et al., 2024), try to push efficiency even further, they often introduce limiting assumptions that curb applicability in several practical scenarios (see examples in Appendix C). In contrast, while SVF only needs r= min( m, n)parameters, we show it empirically does not display the same shortcom- ings thanks to working on a highly-meaning space provided by the latent expressiveness compressed in the weights of modern LLMs. SVF’s scaling only the singular values may seem to lead to limited expressiveness, we wish to point out that the ability to affect the weight matrix in a full-rank manner technically provides more information than low-rank approaches. High compositionality: Decomposing the weights in independent singular components makes the learned zvectors highly composable and interpretable, opening numerous possibilities for adapta- tion via algebraic manipulations. Instead, LoRA-based methods inherently lack these properties. For instance, even if two LoRAs learned on the same task were to learn exactly the same adjustments for eachW, directly interpolating between their compressed AandBmatrices is unlikely to preserve any of their original behavior, given the countless number of equivalent parameter permutations they might have converged to. Principled regularization: Exclusively modifying the magnitude of pre-existing singular compo- nents provides a principled and effective form of regularization. In practice, this property enables us to fine-tune for arbitrary downstream tasks with only hundreds of data points without the risk of severe collapse or overfitting. End-to-end optimization with RL. We train a set of SVF vectors θz={z1,···, zN×M}to fine- tune an arbitrary language model πθWparameterized by θWwith RL, optimizing directly for task performance. Here, θW={W1,···, WN×M}is the set of weight matrices, where Nis the number of layers and Mis the number of weight matrices to fine-tune per layer. We use the seminal RE- INFORCE algorithm (Williams, 1992) and label each generated answer yi(for the prompt xi∈D) with a unitary reward based on its correctness r∈ {− 1,1}. Inspired by related applications of RL for optimizing LLMs (Ouyang et al., 2022), we regularize the REINFORCE objective by adding a KL penalty for deviating from the original model’s behavior, weighted by a small coefficient λ∈R+. Thus, our final objective function can be written as: J(θz) =E log πθW′(ˆyi|xi) r(ˆyi, yi) −λDKL(πθW′∥πθW), (1) where we use πθW′to denote the resulting language model after substituting the original weight matrices WwithW′. While RL is generally considered less stable than next-token prediction ob- jectives, we find the regularization properties of SVF avoid many of the failure modes of prior less- constrained parameterizations (see Section 4.3). Thus, combining these complementary components effectively enables us to avoid relying on expensive fine-tuning procedures with large hand-designed datasets as proxies, and directly maximize task performance end-to-end. 5 Page 6: Published as a conference paper at ICLR 2025 In general, SVF with RL puts lower requirement on the dataset it trains on. For example, LoRA fine-tuning requires “explaining texts” to perform next token predictions, which puts a higher re- quirement on the dataset (e.g., imagine LoRA fine-tuning on a GSM8K dataset where no reasoning text but only the final number is provided). This benefit allows SVF to be more general and effective. One possible caveat SVF can face is the sparse rewards caused by a weak base model, which we discuss this further in Section 5. Self-adaptation is a critical mechanism in nature that has established itself as a core guiding princi- ple in modern system design (Kl ¨os et al., 2015). Our initial efforts toward self-adaptive foundation models focus on the inference stage of LLMs, where we devise a simple two-pass adaptation strat- egy that combines Ksets of base “expert” vectors z1:Ktrained with SVF to provide different kinds of capabilities (e.g., coding, math, etc). The mapping between a capability and the dataset we train on can be acquired in the dataset’s meta data. In the first inference pass, given a task or an individ- ual input prompt, Transformer2executes the model and observes its test-time behavior to derive a newz′vector tailored to its test-time conditions. This adapted z′is then used in the second infer- ence pass to provide an actual response with the newly adapted weights. The interaction between SVF-trained expert vectors and the adaptation strategies ensures seamless integration, where ex- pert vectors provide modular capabilities, and the adaptation strategies dynamically determine and compose the most suitable combination to address the input task. In this first work, we propose three simple approaches to produce the vector z′during the first inference pass, implementing self- adaption with distinct methods and requirements. Below, we provide an outline of each method and refer to Appendix A for additional implementation details. A) Prompt engineering: Our most basic approach involves constructing a new “adaptation” prompt which we use to directly askthe LLM to categorize the input prompt. Based on its response, we then extract one category out of the set of domain topics used to pre-train each SVF expert and, thus, we select the corresponding z′directly from z1:K. In our adaptation prompt, we also explicitly provide the option for a generic “others” category, allowing the model to use its base weights in case no expert provides appropriate capabilities. We show the format used to construct the adaptation prompt in Figure 3. Analyze the given question and classify it into one of four categories: 'code', 'math', 'reasoning', or ‘others’. Follow these guidelines: 1. Code: Questions asking for programming solutions... 2. Math: Questions involving mathematical calculations... 3. Reasoning: Questions requiring logical thinking.... 4. Others: Questions not clearly fit into above categories... Instructions: - Consider the primary focus, skills, and knowledge required to answer the question. - If a question spans multiple categories, choose the most dominant one. - Provide your final classification within \\boxed{} notation. Example: \ \boxed{reasoning} Format your response as follows: Classification: \\boxed{category} Figure 3: Prompt based adaptation. Self- adaptation prompt used by Transformer2to classify the task prompt into pre-defined cat- egories.B) Classification expert: A direct extension of the prompt engineering approach comes from using a specialized system to handle task identification. Fol- lowing the principles of self-adaptation, we ap- ply SVF to fine-tune the base LLM itself to han- dle this task. In particular, we collect a dataset D={(x1,1,1),···,(xi,k, k),···} from the KSVF training tasks, where xi,kis the i-th example from thek-th expert task. Each tuple (xi,k, k)then forms an example to pre-train an additional job classifica- tion expert zclearned in the same fashion as the oth- ers. During the first inference pass, we simply load zc, intending to improve the inherent task classifica- tion capabilities of the base model to select a more appropriate z′to handle the input prompt. C) Few-shot adaptation: Our third approach leverages additional task information by assuming extended access to its test-time conditions beyond individual prompts. Our approach is inspired by popular few-shot prompting techniques, which have been shown to provide consistent performance improvements and even allow LLMs to “in-context” learn tasks that were entirely unseen prior to inference (Brown, 2020). For each optimized W, our approach entails producing an entirely new z′=PK k=1αkzkby linearly interpolating between the Klearned SVF vectors, each weighted by the coefficients αk. We employ CEM to search over the possible values of each αkbased on the performance on a set of “few-shot prompts”, which are specifically held out from the rest of the test prompts and used to evaluate CEM’s population samples. In the case of multiple population samples obtaining the same score on these held-out prompts, we break ties by favoring the one with the highest average log-likelihood across its own generated correct answers. Crucially, we only need to perform this process once for each target task, avoiding the need to increase the length of each 6 Page 7: Published as a conference paper at ICLR 2025 0 100 200 300 400 Epoch0.700.750.800.850.90ScoreMath 0 20 40 60 800.600.650.700.750.800.85Coding 0 50 100 1500.890.900.910.920.930.94Reasoning 0 50 100 1500.400.450.500.550.600.65Vision Language Train T est Figure 4: SVF learning curves. The dashed lines indicate the performance of L LAMA 3-8B- INSTRUCT on the test split of each task. SVF effectively fine-tunes to surpass the base performance. While we use the best validation score to select our checkpoint for evaluation (marked by red dots), we present the entire training curve without early stopping to demonstrate SVF’s learning capabili- ties. Tasks with only hundreds of training samples like Coding and Reasoning were stopped early. In our experiments, we update the parameters at the end of each epoch. question prompt, a relevant downside of traditional few-shot prompting. We refer to Section A.4, for additional details and an extended discussion of this final approach. 4 E XPERIMENTS We extensively evaluate Transformer2on multiple tasks and models with the purpose of: (1) as- sessing the efficiency and effectiveness of SVF; (2) demonstrating self-adaptiveness through the three proposed adaptation strategies; (3) conducting in-depth analysis and ablation studies aimed at understanding and interpreting the properties of our new framework. 4.1 E XPERIMENTAL SETUPS To validate the generality of Transformer2we consider three pre-trained LLMs ranging across dif- ferent model families and architecture sizes: L LAMA 3-8B-I NSTRUCT , M ISTRAL -7B-I NSTRUCT - V0.3, and L LAMA 3-70B-I NSTRUCT . For each model, we obtain three sets of SVF-trained zvec- tors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of zvectors for L LAMA 3-8B-I NSTRUCT , when applied as the language backbone for TextVQA (Singh et al., 2019), in order to assess SVF’s applicability to the vision-language modeling (VLM) domain. We provide SVF’s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer2adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019). In all our adaptation experiments, we only consider experts obtained in the pure-language set- tings, assessing its test-time applicability even for the distinctive vision domain. Please refer to the Appendix A for additional details and a summary of the hyper-parameters used in the experiments. 4.2 E XPERIMENTAL RESULTS SVF performance We provide results after training on each considered task with the L LAMA 3- 8B-I NSTRUCT , MISTRAL -7B-I NSTRUCT -V0.3, and L LAMA 3-70B-I NSTRUCT base models in Ta- ble 1. Remarkably, we find that SVF provides considerable and consistent performance gains across nearly all tasks and base models. Instead, LoRA experts yield smaller gains and even sporadic per- formance degradation. (These LoRA experts are trained with next token prediction. While we also Table 1: Fine-tuning results. LLM performance on the test splits of math, coding and reasoning. Normalized scores are in the parentheses. Method GSM8K MBPP-Pro ARC-Easy LLAMA 3-8B-I NSTRUCT 75.89 (1.00) 64.65 (1.00) 88.59 (1.00) + LoRA 77.18 (1.02) 67.68 (1.05) 88.97 (1.00) + SVF (Ours) 79.15 (1.04) 66.67 (1.03) 89.56 (1.01) MISTRAL -7B-I NSTRUCT -V0.3 42.83 (1.00) 49.50 (1.00) 81.65 (1.00) + LoRA 44.66 (1.04) 51.52 (1.04) 81.19 (0.98) + SVF (Ours) 49.74 (1.16) 51.52 (1.04) 85.14 (1.04) LLAMA 3-70B-I NSTRUCT 85.29 (1.00) 80.81 (1.00) 89.10 (1.00) + LoRA 77.26 (0.91) 68.69 (0.85) 88.55 (0.99) + SVF (Ours) 88.32 (1.04) 80.81 (1.00) 88.47 (0.99) T extVQA OKVQA3035404550Llama3-8B LoRA SVF/Transformer2Figure 5: Results for the VLM domain. 7 Page 8: Published as a conference paper at ICLR 2025 Table 2: Self-adaptation on unseen tasks. Normalized scores are in the parentheses. Method MATH Humaneval ARC-Challenge LLAMA 3-8B-I NSTRUCT 3 24.54 (1.00) 60.98 (1.00) 80.63 (1.00) + LoRA 24.12 (0.98) 52.44 (0.86) 81.06 (1.01) + Transformer2(Prompt) 25.22 (1.03) 61.59 (1.01) 81.74 (1.01) + Transformer2(Cls-expert) 25.18 (1.03) 62.80 (1.03) 81.37 (1.01) + Transformer2(Few-shot) 25.47 (1.04) 62.99 (1.03) 82.61 (1.02) MISTRAL -7B-I NSTRUCT -V0.3 13.02 (1.00) 43.29 (1.00) 71.76 (1.00) + LoRA 13.16 (1.01) 37.80 (0.87) 75.77 (1.06) + Transformer2(Prompt) 11.86 (0.91) 43.90 (1.01) 72.35 (1.01) + Transformer2(Cls-expert) 11.60 (0.89) 43.90 (1.01) 74.83 (1.04) + Transformer2(Few-shot) 13.39 (1.03) 47.40 (1.09) 75.47 (1.05) LLAMA 3-70B-I NSTRUCT 40.64 (1.00) 78.66 (1.00) 87.63 (1.00) + LoRA 25.40 (0.62) 73.78 (0.94) 83.70 (0.96) + Transformer2(Prompt) 40.44 (1.00) 79.88 (1.02) 88.48 (1.01) have LoRA experts trained with RL in Table 4, RL seems work less well with LoRA than with SVF.) This observed trend extends also to the vision-language domain, as fine-tuning L LAMA 3- LLAVA -NEXT-8B with SVF bolsters the base model’s performance by over 39% (see Figure 5). To ensure a fair comparison, we provide extensive ablations to both our model and the LoRA baseline considering different architecture and optimization objectives in Appendix 4.3). Due to its essential parameterization, we would like to note that training SVF requires considerably fewer resources, with less than 10% of the training parameters of our LoRA implementation. Adaptation performance With the SVF trained zvectors, we assess the self-adaptation capability of Transformer2on unseen tasks. For a fair comparison with LoRA, we record the performance of this baseline using all checkpoints from the considered training tasks and report only its high- est performance for each of the test tasks. As shown in Table 2, all of our Transformer2adapta- tion strategies demonstrate improvements across all tasks for L LAMA 3-8B-I NSTRUCT base models, and in at least two out of three tasks for both M ISTRAL -7B-I NSTRUCT -V0.3 and L LAMA 3-70B- INSTRUCT . In contrast, even the best training LoRAs only provide marginal improvements on the ARC-Challenge task and still significantly deteriorate performance on both MATH and Humaneval. This discrepancy suggests that LoRA’s parameterization and optimization might be particularly sen- sitive to overfitting, especially when trained with the smaller GSM8K and MBPP-Pro datasets, the tasks that provide information most related to MATH and Humaneval. In Figure 5, we find a sim- ilar dichotomy in the OKVQA task, with the performance of the base L LAMA 3-L LAVA -NEXT-8B VLM only improving after applying Transformer2. We note that also in this setting, Transformer2 performs self-adaptation only from the expert vectors from GSM8K, MBPP-Pro, and ARC-Easy. Thus, this result further underscores the high flexibility of self-adaptation, transferring knowledge compressed for tasks entirely based on language even for unrelated vision-based problems. Comparing the three proposed adaptation strategies, we highlight a clear monotonic trend – with more involved strategies and additional information about the test-time condition, self-adaptation appears to be increasingly effective. In particular, Transformer2with few-shot self-adaptation is almost always the highest-scoring method, providing notable improvements across all tested settings except for L LAMA 3-70B-I NSTRUCT @MATH, where we have only SVF-tuned half of the layers due to our limited GPU resources. This trend shows that providing additional or different kinds of information seems to be highly beneficial to our framework, suggesting that Transformer2could provide foundation models with new means to continually improve performance when deployed in lifelong settings. Table 3: Time cost of 2-pass inference in prompt adaptation strategy of Transformer2for the entire problem set. 1st to 2nd pass inference time ratios are shown in parentheses. Task 1st (s) 2nd (s) MATH 42.64 (13%) 321.19 Humaneval 2.76 (19%) 14.28 ARC-Challenge 13.40 (47%) 28.51Table 3 reports the inference time required by the prompt adap- tation strategy of Transformer2, with the time spent on solving the entire problem set presented separately for the 1st and 2nd passes. Notice that the 2nd pass inference time is the time spent on solving the problems, and the 1st pass inference time is the time for self-adaptation, 1st to 2nd pass inference time ratios are in the parentheses. While the additional inference pass might appear to double the overall runtime, it is important 8 Page 9: Published as a conference paper at ICLR 2025 Math Code ReasoningMATH HUMANEVAL ACR-Challenge1.00 0.00 0.00 0.04 0.96 0.00 0.17 0.00 0.77Llama3-8B Prompt Engineering Math Code Reasoning0.95 0.00 0.05 0.02 0.98 0.00 0.03 0.00 0.97Llama3-8B Classification Expert Math Code Reasoning0.96 0.00 0.00 0.00 0.99 0.00 0.06 0.00 0.95Mistral-7B Prompt Engineering Math Code Reasoning0.99 0.00 0.01 0.00 1.00 0.00 0.04 0.00 0.95Mistral-7B Classification Expert Math Code Reasoning1.00 0.00 0.00 0.01 0.99 0.00 0.05 0.00 0.95Llama3-70B Prompt Engineering Figure 6: Confusion matrices. These matrices display the classification percentages, where rows represent the task classes (ground truth) and columns indicate the predicted categories. Some sam- ples are misclassified as “Others,” which is reflected in rows where the totals do not sum to one. to note that inference time primarily depends on the number of tokens generated. In our settings, it isO(n)where nis the length of the input. ARC-challenge’s cost ratio is large because they are single choice problems and therefore the cost of the 2nd pass is also O(n). In general settings, we think it is reasonable to assume this ratio to be closer to those of MATH and Humaneval. For a detailed discussion on improving the efficiency of CEM few-shot adaptation methods, please see Appendix D 4.3 A NALYSIS Lastly, we analyze and discuss the properties of our adaptation strategies for which we provide extensions and further discussion Appendix B. Analysis 1: Job dispatching accuracy In Figure 6 we provide the confusion matrices of our classification-based adaptation strategies. These results validate the effectiveness of both our classification-based adaptation strategies to match each prompt with experts trained in similar do- mains, as evidenced by the high values along the diagonals. Furthermore, the results from L LAMA 3- 8B-I NSTRUCT and M ISTRAL -7B-I NSTRUCT -V0.3 also show that using the classification expert consistently provides higher classification accuracy than vanilla prompt engineering. While this dif- ference could explain the higher performance of the relative self-adaptation strategy, we also note that domain similarity might not be the only metric relevant to identifying the best expert for each prompt or task. To this end, we believe many further unexplored extensions could be explored in future work, using heuristics such as past expert performance or token-level analysis to further push our framework’s scalability. Analysis 2: Training tasks adaptation contribution In Figure 7, we show the normalized adap- tive coefficients akinterpolating between our SVF vectors learned via CEM for L LAMA 3-8B- INSTRUCT and M ISTRAL -7B-I NSTRUCT -V0.3 across all the unseen downstream tasks. Intuitively, we find that the expert vectors from the training tasks sharing similar topics to the unseen ones are of- ten the highest contributors to the produced adaptive weights. However, we observe that the MATH task appears as an interesting exception, as the akfor the expert obtained from GSM8K training is actually the lowest out of the three in both models. We hypothesize this reflects the different nature of the mathematics competition problems from MATH as compared to the grade-school problems in GSM8K. In fact, not only is the difficulty of the MATH questions far beyond GSM8K, but a large portion of its problems also hinges mainly on logical reasoning, for which a task like ARC might actually be more aligned. Furthermore, we also note that the different zvectors appear to contribute more uniformly to adaptation in the Llama model. This difference might indicate that, due to its higher base performance, the Llama model does not need to rely on any particular set of skills as much as Mistral, and can harness more holistic benefits from self-adaptation. Note that applying akuniformly is not a universal solution for leveraging expert vectors. This becomes evident when we look at different model and task combinations (e.g. applying akuniformly on L LAMA 3-8B- INSTRUCT for MATH tasks only achieves 24.47, while Transformer2(Few-shot) achieves 25.47). Analysis 3: Ablation studies Module sensitivity: We first compare the performance of SVF when it is applied to different modules (see trials 1-3). Under consistent conditions, both individual MLP and attention updates improve performance, with MLP updates resulting in more pronounced gains. Simultaneous updates to both module types yield even more significant enhancements. 9 Page 10: Published as a conference paper at ICLR 2025 Objective function: We are interested in the performance impact from different objective functions, and we compare the RL objective with next-token prediction loss (see trials 2 and 4). For the latter, we use instruction fine-tuning with official GSM8K solutions as target tokens. Results show clear performance gains with RL, demonstrating its effectiveness in task-specific fine-tuning. Conversely, next-token prediction even hinders performance. This highlights RL’s ability to handle cases lacking detailed solutions, suggesting its superiority in this context. SVF vs LoRA: Finally, we also evaluate LoRA using the RL objective (see trials 2 and 5). A sig- nificant performance disparity is observed, primarily attributed to the severe instability of the LoRA training process. Despite exploring a wide range of learning rates, LoRA’s performance consistently lagged behind. For further illustrations, see Figure 9 in the appendix. Table 4: Ablation studies. We fine-tune L LAMA 3-8B-I NSTRUCT on the GSM8K training split with different settings and the results on the test split along with zero-shot transfer results on MATH. # Method Objective Function Module #Params ( ↓) GSM8K ( ↑) MATH ( ↑) 0 LLAMA-3-8B-I NSTRUCT 75.89 (1.00) 24.54 (1.00) 1 SVF Policy gradient MLP 0.39M 78.62 (1.04) 24.20 (0.99) 2 SVF Policy gradient attention 0.16M 76.19 (1.00) 24.20 (0.99) 3 SVF Policy gradient MLP + attention 0.58M 79.23 (1.04) 25.04 (1.04) 4 SVF Next token pred attention 0.16M 60.50 (0.80) 18.52 (0.75) 5 LoRA Policy gradient attention 6.82M 57.92 (0.76) 15.72 (0.64) 6 LoRA Next token pred attention 6.82M 77.18 (0.98) 24.12 (0.96) 7 LoRA Next token pred MLP + attention 35.13M 75.66 (0.96) 22.12 (0.91) Analysis 4: Cross-model compatibility Finally, we explore the potential for our self-adaptation framework to be applied across different LLMs . In particular, we evaluate whether the SVF ex- pert vectors trained on L LAMA 3-8B-I NSTRUCT can benefit M ISTRAL -7B-I NSTRUCT -V0.3, and whether we can perform adaptation across the expert vectors of these two models. We present our main findings in Table 5 and refer to Appendix B for additional detailed results. Surprisingly, we find that positive transfer occurs across the two models, with visible benefits in 2 out of 3 tasks. We note these improvements are due to the inherent ordering of the SVF parameterization, as randomly shuffling each SVF vector before applying it to the Mistral model consistently degrades performance. GSM8K 25.8% MBPP26.2%Arc Easy 48.0%MATHLlama3-8B GSM8K 31.1% MBPP36.2%Arc Easy 32.8% MATHMistral-7B GSM8K 31.2% MBPP35.1%Arc Easy 33.7% HumanEvalGSM8K 33.3% MBPP64.1%Arc Easy 2.6% HumanEval GSM8K 19.3% MBPP30.0%Arc Easy 50.7%Arc ChallengeGSM8K 5.4%MBPP 7.1% Arc Easy87.5%Arc Challenge Figure 7: αklearned weights.This operation leads to notable performance degradation across each task. Finally, by performing few-shot adaptation using the SVF vectors collected from both models, the performance of M ISTRAL -7B-I NSTRUCT -V0.3 further improves across the board. We observe that these gains even surpass the best score from adapting M ISTRAL -7B-I NSTRUCT -V0.3 with allthe SVF vectors in the ARC-Challenge task reported in Table 2. While these results appear promising, we note that the surprising com- patibility discovered through our naive transfer approach is po- tentially tied to the similarity between the architectures of the two considered LLMs. To this end, whether similar transfer can be replicated with models of different scales remains an open re- search question that could open the doors to disentangling and recycling task-specific skills for newer/larger models, with im- portant implications for democratization and sustainability. 5 C ONCLUSION In this paper, we introduced Transformer2, providing a novel blueprint toward realizing self-adaptive LLMs. Within this framework, we first proposed SVF, offering superior performance than prior fine- tuning recipes, together with reduced costs, high compositionality, and overfitting regularization – all crucial properties to achieve scalable self-adaptation. Leveraging a set of SVF experts as building blocks, we developed three effective strategies for self-adaptation, each offering unique benefits and monotonic performance benefits with increasing access to the test-time conditions. While Transformer2demonstrates promising results, there remain exciting opportunities for future work. One limitation is that the capabilities of SVF experts are tied to the latent components of the 10 Page 11: Published as a conference paper at ICLR 2025 Table 5: Cross-model zvector transfer. Results from transferring the expert vectors trained on LLAMA 3-8B-I NSTRUCT to M ISTRAL -7B-I NSTRUCT -V0.3 with cross model few-shot adaptation. Method MATH Humaneval ARC-Challenge SVF training task GSM8K MBPP-pro ARC-Easy MISTRAL -7B-I NSTRUCT -V0.3 13.02 (1.00) 43.29 (1.00) 71.76 (1.00) + Llama SVF (ordered σi) 11.96 (0.92) 45.12 (1.04) 72.01 (1.00) + Llama SVF (shuffled σi) 10.52 (0.81) 40.24 (0.93) 70.82 (0.99) + Few-shot adaptation (cross-model) 12.65 (0.97) 46.75 (1.08) 75.64 (1.05) base model. To address this, model merging offers a promising direction (Yu et al., 2024; Goddard et al., 2024; Akiba et al., 2024), enabling specialized models to be combined into a single, more capable model. Additionally, while our CEM-based adaptation effectively balances performance and efficiency, scaling to a large number of specialized domains may introduce increased one-time computational costs. However, this trade-off is offset by the benefits of improved performance and enhanced self-adaptation capabilities. Advances in model merging and efficient adaptation tech- niques have produced models dominating open leaderboards, making them strong candidates as base models for Transformer2and opening new possibilities for adaptive LLMs. 11 Page 12: Published as a conference paper at ICLR 2025 REFERENCES Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187 , 2024. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 , 2021. Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. Lora-xs: Low-rank adapta- tion with extremely small number of parameters. arXiv preprint arXiv:2405.17604 , 2024. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020. Alberto Cetoli. Fine-tuning llms with singular value decomposition. Hugging Face Blog, June 2024. URL https://huggingface.co/blog/fractalego/svd-training . Ac- cessed: 2024-07-01. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 , 2018. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. Elizabeth N Davison, Kimberly J Schlesinger, Danielle S Bassett, Mary-Ellen Lynall, Michael B Miller, Scott T Grafton, and Jean M Carlson. Brain network adaptability across task states. PLoS computational biology , 11(1):e1004029, 2015. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 , 2023. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research , 23(120):1–39, 2022. Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257 , 2024. Faustino Gomez and J ¨urgen Schmidhuber. Evolving modular fast-weight networks for control. In International Conference on Artificial Neural Networks , pp. 383–389. Springer, 2005. David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In International Conference on Learn- ing Representations , 2017. URL https://openreview.net/forum?id=rkpACe1lx . Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 , 2021. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021. Kazuki Irie, Imanol Schlag, R ´obert Csord ´as, and J ¨urgen Schmidhuber. A modern self-referential weight matrix that learns to modify itself. In International Conference on Machine Learning , pp. 9660–9677. PMLR, 2022. 12 Page 13: Published as a conference paper at ICLR 2025 Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 , 2024. Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, and Alan Ritter. Self-moe: Towards compositional large language models with self-specialized experts. arXiv preprint arXiv:2406.12034 , 2024. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020. Prakhar Kaushik, Ankit Vaidya, Alan Yuille, et al. EigenloRA: Recycle trained adapters for resource efficient adaptation and inference, 2025. URL https://openreview.net/forum?id= KxGGZag9gW . Verena Kl ¨os, Thomas G ¨othel, and Sabine Glesner. Adaptive knowledge bases in self-adaptive sys- tem design. In 2015 41st Euromicro Conference on Software Engineering and Advanced Appli- cations , pp. 472–478, 2015. doi: 10.1109/SEAA.2015.48. Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki Markus Asano. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454 , 2023. Jan Koutnik, Faustino Gomez, and J ¨urgen Schmidhuber. Evolving neural networks in compressed weight space. In Proceedings of the 12th annual conference on Genetic and evolutionary compu- tation , pp. 619–626, 2010. Vijay Lingam, Atula Tejaswi, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joy- deep Ghosh, Alex Dimakis, Eunsol Choi, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter-efficient fine-tuning with singular vectors. arXiv preprint arXiv:2405.19597 , 2024. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems , 35:1950–1965, 2022. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang- Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353 , 2024. Lasse S Loose, David Wisniewski, Marco Rusconi, Thomas Goschke, and John-Dylan Haynes. Switch-independent task representations in frontal and parietal cortex. Journal of Neuroscience , 37(33):8033–8042, 2017. Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition , pp. 3195–3204, 2019. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback. Advances in neural information processing systems , 35: 27730–27744, 2022. Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, and Sanjeev Arora. Trainable transformer in transformer. arXiv preprint arXiv:2307.01189 , 2023. Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters, March 2024. URL https://qwenlm.github.io/blog/qwen-moe/ . Blog post. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Am- mar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning , pp. 18332–18346. PMLR, 2022. 13 Page 14: Published as a conference paper at ICLR 2025 Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to com- binatorial optimization, Monte-Carlo simulation, and machine learning , volume 133. Springer, 2004. J¨urgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation , 4(1):131–139, 1992. J¨urgen Schmidhuber. A ‘self-referential’weight matrix. In ICANN’93: Proceedings of the Interna- tional Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3 , pp. 446–450. Springer, 1993. J¨urgen Schmidhuber. On learning to think: Algorithmic information theory for novel combina- tions of reinforcement learning controllers and recurrent neural world models. arXiv preprint arXiv:1511.09249 , 2015. Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558 , 2023. Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 8317–8326, 2019. Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercube-based encoding for evolv- ing large-scale neural networks. Artificial life , 15(2):185–212, 2009. Chen Tianlong, Cheng Yu, Chen Beidi, Zhang Minjia, and Bansal Mohit. Mixture-of-experts in the era of llms: A new odyssey. ICML 2024 presentation slides, 2024. International Conference on Machine Learning (ICML). Hanqing Wang, Zeguan Xiao, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing minor singular components for parameter-efficient llm finetuning. arXiv preprint arXiv:2406.09044 , 2024. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning , 8:229–256, 1992. Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Ab- sorbing abilities from homologous models as a free lunch. In Forty-first International Conference on Machine Learning , 2024. Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: building proactive cooperative agents with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pp. 17591–17599, 2024. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning. arXiv preprint arXiv:2303.10512 , 2023. Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. Llama-moe: Building mixture-of-experts from llama with continual pre-training. arXiv preprint arXiv:2406.16554 , 2024. Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R ´obert Csord ´as, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066 , 2023. 14 Page 15: Published as a conference paper at ICLR 2025 A I MPLEMENTATION DETAILS AND HYPER -PARAMETERS A.1 SVF TRAINING We obtain the expert vectors zas the base components in Transformer2by training the SVF fine- tunes with a consistent recipe across the considered training tasks and language models. We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θzwith AdamW using a learning rate of 2×10−3with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ(the coefficient of the KL divergence term) based on validation performance. For the L LAMA 3-70B-I NSTRUCT and Vision tasks experiments, we apply the SVF on half of the layers to reduce memory usage while maintaining considerable performance improvement. During the training of L LAMA 3-8B- INSTRUCT on the vision language tasks, we apply a small negative reward (-0.1) for training stability. A.2 L ORA TRAINING Below is an instruction that describes a task. Write a response that appropriately completes the request. Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72 Figure 8: Sample problem and answer. Math data sample used for LoRA instruction fine-tuning, text in blue is the unmasked solution.We follow community best practices for LoRA fine-tuning, applying it to query and value projection layers with learning rates around 5×10−5. We set 200 total itera- tions with a 256 global batch size for suffi- cient training. For feasible LoRA instruc- tion training, we collect solutions for all training tasks (GSM8K, MBPP, Arc-Easy, TextVQA) from official sources and ap- pend them to question prompts. Table 8 shows a sample math problem used for LoRA fine-tuning. Despite extensive hy- perparameter tuning, we often observe test performance decay as discussed, which can be attributed to the small number of training samples and potential model requirements for instruction fine-tuning data (specifically, the highly detailed thinking process). A.3 H YPER PARAMETERS We present a summary of the hyperparameters used in our experiments in Table 6. To optimize performance, we conducted sweeps across several hyperparameters and selected the most effective combination based on validation results. For SVF, our primary focus was on adjusting the KL coefficient to enhance training stability. In the case of LoRA, we concentrated on sweeping the learning rate and maximum gradient clip norm to identify optimal settings. A.4 F EW-SHOT ADAPTATION As described in the main text, our few-shot adaptation approach entails producing an entirely new z′=PK k=1αkzkfor each Wby linearly interpolating between the Klearned SVF vectors, each weighted by the coefficients α∈RK. We employ CEM to search for αk’s based on the performance on the few-shot prompts, which are specifically held out from the rest of the test prompts and used to obtain the elite set at each iteration. In the case of multiple sample solutions obtaining the same score on these held-out samples, we break ties by choosing the sample solution with the highest average log-likelihood across the tokens of its generated correct answers. In all of our main experiments, we reserve only 10 samples of data for self-adaptation and perform up to 100 CEM iterations. For each setting, we consider both per-layer and per-vector adaptation, where the latter strategy has the advantage of greatly simplifying search (as we only have 3 αcoefficients). Moreover, we experiment with both normalizing across the αof different tasks (such that their sum would be fixed to 1) or keeping them unconstrained. Due to the lack of a validation set, we simply report the performance attained by our best sample from these test configurations at the end of optimization, on the remaining unseen samples for each task. 15 Page 16: Published as a conference paper at ICLR 2025 Table 6: Hyper-parameters used for SVF and LoRA training. We perform a sweep on certain sensitive hyper-parameters across methods for fair comparison. SVF Hyperparameters Initial mean value of z 0.1 Initial variance value of z1×10−3 Global batch size 256 Learning rate 2×10−3 Clip max norm 1×10−3 KL coefficient λ 0.0,0.1,0.2,0.3 LoRA Hyperparameters Rank 16 LoRA alpha 32 LoRA dropout 0.05 Global batch size 256 Learning rate 2×10−4,5×10−4,2×10−5,5×10−5,2×10−6.5×10−6, Clip max norm 1×10−3,1.0 Table 7: Additional Comparison Experiment. Normalized scores are in the parentheses. Method GSM8K MBPP-Pro ARC-Easy LLAMA 3-8B-I NSTRUCT 75.89 (1.00) 64.65 (1.00) 88.59 (1.00) + IA3 78.01 (1.03) 67.68 (1.05)89.10 (1.01) + DORA 78.09 (1.03) 64.65 (1.00) 89.14 (1.01) + SVF(Ours) 79.15 (1.04) 66.67 (1.03) 89.56 (1.01) Method MATH Humaneval ARC-Challenge LLAMA 3-8B-I NSTRUCT 24.54 (1.00) 60.98 (1.00) 80.63 (1.00) + IA3 23.64 (0.96) 59.76 (0.98) 81.57 (1.01) + DORA 24.44 (0.99) 52.44 (0.86) 81.14 (1.01) + Transformer2(Prompt) 25.22 (1.03) 61.59 (1.01) 81.74 (1.01) + Transformer2(Cls-expert) 25.18 (1.03) 62.80 (1.03) 81.37 (1.01) + Transformer2(Few-shot) 25.47 (1.04) 62.99 (1.03) 82.61 (1.02) B A DDITIONAL RESULTS B.1 B ASELINE COMPARISON TO MORE PEFT M ETHODS We conduct additional comparison studies against more parameter-efficient fine-tuning methods, including IA3Liu et al. (2022), DORA. Liu et al. (2024). As Table 7 shows, SVF still outperforms other methods and shows promising generalized perfor- mance. B.2 I MPACT FROM NUMBER OF FEW -SHOTS Table 8: Few-shot adaptation scaling on the Arc- Challenge task. Performance varies with number of examples. Method Transformer2IA3100 steps IA31000 steps LLAMA 3-8B-I NSTRUCT 80.63 (1.00) 80.63 (1.00) 80.63 (1.00) + 3-shot adaptation 82.18 (1.02) 81.83 (1.01) 79.01 (0.98) + 5-shot adaptation 82.38 (1.02) 80.89 (1.00) 79.41 (0.98) + 10-shot adaptation 82.61 (1.02) 82.00 (1.02) 79.78 (0.99) + 20-shot adaptation 82.61 (1.02) 81.40 (1.01) 79.61 (0.99)We investigate the relationship between the number of samples available for few- shot adaptation and downstream perfor- mance. Our analysis focused on the test task where L LAMA 3-8B-I NSTRUCT demonstrates the highest baseline perfor- mance, to prevent the potential for a null signal in our CEM-based search. As Table 8 shows, substantial benefits of our few-shot strategy are evident with as few as 3 to 5 test samples. Moreover, performance appears to plateau beyond 10 samples, underscoring how our essential and inherently regularized SVF pa- 16 Page 17: Published as a conference paper at ICLR 2025 rameterization effectively complements self-adaptation. This efficiency enables optimal use of data to enhance understanding of the test task. For completeness, we have also conducted experiments with identical settings on IA3(Liu et al., 2022), another method that leverages few-shot examples. All experiments were conducted with full batch size, a learning rate of 5×10−5, with 100 and 1000 training steps. Our results indicate that the performance of IA3on the unseen test tasks is inferior to CEM-based adaptation for all numbers of few shots considered. We note that in our experiment, we have to considerably limit the number of optimization steps to avoid overfitting the 500,000 parameters of IA3on the few-shot samples. However, we believe overfitting might still be occurring to some degree even after only 100 steps, as also validated by the model’s perfect training accuracy on this extremely small dataset. This limitation of fine-tuning-based adaptation highlights the superior generalization capability of our CEM-based adaptation approach in Transformer2. B.3 C ROSS -MODEL SVF TRANSFER ON THE TRAINING TASKS We provide complementary results to Table 5 in the main text, where we analyze the SVF cross- model transfer performance from training on GSM8K, MBPP-pro, and ARC-Easy to our consid- ered test tasks. In Table 9, we show the results in the same transfer setting this time evaluating MISTRAL -7B-I NSTRUCT -V0.3 on the same training tasks where the L LAMA 3-8B-I NSTRUCT SVF vectors were obtained from. Overall, we recognize a similar trend, albeit with less consistent im- provement from the original model (only in 1 out of 3 tasks), but still much higher performance than the randomly shuffled baseline. These results further confirm that the canonical ordering of the SVF parameterization is key for cross-model transfer, highlighting once more its inherent suitability to empower self-adaptation. Table 9: Cross-model zVector Transfer. Results from transfering the SVF expert vectors trained on L LAMA 3-8B-I NSTRUCT to M ISTRAL -7B-I NSTRUCT -V0.3 in the respective training tasks. Method GSM8K MBPP-pro ARC-Easy MISTRAL -7B-I NSTRUCT -V0.3 42.83 (1.00) 49.50 (1.00) 81.65 (1.00) + Llama SVF (ordered σi) 42.61 (0.99) 48.48 (0.98) 81.78 (1.00) + Llama SVF (shuffled σi) 41.93 (0.98) 46.34 (0.94) 80.81 (0.99) B.4 T RAINING CURVE OF LORA AND POLICY GRADIENT Figure 9 gives the learning curves for LoRA training on the GSM8K task. 0 50 100 150 200 250 300 Iterations0.550.600.650.700.750.80ScoreLearning Curve on GSM8K with Lora and Policy gradient Train Accuracy T est Accuracy Base Model Performance Figure 9: Training LoRA with policy gradient. The dashed line shows the performance of LLAMA 3-8B-I NSTRUCT on the test split. LoRA collapses at the beginning of the training stage and fails to recover, leading to negative effects on test performance. We swept a wide range of learn- ing rates (2×10−4,5×10−4, . . . , 2×10−2,5×10−2), and all learning curves were similar to the one presented. 17 Page 18: Published as a conference paper at ICLR 2025 C PCA ON LLAMA 3AND MISTRAL To investigate if the singular components that have the highest singular values are able to capture most of the information of a weight matrix, we conducted Principle Component Analysis (PCA) on the weight matrices in L LAMA 3-8B-I NSTRUCT and M ISTRAL -7B-I NSTRUCT -V0.3 (see Fig- ures 10 and 11). In each figure, we plot the variance that is captured by the top rcomponents across all the layers in each type of modules for a weight matrix W∈Rm×n: ratio=Pr i=1σiPmin(m,n) j=1σj Here, σ’s are the ordered (from largest to smallest) singular values on the weight matrix W. It is easy to see from the figures that when r= 256 , less than 50% of the variance is captured by these top components on average. For the MLP layers, this fraction is even lower than 20%. On the other hand, the ranks adopted by LoRA-XS or similar methods are much less than 256, resulting in even more information loss and restrictions in their modeling power that relies mostly on these r components. 0.00.20.40.6q_proj r=16 r=64 r=256k_proj 0.00.20.40.6v_proj o_proj 0.00.20.40.6up_proj gate_proj 0123456789101112131415161718192021222324252627282930310.00.20.40.6down_proj Figure 10: PCA of L LAMA 3-8B-I NSTRUCT .We show the ratio of the variance captured by the toprsingular components on the y-axis, and the layer indices on the x-axis. Except for the Query, Key and Value projection matrices, small rvalues only capture a tiny fraction of variance in singular values in the parameter matrices. D E FFICIENCY CONSIDERATIONS AND IMPROVEMENTS Table 10: 3-shot and light variants Performance with different inference-time adaptation budgets. Method ARC-Challenge LLAMA 3-8B-I NSTRUCT 80.63 (1.00) + CEM 10-shot adaptation 82.61 (1.02) + CEM 3-shot (30% of prompts) 82.18 (1.02) + CEM light (3% of prompts) 82.08 (1.02)Our CEM-based adaptation method involves running inference on a small number of sam- ples for each target task (up to 10 in our ex- periments). In a typical configuration, this pro- cess is relatively efficient: for example, our CEM-light approach (3-shot with 10 genera- tions) completes the ARC-Challenge task in ap- proximately 11 minutes. As shown in Table 10, this lighter setup reduces the total number of samples to just 3% of the original setting while still delivering substantial performance improvements over the base model. 18 Page 19: Published as a conference paper at ICLR 2025 0.00.20.40.60.8q_proj r=16 r=64 r=256k_proj 0.00.20.40.60.8v_proj o_proj 0.00.20.40.60.8up_proj gate_proj 0123456789101112131415161718192021222324252627282930310.00.20.40.60.8down_proj Figure 11: PCA of M ISTRAL -7B-I NSTRUCT -V0.3. We show the ratio of the variance captured by the top rsingular components on the y-axis, and the layer indices on the x-axis. Except for the Query, Key and Value projection matrices, small rvalues only capture a tiny fraction of variance in singular values in the parameter matrices. We acknowledge that CEM-based adaptation entails a trade-off between one-time overhead it spends on searching the optimal combination weights for the SVF-tune vectors and performance. Increasing the number of few-shot samples or the number of generations can yield higher performance, but this comes at the cost of additional computational overhead. However, it is important to note that this adaptation cost is a one-time overhead per task. The cost-per-prompt diminishes significantly when applied to tasks with a large number of prompts. Moreover, in practical scenarios, CEM-based adaptation offers better scalability than few-shot prompting methods, which require increasing the length of every prompt, leading to much worse scaling as task sizes grow. In contrast, our method focuses on determining optimal expert vector combinations efficiently and avoids repetitive inference-time costs. However, we note that the over- head might be significant for tasks with very few prompts. Thus, the other adaptations methods might be more appropriate for these particular settings. We also highlight two immediate directions for improving efficiency: 1. Reducing the number of few-shot samples: As shown in our ablation study in Ap- pendix B.2, substantial benefits can be seen even in the 3-shot setting, which requires only evaluation of only 30% of the number of prompts per generation. 2. Reducing the number of maximum generations: In the explored settings, the CEM param- eters tend to converge early on, being very close to the final values after a much lower number of generations than 100. Finally, in this work we only considered CEM due to its simplicity, there exist several different evolution algorithms empirically showing better efficiency and convergence properties that we hope will be explored in future research. 19

---