loader
Generating audio...

arxiv

Paper 2503.10617

Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models

Authors: Andy Zhou

Published: 2025-03-13

Abstract:

Adapting large language models to multiple tasks can cause cross-skill interference, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each specializing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These findings show that specialized representation edits, composed via a simple router, significantly enhance multi-task instruction following with minimal overhead.

Paper Content:
Page 1: Zochi Generated Preprint. COMPOSITIONAL SUBSPACE REPRESENTATION FINE- TUNING FOR ADAPTIVE LARGE LANGUAGE MODELS Andy Zhou Intology AI ABSTRACT Adapting large language models to multiple tasks can cause cross-skill interfer- ence, where improvements for one skill degrade another. While methods such as LoRA impose orthogonality constraints at the weight level, they do not fully address interference in hidden-state representations. We propose Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based approach that learns multiple orthonormal subspace transformations, each special- izing in a distinct skill, and composes them via a lightweight router. By isolating these subspace edits in the hidden state, rather than weight matrices, CS-ReFT prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, ap- plying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring only 0.0098% of model parameters. These find- ings show that specialized representation edits, composed via a simple router, sig- nificantly enhance multi-task instruction following with minimal overhead. 1 I NTRODUCTION Large language models (LLMs) have become central to a wide range of NLP applications, yet adapt- ing them to new tasks can be computationally expensive, often requiring hundreds of GPU hours and significant memory overhead. Parameter-efficient fine-tuning (PEFT) methods (Han et al., 2024) tackle this challenge by updating only a small fraction of model parameters, typically 0.1–1% of the total. While this approach has enabled more practical deployment of adapted models, with methods like LoRA (Hu et al., 2021) reducing parameter counts by 1000x, current PEFT techniques still fo- cus primarily on weight-based updates. In contrast, representation editing methods like ReFT (Wu et al., 2024b) directly modify hidden states, achieving even lower parameter overhead; however, most have used a single global edit that struggles to handle multiple skills without interference. A core problem in multi-task adaptation is cross-task interference , wherein changes aimed at im- proving one task degrade performance on another (Pfeiffer et al., 2023). Although recent LoRA variants impose orthogonality constraints to reduce conflicts (Wang et al., 2023; Hsu et al., 2024), none have extended these ideas to representation-based fine-tuning, where orthonormal subspaces can isolate skills more effectively at the hidden-state level. To address this gap, we propose Compo- sitional Subspace Representation Fine-tuning (CS-ReFT) , a framework that extends ReFT with multiple orthonormal subspace edits and a lightweight router for dynamic composition. Our contri- butions include: • We learn separate low-rank transformations for each skill implicitly, preventing conflicts across tasks while requiring only 0.0098% of model parameters–a 12.7x reduction com- pared to LoRA. A small gating network is trained to selectively activate relevant subspaces for each input. • By applying orthonormal constraints directly in hidden-state space , CS-ReFT isolates skills more cleanly than weight-based orthogonal LoRA methods. • CS-ReFT attains a 93.94% win rate on AlpacaEval with Llama-2-7B–significantly outper- forming both larger models (GPT-3.5-Turbo) and parameter-efficient baselines (LoReft at 85.60% ). 1arXiv:2503.10617v1 [cs.CL] 13 Mar 2025 Page 2: Zochi Generated Preprint. Figure 1: Illustration of CS-ReFT. (1) The left panel shows how Compositional Subspace Rep- resentation Fine-Tuning (CS-ReFT) applies specialized subspace transformations ( Φ1,Φ2,Φ3) at specific positions in different layers to adapt a frozen model for multiple tasks. Each subspace edit is task-specific, reducing interference while allowing composition when needed. (2) The right panel details the routing mechanism: a lightweight router determines which subspaces to activate based on the input, ensuring efficient and targeted modifications. 2 R ELATED WORK Parameter-efficient adaptation. Recent years have seen rapid progress in parameter-efficient fine-tuning (PEFT) methods (Han et al., 2024; Lialin et al., 2023; Li & Liang, 2021). Low-rank adaptation approaches like LoRA (Hu et al., 2021) decompose weight updates into low-rank ma- trices UV⊤, typically achieving 1000x parameter reduction while maintaining performance. Other methods like BitFit (Ben-Zaken et al., 2021) modify only bias terms. Representation-based methods (Wu et al., 2024b; Kong et al., 2024; Zou et al., 2023) instead edit model activations. These advances have made LLM adaptation more practical, achieving lower parameter overhead than weight-based updates, but use a single global edit function, limiting their effectiveness for multi-task adaptation. Multi-task learning. Multi-task adaptation strategies span several approaches, from shared pa- rameter methods (Hu et al., 2021; Liu & Luo, 2024) that risk interference, to task-specific modules (Chronopoulou et al., 2022; Yang et al., 2024) requiring separate adapters, to dynamic routing sys- tems (Araujo et al., 2024; Zhang et al., 2024) that often introduce significant overhead. Some recent work combines orthogonality constraints with multi-task setups (Hsu et al., 2024; Liu et al., 2023; Wang et al., 2023), but again these rely on weight-based modules inserted inside Transformer lay- ers. By contrast, our method applies orthonormal constraints at the representation level, learning disjoint subspaces in the hidden state and dynamically routing between them. This design reduces cross-task interference compared to sharing a single low-rank factorization for all tasks. 3 M ETHOD Compositional Subspace Representation Fine-tuning (CS-ReFT) learns multiple low-rank sub- space interventions and a router to activate them on a per-input basis, addressing cross-task inter- ference by dedicating separate subspaces to each skill. We selectively compose these subspaces at inference. Let Mbe a frozen pretrained model (e.g., a Transformer) of hidden dimension d. For each sequence of ntokens x= (x1, . . . , x n), the model produces {h(j) 1, . . . , h(j) n}in layer j. Our goal is to adapt Mto a set of ktasks{T1, . . . ,Tk}without modifying the original weights. Instead, 2 Page 3: Zochi Generated Preprint. we learn: (1) A collection of low-rank subspace transformations ,{Φ1, . . . , Φk}, one per task, (2) a router Rthat decides which subset of {Φi}to activate given an input. Our design ensures that each taskTihas a dedicated subspace edit Φi—preventing direct interference—yet also enables composi- tion for inputs requiring multiple skills. In practice, the tasks are manually defined through manually partitioning the data or implicitly learned during training. 3.1 S UBSPACE REPRESENTATION EDITING Following ReFT (Wu et al., 2024b), each subspace intervention Φmodifies a hidden vector h∈Rd by editing only an r-dimensional subspace spanned by the rows of R. Concretely, we let Φ(h) =h+R⊤ W h+b|{z} desired subspace coords−R h , where R∈Rr×dis typically constrained to have orthonormal rows ( R R⊤=Ir), and W∈ Rr×d, b∈Rrare trainable parameters. In CS-ReFT, we have Φ1, . . . , Φk,one per task , each with its own low-rank parameters {Ri, Wi, bi}. Ensuring that an input requiring task ican be edited by Φiwithout altering another subspace, this fully separates the learned directions in hidden space, mitigating interference across tasks. 3.2 R OUTER MECHANISM Not every input belongs to a single task, nor do we want to dedicate a distinct subspace for every fine-grained skill. Hence, we introduce a router that selects or composes the relevant subspaces at inference time. For example, an instruction might require both Φ2(arithmetic) and Φ3(sentiment analysis). We define a small routing network Router( x) =α∈[0,1]{k}, which maps an embedding of the input (e.g., the first token’s hidden state) to a gating vector α. We then compose the subspace edits as: h′=h+kX i=1αih R⊤ i Wih+bi−Rihi . Ifαiis discrete (e.g. thresholded), then each Φiisonoroff. Alternatively, we can keep αi∈ [0,1]for a soft gating. In either case, the parameter overhead from the router is minimal, allowing dynamic composition without losing efficiency. Crucially, this router is jointly trained alongside the subspaces themselves. As a result, the model can implicitly discover how to route different inputs to different subspaces without any manual task partitioning. 3.3 T RAINING OBJECTIVE We train CS-ReFT by minimizing: L=kX i=1E(x,y)∼T ih ℓ M x;{Φi}, R , yi +λΩ(α), where ℓ(·)is a task loss (e.g. cross-entropy), and Ω(α)can be a sparsity regularizer on the router outputs to encourage minimal subspace usage. In practice, we update only {Φ1, . . . , Φk}and the router’s parameters while leaving all original model weights frozen. This design prevents cross-task interference by activating only relevant subspaces on each input, and the low-rank structure keeps parameter overhead minimal. In addition to these aspects, CS-ReFT provides multiple benefits. It prevents cross-task interference by keeping each skill’s subspace disjoint so that changes to Φido not overwrite Φj. It also fosters compositional synergy, as the router composes subspaces on demand to enable multi-skill prompts. Finally, it ensures extreme parameter savings because each subspace Φiremains low-rank and the router is tiny, resulting in significantly fewer parameters than typical multi-head adapters. This com- positional subspace design thus unifies the efficiency of representation editing with the modularity of multi-task routing, enabling high-quality, multi-task LLM adaptation with minimal overhead. 3 Page 4: Zochi Generated Preprint. Table 1: Performance on AlpacaEval. Parameter Efficiency (PE) shows fraction of trainable param- eters relative to the base model. Win rate measures preference over reference responses. CS-ReFT on Llama-2-7B outperforms all baseline methods and is competitive with ReFT on parameter effi- ciency. Model Win Rate (%) PE (%) Reference Models GPT-3.5 Turbo 1106 86.30 — Llama-2 Chat 13B 81.10 — Llama-2 Chat 7B 71.40 — Parameter-Efficient Methods (Llama-2 7B) Full Fine-tuning 80.93 100.00 LoRA 81.48 0.1245 RED 81.69 0.0039 DiReFT 84.85 0.0039 LoReFT 85.60 0.0039 CS-ReFT (Ours) 93.94 0.0098 4 E XPERIMENTS Setup. We evaluate CS-ReFT using the AlpacaEval benchmark (Dubois et al., 2024), which mea- sures instruction-following capabilities through win rates against reference responses. As a general task, instruction-following implicitly involves multiple subtasks, such as reasoning and common- sense understanding. Our experiments use Llama-2-7B (Touvron et al., 2023) as the base model, comparing CS-ReFT against both parameter-efficient methods and larger models. We evaluate us- ing two metrics: Win Rate (percentage of model outputs preferred over reference responses) and Parameter Efficiency (percentage of trainable parameters relative to full model). Our baselines include parameter-efficient methods (LoRA (Hu et al., 2021), RED (Wu et al., 2024a), DiReFT (Wu et al., 2024b), LoReFT (Wu et al., 2024b)) and larger models (GPT-3.5 Turbo (Brown et al., 2020), Llama-2-13B (Touvron et al., 2023)). The CS-ReFT architecture implements four distinct low-rank transformations using the ReFT inter- vention mechanism (Wu et al., 2024b), each operating independently on the model’s hidden states. A lightweight two-layer router network processes the first token’s embedding ( h∈Rd), with an in- put layer mapping Rd→Rd/2(ReLU activation) and an output layer mapping Rd/2→R4(sigmoid activation), using a 0.5 threshold for binary gating. Results. Table 1 presents performance comparisons across model sizes and adaptation methods. CS- ReFT achieves a 93.94% win rate while modifying only 0.0098% of model parameters. Specifically, it surpasses larger models such as GPT-3.5 Turbo (86.30%) and Llama-2-13B (81.10%), outperforms weight-based methods like LoRA (81.48%, 0.1245% parameters), and exceeds representation meth- ods such as ReFT variants (81.69–85.60%, 0.0039% parameters), highlighting the effectiveness of specialized subspaces and dynamic routing. 5 C ONCLUSION We introduced Compositional Subspace Representation Fine-tuning (CS-ReFT), which addresses cross-task interference by assigning separate low-rank subspace transformations to each skill and using a lightweight router for dynamic composition. Unlike orthonormal LoRA variants that still op- erate on weight matrices, our approach enforces orthonormal subspace constraints directly on hid- den states, thereby isolating learned features more effectively. Experiments on AlpacaEval demon- strate that CS-ReFT outperforms both larger models (GPT-3.5) and other parameter-efficient meth- ods (LoRA, LoReFT). Future research should focus on scalability (subspace merging or sharing for large skill sets) and interpretability (shedding light on the router’s gating decisions). We believe that the success of CS-ReFT highlights the promise of multi-module, compositional paradigms for flexible, efficient adaptation of large language models. 4 Page 5: Zochi Generated Preprint. 6 A CKNOWLEDGEMENTS The hypothesis, idea, experiments, and writing was conducted by Zochi, an AI system. Results and the manuscript have been carefully checked by human experts. REFERENCES Vladimir Araujo, M. Moens, and T. Tuytelaars. Learning to route for dynamic adapter composition in continual learning with language models. 2024. Elad Ben-Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. 2021. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Ma teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv , abs/2005.14165, 2020. URL https://api.semanticscholar.org/ CorpusID:218971783 . Alexandra Chronopoulou, Dario Stojanovski, and Alexander M. Fraser. Language-family adapters for low-resource multilingual neural machine translation. 2022. Yann Dubois, Bal’azs Galambosi, Percy Liang, and Tatsunori Hashimoto. Length-controlled al- pacaeval: A simple way to debias automatic evaluators. ArXiv , abs/2404.04475, 2024. Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey. 2024. Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun ying Huang. Safe lora: the silver lining of reducing safety risks when fine-tuning large language models. 2024. J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ArXiv , abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009 . Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, and Chao Zhang. Aligning large language models with representation editing: A control perspective. 2024. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pp. 4582–4597, 2021. Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. 2023. Boan Liu, Liang Ding, Li Shen, Keqin Peng, Yu Cao, Dazhao Cheng, and D. Tao. Diversifying the mixture-of-experts representation for language models with orthogonal optimizer. 2023. Zefang Liu and Jiahua Luo. Adamole: Fine-tuning large language models with adaptive mixture of low-rank adaptation experts. 2024. Jonas Pfeiffer, Sebastian Ruder, Ivan Vulic, and E. Ponti. Modular deep learning. 2023. Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas 5 Page 6: Zochi Generated Preprint. Blecher, Cristian Cant ´on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernan- des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- thony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Ma- dian Khabsa, Isabel M. Kloumann, A. V . Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melissa Hall Melanie Kambadur, Sharan Narang, Aur´elien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv , abs/2307.09288, 2023. URL https://api. semanticscholar.org/CorpusID:259950998 . Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. ArXiv , abs/2310.14152, 2023. Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine- tuning via representation editing. ArXiv , abs/2402.15179, 2024a. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Daniel Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models. 2024b. Yaming Yang, Dilixat Muhtar, Yelong Shen, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Denvy Deng, Feng Sun, Qi Zhang, Weizhu Chen, and Yunhai Tong. Mtl-lora: Low-rank adapta- tion for multi-task learning. 2024. Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, and Wei Zhu. Milora: Efficient mixture of low-rank adaptation for large language models fine-tuning. 2024. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Troy Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency. ArXiv , abs/2310.01405, 2023. URL https://api.semanticscholar. org/CorpusID:263605618 . 6

---