Paper Content:
Page 1:
Zochi Generated Preprint.
COMPOSITIONAL SUBSPACE REPRESENTATION FINE-
TUNING FOR ADAPTIVE LARGE LANGUAGE MODELS
Andy Zhou
Intology AI
ABSTRACT
Adapting large language models to multiple tasks can cause cross-skill interfer-
ence, where improvements for one skill degrade another. While methods such
as LoRA impose orthogonality constraints at the weight level, they do not fully
address interference in hidden-state representations. We propose Compositional
Subspace Representation Fine-tuning (CS-ReFT), a novel representation-based
approach that learns multiple orthonormal subspace transformations, each special-
izing in a distinct skill, and composes them via a lightweight router. By isolating
these subspace edits in the hidden state, rather than weight matrices, CS-ReFT
prevents cross-task conflicts more effectively. On the AlpacaEval benchmark, ap-
plying CS-ReFT to Llama-2-7B achieves a 93.94% win rate, surpassing GPT-3.5
Turbo (86.30%) while requiring only 0.0098% of model parameters. These find-
ings show that specialized representation edits, composed via a simple router, sig-
nificantly enhance multi-task instruction following with minimal overhead.
1 I NTRODUCTION
Large language models (LLMs) have become central to a wide range of NLP applications, yet adapt-
ing them to new tasks can be computationally expensive, often requiring hundreds of GPU hours and
significant memory overhead. Parameter-efficient fine-tuning (PEFT) methods (Han et al., 2024)
tackle this challenge by updating only a small fraction of model parameters, typically 0.1–1% of the
total. While this approach has enabled more practical deployment of adapted models, with methods
like LoRA (Hu et al., 2021) reducing parameter counts by 1000x, current PEFT techniques still fo-
cus primarily on weight-based updates. In contrast, representation editing methods like ReFT (Wu
et al., 2024b) directly modify hidden states, achieving even lower parameter overhead; however,
most have used a single global edit that struggles to handle multiple skills without interference.
A core problem in multi-task adaptation is cross-task interference , wherein changes aimed at im-
proving one task degrade performance on another (Pfeiffer et al., 2023). Although recent LoRA
variants impose orthogonality constraints to reduce conflicts (Wang et al., 2023; Hsu et al., 2024),
none have extended these ideas to representation-based fine-tuning, where orthonormal subspaces
can isolate skills more effectively at the hidden-state level. To address this gap, we propose Compo-
sitional Subspace Representation Fine-tuning (CS-ReFT) , a framework that extends ReFT with
multiple orthonormal subspace edits and a lightweight router for dynamic composition. Our contri-
butions include:
• We learn separate low-rank transformations for each skill implicitly, preventing conflicts
across tasks while requiring only 0.0098% of model parameters–a 12.7x reduction com-
pared to LoRA. A small gating network is trained to selectively activate relevant subspaces
for each input.
• By applying orthonormal constraints directly in hidden-state space , CS-ReFT isolates skills
more cleanly than weight-based orthogonal LoRA methods.
• CS-ReFT attains a 93.94% win rate on AlpacaEval with Llama-2-7B–significantly outper-
forming both larger models (GPT-3.5-Turbo) and parameter-efficient baselines (LoReft at
85.60% ).
1arXiv:2503.10617v1 [cs.CL] 13 Mar 2025
Page 2:
Zochi Generated Preprint.
Figure 1: Illustration of CS-ReFT. (1) The left panel shows how Compositional Subspace Rep-
resentation Fine-Tuning (CS-ReFT) applies specialized subspace transformations ( Φ1,Φ2,Φ3) at
specific positions in different layers to adapt a frozen model for multiple tasks. Each subspace edit
is task-specific, reducing interference while allowing composition when needed. (2) The right panel
details the routing mechanism: a lightweight router determines which subspaces to activate based
on the input, ensuring efficient and targeted modifications.
2 R ELATED WORK
Parameter-efficient adaptation. Recent years have seen rapid progress in parameter-efficient
fine-tuning (PEFT) methods (Han et al., 2024; Lialin et al., 2023; Li & Liang, 2021). Low-rank
adaptation approaches like LoRA (Hu et al., 2021) decompose weight updates into low-rank ma-
trices UV⊤, typically achieving 1000x parameter reduction while maintaining performance. Other
methods like BitFit (Ben-Zaken et al., 2021) modify only bias terms. Representation-based methods
(Wu et al., 2024b; Kong et al., 2024; Zou et al., 2023) instead edit model activations. These advances
have made LLM adaptation more practical, achieving lower parameter overhead than weight-based
updates, but use a single global edit function, limiting their effectiveness for multi-task adaptation.
Multi-task learning. Multi-task adaptation strategies span several approaches, from shared pa-
rameter methods (Hu et al., 2021; Liu & Luo, 2024) that risk interference, to task-specific modules
(Chronopoulou et al., 2022; Yang et al., 2024) requiring separate adapters, to dynamic routing sys-
tems (Araujo et al., 2024; Zhang et al., 2024) that often introduce significant overhead. Some recent
work combines orthogonality constraints with multi-task setups (Hsu et al., 2024; Liu et al., 2023;
Wang et al., 2023), but again these rely on weight-based modules inserted inside Transformer lay-
ers. By contrast, our method applies orthonormal constraints at the representation level, learning
disjoint subspaces in the hidden state and dynamically routing between them. This design reduces
cross-task interference compared to sharing a single low-rank factorization for all tasks.
3 M ETHOD
Compositional Subspace Representation Fine-tuning (CS-ReFT) learns multiple low-rank sub-
space interventions and a router to activate them on a per-input basis, addressing cross-task inter-
ference by dedicating separate subspaces to each skill. We selectively compose these subspaces at
inference. Let Mbe a frozen pretrained model (e.g., a Transformer) of hidden dimension d. For
each sequence of ntokens x= (x1, . . . , x n), the model produces {h(j)
1, . . . , h(j)
n}in layer j. Our
goal is to adapt Mto a set of ktasks{T1, . . . ,Tk}without modifying the original weights. Instead,
2
Page 3:
Zochi Generated Preprint.
we learn: (1) A collection of low-rank subspace transformations ,{Φ1, . . . , Φk}, one per task, (2) a
router Rthat decides which subset of {Φi}to activate given an input. Our design ensures that each
taskTihas a dedicated subspace edit Φi—preventing direct interference—yet also enables composi-
tion for inputs requiring multiple skills. In practice, the tasks are manually defined through manually
partitioning the data or implicitly learned during training.
3.1 S UBSPACE REPRESENTATION EDITING
Following ReFT (Wu et al., 2024b), each subspace intervention Φmodifies a hidden vector h∈Rd
by editing only an r-dimensional subspace spanned by the rows of R. Concretely, we let
Φ(h) =h+R⊤
W h+b|{z}
desired subspace coords−R h
,
where R∈Rr×dis typically constrained to have orthonormal rows ( R R⊤=Ir), and W∈
Rr×d, b∈Rrare trainable parameters. In CS-ReFT, we have Φ1, . . . , Φk,one per task , each with
its own low-rank parameters {Ri, Wi, bi}. Ensuring that an input requiring task ican be edited by
Φiwithout altering another subspace, this fully separates the learned directions in hidden space,
mitigating interference across tasks.
3.2 R OUTER MECHANISM
Not every input belongs to a single task, nor do we want to dedicate a distinct subspace for every
fine-grained skill. Hence, we introduce a router that selects or composes the relevant subspaces at
inference time. For example, an instruction might require both Φ2(arithmetic) and Φ3(sentiment
analysis). We define a small routing network
Router( x) =α∈[0,1]{k},
which maps an embedding of the input (e.g., the first token’s hidden state) to a gating vector α. We
then compose the subspace edits as:
h′=h+kX
i=1αih
R⊤
i
Wih+bi−Rihi
.
Ifαiis discrete (e.g. thresholded), then each Φiisonoroff. Alternatively, we can keep αi∈
[0,1]for a soft gating. In either case, the parameter overhead from the router is minimal, allowing
dynamic composition without losing efficiency. Crucially, this router is jointly trained alongside the
subspaces themselves. As a result, the model can implicitly discover how to route different inputs to
different subspaces without any manual task partitioning.
3.3 T RAINING OBJECTIVE
We train CS-ReFT by minimizing:
L=kX
i=1E(x,y)∼T ih
ℓ
M
x;{Φi}, R
, yi
+λΩ(α),
where ℓ(·)is a task loss (e.g. cross-entropy), and Ω(α)can be a sparsity regularizer on the router
outputs to encourage minimal subspace usage. In practice, we update only {Φ1, . . . , Φk}and the
router’s parameters while leaving all original model weights frozen. This design prevents cross-task
interference by activating only relevant subspaces on each input, and the low-rank structure keeps
parameter overhead minimal.
In addition to these aspects, CS-ReFT provides multiple benefits. It prevents cross-task interference
by keeping each skill’s subspace disjoint so that changes to Φido not overwrite Φj. It also fosters
compositional synergy, as the router composes subspaces on demand to enable multi-skill prompts.
Finally, it ensures extreme parameter savings because each subspace Φiremains low-rank and the
router is tiny, resulting in significantly fewer parameters than typical multi-head adapters. This com-
positional subspace design thus unifies the efficiency of representation editing with the modularity
of multi-task routing, enabling high-quality, multi-task LLM adaptation with minimal overhead.
3
Page 4:
Zochi Generated Preprint.
Table 1: Performance on AlpacaEval. Parameter Efficiency (PE) shows fraction of trainable param-
eters relative to the base model. Win rate measures preference over reference responses. CS-ReFT
on Llama-2-7B outperforms all baseline methods and is competitive with ReFT on parameter effi-
ciency.
Model Win Rate (%) PE (%)
Reference Models
GPT-3.5 Turbo 1106 86.30 —
Llama-2 Chat 13B 81.10 —
Llama-2 Chat 7B 71.40 —
Parameter-Efficient Methods (Llama-2 7B)
Full Fine-tuning 80.93 100.00
LoRA 81.48 0.1245
RED 81.69 0.0039
DiReFT 84.85 0.0039
LoReFT 85.60 0.0039
CS-ReFT (Ours) 93.94 0.0098
4 E XPERIMENTS
Setup. We evaluate CS-ReFT using the AlpacaEval benchmark (Dubois et al., 2024), which mea-
sures instruction-following capabilities through win rates against reference responses. As a general
task, instruction-following implicitly involves multiple subtasks, such as reasoning and common-
sense understanding. Our experiments use Llama-2-7B (Touvron et al., 2023) as the base model,
comparing CS-ReFT against both parameter-efficient methods and larger models. We evaluate us-
ing two metrics: Win Rate (percentage of model outputs preferred over reference responses) and
Parameter Efficiency (percentage of trainable parameters relative to full model). Our baselines
include parameter-efficient methods (LoRA (Hu et al., 2021), RED (Wu et al., 2024a), DiReFT (Wu
et al., 2024b), LoReFT (Wu et al., 2024b)) and larger models (GPT-3.5 Turbo (Brown et al., 2020),
Llama-2-13B (Touvron et al., 2023)).
The CS-ReFT architecture implements four distinct low-rank transformations using the ReFT inter-
vention mechanism (Wu et al., 2024b), each operating independently on the model’s hidden states.
A lightweight two-layer router network processes the first token’s embedding ( h∈Rd), with an in-
put layer mapping Rd→Rd/2(ReLU activation) and an output layer mapping Rd/2→R4(sigmoid
activation), using a 0.5 threshold for binary gating.
Results. Table 1 presents performance comparisons across model sizes and adaptation methods. CS-
ReFT achieves a 93.94% win rate while modifying only 0.0098% of model parameters. Specifically,
it surpasses larger models such as GPT-3.5 Turbo (86.30%) and Llama-2-13B (81.10%), outperforms
weight-based methods like LoRA (81.48%, 0.1245% parameters), and exceeds representation meth-
ods such as ReFT variants (81.69–85.60%, 0.0039% parameters), highlighting the effectiveness of
specialized subspaces and dynamic routing.
5 C ONCLUSION
We introduced Compositional Subspace Representation Fine-tuning (CS-ReFT), which addresses
cross-task interference by assigning separate low-rank subspace transformations to each skill and
using a lightweight router for dynamic composition. Unlike orthonormal LoRA variants that still op-
erate on weight matrices, our approach enforces orthonormal subspace constraints directly on hid-
den states, thereby isolating learned features more effectively. Experiments on AlpacaEval demon-
strate that CS-ReFT outperforms both larger models (GPT-3.5) and other parameter-efficient meth-
ods (LoRA, LoReFT). Future research should focus on scalability (subspace merging or sharing for
large skill sets) and interpretability (shedding light on the router’s gating decisions). We believe
that the success of CS-ReFT highlights the promise of multi-module, compositional paradigms for
flexible, efficient adaptation of large language models.
4
Page 5:
Zochi Generated Preprint.
6 A CKNOWLEDGEMENTS
The hypothesis, idea, experiments, and writing was conducted by Zochi, an AI system. Results and
the manuscript have been carefully checked by human experts.
REFERENCES
Vladimir Araujo, M. Moens, and T. Tuytelaars. Learning to route for dynamic adapter composition
in continual learning with language models. 2024.
Elad Ben-Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning
for transformer-based masked language-models. 2021.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Ma teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. ArXiv , abs/2005.14165, 2020. URL https://api.semanticscholar.org/
CorpusID:218971783 .
Alexandra Chronopoulou, Dario Stojanovski, and Alexander M. Fraser. Language-family adapters
for low-resource multilingual neural machine translation. 2022.
Yann Dubois, Bal’azs Galambosi, Percy Liang, and Tatsunori Hashimoto. Length-controlled al-
pacaeval: A simple way to debias automatic evaluators. ArXiv , abs/2404.04475, 2024.
Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning
for large models: A comprehensive survey. 2024.
Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun ying Huang. Safe
lora: the silver lining of reducing safety risks when fine-tuning large language models. 2024.
J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and
Weizhu Chen. Lora: Low-rank adaptation of large language models. ArXiv , abs/2106.09685,
2021. URL https://api.semanticscholar.org/CorpusID:235458009 .
Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song,
Rongzhi Zhang, Kai Wang, and Chao Zhang. Aligning large language models with representation
editing: A control perspective. 2024.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) ,
pp. 4582–4597, 2021.
Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to
parameter-efficient fine-tuning. 2023.
Boan Liu, Liang Ding, Li Shen, Keqin Peng, Yu Cao, Dazhao Cheng, and D. Tao. Diversifying the
mixture-of-experts representation for language models with orthogonal optimizer. 2023.
Zefang Liu and Jiahua Luo. Adamole: Fine-tuning large language models with adaptive mixture of
low-rank adaptation experts. 2024.
Jonas Pfeiffer, Sebastian Ruder, Ivan Vulic, and E. Ponti. Modular deep learning. 2023.
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas
5
Page 6:
Zochi Generated Preprint.
Blecher, Cristian Cant ´on Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernan-
des, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Ma-
dian Khabsa, Isabel M. Kloumann, A. V . Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor
Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,
Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian,
Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan,
Iliyan Zarov, Yuchen Zhang, Angela Fan, Melissa Hall Melanie Kambadur, Sharan Narang,
Aur´elien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open
foundation and fine-tuned chat models. ArXiv , abs/2307.09288, 2023. URL https://api.
semanticscholar.org/CorpusID:259950998 .
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and
Xuanjing Huang. Orthogonal subspace learning for language model continual learning. ArXiv ,
abs/2310.14152, 2023.
Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu,
Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine-
tuning via representation editing. ArXiv , abs/2402.15179, 2024a.
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Daniel Jurafsky, Christopher D.
Manning, and Christopher Potts. Reft: Representation finetuning for language models. 2024b.
Yaming Yang, Dilixat Muhtar, Yelong Shen, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun,
Denvy Deng, Feng Sun, Qi Zhang, Weizhu Chen, and Yunhai Tong. Mtl-lora: Low-rank adapta-
tion for multi-task learning. 2024.
Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, and Wei Zhu. Milora: Efficient
mixture of low-rank adaptation for large language models fine-tuning. 2024.
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander
Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li,
Michael J. Byun, Zifan Wang, Alex Troy Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt
Fredrikson, Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach
to ai transparency. ArXiv , abs/2310.01405, 2023. URL https://api.semanticscholar.
org/CorpusID:263605618 .
6