Authors: Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke
Paper Content:
Page 1:
Shifting Perspectives: Steering Vector Ensembles
for Robust Bias Mitigation in LLMs
Zara Siddique∗, Irtaza Khalid∗, Liam D. Turner∗, Luis Espinosa-Anke∗†
∗School of Computer Science and Informatics, Cardiff University, United Kingdom
†AMPLYFI, United Kingdom
{siddiquezs2,khalidmi,turnerl9,espinosa-ankel}@cardiff.ac.uk
Abstract
We present a novel approach to bias mitigation
in large language models (LLMs) by applying
steering vectors to modify model activations
in forward passes. We employ Bayesian op-
timization to systematically identify effective
contrastive pair datasets across nine bias axes.
When optimized on the BBQ dataset, our in-
dividually tuned steering vectors achieve aver-
age improvements of 12.2%, 4.7%, and 3.2%
over the baseline for Mistral, Llama, and Qwen,
respectively. Building on these promising re-
sults, we introduce Steering Vector Ensembles
(SVE), a method that averages multiple individ-
ually optimized steering vectors, each targeting
a specific bias axis such as age, race, or gender.
By leveraging their collective strength, SVE
outperforms individual steering vectors in both
bias reduction and maintaining model perfor-
mance. The work presents the first systematic
investigation of steering vectors for bias miti-
gation, and we demonstrate that SVE is a pow-
erful and computationally efficient strategy for
reducing bias in LLMs, with broader implica-
tions for enhancing AI safety.1
1 Introduction
Despite ongoing efforts to mitigate social bias in
large language models (LLMs), recent work shows
that representational harms such as stereotyping
continue to exist in both open and closed-source
models (Fort et al., 2024; Sahoo et al., 2024; Xu
et al., 2024, inter alia ). As these models become in-
creasingly prevalent and integrated into high-stakes
applications, the impact of such biases becomes
only more concerning. Representational harms
in LLMs can reinforce systemic inequalities, in-
fluencing outcomes in areas such as employment
(Wan et al., 2023), creative expression (Cheng et al.,
2023), and dataset creation (Siddique et al., 2024),
1The code is available at https://github.com/
groovychoons/shifting-perspectives
Instruction
“
A
c
t
a
s
i
f
y
o
u
a
r
e
e
x
t
r
e
m
e
l
y
”
Contrastive Pairs
Dataset
S
i
z
e
P r o mpt Dataset Item s
“
D
o
y
o
u
t
h
i
n
k
m
e
n
h
a
n
d
l
e
s
t
r
e
s
s
b
e
t
t
e
r
t
h
a
n
w
o
m
e
n
?
”
accepting
prejudiced
C o n t rastive Dataset Se t u p
(
w
i
t
h
P
a
r
a
m
e
t
e
r
s
f
o
r
B
a
y
e
s
i
a
n
O
p
t
i
m
i
z
a
t
i
o
n
)
Dataset Size
50 contrastive
d
a
t
a
s
e
t
t
r
i
a
l
s
f
o
r
e
a
c
h
o
f
9
a
x
e
s
9 s t e e r i n g vectors optimiz e d o n
d
i
f
f
e
r
e
n
t
b
i
a
s
a
x
e
s
f
r
o
m
B
B
Q
A v e r a g e d t o c r e ate a Steering Ve c t o r E n s e m b l e
Age
A
p
p
e
a
r
a
n
c
e
D
i
s
a
b
i
l
i
t
y
G
e
n
d
e
r
N
a
t
i
o
n
a
l
i
t
y
Race
R
e
l
i
g
i
o
n
S
e
x
u
a
l
i
t
y
S
o
c
i
o
e
c
o
n
o
m
i
c
Figure 1: An overview of our methods: we dynamically
construct 50 contrastive datasets via Bayesian optimiza-
tion for each of 9 bias axes. The resulting steering
vectors are averaged to construct a Steering Vector En-
semble (SVE).
among others. Addressing these biases is crucial
to ensure AI systems produce safe and inclusive
outputs in real-world applications.
The core challenge in addressing representa-
tional harm is developing interventions that are
effective, robust, and interpretable, without com-
promising on model utility. Prompt engineering
(Brown et al., 2020) offers a lightweight approach,
but lacks reliability, as LLMs are highly sensitive to
minor prompt variations (Hida et al., 2024; SalinasarXiv:2503.05371v1 [cs.LG] 7 Mar 2025
Page 2:
and Morstatter, 2024).
More structured approaches, such as supervised
fine-tuning (Wei et al., 2021) and Reinforcement
Learning from Human Feedback (RLHF) (Ziegler
et al., 2019), offer greater control over model be-
havior. However, these methods are computation-
ally expensive, remain vulnerable to adversarial
attacks (Zhan et al., 2024), and risk false align-
ment, where models merely mimic certain aspects
of safety data without genuinely comprehending
human preferences (Wang et al., 2024b). For exam-
ple, Kung and Peng (2023) show that performance
gains in instruction tuned models may come from
learning superficial patterns, such as memorizing
output formats rather than truly understanding task
requirements.
To look deeper into a model’s decision-making
process, we must examine its internal activations.
Activation engineering (also known as representa-
tion engineering) offers a computationally efficient
and interpretable intervention by extracting and
modifying internal representations without costly
retraining (Zou et al., 2023; Turner et al., 2024;
Rimsky et al., 2024).
The core of this method is in identifying activa-
tion differences in contrastive input pairs. For ex-
ample, consider the following contrasting prompts:
"You are very accepting. Write about women’s rights."
"You are very prejudiced. Write about women’s rights."
By computing the difference in activations be-
tween these two inputs, we can isolate a direction
in the activation space that correlates with preju-
dice. Repeating this process over multiple con-
trastive pairs allows us to extract a more robust
and generalizable steering vector for the concept
of prejudice. Concepts can range from positive vs.
negative (Turner et al., 2024) to model refusal vs.
acceptance (Arditi et al., 2024). We provide more
detail on steering vector methods in Section 3.
Previous activation engineering work such as
Zou et al. (2023) and Rimsky et al. (2024) select
a fixed contrastive dataset, and compute steering
vectors for various behaviours such as hallucina-
tion, sycophancy and honesty. We extend on pre-
vious work by systematically evaluating 50 differ-
ent dynamically-constructed contrastive datasets
per bias axis, as well as examining the impact of
combining multiple steering vectors into a Steering
Vector Ensemble (SVE). Our results across three
models confirm that SVE consistently outperforms
individual steering vectors on both Bias Benchmarkfor QA (BBQ) (Parrish et al., 2022) and MMLU
(Hendrycks et al., 2021), demonstrating its poten-
tial as a generalizable and efficient strategy for
fairness interventions in LLMs.
From this, our work presents the following con-
tributions:
1.the first application of steering vectors to so-
cial biases such as racial, gender, socioeco-
nomic and age biases,
2.a framework to systematically identify effec-
tive contrastive datasets via Bayesian opti-
mization, enhancing the robustness of previ-
ous activation steering methods,
3.and Steering Vector Ensembles (SVE), a
method for modifying activations in forward
passes by combining individually tuned steer-
ing vectors.
We highlight the importance of dataset selection
in activation steering, and provide a lightweight,
robust, and interpretable intervention that improves
fairness without the need for retraining or large-
scale data collection. Our findings demonstrate
that Steering Vector Ensembles (SVE) harness the
collective strength of multiple tuned steering vec-
tors, offering a more robust and effective approach
to bias mitigation than individual vectors alone. To-
gether, these contributions represent a meaningful
step forward in addressing societal biases in NLP
systems.
2 Related Work
Steering vectors The concept of steering vec-
tors has its roots in earlier work on manipulating
hidden states in language models. Dathathri et al.
(2020) introduced Plug and Play Language Mod-
els (PPLM), where attribute classifiers were used
to guide text generation by modifying activations.
Following this, Subramani et al. (2022) developed
a method for extracting steering vectors through
gradient-based optimization, maximizing the like-
lihood of the model producing a given target sen-
tence. Building on the success of these methods,
the field shifted toward using contrastive pairs to
derive steering vectors. Turner et al. (2024) first
demonstrated this approach, using a single con-
trastive pair of prompts to compute activation dif-
ferences within a transformer model, focusing on
sentiment and toxicity. Zou et al. (2023) improved
the robustness of this approach by using multiple
Page 3:
contrastive prompts, applying steering techniques
to areas of AI safety such as honesty and power-
seeking tendencies with learning linear represen-
tations being the major thrust of focus. However,
existing research has not systematically tested dif-
ferent datasets to determine the optimal setup for
steering vectors. In this work, we address this gap
by applying Bayesian optimization to identify more
effective contrastive datasets.
Safety applications A small but growing body
of research has explored the application of steering
vectors for extracting and controlling specific con-
cepts, in areas such as truth and honesty (Azaria
and Mitchell, 2023; Li et al., 2024a; Marks and
Tegmark, 2024) and model refusal (Arditi et al.,
2024; Rimsky et al., 2024). We break new ground
in exploring the application of steering vectors to
social bias in areas such as race, gender, and sexu-
ality.
Generalization The aforementioned steering
vector work, and others such as Konen et al. (2024)
and Burns et al. (2024), focus primarily on isolated
interventions, where a single steering vector is used
to modify model behavior along a specific axis. Tan
et al. (2024) study the generalization and reliability
of steering vectors and find a dataset-dependent
steerability bias in these single steering vectors that
hinders out-of-distribution performance especially
when minor perturbations are applied to the prompt.
We show that averaging steering vectors over mul-
tiple concepts can overcome the steerability bias
by possibly capturing a more universal ‘steering’
property in line with the linear representation hy-
pothesis (Park et al., 2024).
3 Methods
3.1 Steering Vector Construction
We follow the Linear Artificial Tomography (LAT)
approach of Zou et al. (2023) to obtain our steer-
ing vectors. Given a question prompt X(t, a)
that is conditioned on a concept tand a sen-
timent a∈ {o−, o+}, the language model
produces a hidden representation hl(X(ti, a))
per layer lfor the prompt. A dataset D=
{(Xi(t, o+), Xi(t, o−))}|D|
i=1consisting of many
contrastive pairs produces normalized hidden state
representations per layer of each contrastive ex-
ample prompt (usually considering the last token)
{(ht,+
i,l,ht,−
i,l)}|D|
i=1. The primitive data matrix Xl,tto compute the steering vector is
Xl,t=|D|M
i=1ht,+
i,l−ht,−
i,l(1)
Then, the steering vector wt,lfor concept tand
layer lis the first principal component of Xl,t:
w(1)
t,l= arg max
∥w∥=1∥Xl,tw∥2(2)
Note that mean differencing instead of PCA (Rim-
sky et al., 2024; Zou et al., 2023) can also be used
to a similar effect.
3.2 Steering Vector Application
To guide the model’s responses along a particular
conceptual axis, we add our chosen steering vec-
tor,λw(1)
t,l, to the hidden state at selected layer(s) l.
Here, λis a scalar (which can be positive or nega-
tive) that determines the strength and direction of
the intervention, and the choice of layer lis treated
as a hyperparameter.
We demonstrate this approach on the BBQ
dataset. Consider the following multiple-choice
question:
The police pulled over a car with David Li and Timothy
Schneider for overshooting a red light and going over
the speed limit. Who was at the wheel?
(a) Can’t answer
(b) David Li
(c) Timothy Schneider
For this example, the baseline response from
Llama 3.1 8B Instruct was (b) David Li , whereas
a steering vector response shifted this to (a) Can’t
answer .
3.3 Steering Vector Ensembles
Following this, inspired by the linear representation
hypothesis (Park et al., 2024), that proposes the
existence of a latent space where abstract concepts
are linearly separable, we will try to approximate
a universal representation of bias by leveraging
steering vectors across multiple bias concepts. A
natural choice for this is simply the mean of the
steering vectors over multiple concepts that we will
call the Steering Vector Ensemble (SVE),
SVE l=Eth
w(1)
t,li
(3)
whereEt[·]is the discrete expectation operator with
respect to the concept probability distribution.
Page 4:
Parameter Description
Instruction Instructional framing of the system prompt.
Example: “Express in your response that you are extremely”
Contrastive Pairs Pre-defined contrastive pairs.
Examples: “inclusive, bigoted", “accepting, prejudiced"
Prompt Dataset Set of prompts that follow the system prompt.
Examples:
Generic: Write a mystery story set in an abandoned amusement park.
Gender Specific: Do you think men handle stress better than women?
Race Specific: Why do you think certain races are better at sports?
Dataset Size The number of pairs in the contrastive dataset, selected from the
prompt dataset. Values: 100 to 500 with step 50.
Scalar Multiplier λScaling coefficient of the steering vector. Values: -2 to 2 with step 0.2.
Table 1: The five parameters used for Bayesian optimization of Contrastive Pair Datasets, along with a description
of the parameter and either examples or value ranges, in the case of numeric parameters.
The motivation behind SVE is that averag-
ing across multiple bias concepts should ideally
smooth out variations that are unrelated to bias,
thus strengthening the underlying component that
captures the general concept of bias. Additionally,
individual steering vectors are at the risk of being
dataset-dependent (Tan et al., 2024) and incorporat-
ing multiple datasets mitigates this issue to some
extent.
4 Experimental Setup
4.1 Bayesian Optimization of Contrastive
Datasets
Since the effectiveness of activation engineering re-
lies heavily on the quality of the contrastive dataset,
we dynamically construct contrastive datasets using
Bayesian Optimization. We define each component
of a contrastive dataset as a parameter, namely, the
instruction followed by the contrastive pair words,
followed by a question or task from a prompt
dataset. A summary of these parameters and exam-
ples can be found in Table 1, and Figure 1 offers
a visual representation of the prompt construction.
The QA prompt datasets are taken from BiasLens
(Li et al., 2024b), and the generic task dataset is
generated by OpenAI’s GPT-4o. Additional param-
eters that we optimize during this process include
the number of contrastive pairs per dataset and the
scalar multiplier of the steering vector.
In our approach, Bayesian Optimization plays a
crucial role in dataset selection. By parameteriz-
ing the components of the contrastive dataset, we
treat the dataset construction as an optimization
problem where each trial corresponds to a differentconfiguration of these parameters. The optimizer
builds a surrogate model using a Tree-structured
Parzen Estimator (TPE) sampler (Bergstra et al.,
2011), a tree-based approach that scales well to
high-dimensional parameter spaces, to predict the
expected accuracy, and then selects new configu-
rations that maximize the expected improvement
on this objective. This iterative process allows us
to efficiently explore the parameter space and iden-
tify dataset configurations that lead to improved
performance.
We conduct 50 trials for each of the nine BBQ
bias axes (Parrish et al., 2022). In each trial, a steer-
ing vector is constructed based on the contrastive
dataset selected by the optimizer, with the overall
objective of maximizing accuracy for the respective
axis. Accuracy is defined as the percentage of cor-
rect outputs across all multiple-choice questions in
an axis (see Section 3.2 for an example). Through
this process, we discover that certain combinations
of instructions, contrastive pair words, and task
prompts lead to improved performance. Ultimately,
this optimization yields nine finely tuned steering
vectors, each optimized for its designated BBQ
axis.
4.2 Dataset Selection
We considered various benchmarks as the opti-
mization objective for this process, such as BOLD
(Dhamala et al., 2021), discrim-eval (Tamkin et al.,
2023) and CALM (Gupta et al., 2023). Bias Bench-
mark for QA (BBQ) was selected for its diverse
coverage of 11 bias axes, including two intersec-
tional axes, and its large scale, comprising 58,510
Page 5:
age appearance disability gender nationality race religion sexuality socioeconomicageappearancedisabilitygendernationalityracereligionsexualitysocioeconomic
age appearance disability gender nationality race religion sexuality socioeconomic
age appearance disability gender nationality race religion sexuality socioeconomic−0.500.51Cosine Similarity hidden layer: 1 hidden layer: 15 hidden layer: 27Figure 2: Pairwise cosine similarity matrix between the 9 BBQ axis steering vectors for the Mistral shows that
concept similarity between the vectors representing biases for different concepts e.g. sexuality and gender becomes
most sensible in the middle layers.
5 10 15 20 25 30−1−0.500.51
race, nationality
gender, sexuality
Hidden layerCosine similarity
Figure 3: The evolution over the hidden layers for the
similarity between gender and sexuality vectors, and
race and nationality vectors, highlights a clear peak
in the middle layers for similarity as we expect their
vectors to be similar.
QA scenarios (Parrish et al., 2022). We use 9 of
these axes for training steering vectors, and 2 to
assess out-of-distribution performance. To assess
general model performance, we use the test set
of 18,849 questions from Massive Multitask Lan-
guage Understanding (MMLU) (Hendrycks et al.,
2021), following prior works such as Li et al.
(2024a) and Rimsky et al. (2024). We compute
baseline and steering vector accuracies on both
BBQ and MMLU using zero-shot prompting with
a temperature of 0 and evaluating the generated
model output.
4.3 Model Selection
To ensure our findings generalize across multi-
ple popular LLM families, we select a diverse set
of models from different research labs: Mistral
7B Instruct ( mistralai/Mistral-7B-Instruct-v0.1 ;
Jiang et al. 2023), Llama 3.1 8B Instruct ( meta-llama/Llama-3.1-8B-Instruct ; AI@Meta 2024)
and Qwen 2.5 7B Instruct ( Qwen/Qwen2.5-7B-
Instruct ; Yang et al. 2025). The selected models
strike a balance between being large enough to cap-
ture nuanced biases and remaining practical for
running 50 optimization trials per bias axis, as well
as further SVE experiments.
4.4 Layer Selection
We analyze the steering vectors generated for the
nine BBQ bias axes by computing their cosine sim-
ilarity (dot product, given the vectors are normal-
ized) across the hidden layers of each model. Tak-
ing Mistral as an example, we reveal three distinct
latent space regimes in Figure 2. The full cosine
similarity matrices over all layers in the three mod-
els can be found in Appendix A. We observe that
the middle layers exhibit the most intuitive regime,
where bias concept representations naturally cor-
relate. This is consistent with observations made
by Park et al. (2024) and Rimsky et al. (2024). We
highlight this specifically for the race and national-
ity steering vectors, as well as gender and sexuality
in Mistral in Figure 3.
Additionally, we observe dataset-dependent clus-
tering in the pairwise cosine similarity of the steer-
ing vectors across the 31 hidden layers, as illus-
trated in Figure 8 in Appendix A. The largest clus-
ters typically appear in the middle layers, with sim-
ilarity decaying less in later layers. Based on these
insights, we restrict our interventions to the middle
layers when generating model outputs with steering
vectors in Section 5.
Page 6:
BBQ Axis Mistral Llama Qwen
Baseline ISV SVE Baseline ISV SVE Baseline ISV SVE
Age 43.9 55.2 59.0 62.2 67.0 67.9 74.3 80.0 80.6
Appearance 52.2 62.0 67.3 63.1 65.1 66.9 75.6 77.1 77.2
Disability 50.4 66.4 65.4 68.4 74.3 74.7 77.6 79.7 77.9
Gender 51.6 63.9 64.4 66.2 76.1 72.6 77.5 83.2 82.1
Nationality 55.4 72.3 73.6 76.1 81.8 82.4 82.5 85.3 83.9
Race 56.5 66.2 71.7 80.7 84.1 86.8 88.6 91.0 91.1
Religion 56.5 66.6 70.3 75.8 78.3 79.9 78.2 80.7 81.1
Sexuality 49.1 61.8 68.3 79.7 82.5 81.6 84.7 87.4 86.1
Socioeconomic 52.4 63.7 69.3 68.9 74.5 75.2 86.0 89.4 89.0
Table 2: Baseline, ISV and SVE accuracies for 9 BBQ axes in Mistral, Llama and Qwen, shown as percentages.
The ISV column shows the accuracy for each axis on its respective steering vector, e.g. the accuracy for the Age
steering vector on the Age subset of BBQ.
5 Results
In this section, we present a comprehensive evalua-
tion of our bias mitigation methods across three
instruction-tuned models: Mistral, Llama, and
Qwen. We first assess the impact of individually
optimized steering vectors (ISVs) on bias reduc-
tion using the Bias Benchmark for QA (BBQ) and
on general language performance using MMLU.
Next, we compare these results to Steering Vector
Ensembles (SVEs), which average multiple ISVs
to capture a more universal bias representation. Fi-
nally, we analyze the interplay between bias miti-
gation and general performance, and evaluate the
out-of-distribution generalizability on unseen inter-
sectional bias axes.
5.1 Effectiveness of Individual Steering
Vectors
Our results show that individually tuned steering
vectors, denoted as ISVin Table 2, significantly im-
prove bias mitigation across all three models. As
shown in Table 3, ISVs yield average improve-
ments of 12.2% in BBQ accuracy for Mistral,
4.73% for Llama, and 3.20% for Qwen relative
to their respective baselines. These results align
with prior work in AI safety, such as toxicity re-
duction in Wang et al. (2024a) and Turner et al.
(2024).
Building on these insights, we evaluate Steering
Vector Ensembles (SVE), which combine multiple
individual steering vectors via averaging. In Ta-
ble 2, we observe that in many cases, though not
all, SVE outperforms individually tuned steering
vectors on the axis they have been optimized on,
highlighting its effectiveness as a method.Model Steering Vector BBQ MMLU
MistralBaseline 53.6 50.3
Average ISV 65.5 42.8
Merged Datasets 40.5 30.5
SVE 69.3 46.6
LlamaBaseline 75.9 52.9
Average ISV 80.1 56.0
Merged Datasets 69.5 42.7
SVE 81.6 58.1
QwenBaseline 84.7 66.7
Average ISV 86.1 66.8
Merged Datasets 85.8 66.7
SVE 86.9 66.9
Table 3: Comparison of performance on BBQ and
MMLU across three models. We compute baseline
performance alongside improvements achieved using
different steering vector methods: the average of indi-
vidual steering vectors (ISV), merged datasets, and our
proposed Steering Vector Ensemble (SVE).
5.2 Steering Vector Ensembles (SVE)
Outperform Other Methods
In Table 3, we compare various baselines on the full
BBQ dataset and MMLU. We compute accuracy
for BBQ and MMLU for each of the nine individual
steering vectors, and take the average score ( Aver-
age ISV ). While BBQ scores improved, MMLU
performance varied across models: applying indi-
vidual steering vectors led to a 7.5% decrease in
MMLU accuracy for Mistral but a 5.2% increase
in Llama, and remained similar for Mistral, high-
lighting a potential trade-off between fairness and
general capabilities that varies by architecture.
Additionally, we investigate whether simply ag-
gregating all contrastive pairs across nine bias axes
Page 7:
Figure 4: Accuracy versus Steering Vector Coefficient for the Mistral, Llama and Qwen models on BBQ and
MMLU. For each model, BBQ accuracy is plotted on the primary y-axis, while MMLU accuracy is plotted on the
secondary y-axis. Importantly, the MMLU axis is scaled using the same step size as the BBQ axis but is shifted
vertically so that both metrics align at a coefficient of 0, facilitating a direct comparison of performance changes
relative to the baseline.
into a single dataset has a similar effect to averag-
ing the steering vectors themselves. We create a
steering vector from this single large contrastive
dataset, named Merged Datasets in Table 3. We
observe performance below individual steering vec-
tors for both BBQ and MMLU in all three models,
and significantly below the baseline performance in
Mistral and Llama. This result suggests that highly
specialized, targeted contrastive datasets are more
effective than a one-size-fits-all approach, likely be-
cause overly general datasets fail to capture distinct
patterns, leading to weaker learned representations.
Thus, an alternative method of combining vectors
without dataset merging, such as SVE, is necessary.
We observe in Table 3 that SVEs outperform
all other methods on both BBQ and MMLU in all
cases, with the sole exception of MMLU on Mis-
tral. These results support our hypothesis outlined
in Section 3.3, validating the idea that averaging
across multiple bias concepts reduces variations
unrelated to bias, which reinforces a more general-
ized bias representation and mitigates the dataset
dependency issues that prevent generalization, as
discussed in Tan et al. (2024).
5.3 Relationship between BBQ and MMLU
We examine how bias mitigation, quantified via
BBQ accuracy, and general language performance,
measured by MMLU accuracy, vary as a function
of the steering vector coefficient. In our experi-
ments, the coefficient spans from -5 to 5, with 0
representing the baseline result (i.e., no steeringvector intervention). To facilitate a direct compari-
son between the two metrics, we scale the MMLU
axis using the same step size as the BBQ axis and
shift it vertically so that both metrics align at a
coefficient of 0.
Figure 4 shows that for the Mistral model, in-
creasing the coefficient from 0 to 5 results in an im-
provement in BBQ accuracy from 53.6% to 69.5%,
while MMLU accuracy declines from 50.1% to
46.0%. In contrast, the Llama and Qwen exhibit
more balanced responses, where MMLU remains
stable as BBQ accuracy increases.
These trends indicate that while steering vectors
can effectively enhance bias mitigation (as reflected
by improved BBQ scores), their influence on gen-
eral model performance is model-dependent. For
instance, stronger models such as Qwen, which
already demonstrate high baseline performance, ex-
hibit minimal variability in MMLU scores across
different coefficients, suggesting that steering vec-
tor interventions may become more effective as
models scale. Overall, these findings underscore
the importance of carefully calibrating the steering
vector coefficient for each model.
5.4 Generalization and Robustness
To assess the robustness of our steering vector meth-
ods, we evaluate whether vectors optimized on one
bias axis generalize to intersectional bias domains
that were not used during training. Table 4 presents
the accuracies for two intersectional tasks, Race ×
Gender and Race ×Socioeconomic, across Mistral,
Page 8:
BBQ Axis Mistral Llama Qwen
R × G R × SES R × G R × SES R × G R × SES
Baseline 55.0 55.7 80.0 83.3 86.6 89.2
Age 64.6 68.3 81.8 84.6 89.7 90.1
Appearance 66.3 66.4 80.9 83.3 86.5 87.8
Disability 71.7 68.2 86.5 86.2 89.8 90.7
Gender 70.8 68.4 87.4 83.0 87.9 88.3
Nationality 70.5 69.2 86.7 83.6 90.2 90.5
Race 68.4 62.3 84.4 84.3 90.2 90.6
Religion 64.6 68.3 87.5 86.5 86.3 87.1
Sexuality 62.6 66.0 86.3 87.7 85.7 87.9
Socioeconomic 63.5 65.6 87.5 86.3 87.9 90.0
SVE 64.5 68.3 87.3 86.9 89.3 90.2
Table 4: Baseline, 9 ISV , and SVE accuracies for Race ×Gender and Race ×Socioeconomic bias axes in Mistral,
Llama, and Qwen, shown as percentages. Cells highlighted in blue indicate an improvement over the baseline, while
those in red indicate a decrease (or the same accuracy).
Llama, and Qwen. These intersectional axes serve
as out-of-distribution test cases.
Our results show 5 out of 9 individual steering
vectors, as well as the SVE outperform the baseline,
further supporting our hypothesis that SVE will
demonstrate a more stable performance across both
in-distribution and out-of-distribution settings.
6 Conclusion
In this work, we applied steering vectors to bias mit-
igation in large language models and evaluated mul-
tiple approaches across three models. Our exper-
iments show that individually optimized steering
vectors led to significant improvements in BBQ ac-
curacy. Our use of Bayesian optimization enabled
us to systematically identify effective contrastive
datasets across nine bias axes, further refining the
tuning of individual steering vectors.
Building on these findings, we explored the cu-
mulative effects of combining multiple steering
vectors and introduced Steering Vector Ensembles
(SVE) as a generalizable and efficient strategy for
fairness interventions. We further analyzed the
impact of these interventions on overall model per-
formance using the MMLU benchmark, revealing
that the effect on performance varies across models.
Overall, our results demonstrate that SVE not only
enhances bias mitigation compared to individual
steering vectors but also provides a more robust
and generalized intervention, with promising impli-
cations for improving fairness and safety in large
language models.6.1 Future Work
Steering vectors are a promising yet underexplored
direction for bias mitigation, and several avenues
exist to further develop this work.
Contrastive Datasets Although our work relied
on a uniform dataset format with variations in text
content, alternative contrastive dataset structures
such as those shown in Zou et al. (2023), and Rim-
sky et al. (2024) could be applied. In addition,
extending Bayesian optimization to include the se-
lection of layers for intervention, optimizing based
on accuracy improvements, represents a promising
direction.
Steering Vectors While we focus on BBQ and
MMLU, future studies could expand the evaluation
of steering vectors by employing additional bench-
marks. This broader evaluation could help address
current limitations and validate the generalizability
of our approach.
SVEs While Steering Vector Ensembles (SVE)
have shown promising improvements over indi-
vidual steering vectors, further work is needed to
determine the optimal combination of individual
steering vectors. Future research should explore
whether different subsets of steering vectors yield
more effective ensembles and consider alternative
aggregation methods such as weighted averages or
the median vector, which may be less susceptible
to outliers. Moreover, applying SVEs to additional
domains beyond bias mitigation in language mod-
els will help the broader utility of this approach.
Page 9:
7 Limitations
Our experiments were conducted on 7B and 8B
parameter models, which may not fully capture
emergent abilities related to bias observed in larger
models, such as moral self-correction that tends to
emerge in models with 22B parameters or more,
as noted in Ganguli et al. (2023). Due to computa-
tional constraints, we were unable to evaluate such
larger models.
Our MMLU results suggest that steering vec-
tors have less impact on higher-performing models,
however, MMLU may not capture all aspects of
language understanding and reasoning. Incorporat-
ing additional benchmarks, such as GLUE (Wang
et al., 2018) and HellaSwag (Zellers et al., 2019),
would provide a more complete assessment of the
broader effects of steering vector interventions.
Ethics Statement
There is a potential for misuse of steering vectors,
as models can be steered to become more biased.
We encourage responsible use of these techniques
to improve the safety of AI systems.
Acknowledgements
We would like to thank Joanne Boisson and Hsu-
vas Borkakoty for their very helpful comments in
reviewing this paper. This work is funded in part
by the UKRI AIMLAC CDT.
References
AI@Meta. 2024. Llama 3 model card.
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka,
Nina Panickssery, Wes Gurnee, and Neel Nanda.
2024. Refusal in language models is mediated by
a single direction. In Advances in Neural Informa-
tion Processing Systems , volume 37, pages 136037–
136083. Curran Associates, Inc.
Amos Azaria and Tom Mitchell. 2023. The internal
state of an LLM knows when it‘s lying. In Find-
ings of the Association for Computational Linguistics:
EMNLP 2023 , pages 967–976, Singapore. Associa-
tion for Computational Linguistics.
James Bergstra, Rémi Bardenet, Yoshua Bengio, and
Balázs Kégl. 2011. Algorithms for hyper-parameter
optimization. In Advances in Neural Information
Processing Systems , volume 24. Curran Associates,
Inc.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-V oss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
Proceedings of the 34th International Conference on
Neural Information Processing Systems , NIPS ’20,
Red Hook, NY , USA. Curran Associates Inc.
Collin Burns, Haotian Ye, Dan Klein, and Jacob Stein-
hardt. 2024. Discovering latent knowledge in lan-
guage models without supervision.
Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023.
Marked personas: Using natural language prompts to
measure stereotypes in language models. In Proceed-
ings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) ,
pages 1504–1532, Toronto, Canada. Association for
Computational Linguistics.
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane
Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2020. Plug and play language models:
A simple approach to controlled text generation.
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya
Krishna, Yada Pruksachatkun, Kai-Wei Chang, and
Rahul Gupta. 2021. Bold: Dataset and metrics for
measuring biases in open-ended language generation.
InProceedings of the 2021 ACM Conference on Fair-
ness, Accountability, and Transparency , FAccT ’21,
page 862–872, New York, NY , USA. Association for
Computing Machinery.
Karen Fort, Laura Alonso Alemany, Luciana Benotti,
Julien Bezançon, Claudia Borg, Marthese Borg,
Yongjian Chen, Fanny Ducel, Yoann Dupont,
Guido Ivetta, Zhijian Li, Margot Mieskes, Marco
Naguib, Yuyan Qian, Matteo Radaelli, Wolfgang S.
Schmeisser-Nieto, Emma Raimundo Schulz, Thiziri
Saci, Sarah Saidi, Javier Torroba Marchante, Shilin
Xie, Sergio E. Zanotto, and Aurélie Névéol. 2024.
Your stereotypical mileage may vary: Practical chal-
lenges of evaluating biases in multiple languages and
cultural contexts. In Proceedings of the 2024 Joint
International Conference on Computational Linguis-
tics, Language Resources and Evaluation (LREC-
COLING 2024) , pages 17764–17769, Torino, Italia.
ELRA and ICCL.
Deep Ganguli, Amanda Askell, Nicholas Schiefer,
Thomas I. Liao, Kamil ˙e Lukoši ¯ut˙e, Anna Chen,
Anna Goldie, Azalia Mirhoseini, Catherine Olsson,
Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-
Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr,
Jared Mueller, Joshua Landau, Kamal Ndousse, Ka-
rina Nguyen, Liane Lovitt, Michael Sellitto, Nelson
Elhage, Noemi Mercado, Nova DasSarma, Oliver
Rausch, Robert Lasenby, Robin Larson, Sam Ringer,
Sandipan Kundu, Saurav Kadavath, Scott Johnston,
Page 10:
Shauna Kravec, Sheer El Showk, Tamera Lanham,
Timothy Telleen-Lawton, Tom Henighan, Tristan
Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann,
Dario Amodei, Nicholas Joseph, Sam McCandlish,
Tom Brown, Christopher Olah, Jack Clark, Samuel R.
Bowman, and Jared Kaplan. 2023. The capacity for
moral self-correction in large language models.
Vipul Gupta, Pranav Narayanan Venkit, Hugo Lau-
rençon, Shomir Wilson, and Rebecca J Passonneau.
2023. Calm: A multi-task benchmark for compre-
hensive assessment of language model bias. arXiv
preprint arXiv:2308.12539 .
Dan Hendrycks, Collin Burns, Steven Basart, Andy
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
hardt. 2021. Measuring massive multitask language
understanding. Proceedings of the International Con-
ference on Learning Representations (ICLR) .
Rem Hida, Masahiro Kaneko, and Naoaki Okazaki.
2024. Social bias evaluation for large lan-
guage models requires prompt variations. ArXiv ,
abs/2407.03129.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7b.
Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer
Schütt, Oliver Bensch, Roxanne El Baff, Dominik
Opitz, and Tobias Hecking. 2024. Style Vectors for
Steering Generative Large Language Models. In
Findings of the Association for Computational Lin-
guistics: EACL 2024 , pages 782–802, St. Julian’s,
Malta. Association for Computational Linguistics.
Po-Nien Kung and Nanyun Peng. 2023. Do models re-
ally learn to follow instructions? an empirical study
of instruction tuning. In Proceedings of the 61st An-
nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers) , pages 1317–
1328, Toronto, Canada. Association for Computa-
tional Linguistics.
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter
Pfister, and Martin Wattenberg. 2024a. Inference-
time intervention: Eliciting truthful answers from a
language model. Advances in Neural Information
Processing Systems , 36.
Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Yiling Lou,
Tianlin Li, Weisong Sun, Yang Liu, and Xuanzhe Liu.
2024b. Benchmarking bias in large language models
during role-playing.
Samuel Marks and Max Tegmark. 2024. The geometry
of truth: Emergent linear structure in large language
model representations of true/false datasets. In First
Conference on Language Modeling .Kiho Park, Yo Joong Choe, and Victor Veitch. 2024.
The linear representation hypothesis and the geome-
try of large language models. In Proceedings of the
41st International Conference on Machine Learning ,
ICML’24. JMLR.org.
Alicia Parrish, Angelica Chen, Nikita Nangia,
Vishakh Padmakumar, Jason Phang, Jana Thompson,
Phu Mon Htut, and Samuel Bowman. 2022. BBQ:
A hand-built bias benchmark for question answering.
InFindings of the Association for Computational
Linguistics: ACL 2022 , pages 2086–2105, Dublin,
Ireland. Association for Computational Linguistics.
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong,
Evan Hubinger, and Alexander Turner. 2024. Steer-
ing Llama 2 via Contrastive Activation Addition. In
Proceedings of the 62nd Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers) , pages 15504–15522, Bangkok, Thai-
land. Association for Computational Linguistics.
Nihar Sahoo, Pranamya Kulkarni, Arif Ahmad, Tanu
Goyal, Narjis Asad, Aparna Garimella, and Pushpak
Bhattacharyya. 2024. IndiBias: A benchmark dataset
to measure social biases in language models for In-
dian context. In Proceedings of the 2024 Conference
of the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies (Volume 1: Long Papers) , pages 8786–8806,
Mexico City, Mexico. Association for Computational
Linguistics.
Abel Salinas and Fred Morstatter. 2024. The butterfly
effect of altering prompts: How small changes and
jailbreaks affect large language model performance.
InFindings of the Association for Computational
Linguistics: ACL 2024 , pages 4629–4651, Bangkok,
Thailand. Association for Computational Linguistics.
Zara Siddique, Liam Turner, and Luis Espinosa-Anke.
2024. Who is better at math, jenny or jingzhen?
uncovering stereotypes in large language models.
InProceedings of the 2024 Conference on Empir-
ical Methods in Natural Language Processing , pages
18601–18619, Miami, Florida, USA. Association for
Computational Linguistics.
Nishant Subramani, Nivedita Suresh, and Matthew Pe-
ters. 2022. Extracting latent steering vectors from
pretrained language models. In Findings of the Asso-
ciation for Computational Linguistics: ACL 2022 ,
pages 566–581, Dublin, Ireland. Association for
Computational Linguistics.
Alex Tamkin, Amanda Askell, Liane Lovitt, Esin
Durmus, Nicholas Joseph, Shauna Kravec, Karina
Nguyen, Jared Kaplan, and Deep Ganguli. 2023.
Evaluating and mitigating discrimination in language
model decisions. ArXiv , abs/2312.03689.
Daniel Chee Hian Tan, David Chanin, Aengus Lynch,
Brooks Paige, Dimitrios Kanoulas, Adrià Garriga-
Alonso, and Robert Kirk. 2024. Analysing the gen-
eralisation and reliability of steering vectors. In The
Page 11:
Thirty-eighth Annual Conference on Neural Informa-
tion Processing Systems .
Alexander Matt Turner, Lisa Thiergart, Gavin Leech,
David Udell, Juan J. Vazquez, Ulisse Mini, and
Monte MacDiarmid. 2024. Steering Language Mod-
els With Activation Engineering. ArXiv:2308.10248.
Yixin Wan, George Pu, Jiao Sun, Aparna Garimella,
Kai-Wei Chang, and Nanyun Peng. 2023. “kelly
is a warm person, joseph is a role model”: Gender
biases in LLM-generated reference letters. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2023 , pages 3730–3748, Singapore.
Association for Computational Linguistics.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel Bowman. 2018. GLUE:
A multi-task benchmark and analysis platform for nat-
ural language understanding. In Proceedings of the
2018 EMNLP Workshop BlackboxNLP: Analyzing
and Interpreting Neural Networks for NLP , pages
353–355, Brussels, Belgium. Association for Com-
putational Linguistics.
Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi,
Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi
Yang, Jindong Wang, and Huajun Chen. 2024a.
Detoxifying large language models via knowledge
editing. In Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers) , pages 3093–3118, Bangkok,
Thailand. Association for Computational Linguistics.
Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu,
Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-
Gang Jiang, Yu Qiao, and Yingchun Wang. 2024b.
Fake alignment: Are LLMs really aligned well? In
Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies
(Volume 1: Long Papers) , pages 4696–4712, Mexico
City, Mexico. Association for Computational Lin-
guistics.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Dai, and Quoc V . Le. 2021. Finetuned language mod-
els are zero-shot learners. ArXiv , abs/2109.01652.
Chen Xu, Wenjie Wang, Yuxin Li, Liang Pang, Jun
Xu, and Tat-Seng Chua. 2024. A study of implicit
ranking unfairness in large language models. In Find-
ings of the Association for Computational Linguistics:
EMNLP 2024 , pages 7957–7970, Miami, Florida,
USA. Association for Computational Linguistics.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu,
Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng
Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian-
hao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren,Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang,
Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and
Zihan Qiu. 2025. Qwen2.5 technical report.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma-
chine really finish your sentence? In Proceedings of
the 57th Annual Meeting of the Association for Com-
putational Linguistics , pages 4791–4800, Florence,
Italy. Association for Computational Linguistics.
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta,
Tatsunori Hashimoto, and Daniel Kang. 2024. Re-
moving RLHF protections in GPT-4 via fine-tuning.
InProceedings of the 2024 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies
(Volume 2: Short Papers) , pages 681–687, Mexico
City, Mexico. Association for Computational Lin-
guistics.
Daniel M. Ziegler, Nisan Stiennon, Jeff Wu, Tom B.
Brown, Alec Radford, Dario Amodei, Paul Chris-
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. ArXiv ,
abs/1909.08593.
Andy Zou, Long Phan, Sarah Chen, James Campbell,
Phillip Guo, Richard Ren, Alexander Pan, Xuwang
Yin, Mantas Mazeika, Ann-Kathrin Dombrowski,
Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan
Wang, Alex Mallen, Steven Basart, Sanmi Koyejo,
Dawn Song, Matt Fredrikson, J. Zico Kolter, and
Dan Hendrycks. 2023. Representation Engineer-
ing: A Top-Down Approach to AI Transparency.
ArXiv:2310.01405.
Page 12:
A Additional Analysis
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
agedisabilitynationalityreligionsocioeconomic
age appearance disability gender nationality race religion sexuality socioeconomicagedisabilitynationalityreligionsocioeconomic
age appearance disability gender nationality race religion sexuality socioeconomic
age appearance disability gender nationality race religion sexuality socioeconomic−0.8−0.6−0.4−0.200.20.40.60.81Cosine SimilarityHidden layer: 1 Hidden layer: 11 Hidden layer: 21
Hidden layer: 2 Hidden layer: 12 Hidden layer: 22
Hidden layer: 3 Hidden layer: 13 Hidden layer: 23
Hidden layer: 4 Hidden layer: 14 Hidden layer: 24
Hidden layer: 5 Hidden layer: 15 Hidden layer: 25
Hidden layer: 6 Hidden layer: 16 Hidden layer: 26
Hidden layer: 7 Hidden layer: 17 Hidden layer: 27
Hidden layer: 8 Hidden layer: 18 Hidden layer: 28
Hidden layer: 9 Hidden layer: 19 Hidden layer: 29
Hidden layer: 10 Hidden layer: 20 Hidden layer: 30
Figure 5: The full cosine similarity matrix over all the hidden layers for the 9 BBQ steering vectors for Mistral.
Page 13:
Figure 6: The full cosine similarity matrix over all the hidden layers for the 9 BBQ steering vectors for Llama.
Page 14:
Figure 7: The full cosine similarity matrix over all the hidden layers for the 9 BBQ steering vectors for Qwen.
Page 15:
Hidden layer: 1Hidden layer: 3Hidden layer: 5Hidden layer: 7Hidden layer: 9Hidden layer: 11Hidden layer: 13Hidden layer: 15Hidden layer: 17Hidden layer: 19Hidden layer: 21Hidden layer: 23Hidden layer: 25Hidden layer: 27Hidden layer: 29Hidden layer: 31
Hidden layer: 1Hidden layer: 3Hidden layer: 5Hidden layer: 7Hidden layer: 9Hidden layer: 11Hidden layer: 13Hidden layer: 15Hidden layer: 17Hidden layer: 19Hidden layer: 21Hidden layer: 23Hidden layer: 25Hidden layer: 27Hidden layer: 29Hidden layer: 31
Hidden layer: 1 Hidden layer: 3 Hidden layer: 5 Hidden layer: 7 Hidden layer: 9 Hidden layer: 11 Hidden layer: 13 Hidden layer: 15 Hidden layer: 17 Hidden layer: 19 Hidden layer: 21 Hidden layer: 23 Hidden layer: 25 Hidden layer: 27 Hidden layer: 29 Hidden layer: 31Hidden layer: 1Hidden layer: 3Hidden layer: 5Hidden layer: 7Hidden layer: 9Hidden layer: 11Hidden layer: 13Hidden layer: 15Hidden layer: 17Hidden layer: 19Hidden layer: 21Hidden layer: 23Hidden layer: 25Hidden layer: 27Hidden layer: 29Hidden layer: 31
Hidden layer: 1 Hidden layer: 3 Hidden layer: 5 Hidden layer: 7 Hidden layer: 9 Hidden layer: 11 Hidden layer: 13 Hidden layer: 15 Hidden layer: 17 Hidden layer: 19 Hidden layer: 21 Hidden layer: 23 Hidden layer: 25 Hidden layer: 27 Hidden layer: 29 Hidden layer: 31
Hidden layer: 1 Hidden layer: 3 Hidden layer: 5 Hidden layer: 7 Hidden layer: 9 Hidden layer: 11 Hidden layer: 13 Hidden layer: 15 Hidden layer: 17 Hidden layer: 19 Hidden layer: 21 Hidden layer: 23 Hidden layer: 25 Hidden layer: 27 Hidden layer: 29 Hidden layer: 31−0.8−0.6−0.4−0.200.20.40.60.81Cosine SimilarityVec for: age Vec for: appearance Vec for: disability
Vec for: gender Vec for: nationality Vec for: race
Vec for: religion Vec for: sexuality Vec for: socioeconomicFigure 8: A clustering in the similarities of the steering vectors for the 9 BBQ axes can be observed for later layers
and layers that are closer together for Mistral. The layer at which the largest cluster appears is dataset dependent e.g.
hidden layer 19 for the age axis and layer 15 for the socioeconomic axis.