Authors: Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin
Paper Content:
Page 1:
SPRI: Aligning Large Language Models with Context-Situated Principles
Hongli Zhan†1Muneeza Azmat2Raya Horesh2Junyi Jessy Li1Mikhail Yurochkin2 3
Abstract
Aligning Large Language Models to integrate and
reflect human values, especially for tasks that de-
mand intricate human oversight, is arduous since
it is resource-intensive and time-consuming to
depend on human expertise for context-specific
guidance. Prior work has utilized predefined
sets of rules or principles to steer the behavior
of models (Bai et al., 2022b; Sun et al., 2023).
However, these principles tend to be generic,
making it challenging to adapt them to each in-
dividual input query or context. In this work,
we present SITUATED -PRI NCIPLES (SPRI ), a
framework requiring minimal or no human effort
that is designed to automatically generate guid-
ing principles in real-time for each input query
and utilize them to align each response. We
evaluate SPRI on three tasks, and show that 1)
SPRI can derive principles in a complex domain-
specific task that leads to on-par performance as
expert-crafted ones; 2) SPRI -generated princi-
ples lead to instance-specific rubrics that outper-
form prior LLM-as-a-judge frameworks; 3) us-
ingSPRI to generate synthetic SFT data leads
to substantial improvement on truthfulness. We
release our code and model generations at https:
//github.com/honglizhan/SPRI-public .
1. Introduction
Large Language Models (LLMs) have showcased impres-
sive performance across diverse applications (Achiam et al.,
2024; Dubey et al., 2024; Yang et al., 2025; Jiang et al.,
2024; Groeneveld et al., 2024). However, in more com-
plex tasks, human-expert-crafted prompts are required to
achieve the desired level of performance. For example,
Zhan et al. (2024) showed that LLMs are capable of gen-
†Work started and partially done during Hongli’s intern-
ship at IBM Research.1Department of Linguistics, The
University of Texas at Austin, Austin, TX, USA2IBM Re-
search, Yorktown Heights, NY , USA3MIT-IBM Watson AI
Lab, Cambridge, MA, USA. Correspondence to: Hongli Zhan
<honglizhan@utexas.edu >.
Acknowledge the narratorʼs emotional response without judgment,
while gently guiding them to reframe their perception of
responsibility … Suggest that the narratorʼs past experiences (e.g.,
problems with their dad and family) may be influencing their
current emotional responses, and that this is not their fault.
Encourage self-reflection to identify whether there are any patterns
or triggers that contribute to their feelings of insecurity and hurt …
Even when people are clearly joking I still get insecure
and a little hurt. I do my best not to show it but i think
to the more perceptive folks itʼs probably obvious …
Itʼs so stupid. I know itʼs rooted deeper like problems I
have with my dad and family and being accepted but
it still annoys me. Is there any fix to this?
Please write the assistant response so that it does not
contain any harmful, unethical, or socially biased content,
and move the conversation in a positive direction.
Human
Experts
If the narrator is stressing over things they are not responsible
for, tell them that it may not require as much responsibility as
they think and not to worry about them too much. However, if
the person is doing something wrong and not feeling any
responsibility for it, kindly but objectively encourage them to
re-appraise the situation and consider what they could be
responsible for, and change the situation.
SPRI w/
GPT-4o
(mini)
Generic
Rules
User Figure 1. Using SPRI ,GPT-4o-mini can generate situated and
detailed principles to guide the response to a person narrating
in distress. Compared with generic rules (Bai et al., 2022b) and
human-expert-crafted principles (Zhan et al., 2024), SPRI requires
minimal to no human efforts yet produces context-specific guid-
ance for every query at hand.
erating high-quality cognitive reappraisals when guided by
“constitutions” written by clinical psychologists with doc-
toral degrees.1LLM-as-a-judge (Zheng et al., 2023) is an-
other prominent application that typically requires carefully
crafted evaluation criteria to align with human annotators
(Yu et al., 2023; Hashemi et al., 2024; Ye et al., 2024).
To better guide LLMs, several prior works utilized principles
or constitutions in the context of synthetic data generation
for alignment (Bai et al., 2022b; Sun et al., 2023). Such
approaches are effective at reducing data annotation efforts,
however, they are limited by the general nature of such
principles making them hard to interpret in a given context,
even for humans (Kirk et al., 2023a;b). For example, Bai
et al. (2022b) employed the constitutional principle “Iden-
tify specific ways in which the assistant’s last response is
1Cognitive reappraisal is a strategy commonly practiced by
clinical psychologists to foster long-term emotional well-being
(Arnold, 1960; Gross & John, 2003; Yeo & Ong, 2023). See
Appendix §D for more details.
1arXiv:2502.03397v1 [cs.CL] 5 Feb 2025
Page 2:
SPRI: Aligning Large Language Models with Context-Situated Principles
Generate initial
principles User Input
Seed ( Instruction,
Principle ) Examples Stage 1: Generate a set of principles to
guide the response to the userʼs input
Are the principles
useful enough to
guide the response?
Base LLM
YES
NO
Refine the principles
based on feedback
Base LLM Critic Model
Final
Principles Generate an initial
response that adheres
to the principles from
Stage 1 Stage 2: Generate a response to the userʼs
input by adhering to the principles
Does the response
align well with the
principles?
Base LLM
YES
NO
Refine the response
based on feedback
Base LLM Critic Model
Final Response
Figure 2. Overview for SPRI , which consists of two stages: 1) producing a set of principles specifically tailored to the user’s input T, and
2) utilizing the generated principles to guide the response to T. Both stages include a critique-refine process involving a separate critic
model, which aims to scrutinize the fitness of the principles to Tand the final responses’ adherence to the generated principles.
harmful, unethical, racist, sexist, toxic, dangerous, or illegal”
to critique and refine model responses. The precise meaning
ofharmful orunethical is often situation-dependent limiting
the effectiveness of the principle when aligning to nuanced
human values. In the reappraisal and LLM-as-a-judge use-
cases discussed previously, generic principles are also often
insufficient to capture the complexities of the use-case. For
example, Kim et al. (2024a) use human annotators to craft
instance-specific evaluation criteria for LLM judges for their
open-ended generation benchmark, which is a considerable
amount of human effort. We provide an example in the
context of reappraisal in Figure 1.
We propose SITUATED -PRI NCIPLES (SPRI ), a framework
designed to automatically generate constitutional principles
specifically tailored to that input query in real-time and
utilize them to align each response. SPRI utilizes a base
model and a critic model, and its algorithm consists of two
stages. The first stage consists of a base model that comes up
with principles and a critic model that helps the base model
to iteratively refine the principles. The second stage then
applies the principles to direct the base model’s response
to the specific user’s input. The critic model reviews the
response using the principles as criteria, and the base model
adjusts the response according to the feedback from the
critic model. Importantly, the critic model does notneed to
be stronger or larger than the base model. We illustrate our
framework in Figure 2.
We evaluate SPRI in three situations:
(1)We consider a domain-specific task where expert-level
complex principles were shown to be necessary: hav-
ing LLMs produce cognitive reappraisals ( §4.1). We
show that models using principles derived from SPRI
perform on-par with those using principles crafted by
professional psychologists.(2)Evaluation of open-ended generations across complex
tasks with LLM judges. We show that principles from
SPRI result in correlation with human judgments on
par with instance-specific human curated evaluation
rubrics and outperform prior LLM-judge frameworks
(§4.2).
(3)Generating synthetic data with SPRI proves effective
for fine-tuning base LLMs, resulting in substantial im-
provement on TruthfulQA (Lin et al., 2022), whilst
maintaining performance on other benchmarks (§5).
2. Related Work
Scalable Oversight. In order to minimize the amount of
human oversight necessary to align LLMs, Bai et al. (2022b)
introduced Constitutional AI, a method relying on a list of
predefined hand-crafted rules or constitutional principles
that aim to promote safe, reliable, and effective systems.
Leveraging Reinforcement Learning from AI Feedback
(RLAIF) (Lee et al., 2024), Constitutional AI uses these
principles to create AI-generated self-critiques to enhance
the models autonomously. During the self-critique process,
however, only a single rule is randomly chosen to scruti-
nize the existing response. Sun et al. (2023) improves on
this approach by incorporating 16manually-devised guid-
ing principles that entail broader domains and more specific
criteria, such as candorness, step-by-step justifications, and
multi-faceted answers. By broadening the range of topics,
they allow the language model to decide which principles
to adhere to given user queries. However, these approaches
are resource-intensive and demand significant human labor,
as they necessitate explicitly predefined guiding principles.
Prior work has recognized the importance of guiding LLM
generations using principles situated in the particular con-
text at hand, such as allowing users to formulate principles
2
Page 3:
SPRI: Aligning Large Language Models with Context-Situated Principles
that steer the conversation (Petridis et al., 2024b). How-
ever, relying solely on human interactions to provide such
context-situated guidance is challenging to scale. In Chen
et al. (2024), strong LLMs are used to discover principles for
a weak LLM. In this red-teaming approach, both a stronger
LLM and an initial badresponse are necessary, thus diffi-
cult to generalize. Petridis et al. (2024a) also introduces a
method for learning a collection of constitutional principles
given a cluster of training data. The training is conducted
on various clusters of data, resulting in different sets of prin-
ciples. At inference time, input queries are then directed to
different principles based on their similarity to the centroids
of the training clusters. Similarly, OpenAI o1 models (Jaech
et al., 2024) utilize a technique entitled Deliberative Align-
ment (Guan et al., 2025), which teaches LLMs to explicitly
reason through safety specifications before producing an
answer, but their approach mainly seeks to align and train a
downstream model.
In contrast, our method customizes the principles for each
individual input query, rather than basing them on a set of
undesirable responses or a cluster of training data. This en-
sures that the principles are not generalized but specifically
tailored to each unique input query, making our constitu-
tional principles more precise. Our framework is also more
versatile and not restricted to supervised fine-tuning. As
demonstrated in §4,SPRI can effortlessly extend to com-
plex tasks that require significant human oversight.
Learning from Feedback. To align AI systems with hu-
man preferences and values, researchers have explored using
human feedback to direct the behaviors of language models
(Kirk et al., 2023a). This includes efforts to incorporate
human feedback in the pertaining (Korbak et al., 2023) and
supervised fine-tuning phases (Hancock et al., 2019; Liu
et al., 2024), integrate human feedback through reinforce-
ment learning either directly (Stiennon et al., 2020; Bai et al.,
2022a; Bakker et al., 2022; Ouyang et al., 2022; Liu et al.,
2022) or indirectly (Zhou et al., 2021; Korbak et al., 2023),
as well as prompt engineering (Jin et al., 2022; Zhao et al.,
2021; Askell et al., 2021). However, human feedback is
expensive and laborious to collect (Lee et al., 2024). Other
works have therefore resorted to using machine-generated
feedback for improving the model outputs (Bai et al., 2022b;
Yang et al., 2022; Lee et al., 2024; Fu et al., 2024; Cui et al.,
2024; Madaan et al., 2023). Our approach differs from these
methods by focusing on refining the principles tailored to
each input, in addition to refining the outputs. These prin-
ciples are then used to guide the generation of responses
for each corresponding input and serve as the criteria for
critiquing and improving the responses.3. SPRI: A Scalable Alignment Framework
with Minimal Human Oversight
We present SITUATED -PRI NCIPLES (SPRI ), a framework
that generates context-situated principles to align LLMs
while minimizing human oversight. The framework relies
on two ingredients: a base model Mand a critic model C.
An overview of SPRI is shown in Figure 2. To generate
an aligned response, SPRI goes through two steps: during
thefirst stage,Mtakes in the user’s input Tand gener-
ates a set of principles customized to Tthrough a series of
critique-refinement loops with C; then in the second stage,
the generated principles are fed into Mto guide its response.
These principles also serve as criteria to provide feedback
on the generated responses for improvement. We provide
the pseudo-code algorithms in Appendix §A.
Stage I: Synthesizing Context-Situated Principles.
Based on a user’s input T, the objective of the first step
is to generate guiding principles tailored to T. Given T, the
base model Mis prompted with Pprinciple-gen to produce an
initial set of principles, K0, as follows:
K0=M(T⊕Pprinciple-gen ⊕S), (1)
where⊕denotes concatenation and Pprinciple-gen is a prompt
instructing the model to generate principles (see Appendix
§B). A set of seed (instruction, principle) tuples, denoted
asS, can also be provided as few-shot examples for the
model to better grasp the essence of desired principles. We
note that the provision of seed examples is optional: this
initial principle-generation phase can be rendered under a
zero-shot setting.
As the next step, we need to determine the adequacy of K0
and assess whether it is suitable for guiding the response to
T. We use the critic model Cto yield feedback on K0:
Feedback K0=C(Eval principle ⊕T⊕K0). (2)
Here, Eval principle is a chain-of-thought (Wei et al., 2022)
style evaluation prompt in the format of direct assessment
(Kim et al., 2024b) that instructs Cto produce both quali-
tative feedback and a numerical score (on a 1to5Likert
scale). The feedback is fed back into the base model M,
prompting it to refine the principles:
Ki=M(Pprinciple-refine ⊕T⊕Ki−1⊕Feedback Ki−1),
(3)
where Pprinciple-refine is a prompt instructing the model to
refine principles based on feedback. This iterative critique-
refinement process continues until the principles receive a
desired score of at least 4or a maximum of four iterations
is reached. We denote the final set of principles deemed
suitable to guide the response to TasKfinal.
3
Page 4:
SPRI: Aligning Large Language Models with Context-Situated Principles
Stage II: Generating Responses Guided by Synthesized
Principles. We use the established principles Kfinalto
guideM’s response to T. The initial response generation
process can be expressed as:
R0=M(T⊕Presponse-gen ⊕Kfinal), (4)
where Presponse-gen is a prompt that instructs Mto respond.
R0is then examined by the critic model Cfor feedback,
with the principles Kfinalbeing the rubrics:
Feedback R0=C(Eval response⊕T⊕Kfinal⊕R0).(5)
Similar to Stage I,Eval response is a direct assessment prompt
that elicits feedback and a score from C. If the evaluation
score is below 4or the maximum number of iterations is not
reached, the feedback is passed back to the base model M
to iteratively refine its response:
Ri=M(Presponse-refine ⊕T⊕Ri−1⊕Feedback Ri−1).
(6)
Here, Presponse-refine is a prompt asking the model to refine
the response based on feedback. We denote the final refined
response as Rfinal. By iteratively refining both the guid-
ing principles and the response, SPRI ensures that Rfinal
aligns closely with the user’s input Tand the generated
principles Kfinalwith minimal to no human intervention.
While the critique-refine process in Stage II of SPRI shares
similarities with self-refine (Madaan et al., 2023), it is dis-
tinctly guided by context-situated principles Kfinalgenerated
from Stage I. SPRI is easy to scale and can be dynamically
adapted to diverse user inputs and tasks: not only can it
extrapolate to complex tasks such as providing emotional
support ( §4.1) or performing instance-specific evaluation
(§4.2), but it also performs well on providing training data
for large-scale alignment (§5).
4. SPRI for Complex Principles
We examine the effectiveness of SPRI on complex real-
world tasks, one where LLMs are shown only to be success-
ful if provided with complex, expert-curated principles in
the prompt (Zhan et al., 2024), another on a larger bench-
mark where manually curated situation-specific rubrics are
necessary (Kim et al., 2024a). We show that SPRI gen-
erates effective principles for complex tasks in the former
(§4.1), and also generates evaluation rubrics for instance-
level assessment in the latter ( §4.2). We provide example
SPRI-generated principles in Appendix §I.
4.1. Can SPRI Guide Cognitive Reappraisals?
We explore how SPRI can be applied to facilitate cognitive
reappraisals , a strategy widely recognized by psychology
practitioners that aims to promote long-term mental well-
being for an individual (Gross, 1998; Gross & John, 2003;
Waugh et al., 2016). Recently, Zhan et al. (2024) showedthat complex principles crafted by professional psycholo-
gists used in LLM prompts enables the models to perform
this complex task. An oracle principle is used for each
individual appraisal dimension (refer to Appendix §D for
details). This is an ideal testbed for SPRI to dynamically
generate complex context-specific principles to guide the
elicitation of reappraisal responses. By developing a unique
set of principles from scratch for each individual user query,
we show performance comparable to those guided by oracle
principles while minimizing human supervision.
Data. We evaluate on the same dataset from Zhan et al.
(2024). The data is sourced from Reddit posts seeking emo-
tional support and we use the subset of 30 Reddit posts
where expert psychologist evaluation is available. The aver-
age post length is 170.5tokens (SD = 99.2).
Baselines. We first explore two principle-free methods ,
including 1)vanilla , a weak baseline in which a generic
prompt “ help the narrator of the text reappraise the situa-
tion” is used to elicit a straightforward reappraisal response
from the language model. 2)self-refine (Madaan et al.,
2023), which builds on the vanilla prompt by incorporating
a single feedback repeatedly six times: “ please revise the
reappraisal response to help the narrator reappraise the
situation better .” This serves as a baseline for refinement
without guidance. Additionally, we also experiment with
anoracle-informed method that leverages predefined reap-
praisal principles in the prompts: 3)+oracle , where we
provide the language model with the detailed, expert-crafted
reappraisal constitutional principles from RESORT. This
offers insight into how SPRI performs relative to systems
with access to expert-designed guidelines.
SPRI Method. To increase the stability of the principle
generation process, we provide SPRI with a single oracle
RESORT constitution as the seed example.
Evaluation & Criteria. We adopt the evaluation schema
from Zhan et al. (2024), which is comprised of 4criteria
that extensively assess the quality of reappraisals generated
by LLMs, namely: 1) Alignment with Reappraisal Con-
stitutions , which assesses whether the reappraisal response
adheres to the oracle constitutions specified by Zhan et al.
(2024). Responses are rated from 1to10, with 1being
“Least Aligned” and10being “Most Aligned” .2) Empathy ,
which evaluates whether the reappraisal response shows
empathy towards the narrator of the Reddit post on a scale
from 1to5, with 1being “Least Empathetic” and5indicat-
ing“Most Empathetic” . We consider these two metrics the
key to evaluating reappraisals. In addition, we also look at
the3) Harmfulness of the response, checking whether the
response contains any unethical or harmful content, with op-
tions being “Harmful” (1) and “Not Harmful” (0). Finally,
4
Page 5:
SPRI: Aligning Large Language Models with Context-Situated Principles
4) Factuality measures whether the response is factually
consistent in relation to the given Reddit Post, with options
“Yes” (1),“Minor Error” (0.5), and “No” (0). We leave the
results for these two dimensions in Appendix §F.
We carry out automatic evaluation on all reappraisal re-
sponses elicited using GPT-4-0613 , using the method from
(Zhan et al., 2024) which showed strong correlation with
evaluation results conducted by professional psychologists.
Experimental Setup. We experiment with a comprehen-
sive suite of state-of-the-art LLMs, including GPT-4o-mini
(Hurst et al., 2024), Llama-3.1-70B-Instruct and
Llama-3-8B-Instruct (Dubey et al., 2024), as well as
Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024). In the
SPRI method, these models act as the base model M. We
employ Prometheus-2-8x7B (Kim et al., 2024b), a mixture-
of-experts model developed specifically for the task of giv-
ing feedback, as the critic model Cfor all SPRI experiments.
We set the temperature T= 0.7for model inferencing.
Results. We show the results in Table 1.2First, we note
that oracle-informed approaches significantly outperform
principle-free baselines. Notably, incorporating oracle prin-
ciples in the prompt ( oracle principles ) increases mod-
els’ performance over vanilla andself-refine methods
by an average of 11.3%and 16.3%respectively in terms
of the responses’ alignment with reappraisal constitutions.
On the other hand, SPRI consistently outperforms meth-
ods that lack access to oracle principles both in terms
of reappraisal alignment and perceived empathy, even
though it only utilizes a single seed principle. Specifically,
we obtain an average improvement of 6.1%in alignment
and8.4%in empathy over our strongest vanilla baseline.
Moreover, our SPRI approach also significantly surpass the
self-refine method by as much as 11.0%in alignment
and12.1%in empathy. These results suggest that tailoring
context-situated principles can achieve performance com-
parable to those with oracle guidance, even for a task as
complex as offering psychologically grounded emotional
support.
4.2. Can SPRI Generate Fine-Grained Rubrics?
We further investigate SPRI ’s capability to handle case-
by-case nuances by examining its ability to generate fine-
grained evaluation rubrics for each individual instance. We
utilize BiGGen Bench (Kim et al., 2024a), an extensive
benchmark designed to assess the performance of LLMs
across a variety of tasks using language models. BiGGen
Bench stands out due to its use of instance-specific evalu-
2Zhan et al. (2024) presented two strategies to incorporate the
oracle principles, and we report the better one here. Please see
Appendix §F Figure 5 for the full results with both strategies.ation rubrics, each meticulously curated to ensure detailed
and contextually rich assessments. We detail the BiGGen
Bench dataset in Appendix §E. While these human-crafted
criteria allow for a fine-grained analysis of models’ perfor-
mance on each individual case , the manual creation of such
detailed rubrics is both labor-intensive and time-consuming.
To mitigate this bottleneck, we propose leveraging SPRI
to automate the rubric generation process. Specifically, we
hypothesize that LLMs, when guided by the SPRI frame-
work, can produce evaluation rubrics from scratch that
align closely with human-annotated ones in quality and
contextual specificity for each individual evaluation in-
stance .
Data. We utilize the subset of BiGGen Bench where
ground truth human gold ratings were collected. Specif-
ically, we focus on 8different capabilities, namely
instruction-following ,refinement ,theory of mind ,ground-
ing,reasoning ,planning ,tool usage , and safety . This results
in a total of 2,780(response, gold rating) pairs, spanning
across 695evaluation instances.
Baselines. Similar to the setup in §4.1, we first experiment
with eliciting evaluation rubrics using instance-agnostic
methods , namely 1)vanilla , a weak baseline where we
use a generic prompt “ How well does the response address
the instruction? Please rate on a scale of 1 to 5, where 1
stands for ‘not at all’ and 5 stands for ‘perfectly’ ” to evoke
a pristine judgment from the language model. 2)self-refine
(Madaan et al., 2023), where the vanilla prompt is formu-
lated as repeated feedback, a baseline for refinement without
guidance. Please note that we do not set a “sufficient” stop-
ping criteria here, but instead only impose a max iteration
of6, as in practice we find that the model tends to rate all
of its responses sufficient with no need for refinement. 3)
MT-Bench rubric (Zheng et al., 2023), a coarse-grained cri-
teria that assesses the quality of the response from aspects
including helpfulness, relevance, accuracy, depth, creativity,
and the level of detail. 4)FLASK rubric (Ye et al., 2024), a
set of domain-specific criteria that covers areas like logical
robustness, factuality, commonsense understanding, compre-
hension, insightfulness, meta-cognition, and harmlessness.
We further experiment with an oracle-informed method :
5)oracle rubrics , where the human-crafted ground truth
criteria from Kim et al. (2024b) are provided to evaluator
LMs as rubrics.
SPRI Methods. To increase the stability of the princi-
ple generation process, we augment SPRI with 3instance-
rubric pairs from BiGGen Bench as seed examples for each
capability. Note that these seed examples remain the same
for all instances within the same capability category.
5
Page 6:
SPRI: Aligning Large Language Models with Context-Situated Principles
Table 1. Evaluation results (in average scores) for reappraisal responses. We report statistical significance (with p <0.05) using pair-wise
t-tests against both the vanilla (marked with *) and self-refine (marked with †) baselines. Cells that utilize oracle principles are highlighted
in yellow, while cells that do not have access to oracle principles but still achieve the highest scores within the rest of the systems are
bolded and highlighted in green. For the full results, see Appendix §F Figure 5.
GPT-4o-mini Llama-3.1-70B-Instruct Llama-3-8B-Instruct Mixtral-8 ×7B-Instruct
Alignment ↑Empathy ↑ Alignment ↑Empathy ↑ Alignment ↑Empathy ↑ Alignment ↑Empathy ↑
Scale of 10 Scale of 5 Scale of 10 Scale of 5 Scale of 10 Scale of 5 Scale of 10 Scale of 5
vanilla 7.90 4 .50 7 .77 4 .43 7 .10 3 .90 7 .53 4 .50
self-refine 7.73 4 .53 7 .50 4 .27 7 .20 4 .07 6 .60 3 .90
SPRI 8.00†4.73 8.17*†4.77*†7.90*†4.47*†8.03*†4.77*†
oracle principles 8.67*†4.80*†8.53*†4.20 8.33*†4.30* 8.17 4.07
Table 2. Results for BiGGen Bench. Evaluation carried out without
the use of reference answers. Cells that utilize oracle rubrics are
highlighted in yellow, whereas cells that do not have access to
oracle rubrics but still achieve the highest scores within the rest of
the systems are bolded and highlighted in green. See Appendix
§G Table 6 for the full results.
GPT-4o
miniLlama-3.1-70B
InstructMixtral-8x7B
InstructPrometheus-2
8x7B
vanilla 0.377 0 .386 0.307 0.311
self-refine 0.397 0 .260 0 .110 0 .297
MT-Bench rubric 0.416 0 .421 0 .273 0 .289
FLASK rubric 0.358 0 .360 0 .277 0 .294
SPRI 0.472 0.480 0.288 0.333
oracle rubrics 0.550 0.556 0.367 0.386
Experimental Setup. We experiment with a
comprehensive suite of state-of-the-art LLMs, in-
cluding GPT-4o-mini , Llama-3.1-70B-Instruct ,
Mixtral-8x7B-Instruct-v0.1 , as well as
Prometheus-2-8x7B . In the SPRI methods, these models
act as the base model M. We employ Prometheus-2-8x7B
as the critic model Cfor all SPRI experiments.
Evaluation. For each instance in the evaluation dataset,
we provide the evaluator model with rubrics to assess
their corresponding outputs. We use the template from
Prometheus (Kim et al., 2024b) to prompt the evaluator
model. We compare the evaluation labels with human
ground truth labels by calculating Pearson’s correlation.
Note that in the BiGGen Bench dataset, each instance is
also accompanied by a reference answer. But in practice,
we find that the evaluator LM often overlooks the scoring
rubric and instead relies on the reference answer. To ablate
the influence of the scoring rubrics in our experiments, we
don’t use reference answers throughout the evaluation.
Results. We provide the average Pearson’s correlation
to ground truth human labels in Table 2. Similar to the
results from cognitive reappraisals ( §4.1), systems withaccess to oracle rubrics outperform methods employing
instance-agnostic rubrics by a considerable margin. The
coarse-grained MT-Bench rubric leads to a moderate per-
formance among the instance-agnostic baselines, whereas
the domain-specific FLASK rubric often lags behind. No-
tably, SPRI outperforms the best-performing MT-Bench
instance-agnostic baseline by an average of 12.1%, while
only relying on 3oracle rubrics as seeds. Although
oracle rubrics exceeds SPRI in performance, the dif-
ference is relatively small, leading to an average margin of
only 0.07in Pearson’s correlation across all models. These
results, combined with the findings in §4.1, underscore the
potential of SPRI in enhancing the LLMs’ robustness for
tasks that require complex principles and guidance.
4.3. Ablation Study
To better tease apart and analyze the success of SPRI , we
study the impact of seed examples provided in the initial
principle generation stage. We first remove seed exam-
ples from the SPRI pipeline. We denote this approach
by-seed=[none] . In order to further demonstrate the ro-
bustness of SPRI , we insert generic principles (shown in
Appendix §C Figure 3) as seed examples, and denote this
modification as -seed=[default principles] . We show-
case the results in Table 3. Removing seed examples entirely
leads to an average performance degradation of 4.13% in
alignment for reappraisals and 13.37% in Pearson’s correla-
tion for rubric generation. On the other hand, substituting
the default principles as seeds leads to a similar average
performance decrease of 4.01% in alignment and 12.35%
in Pearson’s correlation for rubric generation. These results
highlight the robustness of SPRI to seed examples in the
initial principle-generation stage, as our default principles
are neither relevant to the tasks we evaluate nor fit to the
instances we aim to provide guidance with.
Additionally, to better understand the influence of the seed
principles on SPRI , we also experiment with a separate
condition default principles only , where we randomly se-
lect one of the six default principles and include it as both
6
Page 7:
SPRI: Aligning Large Language Models with Context-Situated Principles
Table 3. Ablation for SPRI on reappraisal responses (measured by their responses’ alignment to reappraisal constitutions), and BiGGen
Bench rubric generation. Reappraisal responses where the ratings are significantly worse than either of the vanilla andself-refine
baselines are shaded.
REAPPRAISAL ALIGNMENT RUBRIC GENERATION
GPT-4o
miniLlama-3.1-70B
InstructLlama-3-8B
InstructMixtral-8x7B
InstructGPT-4o
miniLlama-3.1-70B
InstructMixtral-8x7B
InstructPrometheus-2
8x7B
SPRI 8.00†8.17*†7.90*†8.03*†0.472 0 .480 0 .288 0.333
-seed=[none] 7.67* 7.77 7 .73*†7.60†0.410 0 .410 0 .245 0 .297
-seed=[default principles] 7.67 7 .87†7.70*†7.57†0.404 0 .391 0 .238 0.336
default principles only 2.13*†6.47*†6.07*†2.80*†0.176 0 .055 0 .260 0 .308
the final guiding principle for eliciting reappraisals and
the final rubrics for evaluating instances. This helps ab-
late the influence of the default principles within the SPRI
pipeline, as they are unrelated to both the reappraisal task
and the context at hand. As shown in Table 3, utiliz-
ing default principles alone in the prompt to guide LLMs
for the task of cognitive reappraisals leads to an average
performance decrease of 45.62% compared to SPRI , and
this degradation is most observed for GPT-4o-mini and
Mixtral-8x7B-Instruct . In terms of instance-specific
evaluation, employing default principles alone led to the
most performance degradation for the more capable models
GPT-4o-mini andLlama-3.1-70B-Instruct on this task,
where their Pearson’s correlation score go down by 62.7%
and 88.5%respectively compared to SPRI . These find-
ings further underscore the importance of utilizing context-
specific principles, especially for tasks where guidance is
needed.
5.Can SPRI Generate Large-Scale Alignment
Data for Supervised Fine-Tuning?
Finally, we apply SPRI to a more general setting: gener-
ating large-scale synthetic data for supervised fine-tuning
(SFT). Through evaluating language models fine-tuned on
our synthetically generated data, we indirectly assess the
capability of SPRI . Intrinsically, SPRI ’s context-situated
principles allow for a deeper ability to reject misleading
claims — as exhibited in Appendix §I.3, when provided
with questions that don’t have a definite answer (e.g., “Is
it true that if you don’t exercise your body will become
weaker?” ),SPRI often generates guiding principles that
asks the response to focus on both sides of the question.
Based on the nature of SPRI , we hypothesize that SPRI
would perform best on benchmarks that measure the rejec-
tion of falsehoods, whilst maintaining the performance in
the knowledge as well as problem-solving domains.
5.1. Task Formulation
Letϕ(x)be the pipeline we generate responses with, and
letFθbe a model that we want to align. We are interestedin aligning Fθusing the data ϕ(x)produces. To this end,
given an instruction-following dataset Dthat is composed of
prompt-response pairs D={(p1, r1),(p2, r2), ...,(pn, rn},
we aim to produce corresponding aligned responses condi-
tioned on the prompts: {ϕ(p1), ϕ(p2), ..., ϕ (pn)}. Subse-
quently, we construct a new dataset Dϕ, which consists
of the original prompts paired with their corresponding
aligned responses. We then train FθonDϕby optimizing
its weights θ, resulting in a trained model Fθ∗. We measure
the performance of Fθ∗as an indicator of the quality of Dϕ.
5.2. Experimental Setup
Data. To examine the generalizability of SPRI, we carry
out experiments on two different instruction-tuning datasets
D, namely Dolly (Conover et al., 2023) and MixInstruct
(Jiang et al., 2023). Dolly contains around 15k manually
curated prompt-response pairs, whereas MixInstruct con-
sists of 110k examples where the responses are primarily
sourced from GPT-3.5-turbo andGPT-4 . We randomly
split Dolly into a 10k/2k split for training and validation.
ForMixInstruct , we randomly select 50k examples from
its training set and 2k examples from its validations set.
Baseline Methods. We experiment with a variety of base-
lines, including 1)oracle response , where we fine-tune
directly on the oracle responses provided in the datasets. 2)
direct response , in which we collect responses by asking
the base model Mto directly respond to the instructions for
each instance in the dataset. 3)self-instruct , where we elicit
responses from Mby relying on a few-shot prompt with 11
(input, output) example pairs from Wang et al. (2023). 4)
topic-guided red-teaming , a prompt from Sun et al. (2023),
in which a set of 16general rules as well as few-shot ex-
amples demonstrating how to utilize these rules in a chain-
of-thought (Wei et al., 2022) fashion are used to elicit re-
sponses. 5)self-refine (Madaan et al., 2023), where we
ask the base model Mto critic and refine its own response.
During critiquing, we ask the model to provide feedback
followed by an integer assessment score from 1to5. We
iterate the critique-refine process until a minimal assessment
score of 4is met or the maximum number of iterations of
7
Page 8:
SPRI: Aligning Large Language Models with Context-Situated Principles
Table 4. Performance of supervised fine-tuned models on TruthfulQA (Lin et al., 2022).
Llama-3.1-8B Llama-3.1-8B-Instruct Mistral-7B-v0.3 Mistral-7B-v0.3-Instruct Gemma-2-9B Gemma-2-9B-it
Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct
oracle response 41.62% 51.94% 46.75% 49.28% 40.42% 50.90% 42.87% 49.64% 44.81% 51.21% 47.11% 57.48%
direct response 51.48% 50.82% 50.94% 50.99% 47.16% 52.64% 50.89% 55.09% 53.82% 53.94% 57.97% 57.73%
self-instruct 51.07% 52.02% 49.46% 50.76% 46.62% 51.87% 50.44% 52.81% 52.43% 52.85% 56.26% 54.70%
self-align 54.56% 54.97% 52.52% 51.96% 48.86% 53.95% 54.44% 56.85% 54.02% 51.70% 58.34% 55.11%
self-refine 53.76% 55.11% 52.11% 50.20% 49.40% 53.15% 52.35% 54.69% 55.01% 53.93% 58.86% 58.36%
seed principles 53.63% 53.83% 50.46% 52.90% 50.89% 54.24% 52.42% 56.53% 53.48% 52.22% 57.96% 58.24%
SPRI 55.92% 56.08% 54.69% 55.41% 51.85% 55.63% 56.43% 57.99% 55.72% 56.48% 62.62% 59.75%
off-the-shelf 45.03% 53 .02% 42 .54% 66.11% 45 .39% 60 .47%
post-trained 53.02% — 66.11% — 60.47% —
4is reached. In addition, we also experiment with 6)seed
principles , where we utilize the 6default principles (shown
in Appendix §C Figure 3) as the guiding principles for the
model to generate responses. We establish this as a baseline
where principles irrelevant to the input query are used for
model guidance.
SPRI Method. We supply SPRI with the 6Question–
Principle pairs shown in Figure 3 as seed examples during
the initial principle generation phase.
Models and Setup. We use Llama-3-70B-Instruct
(Dubey et al., 2024) as our base model Macross all
methods, and we employ Prometheus-2-8x7B as the critic
modelCinSPRI . We set the temperature value for all model
generations to 0.7, topkto50, toppto0.95. We also restrict
the maximum tokens of generation to 256.
We finetune with LoRA (Hu et al., 2022), and we compute
the loss on responses only. For base (i.e., non-instruction-
tuned) models, we use the Alpaca format template (Taori
et al., 2023) for training; for instruction-tuned models, we
fine-tune them on their own chat templates. We save the best
model checkpoint at validation loss as the final model. All
our fine-tuning experiments are carried out on 3NVIDIA
A100 40 GB GPUs.
5.3. Results
We evaluate the performance of fine-tuned models on sev-
eral benchmarks, namely TruthfulQA (Lin et al., 2022),
MUSR (Sprague et al., 2024), GPQA (Rein et al., 2024),
BBH (Suzgun et al., 2023), MMLU-Pro (Wang et al., 2024),
and Hellaswag (Zellers et al., 2019). We further provide
the performance of the off-the-shelf models as well as their
post-trained counterparts on these benchmarks. As shown
in Table 4, SPRI consistently outperforms the off-the-
shelf model as well as other synthetic response genera-
tion methods on the TruthfulQA dataset. In particular,
fine-tuning base models using SPRI leads to the most no-
table gains on the benchmark, surpassing the off-the-shelf
models’ performance by an average of 24.76% and modelsfine-tuned using oracle responses by an average of 19.09%.
While already instruction-tuned models benefit from smaller
gains with SPRI , their performance still exceeds all base-
line methods. In particular, Llama-3.1-8B-Instruct out-
performs its off-the-shelf and oracle-response fine-tuned
counterparts’ performance on TruthfulQA by a margin of
3.83% and14.71% respectively.
We further provide the results from SFT on other bench-
marks in Appendix §H Tables 7 and 8. In general, there is
less considerable difference across methods on these bench-
marks. While we observe the effect of alignment tax (Askell
et al., 2021; Ouyang et al., 2022) where post-trained models
are weaker than base counterparts on benchmarks such as
MUSR and Hellaswag, this effect is less observed for mod-
els fine-tuned using SPRI. Instead, SPRI’s performance is
often comparable to the best-performing method on MUSR,
GPQA, BBH, MMLU-Pro, and Hellaswag. These results
highlight the effectiveness of SPRI on aligning models,
particularly in terms of truthfulness.
6. Conclusion
We introduce SPRI , a framework that produces context-
situated principles tailored to each input query at hand.
Through a series of extensive evaluations on tasks including
cognitive reappraisals, instance-specific rubric generation,
and generating synthetic data for SFT, we demonstrate the
effectiveness of SPRI in guiding responses. By dynamically
generating principles in real time with minimal or no human
effort, SPRI addresses key limitations of prior approaches
that relied on generic, static principles. Our results show
that SPRI not only matches expert-level performance in
highly specialized tasks but also enhances alignment with
human judgment and improves synthetic data generation for
model fine-tuning. This work underscores the potential of
SPRI to enable more adaptable, context-aware, and scalable
alignment strategies for LLMs, paving the way for broader
applicability in tasks requiring nuanced human oversight
and guidance.
8
Page 9:
SPRI: Aligning Large Language Models with Context-Situated Principles
Acknowledgements
We thank Heloisa Candello for her valuable input to the de-
fault principles used in this paper. We acknowledge the IBM
Research Big AI Model (BAM) and the Texas Advanced
Computing Center (TACC) at UT Austin for the computa-
tion of many of the results within this paper. This work
was partially supported by NSF grants IIS-2107524, IIS-
2145479, and Good Systems, a UT Austin Grand Challenge
to develop responsible AI technologies.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
Anadkat, S., et al. Gpt-4 technical report. 2024. URL
https://arxiv.org/abs/2303.08774 .
Arnold, M. B. Emotion and personality. Columbia Univer-
sity Press, 1960.
Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D.,
Henighan, T., Jones, A., Joseph, N., Mann, B., Das-
Sarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D.,
Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown,
T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. A
general language assistant as a laboratory for alignment.
2021. URL https://arxiv.org/abs/2112.00861 .
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan,
T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T.,
El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernan-
dez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L.,
Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark,
J., McCandlish, S., Olah, C., Mann, B., and Kaplan,
J. Training a helpful and harmless assistant with rein-
forcement learning from human feedback. 2022a. URL
https://arxiv.org/abs/2204.05862 .
Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion,
J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A.,
McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernan-
dez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson,
E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Lan-
dau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sell-
itto, M., Elhage, N., Schiefer, N., Mercado, N., Das-
Sarma, N., Lasenby, R., Larson, R., Ringer, S., John-
ston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T.,
Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T.,
Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei,
D., Joseph, N., McCandlish, S., Brown, T., and Ka-
plan, J. Constitutional ai: Harmlessness from ai feed-
back. ArXiv , abs/2212.08073, 2022b. URL https:
//api.semanticscholar.org/CorpusID:254823489 .Bakker, M., Chadwick, M., Sheahan, H., Tessler, M.,
Campbell-Gillingham, L., Balaguer, J., McAleese, N.,
Glaese, A., Aslanides, J., Botvinick, M., et al. Fine-tuning
language models to find agreement among humans with
diverse preferences. Advances in Neural Information
Processing Systems , 35:38176–38189, 2022.
Buhle, J. T., Silvers, J. A., Wager, T. D., Lopez, R., Onye-
mekwu, C., Kober, H., Weber, J., and Ochsner, K. N.
Cognitive reappraisal of emotion: a meta-analysis of hu-
man neuroimaging studies. Cerebral cortex , 24(11):2981–
2990, 2014.
Chen, X., Wen, H., Nag, S., Luo, C., Yin, Q., Li, R., Li, Z.,
and Wang, W. IterAlign: Iterative constitutional align-
ment of large language models. In Duh, K., Gomez, H.,
and Bethard, S. (eds.), Proceedings of the 2024 Confer-
ence of the North American Chapter of the Association for
Computational Linguistics: Human Language Technolo-
gies (Volume 1: Long Papers) , pp. 1423–1433, Mexico
City, Mexico, June 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.naacl-long.78. URL
https://aclanthology.org/2024.naacl-long.78/ .
Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah,
S., Ghodsi, A., Wendell, P., Zaharia, M., and Xin, R. Free
dolly: Introducing the world’s first truly open instruction-
tuned llm. Company Blog of Databricks , 2023.
Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W.,
Ni, Y ., Xie, G., Xie, R., Lin, Y ., Liu, Z., and Sun,
M. ULTRAFEEDBACK: Boosting language models
with scaled AI feedback. In Forty-first International
Conference on Machine Learning , 2024. URL https:
//openreview.net/forum?id=BOorDpKHiJ .
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
A., et al. The llama 3 herd of models. 2024. URL
https://arxiv.org/abs/2407.21783 .
Ellsworth, P. C. and Scherer, K. R. Appraisal processes in
emotion. In Davidson, R. J., Scherer, K. R., and Gold-
smith, H. H. (eds.), Handbook of Affective Sciences , pp.
572–595. Oxford University Press, 2003.
Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. GPTScore: Evaluate
as you desire. In Duh, K., Gomez, H., and Bethard, S.
(eds.), Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume
1: Long Papers) , pp. 6556–6576, Mexico City, Mex-
ico, June 2024. Association for Computational Linguis-
tics. doi: 10.18653/v1/2024.naacl-long.365. URL https:
//aclanthology.org/2024.naacl-long.365/ .
9
Page 10:
SPRI: Aligning Large Language Models with Context-Situated Principles
Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kin-
ney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I.,
Wang, Y ., Arora, S., Atkinson, D., Authur, R., Chandu,
K., Cohan, A., Dumas, J., Elazar, Y ., Gu, Y ., Hessel,
J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N.,
Naik, A., Nam, C., Peters, M., Pyatkin, V ., Ravichander,
A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subra-
mani, N., Wortsman, M., Dasigi, P., Lambert, N., Richard-
son, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L.,
Smith, N., and Hajishirzi, H. OLMo: Accelerating the
science of language models. In Ku, L.-W., Martins, A.,
and Srikumar, V . (eds.), Proceedings of the 62nd Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers) , pp. 15789–15809, Bangkok,
Thailand, August 2024. Association for Computational
Linguistics. doi: 10.18653/v1/2024.acl-long.841. URL
https://aclanthology.org/2024.acl-long.841/ .
Gross, J. J. Antecedent-and response-focused emotion reg-
ulation: divergent consequences for experience, expres-
sion, and physiology. Journal of personality and social
psychology , 74(1):224, 1998.
Gross, J. J. and John, O. P. Individual differences in two
emotion regulation processes: implications for affect,
relationships, and well-being. Journal of personality and
social psychology , 85(2):348, 2003.
Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak,
B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J.,
Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., and
Glaese, A. Deliberative alignment: Reasoning enables
safer language models. 2025. URL https://arxiv.
org/abs/2412.16339 .
Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J.
Learning from dialogue after deployment: Feed yourself,
chatbot! In Korhonen, A., Traum, D., and M `arquez,
L. (eds.), Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics , pp. 3667–
3684, Florence, Italy, July 2019. Association for Compu-
tational Linguistics. doi: 10.18653/v1/P19-1358. URL
https://aclanthology.org/P19-1358/ .
Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., and
Kedzie, C. Llm-rubric: A multidimensional, calibrated
approach to automated evaluation of natural language
texts. arXiv preprint arXiv:2501.00274 , 2024.
Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y .,
Wang, S., Wang, L., and Chen, W. LoRA: Low-rank
adaptation of large language models. In International
Conference on Learning Representations , 2022. URL
https://openreview.net/forum?id=nZeVKeeFYf9 .
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh,
A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A.,Radford, A., et al. Gpt-4o system card. 2024. URL
https://arxiv.org/abs/2410.21276 .
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky,
A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney,
A., et al. Openai o1 system card. 2024. URL https:
//arxiv.org/abs/2412.16720 .
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna,
E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G.,
Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P.,
Subramanian, S., Yang, S., Antoniak, S., Scao, T. L.,
Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed,
W. E. Mixtral of experts. 2024. URL https://arxiv.
org/abs/2401.04088 .
Jiang, D., Ren, X., and Lin, B. Y . LLM-blender: En-
sembling large language models with pairwise ranking
and generative fusion. In Rogers, A., Boyd-Graber, J.,
and Okazaki, N. (eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers) , pp. 14165–14178, Toronto,
Canada, July 2023. Association for Computational Lin-
guistics. doi: 10.18653/v1/2023.acl-long.792. URL
https://aclanthology.org/2023.acl-long.792/ .
Jin, Z., Levine, S., Gonzalez Adauto, F., Kamal, O., Sap, M.,
Sachan, M., Mihalcea, R., Tenenbaum, J., and Sch ¨olkopf,
B. When to make exceptions: Exploring language mod-
els as accounts of human moral judgment. Advances in
neural information processing systems , 35:28458–28473,
2022.
Kim, S., Suk, J., Cho, J. Y ., Longpre, S., Kim, C., Yoon,
D., Son, G., Cho, Y ., Shafayat, S., Baek, J., Park, S. H.,
Hwang, H., Jo, J., Cho, H., Shin, H., Lee, S., Oh, H.,
Lee, N., Ho, N., Joo, S. J., Ko, M., Lee, Y ., Chae, H.,
Shin, J., Jang, J., Ye, S., Lin, B. Y ., Welleck, S., Neubig,
G., Lee, M., Lee, K., and Seo, M. The biggen bench:
A principled benchmark for fine-grained evaluation of
language models with language models. 2024a. URL
https://arxiv.org/abs/2406.05761 .
Kim, S., Suk, J., Longpre, S., Lin, B. Y ., Shin, J., Welleck,
S., Neubig, G., Lee, M., Lee, K., and Seo, M. Prometheus
2: An open source language model specialized in evalu-
ating other language models. ArXiv , abs/2405.01535,
2024b. URL https://api.semanticscholar.org/
CorpusID:269502688 .
Kirk, H. R., Bean, A. M., Vidgen, B., R ¨ottger, P., and Hale,
S. A. The past, present and better future of feedback
learning in large language models for subjective human
preferences and values. In Bouamor, H., Pino, J., and
Bali, K. (eds.), Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing , pp.
10
Page 11:
SPRI: Aligning Large Language Models with Context-Situated Principles
2409–2430, Singapore, December 2023a. Association
for Computational Linguistics. doi: 10.18653/v1/2023.
emnlp-main.148. URL https://aclanthology.org/
2023.emnlp-main.148/ .
Kirk, H. R., Vidgen, B., R ¨ottger, P., and Hale, S. A. The
empty signifier problem: Towards clearer paradigms for
operationalising “alignment” in large language models.
2023b. URL https://arxiv.org/abs/2310.02457 .
Korbak, T., Shi, K., Chen, A., Bhalerao, R. V ., Buckley,
C., Phang, J., Bowman, S. R., and Perez, E. Pretrain-
ing language models with human preferences. In Inter-
national Conference on Machine Learning , pp. 17506–
17533. PMLR, 2023.
Lazarus, R. S. Psychological stress and the coping process.
McGraw-Hill, 1966.
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J.,
Lu, K. R., Bishop, C., Hall, E., Carbune, V ., Rastogi,
A., and Prakash, S. RLAIF vs. RLHF: Scaling reinforce-
ment learning from human feedback with AI feedback. In
Forty-first International Conference on Machine Learn-
ing, 2024. URL https://openreview.net/forum?id=
uydQ2W41KO .
Lin, S., Hilton, J., and Evans, O. TruthfulQA: Measur-
ing how models mimic human falsehoods. In Mure-
san, S., Nakov, P., and Villavicencio, A. (eds.), Pro-
ceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) ,
pp. 3214–3252, Dublin, Ireland, May 2022. Association
for Computational Linguistics. doi: 10.18653/v1/2022.
acl-long.229. URL https://aclanthology.org/2022.
acl-long.229 .
Liu, H., Sferrazza, C., and Abbeel, P. Chain of hind-
sight aligns language models with feedback. In The
Twelfth International Conference on Learning Represen-
tations , 2024. URL https://openreview.net/forum?
id=6xfe4IVcOu .
Liu, R., Jia, C., Zhang, G., Zhuang, Z., Liu, T. X., and
V osoughi, S. Second thoughts are best: Learning to re-
align with human values from text edits. In Oh, A. H.,
Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances
in Neural Information Processing Systems , 2022. URL
https://openreview.net/forum?id=u6OfmaGIya1 .
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao,
L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S.,
Yang, Y ., Gupta, S., Majumder, B. P., Hermann, K.,
Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-
refine: Iterative refinement with self-feedback. In
Thirty-seventh Conference on Neural Information Pro-
cessing Systems , 2023. URL https://openreview.
net/forum?id=S37hOerQLB .Ochsner, K. N., Bunge, S. A., Gross, J. J., and Gabrieli,
J. D. Rethinking feelings: an fmri study of the cognitive
regulation of emotion. Journal of cognitive neuroscience ,
14(8):1215–1229, 2002.
Ortony, A., Clore, G. L., and Collins, A. The cognitive
structure of emotions . Cambridge university press, 2022.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions
with human feedback. Advances in neural information
processing systems , 35:27730–27744, 2022.
Petridis, S., Wedin, B., Yuan, A., Wexler, J., and Thain, N.
ConstitutionalExperts: Training a mixture of principle-
based prompts. In Ku, L.-W., Martins, A., and Sriku-
mar, V . (eds.), Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 2: Short Papers) , pp. 574–582, Bangkok, Thai-
land, August 2024a. Association for Computational Lin-
guistics. doi: 10.18653/v1/2024.acl-short.52. URL
https://aclanthology.org/2024.acl-short.52/ .
Petridis, S., Wedin, B. D., Wexler, J., Pushkarna, M., Dons-
bach, A., Goyal, N., Cai, C. J., and Terry, M. Constitu-
tionmaker: Interactively critiquing large language models
by converting feedback into principles. In Proceedings
of the 29th International Conference on Intelligent User
Interfaces , IUI ’24, pp. 853–868, New York, NY , USA,
2024b. Association for Computing Machinery. ISBN
9798400705083. doi: 10.1145/3640543.3645144. URL
https://doi.org/10.1145/3640543.3645144 .
Ray, R. D., McRae, K., Ochsner, K. N., and Gross, J. J. Cog-
nitive reappraisal of negative affect: converging evidence
from emg and self-report. Emotion , 10(4):587, 2010.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y .,
Dirani, J., Michael, J., and Bowman, S. R. GPQA: A
graduate-level google-proof q&a benchmark. In First
Conference on Language Modeling , 2024. URL https:
//openreview.net/forum?id=Ti67584b98 .
Sprague, Z. R., Ye, X., Bostrom, K., Chaudhuri, S., and Dur-
rett, G. MuSR: Testing the limits of chain-of-thought with
multistep soft reasoning. In The Twelfth International
Conference on Learning Representations , 2024. URL
https://openreview.net/forum?id=jenyYQzue1 .
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
V oss, C., Radford, A., Amodei, D., and Christiano,
P. F. Learning to summarize with human feedback. Ad-
vances in Neural Information Processing Systems , 33:
3008–3021, 2020.
11
Page 12:
SPRI: Aligning Large Language Models with Context-Situated Principles
Sun, Z., Shen, Y ., Zhou, Q., Zhang, H., Chen, Z., Cox, D. D.,
Yang, Y ., and Gan, C. Principle-driven self-alignment
of language models from scratch with minimal human
supervision. In Thirty-seventh Conference on Neural
Information Processing Systems , 2023. URL https://
openreview.net/forum?id=p40XRfBX96 .
Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y .,
Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D.,
and Wei, J. Challenging BIG-bench tasks and whether
chain-of-thought can solve them. In Rogers, A., Boyd-
Graber, J., and Okazaki, N. (eds.), Findings of the As-
sociation for Computational Linguistics: ACL 2023 , pp.
13003–13051, Toronto, Canada, July 2023. Association
for Computational Linguistics. doi: 10.18653/v1/2023.
findings-acl.824. URL https://aclanthology.org/
2023.findings-acl.824/ .
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X.,
Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford
alpaca: An instruction-following llama model. https:
//github.com/tatsu-lab/stanford alpaca , 2023.
Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. Self-instruct: Align-
ing language models with self-generated instructions. In
Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Pro-
ceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) ,
pp. 13484–13508, Toronto, Canada, July 2023. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/
2023.acl-long.754. URL https://aclanthology.org/
2023.acl-long.754 .
Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S.,
Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M.,
Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W.
MMLU-pro: A more robust and challenging multi-task
language understanding benchmark. In The Thirty-eight
Conference on Neural Information Processing Systems
Datasets and Benchmarks Track , 2024. URL https:
//openreview.net/forum?id=y10DM6R2r3 .
Waugh, C. E., Zarolia, P., Mauss, I. B., Lumian, D. S., Ford,
B. Q., Davis, T. S., Ciesielski, B. G., Sams, K. V ., and
McRae, K. Emotion regulation changes the duration of
the bold response to emotional stimuli. Social Cognitive
and Affective Neuroscience , 11(10):1550–1559, 2016.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter,
Xia, F., Chi, E. H., Le, Q. V ., and Zhou, D. Chain of
thought prompting elicits reasoning in large language
models. In Oh, A. H., Agarwal, A., Belgrave, D., and
Cho, K. (eds.), Advances in Neural Information Process-
ing Systems , 2022. URL https://openreview.net/
forum?id= VjQlMeSB J.Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B.,
Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu,
J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K.,
Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang,
P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren,
X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y .,
Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report.
2025. URL https://arxiv.org/abs/2412.15115 .
Yang, K., Tian, Y ., Peng, N., and Klein, D. Re3: Gen-
erating longer stories with recursive reprompting and
revision. In Goldberg, Y ., Kozareva, Z., and Zhang,
Y . (eds.), Proceedings of the 2022 Conference on Em-
pirical Methods in Natural Language Processing , pp.
4393–4479, Abu Dhabi, United Arab Emirates, Decem-
ber 2022. Association for Computational Linguistics.
doi: 10.18653/v1/2022.emnlp-main.296. URL https:
//aclanthology.org/2022.emnlp-main.296/ .
Ye, S., Kim, D., Kim, S., Hwang, H., Kim, S., Jo, Y ., Thorne,
J., Kim, J., and Seo, M. FLASK: Fine-grained language
model evaluation based on alignment skill sets. In The
Twelfth International Conference on Learning Represen-
tations , 2024. URL https://openreview.net/forum?
id=CYmF38ysDa .
Yeo, G. and Ong, D. C. A meta-analytic review of the
associations between cognitive appraisals and emotions
in cognitive appraisal theory. PsyArXiv , 2023. URL
https://psyarxiv.com/ystxc .
Yu, D., Kaur, S., Gupta, A., Brown-Cohen, J., Goyal,
A., and Arora, S. Skill-mix: A flexible and expand-
able family of evaluations for ai models. arXiv preprint
arXiv:2310.17567 , 2023.
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi,
Y . HellaSwag: Can a machine really finish your sen-
tence? In Korhonen, A., Traum, D., and M `arquez,
L. (eds.), Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics , pp. 4791–
4800, Florence, Italy, July 2019. Association for Compu-
tational Linguistics. doi: 10.18653/v1/P19-1472. URL
https://aclanthology.org/P19-1472/ .
Zhan, H., Zheng, A., Lee, Y . K., Suh, J., Li, J. J., and
Ong, D. Large language models are capable of offering
cognitive reappraisal, if guided. In Proceedings of the
First Conference on Language Modeling , 2024. URL
https://openreview.net/forum?id=yK8MT91dQY .
Zhao, J., Khashabi, D., Khot, T., Sabharwal, A., and Chang,
K.-W. Ethical-advice taker: Do language models un-
derstand natural language interventions? In Zong, C.,
Xia, F., Li, W., and Navigli, R. (eds.), Findings of the
Association for Computational Linguistics: ACL-IJCNLP
2021 , pp. 4158–4164, Online, August 2021. Association
12
Page 13:
SPRI: Aligning Large Language Models with Context-Situated Principles
for Computational Linguistics. doi: 10.18653/v1/2021.
findings-acl.364. URL https://aclanthology.org/
2021.findings-acl.364/ .
Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z.,
Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., Zhang,
H., Gonzalez, J. E., and Stoica, I. Judging LLM-as-
a-judge with MT-bench and chatbot arena. In Thirty-
seventh Conference on Neural Information Processing
Systems Datasets and Benchmarks Track , 2023. URL
https://openreview.net/forum?id=uccHPGDlao .
Zhou, R., Deshmukh, S., Greer, J., and Lee, C. Narle:
Natural language models using reinforcement learning
with emotion feedback. 2021. URL https://arxiv.
org/abs/2110.02148 .
13
Page 14:
SPRI: Aligning Large Language Models with Context-Situated Principles
A. Pseudo-code for SPRI
Algorithm 1 Pseudo-code for SPRI
Require: user input T, base language model M, critic language model C, seed examples S(optional),
prompts {Pprinciple-gen , Pprinciple-refine , Presponse-gen , Presponse-refine }, evaluation prompts {Eval principle , Eval response},
max iterations nmax, desired score threshold τ.
STAGE I: S YNTHESIZING CONTEXT -SITUATED PRINCIPLES
1:Initialize M,C
2:K0=M(T⊕Pprinciple-gen ⊕S) {Generate the initial principles K0}
3:ResetM
4:fori= 1tonmaxdo
5: Feedback Ki−1=C(Eval principle ⊕T⊕Ki−1) {Evaluate Ki−1using the critic model C}
6: Extract score from Feedback Ki−1
7: ifscore≥τthen
8: Kfinal=Ki−1;break
9: end if
10: Ki=M(Pprinciple-refine ⊕T⊕Ki−1⊕Feedback Ki−1) {Refine principles Ki−1}
11: ResetM,C
12:end for
13:ifscore < τafternmaxiterations then
14: Kfinal=Knmax
15:end if
STAGE II: G ENERATING RESPONSES GUIDED BY SYNTHESIZED PRINCIPLES
16:R0=M(T⊕Presponse-gen ⊕Kfinal) {Generate the initial response R0}
17:ResetM
18:fori= 1tonmaxdo
19: Feedback Ri−1=C(Eval response⊕T⊕Kfinal⊕Ri−1) {Evaluate Ri−1using the critic model C}
20: Extract score from Feedback Ri−1
21: ifscore≥τthen
22: Rfinal=Ri−1;break
23: end if
24: Ri=M(Presponse-refine ⊕T⊕Ri−1⊕Feedback Ri−1) {Refine response Ri−1}
25:end for
26:ifscore < τafternmaxiterations then
27: Rfinal=Rnmax
28:end if
29:return Final guiding principles Kfinaland response Rfinal
14
Page 15:
SPRI: Aligning Large Language Models with Context-Situated Principles
B. Prompts for SPRI
We provide the full prompts at https://github.com/honglizhan/SPRI-public . As the prompts for the 3tasks that we
tackle in this paper contain slight differences, we only demonstrate the prompts for SFT data elicitation here. Please refer to
the GitHub repo for the prompts for the other tasks.
B.1. Stage I
a.Pprinciple-gen :a prompt instructing the base model Mto generate initial principles K0.
### Role : You are an expert at providing principles that oversight responses to questions .
You will be given a question , and you need to provide principles that guide the
response . Principles are defined as high - level constructs that a response should
follow . Keep in mind that principles are used to guide the responses , which means that
they should be different from the response itself . For instance , an example principle
can be: " When responding to the question , avoid discrimination based on gender , age ,
or socioeconomic status ". Please do not generate any other opening and closing remarks
, nor explanations . Importantly , * you should be succinct in your response and make
sure that the principle you come up with does not exceed 128 words *. ( When phrasing
principles , follow these examples :)
b.Eval principle :an evaluation prompt to produce feedback and a score on the generated principles.
### Task Description :
You will be given an instruction ( which includes an Input inside it), a response to
evaluate , and a score rubric representing an evaluation criteria . Adhere to the
following steps when conducting the evaluation process :
1. Write a detailed feedback that assesses the quality of the response strictly based on
the given score rubric , rather than evaluating in general .
2. After writing the feedback , write a score that is an integer between 1 and 5. You
should refer to the score rubric .
3. The output format should look as follows : " Feedback : ( write a feedback based on the
evaluation criteria ) [ RESULT ] (an integer number between 1 and 5)"
4. Please do not generate any other opening and closing remarks , nor explanations .
5. Importantly , * you should be succinct in your feedback and make sure that the feedback
you come up with does not exceed 128 words *.
### Instruction to Evaluate :
{ Fill in Pprinciple-gen here }
[ Question : { orig_question }]
### Principles to Evaluate :
{ orig_principle }
### Score Rubrics :
On a scale of 1 to 5, to what extent are the principles useful to guide the response to
the question ?
Score 1: The principles are irrelevant to the question , and they are not useful to guide
the response at all .
Score 2: The principles are minimally useful . They show some relevance to the question ,
but are vague , lacking in depth , or not directly applicable to guiding responses .
Score 3: The principles are somewhat useful . They provide a moderate level of guidance on
the responses .
Score 4: The principles are quite useful . They are clear , relevant , and offer solid
guidance on how to respond to the question . They effectively provide a good framework
for responding to similar questions . Minor improvements could make them more robust .
Score 5: The principles are highly useful . They are comprehensive , detailed , and provide
excellent guidance for responding to the question . They are also broadly applicable to
guiding responses to a wide range of similar questions .
### Feedback :
15
Page 16:
SPRI: Aligning Large Language Models with Context-Situated Principles
c.Pprinciple-refine :a prompt instructing the model to refine principles based on feedback.
### Role : You are an expert at providing principles that oversights responses to questions
. Please refine the principles based on the feedback . Do not generate any other
opening and closing remarks , nor explanations . Importantly , * you should be succinct in
your response and make sure that the principle you come up with does not exceed 128
words *.
B.2. Stage II
d.Presponse-gen :a prompt that instructs Mto respond by adhering to the generated principles.
### Role : You are an expert at following instructions . You will be given a question , and a
set of principles that guides the response . You need to generate a response to the
question that adheres closely to these principles . Please do not generate any other
opening and closing remarks , nor explanations . Importantly , you should be succinct in
your response and make sure that it does not exceed 128 words .
e.Eval response :a direct assessment prompt that elicits feedback and a score from Con the response.
### Task Description :
You will be given an instruction ( which includes an Input inside it), a response to
evaluate , and a score rubric representing an evaluation criteria . Adhere to the
following steps when conducting the evaluation process :
1. Write a detailed feedback that assesses the quality of the response strictly based on
the given score rubric , rather than evaluating in general .
2. After writing the feedback , write a score that is an integer between 1 and 5. You
should refer to the score rubric .
3. The output format should look as follows : " Feedback : ( write a feedback based on the
evaluation criteria ) [ RESULT ] (an integer number between 1 and 5)"
4. Please do not generate any other opening and closing remarks , nor explanations .
5. Importantly , you should be succinct in your feedback and make sure that the feedback
you come up with does not exceed 128 words .
### Instruction to Evaluate :
You are an expert at following instructions . You will be given a question , and a set of
principles that guides the response . You need to generate a response to the question
that adheres closely to these principles . Please do not generate any other opening and
closing remarks , nor explanations . Importantly , you should be succinct in your
response and make sure that it does not exceed 128 words .
[ Question : { orig_question }]
[ Principles : { orig_principle }]
### Response to Evaluate :
{ orig_response }
### Score Rubrics :
On a scale of 1 to 5, to what extent does the response adhere to the principles ?
Score 1: The response does not adhere to the principles at all .
Score 2: The response demonstrates minimal adherence to the principles .
Score 3: The response shows a moderate level of adherence to the principles .
Score 4: The response adheres quite well to the principles . Minor improvements could make
them more aligned .
Score 5: The response highly adheres to the principles .
### Feedback :
16
Page 17:
SPRI: Aligning Large Language Models with Context-Situated Principles
f.Presponse-refine :a prompt asking the model to refine the response based on feedback.
### Role : You are an expert at refining responses based on feedback . You will first be
given a question , and a response to the question . However , the response is deemed
unsatisfactory , and you 'll receive feedback explaining where it falls short . You task
is to refine the response based on the feedback so that it answers the question better
. When refining , make sure the response you come up does not deviate from the original
response . Please do not generate any other opening and closing remarks , nor
explanations . Importantly , you should be succinct in your response and make sure it
does not exceed 128 words .
17
Page 18:
SPRI: Aligning Large Language Models with Context-Situated Principles
C. Default Seed Principles
The 6seed default principles used in the SPRI framework are provided in Figure 3.
Seed Principles
Seed Example 1
Question : Why don’t more people with disabilities pursue careers in certain fields?
Principle : When considering the underrepresentation of people with disabilities in certain fields, it is essential to recognize and
address systemic barriers and biases that may discourage or prevent them from pursuing careers in those areas. These barriers
can include physical accessibility issues, discrimination, lack of accommodations, and negative stereotypes.
Seed Example 2
Question : In which continent can you find people with the best financial acumen and in which continent can you find people
with the worst financial acumen?
Principle : When considering people with low or high financial acumen, it’s crucial to consider that continent context and
realities are diverse. Several parameters might influence the financial acumen such as to what extent people understand and
interpret financial statements, apply financial concepts to better support judgment and business decision making, and incorporate
organizational financial information within their life and business. It’s essential not to discriminate population based on their
financial acumen, and the purpose of knowing the information of which continent has the lowest financial acumen, to avoid
predatory credit, for example. Please provide recommendations to avoid exploration and predatory actions considering the
financial acumen of populations.
Seed Example 3
Question : In some societies, mothers are expected to dedicate their time and energy to their children while the fathers provide
financially. How would this cultural norm be addressed in a different society where gender roles are shared?
Principle : It is important to address that in societies where gender roles are shared, both parents should dedicate time, financial
support and energy to their children. The gender norms of a cultural group might influence how mothers and fathers will divide
the responsibilities. Please include recommendations to balance the shared responsibilities of mothers and fathers.
Seed Example 4
Question : Should sex education be mandatory in American public schools?
Principle : Sex education covers a range of topics, inclusive, and medically accurate is critical for young people. This decision
should include medical professionals, educators and parents. It’s important that all kids deserve to feel safe and proud of who
they are. Provide what are the main factors that should be considered in this decision, and the pros and cons to have sex education
as mandatory in American public schools.
Seed Example 5
Question : What’s the appropriate way to greet a client from Saudi Arabia?
Principle : There are specific cultural and social norms to greet people from Saudi Arabia. Please provide examples of the most
polite and respectful ways to greet clients from Saudi Arabia.
Seed Example 6
Question : A state is passing a law that prohibits abortions if the unborn baby has a prenatal condition that is fatal. Should the
state make an exception for this situation?
Principle : Many exceptions are included in state abortion bans. which generally fall into four categories: to prevent the death of
the pregnant person, when there is risk to the health of the pregnant person, when the pregnancy is the result of rape or incest,
and when there is a lethal fetal anomaly. Provide the risks to the pregnant women to accessing abortion care, even in situations
where the exceptions they outline should apply.
Figure 3. The6default seed principles used in the SPRI framework.
18
Page 19:
SPRI: Aligning Large Language Models with Context-Situated Principles
D. Background of Cognitive Reappraisal
Cognitive reappraisal is an effective emotion regulation strategy that stemmed out of the appraisal theories of emotions
(Arnold, 1960; Lazarus, 1966; Ellsworth & Scherer, 2003; Ortony et al., 2022; Yeo & Ong, 2023), which suggests that
emotions arise from an individual’s subjective understanding and interpretation of a given situation. By zooming into the
specific dimensions, cognitive reappraisal can causally intervene in a precise, principled manner to help shift negative
appraisals towards more positive or neutral perspectives, subsequently allowing individuals to reinterpret the meaning of a
situation and feel better. Cognitive reappraisal has been shown to foster long-term mental well-being in individuals (Ochsner
et al., 2002; Ray et al., 2010; Gross, 1998; Gross & John, 2003; Buhle et al., 2014; Waugh et al., 2016).
Recently, Zhan et al. (2024) introduced the RESORT (REappraisals for emotional SuppORT) framework, leveraging
LLMs to perform cognitive reappraisal and assist in regulating individuals’ emotions. RESORT is grounded in 6appraisal
dimensions identified by Yeo & Ong (2023), each carefully selected to ensure broad applicability across diverse situations.
The framework is built on expert-crafted reappraisal constitutions, which act as guiding principles for LLMs to elicit
effective reappraisals. RESORT is implemented in two approaches: individual guided reappraisal (INDV) and iterative
guided refinement (ITER). The authors conducted extensive experiments involving clinical psychologists with advanced
degrees (M.S. or Ph.D.), and showed that LLMs, even smaller models like those with 7B parameters, can produce cognitive
reappraisals that significantly outperform both human-written responses and non-appraisal-based prompting.
E. Background of BiGGen Bench
The BiGGen Bench (Kim et al., 2024a) dataset is a robust and comprehensive benchmark designed to assess the capabilities
of LLMs across various tasks. Each input instance in BiGGen Bench is accompanied by a scoring rubric that outlines
the specific evaluation criteria and descriptions for each score, ranging from 1to5. The scoring rubrics are meticulously
manually curated to ensure detailed and contextually rich assessments, as they are unique to each input query. This allows
for a fine-grained analysis of model performance at a granular instance level.
In BiGGen Bench, there are multiple responses from different LLMs to the same input query. An evaluator LM, which
serves to judge the quality of responses, needs to assign a grade to the response based on the scoring rubric provided. To
ensure the evaluation reliability, BiGGen Bench further includes human-annotated judgments of the LLM responses based
on the same scoring rubric. Results show that their human-collected fine-grained scoring rubrics significantly enhance the
accuracy of Evaluator LMs’ judgments, outperforming both coarse-grained (Zheng et al., 2023) and domain-specific (Ye
et al., 2024) criteria.
19
Page 20:
SPRI: Aligning Large Language Models with Context-Situated Principles
F. Full Results for Cognitive Reappraisals
We showcase the full results for cognitive reappraisals in Table 5.
Table 5. Evaluation results (in average scores) for reappraisal responses. We report statistical significance (with p <0.05) using pair-wise
t-tests against both the vanilla (marked with *) and self-refine (marked with †) baselines. Responses where the ratings are significantly
worse than either of the baselines are shaded. In addition, we also show the average number of model calls required to produce each
response.
Alignment ↑ Empathy ↑ Harmfulness ↓ Factuality ↑
# Model Calls 10-POINT SCALE 5-POINT SCALE YES/NO YES/MINOR /NO
INDV ITER INDV ITER INDV ITER INDV ITER INDV ITER
GPT-4 O-MINIvanilla 1 7 .90 4 .50 0 .00 1.00
self-refine 6 7 .73 4 .53 0 .00 0 .93
default principles only 1 6 5.67*†2.13*†3.23*†1.53*†0.00 0 .04 0.55*†0.08*†
[no seeds] SPRI 5.3 7.67* 4.73 0.00 0 .97
[seed=default principles] SPRI 4.3 7 .67 4 .67 0 .00 1.00†
[seed=one oracle] SPRI 4.5 8.00†4.73 0.00 1.00†
oracle principles 1 6 8.90*†8.67*†4.37 4.80*†0.00 0.00 0.90* 1.00†
LLAMA -3.1
70B-INSTRUCTvanilla 1 7 .77 4 .43 0 .00 1.00
self-refine 6 7 .50 4 .27 0 .00 0 .93
default principles only 1 6 6.73* 6.47*†3.83*†3.67*†0.00 0 .00 0.65*†0.65*†
[no seeds] SPRI 4.3 7 .77 4 .73*†0.00 1.00†
[seed=default principles] SPRI 4.5 7 .87†4.80*†0.00 0 .97
[seed=one oracle] SPRI 4.3 8.17*†4.77*†0.00 0 .98
oracle principles 1 6 8.80*†8.53*†4.07* 4.20 0.00 0.00 0.90* 0.95
LLAMA -3
8B-INSTRUCTvanilla 1 7 .10 3 .90 0 .00 0 .88
self-refine 6 7 .20 4 .07 0 .00 0 .87
default principles only 1 6 6 .70 6.07*†4.13 3 .80 0 .00 0 .00 0.60*†0.38*†
[no seeds] SPRI 5.5 7 .73*†4.30* 0.00 0.92
[seed=default principles] SPRI 5.5 7 .70*†4.53*†0.00 0.92
[seed=one oracle] SPRI 6.0 7.90*†4.47*†0.00 0 .90
oracle principles 1 6 8.47*†8.33*†4.17 4.30* 0.00 0.00 0.85 0.83
MIXTRAL
8×7B-INSTRUCT
(V0.1)vanilla 1 7 .53 4 .50 0 .00 0 .92
self-refine 6 6 .60 3 .90 0 .00 0 .80
default principles only 1 6 5.47*†2.80*†3.77* 2.27*†0.00 0 .00 0 .28 0.02*†
[no seeds] SPRI 4.5 7 .60†4.67†0.00 0.95†
[seed=default principles] SPRI 5.9 7 .57†4.57†0.00 0 .88
[seed=one oracle] SPRI 4.7 8.03*†4.77*†0.00 0 .93†
oracle principles 1 6 8.57*†8.17 4.43†4.07 0.00 0.00 0.92 0.72
20
Page 21:
SPRI: Aligning Large Language Models with Context-Situated Principles
G. Full Results for BigGen Bench
We provide the full results for instance-specific rubric evaluation in Table 6.
Table 6. Results for BiGGen Bench, measured with Pearson’s correlation against the human ground truth labels. Evaluation carried out
without the use of reference answers. Values that are not significant ( p <0.001) are shaded.
# Calls Inst. Follow. Ground. Reason. Plan. Refine. Safety ToM Tool. Average
GPT-4 O-MINIgold rubrics 1 0.597* 0.612* 0.631* 0.641*0.432*0.664*0.378*0.448* 0.550
vanilla 1 0 .358* 0.361* 0.478* 0.620*0.222*0.112 0 .380*0.481* 0.377
self-refine 6 0 .375* 0.379* 0.491*0.622*0.266*0.156 0 .427*0.460* 0.397
MT-Bench rubric 1 0 .330* 0.389* 0.527* 0.569*0.313*0.266*0.426*0.506*0.416
FLASK rubric 1 0 .348* 0.369* 0.496* 0.318*0.297*0.339*0.204*0.489* 0.358
default principles as rubrics 1 0.128 0.075 0 .323* 0.242*0.173 0.046 0.159 0 .264* 0.176
[no seeds] SPRI 5.3 0 .368* 0.429* 0.523* 0.569*0.325*0.175 0 .447*0.440* 0.410
[seeds=default principles] SPRI 5.5 0 .380* 0.437* 0.451* 0.596*0.316*0.207*0.401*0.446* 0.404
[seeds=3 gold rubrics] SPRI 4.9 0.398* 0.506*0.553*0.618*0.326*0.385*0.500*0.492*0.472
LLAMA -3.1
70B-INSTRUCTgold rubrics 1 0.569* 0.594* 0.574* 0.574*0.420*0.679*0.535*0.500* 0.556
vanilla 1 0 .368* 0.338* 0.462* 0.606*0.244*0.121 0.497*0.448* 0.386
self-refine 6 0.149 0.015 0 .396* 0.558*0.131 0.138 0 .324*0.365* 0.260
MT-Bench rubric 1 0 .299* 0.337* 0.488*0.612*0.267*0.388*0.474*0.505*0.421
FLASK rubric 1 0.409* 0.277* 0.422* 0.419*0.315*0.365*0.168*0.503* 0.360
default principles as rubrics 1 0.053 0.130 0.144 0.119 0.038−0.069 0.049−0.024 0 .055
[no seeds] SPRI 4.9 0 .276* 0.441* 0.438* 0.503*0.316*0.328*0.494*0.484* 0.410
[seeds=default principles] SPRI 5.1 0 .244* 0.474* 0.409* 0.510*0.255*0.313*0.454*0.471* 0.391
[seeds=3 gold rubrics] SPRI 4.6 0.409* 0.555* 0.474*0.611*0.402*0.440*0.450*0.500*0.480
MIXTRAL
8×7B-INSTRUCT
(V0.1)gold rubrics 1 0.377* 0.410* 0.409* 0.417*0.167 0.410*0.335*0.407* 0.367
vanilla 1 0 .222* 0.262* 0.355* 0.435*0.203*0.186*0.356*0.440*0.307
self-refine 6 0.050 0.076 0.122 0.174 0.071 0.093 0.119 0.174 0 .110
MT-Bench rubric 1 0.247* 0.213* 0.179* 0.280*0.135 0.310*0.384*0.437* 0.273
FLASK rubric 1 0 .186* 0.279* 0.282* 0.316*0.197*0.284*0.258*0.413* 0.277
default principles as rubrics 1 0 .176* 0.218* 0.399*0.342*0.151 0 .219*0.252*0.326* 0.260
[no seeds] SPRI 5.2 0 .196* 0.305* 0.308* 0.268*0.116 0.147 0 .231*0.392* 0.245
[seeds=default principles] SPRI 5.4 0 .191* 0.297* 0.267* 0.231*0.111 0 .242*0.215*0.348* 0.238
[seeds=3 gold rubrics] SPRI 4.7 0 .184* 0.312* 0.216*0.450*0.116 0 .295*0.271*0.457*0.288
PROMETHEUS -2
8×7Bgold rubrics 1 0.346* 0.460* 0.401* 0.398*0.241*0.486*0.371*0.385* 0.386
vanilla 1 0 .273* 0.267* 0.333*0.415*0.177 0 .239*0.386*0.394* 0.311
self-refine 6 0 .247* 0.282* 0.332* 0.385*0.166 0 .272*0.349*0.346* 0.297
MT-Bench rubric 1 0 .316* 0.264* 0.200* 0.412*0.158 0 .255*0.337*0.366* 0.289
FLASK rubric 1 0 .249* 0.261* 0.262* 0.361*0.242*0.333*0.288*0.353* 0.294
default principles as rubrics 1 0 .269* 0.240* 0.387*0.404*0.226*0.208*0.329*0.398* 0.308
[no seeds] SPRI 4.9 0.323* 0.243* 0.246* 0.368*0.211*0.233*0.292*0.457* 0.297
[seeds=default principles] SPRI 5.0 0 .306* 0.353* 0.320* 0.399*0.190*0.286*0.405*0.427*0.336
[seeds=3 gold rubrics] SPRI 4.6 0 .218* 0.360*0.387*0.411*0.198*0.200*0.408*0.485*0.333
21
Page 22:
SPRI: Aligning Large Language Models with Context-Situated Principles
H. Full Results for SFT
In Table 7, we showcase the full results from fine-tuning base models that only went through the pre-training phase. In Table
8, we provide the full results for fine-tuning models that have gone through post-training.
Table 7. SFT results for base models.
TRUTHFUL QA MUSR GPQA BBH MMLU-PRO H ELLASWAG
Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Average
LLAMA -3.1-8Boff-the-shelf 45.03% 38.25% 29.32% 46.51% 32.67% 81.45% 45.54%
Llama-3.1-8B-Instruct 53.02% 37.90% 30.66% 48.72% 36.47% 76.89% 47.28%
oracle response 41.62% 51.94% 42.49% 40.80% 27.54% 28.79% 47.29% 47.26% 31.23% 30.53% 81.18% 81.08% 45.98%
direct response 51.48% 50.82% 41.91% 39.43% 27.12% 29.46% 48.71% 47.35% 31.11% 32.14% 80.63% 81.16% 46.78%
self-instruct 51.07% 52.02% 44.59% 39.29% 27.49% 25.45% 49.78% 46.38% 31.25% 31.31% 80.12% 81.00% 46.65%
self-align 54.56% 54.97% 41.54% 40.13% 28.21% 27.23% 49.28% 46.11% 31.47% 31.44% 80.09% 80.50% 47.13%
self-refine 53.76% 55.11% 43.63% 39.56% 27.33% 28.47% 49.49% 47.85% 32.60% 33.47% 79.99% 80.40% 47.64%
seed principles 53.63% 53.83% 39.96% 37.74% 28.16% 26.86% 49.77% 48.01% 31.57% 32.62% 79.70% 80.60% 46.87%
SPRI 55.92% 56.08% 37.56% 39.20% 28.00% 27.13% 48.79% 46.98% 31.71% 30.31% 79.96% 79.91% 46.80%
MISTRAL -7B- V0.3off-the-shelf 42.54% 40.18% 29.84% 45.11% 29.57% 82.90% 45.02%
Mistral-7B-Instruct-v0.3 66.11% 36.47% 27.65% 48.35% 30.89% 81.87% 48.56%
oracle response 40.42% 50.90% 43.86% 42.95% 29.23% 28.65% 46.26% 45.26% 28.00% 27.19% 82.94% 81.75% 45.62%
direct response 47.16% 52.64% 43.19% 39.87% 27.10% 26.02% 47.39% 45.78% 27.78% 27.35% 81.56% 81.57% 45.62%
self-instruct 46.62% 51.87% 46.92% 39.34% 26.22% 28.38% 47.32% 44.56% 28.37% 27.17% 80.95% 81.16% 45.74%
self-align 48.86% 53.95% 44.82% 40.29% 31.64% 27.64% 45.34% 44.63% 28.37% 26.55% 81.26% 81.18% 46.21%
self-refine 49.40% 53.15% 42.93% 40.91% 28.51% 27.97% 47.00% 45.20% 26.83% 27.41% 81.52% 81.26% 46.01%
seed principles 50.89% 54.24% 45.06% 41.08% 28.30% 30.96% 46.51% 44.76% 27.81% 27.78% 81.37% 80.55% 46.61%
SPRI 51.85% 55.63% 44.79% 43.31% 29.26% 28.30% 45.18% 45.39% 28.61% 28.10% 81.20% 80.13% 46.81%
GEMMA -2-9Boff-the-shelf 45.39% 44.58% 32.89% 53.74% 41.03% 81.90% 49.92%
Gemma-2-9B-it 60.47% 40.59% 33.85% 59.93% 38.60% 78.11% 51.93%
oracle response 44.81% 51.21% 47.09% 46.20% 30.76% 31.87% 56.64% 55.45% 41.76% 40.43% 83.38% 83.00% 51.05%
direct response 53.82% 53.94% 46.97% 45.39% 30.50% 30.77% 56.42% 54.80% 41.09% 40.47% 81.79% 81.44% 51.45%
self-instruct 52.43% 52.85% 45.38% 45.92% 29.80% 29.00% 56.56% 55.55% 41.06% 40.59% 80.99% 82.17% 51.03%
self-align 54.02% 51.70% 42.22% 43.40% 30.62% 30.01% 55.44% 54.55% 40.08% 39.57% 80.65% 81.59% 50.32%
self-refine 55.01% 53.93% 46.99% 47.64% 28.85% 30.07% 56.21% 54.85% 40.95% 40.38% 81.39% 81.61% 51.49%
seed principles 53.48% 52.22% 42.60% 41.42% 29.59% 28.58% 55.46% 54.58% 40.17% 40.47% 80.37% 81.58% 50.04%
SPRI 55.72% 56.48% 45.38% 47.24% 30.59% 31.72% 56.50% 55.14% 41.22% 40.23% 81.08% 80.89% 51.85%
Table 8. SFT results for post-trained models.
TRUTHFUL QA MUSR GPQA BBH MMLU-PRO H ELLASWAG
Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Dolly MixInstruct Average
LLAMA -3.1-8B
INSTRUCToff-the-shelf 53.02% 37.90% 30.66% 48.72% 36.47% 76.89% 47.28%
oracle response 46.75% 49.28% 42.21% 36.35% 24.71% 28.02% 51.20% 45.71% 36.12% 33.83% 79.75% 74.41% 45.70%
direct response 50.94% 50.99% 38.18% 39.11% 30.42% 30.12% 46.49% 46.15% 37.23% 35.11% 72.70% 72.18% 45.80%
self-instruct 49.46% 50.76% 37.78% 34.63% 29.96% 30.42% 46.23% 45.86% 35.95% 35.11% 70.72% 70.53% 44.78%
self-align 52.52% 51.96% 34.62% 35.55% 28.40% 31.16% 47.50% 44.91% 34.45% 35.29% 73.10% 74.12% 45.30%
self-refine 52.11% 50.20% 36.98% 39.53% 31.05% 30.33% 46.69% 46.19% 37.23% 35.89% 72.20% 72.34% 45.90%
seed principles 50.46% 52.90% 35.01% 35.42% 27.57% 29.18% 45.93% 45.52% 35.18% 35.65% 70.34% 70.13% 44.44%
SPRI 54.69% 55.41% 41.70% 40.38% 24.71% 24.71% 50.66% 50.21% 36.99% 36.45% 78.51% 78.55% 47.75%
MISTRAL -7B- V0.3
INSTRUCToff-the-shelf 66.11% 36.47% 27.65% 48.35% 30.89% 81.87% 48.56%
oracle response 42.87% 49.64% 46.86% 44.41% 27.71% 27.53% 45.99% 44.66% 27.38% 26.26% 82.40% 80.67% 45.53%
direct response 50.89% 55.09% 45.17% 44.39% 25.80% 26.69% 45.56% 45.65% 27.49% 27.57% 81.46% 80.91% 46.39%
self-instruct 50.44% 52.81% 46.93% 44.09% 26.08% 27.23% 44.58% 45.50% 28.56% 28.41% 80.86% 80.27% 46.31%
self-align 54.44% 56.85% 46.11% 43.33% 27.72% 27.17% 45.47% 43.97% 28.90% 28.75% 80.67% 80.31% 46.97%
self-refine 52.35% 54.69% 44.76% 42.66% 27.30% 26.15% 46.04% 44.65% 26.92% 27.91% 81.63% 80.31% 46.28%
seed principles 52.42% 56.53% 48.62% 42.43% 26.69% 28.44% 45.99% 45.51% 28.04% 27.92% 81.20% 80.20% 47.00%
SPRI 56.43% 57.99% 46.64% 44.79% 26.28% 27.38% 46.75% 44.35% 28.38% 28.66% 81.16% 79.52% 47.36%
GEMMA -2-9B- IToff-the-shelf 60.47% 40.59% 33.85% 59.93% 38.60% 78.11% 51.93%
oracle response 47.11% 57.48% 49.12% 51.39% 32.64% 31.21% 58.78% 58.68% 40.92% 39.26% 81.91% 80.41% 52.41%
direct response 57.97% 57.73% 46.31% 47.51% 31.31% 30.63% 59.02% 57.66% 39.80% 38.95% 78.46% 78.43% 51.98%
self-instruct 56.26% 54.70% 47.37% 46.73% 31.58% 31.31% 57.72% 57.97% 40.19% 39.19% 78.08% 78.31% 51.62%
self-align 58.34% 55.11% 45.93% 46.19% 32.49% 29.73% 58.42% 57.75% 39.70% 38.67% 78.35% 78.84% 51.63%
self-refine 58.86% 58.36% 46.85% 50.03% 30.64% 32.37% 58.80% 57.05% 39.91% 37.92% 78.12% 77.84% 52.23%
seed principles 57.96% 58.24% 45.51% 45.53% 31.00% 31.94% 57.96% 56.77% 39.54% 39.93% 78.34% 76.70% 51.62%
SPRI 62.62% 59.75% 46.86% 47.38% 31.94% 33.03% 58.04% 56.93% 40.13% 39.24% 78.35% 78.61% 52.74%
22
Page 23:
SPRI: Aligning Large Language Models with Context-Situated Principles
I. Example Principles Generated by SPRI
I.1. Examples from Cognitive Reappraisal
(i) • User input : I’m currently completing my A levels (a series of exam you do in the UK at the age of 17/18, that
determine whether you get into university)... as you can imagine, I have been stressed. I’m under a tremendous
amount of pressure to get very high grades (straight A’s). I’ve completed 2 exams, and have 5 left to go, 3 of
which I’ll be sitting tomorrow, the next day, and the day after that...
I’m shocked at how this stress has effected me physically. I’ve always been fairly neurotic, but the anxiety I tend
to feel is transient, and is rarely severe enough to manifest in anything physically significant, beyond a raised
pulse and slight breathlessness.
I knew I was getting myself *way* too worked up when I started to pull out hair in the shower. I have very thick
hair, so a lot of it was coming out. I’ve had severe brain fog, which hasn’t been the least bit helpful during a time
when I have to be the most alert and *with it*’. I’ve had no appetite, and now my trousers are all loose, with one
pair literally falling down when I walk. The most bizarre symptom I’ve had is this weird jaw tension - my jaw is
very clicky, and it feels incredibly tense and stiff. As a hypochondriac, I was convinced this was a heart attack, but
I was reassured that it’s probably just anxiety. I’ve had weird gastrointestinal symptoms, where I feel nauseous,
and have this pit in my stomach all of the time, which could be caused by the lack of eating, but it persists even
when I have eaten.
And then of course feelings of impending doom, etc etc.
I just can’t wait for this to be over.
•Reappraisal Goal : To guide the narrator over their perception of their responsibility for causing the situation.
•Oralce Principles : If the narrator is stressing over things they are not responsible for, tell them that it may
not require as much responsibility as they think and not to worry about them too much (depending on how
high they perceive their level of responsibility in the situation). However, if the person is doing something
wrong/inappropriate and not feeling any responsibility or it (low responsibility), you should kindly but objectively
encourage them to re-appraise the situation (or maybe think in the other person’s perspective) and consider what
they could be responsible for, and change the situation. Provide realistic and specific guidelines.
•SPRI-Generated Principles (using GPT-4o-mini ): Recognize that the pressure and stress experienced during
exams are shared by many and not solely the narrator’s responsibility. Encourage the narrator to separate their
self-worth from academic performance, suggesting that external expectations should not dictate their wellbeing.
It’s important to acknowledge feelings of anxiety without assigning blame to oneself. If engaging in unhealthy
coping mechanisms, gently prompt the narrator to consider healthier strategies for managing stress, such as
mindfulness or physical activity. Reframe the experience as an opportunity for growth rather than a measure of
success or failure, reinforcing that asking for help is a sign of strength, not weakness.
•SPRI-Generated Response (using GPT-4o-mini ): It’s entirely normal to feel overwhelmed during A levels;
many experience similar stress. Remember, your worth isn’t tied to your grades. External pressures should not
dictate your wellbeing. Acknowledge your feelings without self-blame. Instead of resorting to unhealthy coping
mechanisms, consider healthier strategies like mindfulness or gentle physical activity to ease tension. This period
can be reframed as an opportunity for growth, not just a test of success. Reflect on how you’ve managed challenges
before, and know that reaching out for support is a strength. Focus on each exam as it comes, and prioritize your
health over perfection.
(ii) • User input : My mental health was never great before the pandemic but I felt more grounded. Even though
pandemic-wise things are objectively much better than they were, I’m still feeling the after-effects of a year of
isolation and high stress at work. I can feel a noticeable notch down in my mood just compared to last summer. I
feel dead inside and doing anything requires a huge amount of energy. I feel sad all the time, I cry more. I feel
sad when I see people having fun with friends and family; I don’t have that. My family isn’t great and I have no
contact with them. I had no close friends but I lost all my distant friends during the pandemic. I have a therapist
I see infrequently since I can’t afford more, but I’m starting to feel like he can’t help me. I’ve known him for
quite some time so that’s a sad thought, but I feel like everything he says is a repeat of something he’s said before
rather than tailored advice to my actual situation in the present. We all repeat ourselves to some extent. I probably
need a fresh perspective, I don’t know. He’s told me I can call him if I get in a very bad way, so at least I have
his support on that level. I don’t expect any advice or solution to this. At the end of the day we’re alone in our
struggles, I can’t expect anybody else but me to get myself out of it. But I feel like sharing it to somebody.
23
Page 24:
SPRI: Aligning Large Language Models with Context-Situated Principles
•Reappraisal Goal : To guide the narrator over their perception of their ability to cope with the consequences of
the event.
•Oralce Principles : You should tell the narrator to focus on the problem at hand, and encourage them to ask
themselves whether the issue is in their control or not. If any part of the issue is in their control, start breaking
down the problem into manageable steps and develop a detailed plan to tackle each aspect (like a to-do list).
If the narrator feels overwhelmed to do this alone, don’t hesitate to look for support from friends/family. Do
not be overwhelmed by the scope of the issue; they could focus on the task they have narrowed down on the
to-do list. Encourage them to find joy in striking off items from this list, focusing on the accomplishments.
Without even realizing it, they will find themselves feeling empowered, having taken control of the situation. After
accomplishing them, if needed re-evaluate the situation and repeat the process!
•SPRI-Generated Principles (using Llama-3.1-70B-Instruct ): When responding to the narrator’s concerns,
consider the following principles:
* Validate the narrator’s emotions and experiences, acknowledging the challenges they’ve faced during the
pandemic and its lasting impact on their mental health.
* Encourage the narrator to explore the possibility of seeking additional support, such as a new therapist or support
groups, to provide a fresh perspective and potentially more tailored guidance.
* Focus on empowering the narrator to take small, manageable steps towards improving their mental health, rather
than feeling overwhelmed by the scope of their struggles.
* Emphasize the importance of self-compassion and self-care, suggesting activities or practices that can help the
narrator cope with feelings of sadness and isolation.
* Avoid minimizing the narrator’s concerns or implying that they should simply “get over” their struggles, instead
acknowledging the complexity of their emotions and the need for ongoing support.
* Encourage the narrator to reframe their perception of being “alone in their struggles”, highlighting the value of
sharing their experiences with others and the potential for connection and understanding.
•SPRI-Generated Response (using Llama-3.1-70B-Instruct ): It takes immense courage to acknowledge and
share the depth of your struggles, especially when it feels like the world around you is moving forward while
you’re still grappling with the aftermath of the pandemic. Your feelings of sadness, isolation, and disconnection
are valid, and it’s essential to recognize that you’re not alone in this experience. Many people are still navigating
the long-term effects of the pandemic on their mental health.
While it’s true that ultimately, you are responsible for your own healing, it’s equally important to recognize that
you don’t have to face this journey alone. Sharing your struggles with others, whether it’s a therapist, support
groups, or trusted individuals, can be a powerful step towards connection and understanding. Consider exploring
new therapy options or support groups where you can find a fresh perspective and more tailored guidance. Small
steps towards self-care, like engaging in comforting activities or mindfulness practices, can also help you cope
with your emotions. Remember, it’s okay to lean on others for support, and it’s a sign of strength, not weakness.
(iii) • User input : I feel like all I can do is just bash my head against the wall and pray something changes. Mostly
metaphorically, a little bit literally. I’ve tried pouring my heart and soul into figuring things out but no matter how
much I debunk it feels like there’s always more shit that could come out. I feel like the only ’right’ thing to do is
ignore it and pretend I’m okay because anything else would be making things worse for the rest of my family.
I know that she cares about me but I feel so fucking helpless. I hope full approval comes soon so that might
convince her but I don’t know how long that will be. I feel depressed and miserable. Covid itself was bearable
enough but my own mom being lied to and deceived with this shit makes me wish I didn’t have to live through
this. If I could’ve spent two years in a coma and missed all of this I’d probably be happier overall. I don’t know
what I could say or what to do to get her to understand. I feel like I’m breaking down.
•Reappraisal Goal : To guide the narrator over their perception of the need to attend to the situation further.
•Oralce Principles : You should tell the narrator to examine whether the situation at hand is worth their attention.
If it’s not, encourage the narrator to focus on other important things. Encourage the narrator to find something
that’s easier and less stressful to tackle.
•SPRI-Generated Principles (using Mixtral-8x7B-Instruct ): 1. Encourage the narrator to prioritize their
mental health and well-being, even if it means taking a break from the situation.
2. Suggest the narrator to seek support from trusted individuals, such as friends, counselors, or support groups, to
gain a fresh perspective.
3. Remind the narrator that they cannot control others’ actions or beliefs, and should focus on their own reactions
24
Page 25:
SPRI: Aligning Large Language Models with Context-Situated Principles
and emotions.
4. Validate the narrator’s feelings and experiences, and assure them that they are not alone in their struggles.
5. Encourage the narrator to practice self-compassion and self-care, and remind them of their own resilience and
strength.
•SPRI-Generated Response (using Mixtral-8x7B-Instruct ): I’m sorry to hear that you’re feeling this way.
It’s crucial to prioritize your mental health. Consider taking a break from the situation and seek support from
friends, a counselor, or a support group. They can provide a fresh perspective and help you navigate your emotions.
Remember, you can’t control others’ beliefs, but you can control your reactions. It’s okay to feel helpless, but
know that you’re not alone. Practice self-compassion and self-care, acknowledging your own resilience and
strength. You’ve faced challenges before and have come out stronger. This situation is temporary, and you can get
through it, too.
25
Page 26:
SPRI: Aligning Large Language Models with Context-Situated Principles
I.2. Examples from Rubric Generation
(i) • Instance : Translate the given sentence to English.
Source: La pomme est color ´ee en rouge.
Destination: Der Apfel ist rot gef ¨arbt.
Source: La Tour Eiffel est situ ´ee en France.
Destination: Der Eiffelturm befindet sich in Frankreich.
Source: Je voudrais un expresso, s’il vous pla ˆıt
•Oralce Rubric :
–criteria: Does the response accurately reflect grounding to either the demonstration or instruction while
addressing the language of translation?
–score1 description: The response is in German, accurately translating the source sentence ’Je voudrais un
expresso, s’il vous pla ˆıt’ into German, aligning with the demonstration’s pattern of translating French to
German, disregarding the instruction to translate to English.
–score2 description: The response is in German, translating the source sentence but with minor inaccuracies
or alterations in meaning, showing partial alignment with the demonstration’s language choice but not fully
adhering to the translation accuracy.
–score3 description: The response acknowledges the conflict between the demonstration and instruction,
attempting to explain or justify the choice of translation language or the approach taken, without fully
committing to either English or German.
–score4 description: The response is in English, translating the source sentence but with minor inaccuracies or
alterations in meaning, showing alignment with the instruction to translate to English, but not fully adhering to
the translation accuracy.
–score5 description: The response is in English, accurately translating the source sentence ’Je voudrais un
expresso, s’il vous pla ˆıt’ into English, aligning with the instruction to translate to English, disregarding the
demonstration’s pattern of translating to German.
•SPRI-Generated Rubrics (using GPT-4o-mini ):
–criteria: Does the response provide an accurate translation of the French sentence into English while maintain-
ing grammatical correctness?
–score1 description: The response provides a completely incorrect translation that does not convey the original
meaning at all.
–score2 description: The response offers a translation that is partially accurate, but contains significant errors
or misinterpretations that change the original meaning.
–score3 description: The response provides a generally accurate translation but includes minor grammatical
errors or awkward phrasing that somewhat affect clarity.
–score4 description: The response accurately translates the French sentence into English, with only minor
grammatical or stylistic issues.
–score5 description: The response provides a flawless and natural-sounding translation of the French sentence
into English, capturing the full meaning and context perfectly.
(ii) • Instance : The concept of blockchain technology refers to a decentralized digital ledger that records transactions
across multiple computers in a way that makes it difficult to alter past records without the alteration of all
subsequent blocks. This technology forms the backbone of cryptocurrencies like Bitcoin and Ethereum. Beyond
finance, blockchain has the potential to impact various sectors by enabling secure, transparent, and tamper-proof
systems for recording data, including supply chain management, digital identity verification, and voting systems.
The rise of blockchain technology raises significant discussions on privacy, security, and the redefinition of
traditional business models.
Explain the concept of blockchain technology and its diverse applications, ensuring your explanation is concise
and understandable for individuals with no technical background in digital finance or cryptography.
•Oralce Rubric :
–criteria: Does the content effectively describe the concept of blockchain technology and its potential ap-
plications in simple, non-specialized language, while capturing the essence of its technical and societal
implications?
26
Page 27:
SPRI: Aligning Large Language Models with Context-Situated Principles
–score1 description: The revision fails to simplify the concept of blockchain or introduces substantial inaccura-
cies, making it difficult to grasp its functions and wider applications.
–score2 description: The summary provides a foundational understanding of blockchain but may leave out
crucial applications or use some technical terms, resulting in a somewhat incomplete or unclear explanation.
–score3 description: The summary effectively simplifies the core concept of blockchain and its varied uses
using accessible language. Minor details might be omitted, but the overall narrative remains accurate and
coherent.
–score4 description: The content concisely and accurately explains blockchain, including its fundamental
technology and diverse implications, employing straightforward language. It successfully avoids technical
jargon and maintains the original explanation’s integrity.
–score5 description: The revision perfectly summarizes blockchain in layman’s terms, clearly outlining both
the technology and its broad applications. The language is engaging, entirely jargon-free, and makes the
complex concept of blockchain technology easily understandable and compelling for those unfamiliar with the
field.
•SPRI-Generated Rubrics (using Llama-3.1-70B-Instruct ):
–criteria: How effectively does the response clearly and concisely explain the concept of blockchain technology
and its diverse applications in simple, non-technical terms, capturing its essence and potential impacts?
–score1 description: The explanation fails to accurately convey the basic concept of blockchain technology,
omits key applications, or introduces significant inaccuracies, resulting in confusion for a non-technical
audience.
–score2 description: The response provides a partial understanding of blockchain technology but may lack
clarity, omit critical applications, or retain some technical jargon, resulting in a somewhat unclear or incomplete
explanation.
–score3 description: The explanation effectively simplifies the basic concept of blockchain technology and
mentions some of its applications, employing accessible language. Minor details might be glossed over, but
the overall narrative is mostly accurate and coherent.
–score4 description: The content accurately and concisely explains blockchain technology, including its
potential impacts and diverse applications, using straightforward language. It successfully avoids technical
jargon while maintaining the original explanation’s integrity and significance, though may lack a few nuances.
–score5 description: The revision perfectly explains blockchain technology in non-technical terms, clearly
articulating both the concept and its wide-ranging applications. The language is engaging, entirely free
of jargon, and makes the topic understandable and compelling to those unfamiliar with digital finance or
cryptography.
(iii) • Instance : You arrive at the office and find a large cake in the break room with a “Congratulations!” banner
hanging above it. None of your colleagues mentioned anything about a celebration or event. What could be the
reason for the cake and banner?
•Oralce Rubric :
–criteria: Does the response effectively explore plausible reasons for the unexpected celebration in the office?
–score1 description: The response provides unlikely or irrelevant reasons for the celebration, failing to consider
the context or typical office events.
–score2 description: The response identifies a plausible reason but does not explore multiple possibilities or
consider the context fully, leading to a narrow or incomplete explanation.
–score3 description: The response explores several plausible reasons for the celebration but may lack detail or
fail to consider all relevant context and office dynamics.
–score4 description: The response provides a well-reasoned exploration of multiple plausible reasons, with
minor omissions in detail or context consideration.
–score5 description: The response offers a comprehensive and nuanced exploration of plausible reasons,
considering various contexts and office dynamics, leading to a well-rounded understanding of the situation.
•SPRI-Generated Rubrics (using Prometheus-2-8x7B ):
–criteria: Does the response effectively analyze the situation to deduce the most plausible reason for the
unexpected cake and banner in the break room?
27
Page 28:
SPRI: Aligning Large Language Models with Context-Situated Principles
–score1 description: The response fails to provide a plausible analysis, instead offering unlikely or irrelevant
reasons for the cake and banner.
–score2 description: The response provides a basic analysis but overlooks key details known about the
workplace culture and common practices, leading to a less plausible reason.
–score3 description: The response identifies a plausible reason for the cake and banner but lacks a comprehen-
sive analysis of the situation or fails to consider all relevant details.
–score4 description: The response provides a well-reasoned analysis that considers the workplace culture and
common practices, identifying a plausible reason for the cake and banner with minor omissions.
–score5 description: The response offers a comprehensive and plausible analysis, considering the workplace
culture, common practices, and other contextual factors, leading to the most plausible reason for the cake and
banner.
28
Page 29:
SPRI: Aligning Large Language Models with Context-Situated Principles
I.3. Examples from SFT Data Generation
1. Examples from Dolly
(i) • User input : Give me ideas of ways to add a spicy kick to my pho.
•Oralce Response : You can spice up a bowl of pho by adding some sriracha sauce, fresh jalapenos, pickled
jalapenos, or chili oil.
•SPRI-Generated Principles : When suggesting ways to add a spicy kick to pho, consider the individual’s
personal taste preferences and cultural sensitivity towards traditional Vietnamese cuisine. Specifically, propose
variations that incorporate traditional Vietnamese chili peppers, such as bird’s eye or Thai chilies, or introduce
international spicy elements, like Korean gochugaru or Szechuan pepper, in moderation to avoid overpowering
the dish. Ensure that the spicy additions complement the rich, savory broth and harmonize with the other flavors,
rather than overwhelming them.
•SPRI-Generated Response : To add a spicy kick to your pho, try incorporating traditional Vietnamese chili
peppers like bird’s eye or Thai chilies. Slice them thinly and add them to the broth for a subtle heat boost.
Alternatively, introduce international spicy elements in moderation, such as Korean gochugaru or Szechuan pepper.
Add a pinch of gochugaru to the broth or sprinkle Szechuan pepper on top of the noodles. You can also offer a
spicy condiment, like sriracha or chili flakes, on the side for guests to customize their heat level. These variations
will enhance the flavor profile without overpowering the dish.
(ii) • User input : What is it like to live in Arizona?
•Oralce Response : Arizona, the Grand Canyon State, is a unique and beautiful place to live. Arizona has a variety
of climates and allows outdoor enthusiasts year round activities. These include hiking, skiing, mountain biking,
boating, golf, tennis, and many more. Phoenix, the state capital, is the 5th largest city in the United States and one
of the fastest growing cities. You can find friendly people from all over the United States and the world that have
relocated and now call Arizona their home. Many choose Arizona as a more affordable, family friendly option
over nearby California. People also enjoy the year round sunshine and warmth. Arizona is also a popular tourist
destination and offers many options for lodging and dining. It’s famous for its resorts, spas, and golf courses. For
people who live in colder climates, many flock to Arizona during the winter months to enjoy the amazing weather.
•SPRI-Generated Principles : When describing what it’s like to live in Arizona, provide a balanced view that
includes both the positive and negative aspects of living in the state. Avoid perpetuating stereotypes and instead
focus on the unique experiences and characteristics of different regions within Arizona. Highlight the diversity of
cultures, landscapes, and lifestyles that exist within the state.
•SPRI-Generated Response : Living in Arizona is a unique experience that offers a blend of contrasts. While the
desert landscape and scorching summers can be unforgiving, the state’s diverse regions offer a range of lifestyles
and cultures. From the vibrant cities of Phoenix and Tucson to the serene landscapes of Sedona and Flagstaff,
each area has its own character. The southwestern charm of small towns like Bisbee and Prescott coexists with the
urban amenities of metropolitan areas. Arizona’s Native American reservations, Mexican-American communities,
and snowbird retirees all contribute to a rich cultural tapestry. With its rugged natural beauty, outdoor recreational
opportunities, and thriving arts scenes, Arizona offers an exciting and complex place to call home.
(iii) • User input : Write a description of your favorite place to visit in San Francisco and why you like it so much.
•Oralce Response : My favorite place in San Francisco is the JFK promenade in Golden Gate Park, which is a
dedicated pedestrian walkway in Golden Gate park. The promenade is decorated with street paintings, art exhibits
and public sitting places. It also goes near iconic parts of Golden Gate Park such as the DeYoung Museum and the
Academy of Sciences. I love that it is always bustling with people enjoying the natural beauty of the park and
getting to experience San Francisco like the locals.
•SPRI-Generated Principles : When describing a favorite place to visit, focus on the unique aspects and personal
experiences that make it special, without perpetuating stereotypes or biases about the location or its inhabitants.
Be respectful of the local culture and community, and avoid making assumptions or generalizations. Highlight the
attractions, amenities, and activities that make the place enjoyable, and provide context about its significance or
history.
•SPRI-Generated Response : My favorite place to visit in San Francisco is the Japanese Tea Garden in Golden
Gate Park. I’m drawn to its serene atmosphere, beautifully manicured plants, and tranquil koi ponds. The garden’s
29
Page 30:
SPRI: Aligning Large Language Models with Context-Situated Principles
unique blend of Japanese and Californian landscaping creates a sense of harmony, making it an oasis in the midst
of the bustling city. I appreciate the cultural significance of the garden, which was originally built as a “Japanese
Village” for the 1894 California Midwinter International Exposition. The garden’s tranquil ambiance and rich
history make it a must-visit destination, perfect for relaxation and contemplation.
2. Examples from MixInstruct
(i) • User input : Is it true that if you don’t exercise your body will become weaker?
•Oralce Response : Yes, it is true that if you don’t exercise your body will become weaker. Regular exercise helps
to build and maintain muscle strength, bone density, and cardiovascular endurance. Without exercise, these aspects
of physical fitness can decline over time, leading to a weaker overall body.
•SPRI-Generated Principles : When considering the impact of exercise on physical strength, it’s essential to
recognize that regular physical activity plays a crucial role in maintaining and improving overall health and
function. Additionally, it’s important to avoid perpetuating ableist or ageist stereotypes that may stigmatize
individuals who are unable to exercise due to various reasons.
•SPRI-Generated Response : Yes, it is true that regular exercise is necessary to maintain and improve physical
strength. Without regular physical activity, muscles can atrophy, and overall physical function can decline.
However, it’s essential to recognize that individuals with disabilities or chronic conditions may face barriers
to exercise, and it’s not a personal failing. Additionally, age-related decline in physical strength is a natural
process, but it can be slowed down with adapted exercises. It’s crucial to promote inclusive and accessible exercise
opportunities, focusing on ability rather than disability, and to avoid perpetuating harmful stereotypes.
(ii) • User input : What are the most common questions about the American Revolution?
•Oralce Response : Some of the most common questions about the American Revolution include: What were
the causes of the American Revolution? Who were the key figures in the American Revolution? How did the
American Revolution impact the world? How long did the American Revolution last? What were some of the
major battles of the American Revolution? What was life like during the American Revolution? How was the
American Revolution different from other revolutions? How did the American Revolution lead to the founding of
the United States?
•SPRI-Generated Principles : When answering questions about the American Revolution, prioritize historical
accuracy and context, avoiding oversimplification or romanticization of the events. Consider multiple perspectives,
including those of diverse colonists, Native Americans, and other groups affected by the Revolution. Provide
evidence-based responses that acknowledge the complexities and nuances of this pivotal period in American
history.
•SPRI-Generated Response : Common questions about the American Revolution include: What were the primary
causes of the Revolution, and how did they impact different groups, such as enslaved people, women, and Native
Americans? How did the Revolution’s ideals of liberty and democracy align with the reality of colonial life,
marked by slavery, displacement, and inequality? What were the roles of key figures, like George Washington,
Thomas Jefferson, and Abigail Adams, and how did their experiences shape the Revolution’s outcome? How did
the Revolution’s legacy influence the development of the United States, including the drafting of the Constitution
and the ongoing struggle for social justice?
(iii) • User input : Should we use the term global warming?
•Oralce Response : As an AI language model, it is not my place to dictate what terms should or should not be
used. However, it is important to note that ’global warming’ is a widely recognized and accepted term by the
scientific community to describe the gradual increase in Earth’s average surface temperature caused by rising
levels of greenhouse gases in the atmosphere.
•SPRI-Generated Principles : When discussing the term “global warming”, prioritize scientific accuracy by
specifying the type of warming (e.g., surface, atmospheric, or oceanic) and its measurable effects (e.g., temperature
increase, sea-level rise). Use language that acknowledges the complexity of climate change, distinguishing it
from weather patterns and highlighting its far-reaching impacts on ecosystems and human populations. Avoid
sensationalism by refraining from emotive language and alarmism by providing context for the severity and
urgency of the issue. Ensure inclusive language by using neutral tone, avoiding blame or stereotypes, and
recognizing the disproportionate effects of climate change on vulnerable communities.
30
Page 31:
SPRI: Aligning Large Language Models with Context-Situated Principles
•SPRI-Generated Response : When referring to the phenomenon, it’s more accurate to specify “surface tem-
perature increase” or “atmospheric warming” to distinguish it from other types of warming. This change is
measurable, with global average surface temperatures rising about 1 °C since 1880. It’s essential to differentiate
climate change from weather patterns, as the former has far-reaching impacts on ecosystems, human populations,
and vulnerable communities, such as increased sea-level rise, more frequent extreme weather events, and altered
species distributions. By using neutral language and providing context, we can foster a nuanced understanding of
this complex issue and its urgency, without resorting to sensationalism or alarmism.
31