loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2305.18290

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Published: 2023-05-29

Abstract:

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Paper Content: on Alphaxiv
Page 1: Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov∗†Archit Sharma∗†Eric Mitchell∗† Stefano Ermon†‡Christopher D. Manning†Chelsea Finn† †Stanford University‡CZ Biohub {rafailov,architsh,eric.mitchell}@cs.stanford.edu Abstract While large-scale unsupervised language models (LMs) learn broad world knowl- edge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these prefer- ences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Prefer- ence Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sen- timent of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train. 1 Introduction Large unsupervised language models (LMs) trained on very large datasets acquire surprising capabili- ties [ 11,7,42,8]. However, these models are trained on data generated by humans with a wide variety of goals, priorities, and skillsets. Some of these goals and skillsets may not be desirable to imitate; for example, while we may want our AI coding assistant to understand common programming mistakes in order to correct them, nevertheless, when generating code, we would like to bias our model toward the (potentially rare) high-quality coding ability present in its training data. Similarly, we might want our language model to be aware of a common misconception believed by 50% of people, but we certainly do not want the model to claim this misconception to be true in 50% of queries about it! In other words, selecting the model’s desired responses and behavior from its very wide knowledge and abilities is crucial to building AI systems that are safe, performant, and controllable [ 28]. While existing methods typically steer LMs to match human preferences using reinforcement learning (RL), ∗Equal contribution; more junior authors listed earlier. 37th Conference on Neural Information Processing Systems (NeurIPS 2023).arXiv:2305.18290v3 [cs.LG] 29 Jul 2024 Page 2: Figure 1: DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form. we will show that the RL-based objective used by existing methods can be optimized exactly with a simple binary cross-entropy objective, greatly simplifying the preference learning pipeline. At a high level, existing methods instill the desired behaviors into a language model using curated sets of human preferences representing the types of behaviors that humans find safe and helpful. This preference learning stage occurs after an initial stage of large-scale unsupervised pre-training on a large text dataset. While the most straightforward approach to preference learning is supervised fine-tuning on human demonstrations of high quality responses, the most successful class of methods is reinforcement learning from human (or AI) feedback (RLHF/RLAIF; [ 12,2]). RLHF methods fit a reward model to a dataset of human preferences and then use RL to optimize a language model policy to produce responses assigned high reward without drifting excessively far from the original model. While RLHF produces models with impressive conversational and coding abilities, the RLHF pipeline is considerably more complex than supervised learning, involving training multiple LMs and sampling from the LM policy in the loop of training, incurring significant computational costs. In this paper, we show how to directly optimize a language model to adhere to human preferences, without explicit reward modeling or reinforcement learning. We propose Direct Preference Optimiza- tion (DPO) , an algorithm that implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint) but is simple to implement and straight- forward to train. Intuitively, the DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective. Like existing algorithms, DPO relies on a theoretical preference model (such as the Bradley-Terry model; [ 5]) that measures how well a given reward function aligns with empirical preference data. However, while existing methods use the preference model to define a preference loss to train a reward model and then train a policy that optimizes the learned reward model, DPO uses a change of variables to define the preference loss as a function of the policy directly. Given a dataset of human preferences over model responses, DPO can therefore optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data. Our main contribution is Direct Preference Optimization (DPO), a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters. 2 Related Work Self-supervised language models of increasing scale learn to complete some tasks zero-shot [ 33] or with few-shot prompts [ 6,27,11]. However, their performance on downstream tasks and alignment with user intent can be significantly improved by fine-tuning on datasets of instructions and human- written completions [ 25,38,13,41]. This ‘instruction-tuning’ procedure enables LLMs to generalize to instructions outside of the instruction-tuning set and generally increase their usability [ 13]. Despite the success of instruction tuning, relative human judgments of response quality are often easier to collect than expert demonstrations, and thus subsequent works have fine-tuned LLMs with datasets of human preferences, improving proficiency in translation [ 20], summarization [ 40,51], story-telling [51], and instruction-following [ 28,34]. These methods first optimize a neural network reward function for compatibility with the dataset of preferences under a preference model such as the 2 Page 3: Bradley-Terry model [ 5], then fine-tune a language model to maximize the given reward using reinforcement learning algorithms, commonly REINFORCE [ 47], proximal policy optimization (PPO; [ 39]), or variants [ 34]. A closely-related line of work leverages LLMs fine-tuned for instruction following with human feedback to generate additional synthetic preference data for targeted attributes such as safety or harmlessness [ 2], using only weak supervision from humans in the form of a text rubric for the LLM’s annotations. These methods represent a convergence of two bodies of work: one body of work on training language models with reinforcement learning for a variety of objectives [ 35,29,48] and another body of work on general methods for learning from human preferences [ 12,21]. Despite the appeal of using relative human preferences, fine-tuning large language models with reinforcement learning remains a major practical challenge; this work provides a theoretically-justified approach to optimizing relative preferences without RL. Outside of the context of language, learning policies from preferences has been studied in both bandit and reinforcement learning settings, and several approaches have been proposed. Contextual bandit learning using preferences or rankings of actions, rather than rewards, is known as a contextual dueling bandit (CDB; [ 50,14]). In the absence of absolute rewards, theoretical analysis of CDBs substitutes the notion of an optimal policy with a von Neumann winner , a policy whose expected win rate against anyother policy is at least 50% [ 14]. However, in the CDB setting, preference labels are given online, while in learning from human preferences, we typically learn from a fixed batch of offline preference-annotated action pairs [ 49]. Similarly, preference-based RL (PbRL) learns from binary preferences generated by an unknown ‘scoring’ function rather than rewards [9, 37]. Various algorithms for PbRL exist, including methods that can reuse off-policy preference data, but generally involve first explicitly estimating the latent scoring function (i.e. the reward model) and subsequently optimizing it [ 16,9,12,36,21]. We instead present a single stage policy learning approach that directly optimizes a policy to satisfy preferences. 3 Preliminaries We review the RLHF pipeline in Ziegler et al. (and later [ 40,1,28]). It usually includes three phases: 1) supervised fine-tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization. SFT: RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model πSFT. Reward Modelling Phase : In the second phase the SFT model is prompted with prompts xto produce pairs of answers (y1, y2)∼πSFT(y|x). These are then presented to human labelers who express preferences for one answer, denoted as yw≻yl|xwhere ywandyldenotes the preferred and dispreferred completion amongst (y1, y2)respectively. The preferences are assumed to be generated by some latent reward model r∗(y, x), which we do not have access to. There are a number of approaches used to model preferences, the Bradley-Terry (BT) [ 5] model being a popular choice (although more general Plackett-Luce ranking models [ 32,23] are also compatible with the framework if we have access to several ranked answers). The BT model stipulates that the human preference distribution p∗can be written as: p∗(y1≻y2|x) =exp (r∗(x, y 1)) exp (r∗(x, y 1)) + exp ( r∗(x, y 2)). (1) Assuming access to a static dataset of comparisons D= x(i), y(i) w, y(i) l N i=1sampled from p∗, we can parametrize a reward model rϕ(x, y)and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss: LR(rϕ,D) =−E(x,yw,yl)∼D logσ(rϕ(x, yw)−rϕ(x, yl)) (2) where σis the logistic function. In the context of LMs, the network rϕ(x, y)is often initialized from the SFT model πSFT(y|x)with the addition of a linear layer on top of the final transformer layer that produces a single scalar prediction for the reward value [ 51]. To ensure a reward function with lower variance, prior works normalize the rewards, such that Ex,y∼D[rϕ(x, y)] = 0 for all x. RL Fine-Tuning Phase : During the RL phase, the learned reward function is used to provide feedback to the language model. Following prior works [17, 18], the optimization is formulated as max πθEx∼D,y∼πθ(y|x) rϕ(x, y) −βDKL πθ(y|x)||πref(y|x) , (3) 3 Page 4: where βis a parameter controlling the deviation from the base reference policy πref, namely the ini- tial SFT model πSFT. In practice, the language model policy πθis also initialized to πSFT. The added constraint is important, as it prevents the model from deviating too far from the distri- bution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers. Due to the discrete nature of lan- guage generation, this objective is not differentiable and is typically optimized with reinforce- ment learning. The standard approach [ 51,40,1,28] has been to construct the reward function r(x, y) =rϕ(x, y)−β(logπθ(y|x)−logπref(y|x)), and maximize using PPO [39]. 4 Direct Preference Optimization Motivated by the challenges of applying reinforcement learning algorithms on large-scale problems such as fine-tuning language models, our goal is to derive a simple approach for policy optimization using preferences directly. Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. As we will describe next in detail, our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences, such as the Bradley-Terry model. In essence, the policy network represents both the language model and the (implicit) reward. Deriving the DPO objective. We start with the same RL objective as prior work, Eq. 3, under a general reward function r. Following prior work [ 31,30,19,15], it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective in Eq. 3 takes the form: πr(y|x) =1 Z(x)πref(y|x) exp1 βr(x, y) , (4) where Z(x) =P yπref(y|x) exp 1 βr(x, y) is the partition function. See Appendix A.1 for a complete derivation. Even if we use the MLE estimate rϕof the ground-truth reward function r∗, it is still expensive to estimate the partition function Z(x)[19,15], which makes this representation hard to utilize in practice. However, we can rearrange Eq. 4 to express the reward function in terms of its corresponding optimal policy πr, the reference policy πref, and the unknown partition function Z(·). Specifically, we first take the logarithm of both sides of Eq. 4 and then with some algebra we obtain: r(x, y) =βlogπr(y|x) πref(y|x)+βlogZ(x). (5) We can apply this reparameterization to the ground-truth reward r∗and corresponding optimal model π∗. Fortunately, the Bradley-Terry model depends only on the difference of rewards between two completions, i.e., p∗(y1≻y2|x) =σ(r∗(x, y 1)−r∗(x, y 2)). Substituting the reparameterization in Eq. 5 for r∗(x, y)into the preference model Eq. 1, the partition function cancels, and we can express the human preference probability in terms of only the optimal policy π∗and reference policy πref. Thus, the optimal RLHF policy π∗under the Bradley-Terry model satisfies the preference model: p∗(y1≻y2|x) =1 1 + exp βlogπ∗(y2|x) πref(y2|x)−βlogπ∗(y1|x) πref(y1|x) (6) The derivation is in Appendix A.2. While Eq. 6 uses the Bradley-Terry model, we can similarly derive expressions under the more general Plackett-Luce models [32, 23], shown in Appendix A.3. Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy πθ. Analogous to the reward modeling approach (i.e. Eq. 2), our policy objective becomes: LDPO(πθ;πref) =−E(x,yw,yl)∼D logσ βlogπθ(yw|x) πref(yw|x)−βlogπθ(yl|x) πref(yl|x) . (7) This way, we fit an implicit reward using an alternative parameterization, whose optimal policy is simply πθ. Moreover, since our procedure is equivalent to fitting a reparametrized Bradley-Terry 4 Page 5: model, it enjoys certain theoretical properties, such as consistencies under suitable assumption of the preference data distribution [ 4]. In Section 5, we further discuss theoretical properties of DPO in relation to other works. What does the DPO update do? For a mechanistic understanding of DPO, it is useful to analyze the gradient of the loss function LDPO. The gradient with respect to the parameters θcan be written as: ∇θLDPO(πθ;πref) = −βE(x,yw,yl)∼D σ(ˆrθ(x, yl)−ˆrθ(x, yw))| {z } higher weight when reward estimate is wrong ∇θlogπ(yw|x)| {z } increase likelihood of yw− ∇ θlogπ(yl|x)|{z } decrease likelihood of yl , where ˆrθ(x, y) =βlogπθ(y|x) πref(y|x)is the reward implicitly defined by the language model πθand refer- ence model πref(more in Section 5). Intuitively, the gradient of the loss function LDPOincreases the likelihood of the preferred completions ywand decreases the likelihood of dispreferred completions yl. Importantly, the examples are weighed by how much higher the implicit reward model ˆrθrates the dispreferred completions, scaled by β, i.e, how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint. Our experiments suggest the importance of this weighting, as a naïve version of this method without the weighting coefficient can cause the language model to degenerate (Appendix Table 3). DPO outline. The general DPO pipeline is as follows: 1) Sample completions y1, y2∼πref(· |x) for every prompt x, label with human preferences to construct the offline dataset of preferences D={x(i), y(i) w, yl)(i)}N i=1and 2) optimize the language model πθto minimize LDPOfor the given πrefandDand desired β. In practice, one would like to reuse preference datasets publicly available, rather than generating samples and gathering human preferences. Since the preference datasets are sampled using πSFT, we initialize πref=πSFTwhenever available. However, when πSFTis not available, we initialize πrefby maximizing likelihood of preferred completions (x, yw), that is,πref= arg maxπEx,yw∼D[logπ(yw|x)]. This procedure helps mitigate the distribution shift between the true reference distribution which is unavailable, and πrefused by DPO. Further details related to the implementation and hyperparameters can be found in Appendix B. 5 Theoretical Analysis of DPO In this section, we give further interpretation of the DPO method, provide theoretical backing, and relate advantages of DPO to issues with actor critic algorithms used for RLHF (such as PPO [39]). 5.1 Your Language Model Is Secretly a Reward Model DPO is able to bypass both fitting an explicit reward and performing RL to learn the policy using a single maximum likelihood objective. Note the optimization objective Eq. 5 is equivalent to a Bradley-Terry model with a reward parameterization r∗(x, y) =βlogπ∗ θ(y|x) πref(y|x)and we optimize our parametric model πθ, equivalently to the reward model optimization in Eq. 2 under the change of variables. In this section we will build the theory behind this reparameterization, show that it does not constrain the class of learned reward models, and allows for the exact recovery of the optimal policy. We begin with by defining an equivalence relation between reward functions. Definition 1. We say that two reward functions r(x, y)and r′(x, y)are equivalent iff r(x, y)−r′(x, y) =f(x)for some function f. It is easy to see that this is indeed an equivalence relation, which partitions the set of reward functions into classes. We can state the following two lemmas: Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. Lemma 2. Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. The proofs are straightforward and we defer them to Appendix A.5. The first lemma is a well-known under-specification issue with the Plackett-Luce family of models [ 32]. Due to this under-specification, 5 Page 6: we usually have to impose additional identifiability constraints to achieve any guarantees on the MLE estimates from Eq. 2 [ 4]. The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class. We prove the following Theorem in Appendix A.6: Theorem 1. Under mild assumptions, all reward classes consistent with the Plackett-Luce (and Bradley-Terry in particular) models can be represented with the reparameterization r(x, y) =βlogπ(y|x) πref(y|x)for some model π(y|x)and a given reference model πref(y|x). Proof Sketch. Consider any reward function r(x, y), which induces a corresponding optimal model πr(y|x), specified by Eq. 4. We will show that a reward function from the equivalence class of r can be represented using the reparameterization given above. We define the projection fas f(r;πref, β)(x, y) =r(x, y)−βlogX yπref(y|x) exp1 βr(x, y) (8) The operator fsimply normalizes the reward function with the logarithm of the partition function ofπr. Since the added normalization term is only a function of the prefix x,f(r;πref, β)(x, y)is a reward function in the equivalence class of r(x, y). Finally, replacing rwith the RHS of Eq. 5 (which holds for any reward function), we have f(r;πref, β)(x, y) =βlogπr(y|x) πref(y|x). That is, the projection fproduces a member of the equivalence class of rwith the desired form, and we do not lose any generality in our reward model from the proposed reparameterization. We can alternatively view Theorem 1 as specifying exactly which reward function within each equivalence class the DPO reparameterization selects, that is, the reward function satisfying: X yπref(y|x) exp1 βr(x, y) | {z } =π(y|x), using Thm. 1 reparam.= 1, (9) i.e.,π(y|x)is a valid distribution (probabilities are positive and sum to 1). However, following Eq. 4, we can see that Eq. 9 is the partition function of the optimal policy induced by the reward function r(x, y). The key insight of the DPO algorithm is that we can impose certain constraints on the under-constrained Plackett-Luce (and Bradley-Terry in particular) family of preference models, such that we preserve the class of representable reward models, but explicitly make the optimal policy in Eq. 4 analytically tractable for all prompts x. 5.2 Instability of Actor-Critic Algorithms We can also use our framework to diagnose instabilities with standard actor-critic algorithms used for the RLHF, such as PPO. We follow the RLHF pipeline and focus on the RL fine-tuning step outlined in Section 3. We can draw connections to the control as inference framework [ 22] for the constrained RL problem outlined in 3. We assume a parameterized model πθ(y|x)and minimize DKL[πθ(y|x)||π∗(y|x)]where π∗is the optimal policy from Eq. 7 induced by the reward function rϕ(y, x). With some algebra this leads to the optimization objective: max πθEπθ(y|x) rϕ(x, y)−βlogX yπref(y|x) exp1 βrϕ(x, y) | {z } f(rϕ,πref,β)−βlogπθ(y|x) πref(y|x)|{z } KL (10) This is the same objective optimized in prior works [ 51,40,1,28] using the DPO-equivalent reward for the reward class of rϕ. In this setting, we can interpret the normalization term in f(rϕ, πref, β) as the soft value function of the reference policy πref. While this term does not affect the optimal solution, without it, the policy gradient of the objective could have high variance, making learning unstable. We can accommodate for the normalization term using a learned value function, but that can also be difficult to optimize. Alternatively, prior works have normalized rewards using a human completion baseline, essentially a single sample Monte-Carlo estimate of the normalizing term. In contrast the DPO reparameterization yields a reward function that does not require any baselines. 6 Page 7: 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 KL( ref) 0.40.50.60.70.80.91.0RewardIMDb Sentiment Generation DPO (Ours) Unlikelihood PPO (Our impl.)PPO-GT (Our impl.) PPO-GT (TRL) Preferred-FT 0.00 0.25 0.50 0.75 1.00 Sampling temperature0.00.10.20.30.40.50.60.7Win rateTL;DR Summarization Win Rate vs Reference DPO PPOPreferred-FT SFTGPT-J Best of 128Figure 2: Left. The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization. Right. TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature. 6 Experiments In this section, we empirically evaluate DPO’s ability to train policies directly from preferences. First, in a well-controlled text-generation setting, we ask: how efficiently does DPO trade off maximizing reward and minimizing KL-divergence with the reference policy, compared to common preference learning algorithms such as PPO? Next, we evaluate DPO’s performance on larger models and more difficult RLHF tasks, including summarization and dialogue. We find that with almost no tuning of hyperparameters, DPO tends to perform as well or better than strong baselines like RLHF with PPO as well as returning the best of Nsampled trajectories under a learned reward function. Before presenting these results, we describe the experimental set-up; additional details are in Appendix C. Tasks. Our experiments explore three different open-ended text generation tasks. For all experiments, algorithms learn a policy from a dataset of preferences D= x(i), y(i) w, y(i) l N i=1. In controlled sentiment generation ,xis a prefix of a movie review from the IMDb dataset [ 24], and the policy must generate ywith positive sentiment. In order to perform a controlled evaluation, for this experiment we generate preference pairs over generations using a pre-trained sentiment classifier, where p(positive |x, yw)> p(positive |x, yl). For SFT, we fine-tune GPT-2-large until convergence on reviews from the train split of the IMDB dataset (further details in App C.1). In summarization , xis a forum post from Reddit; the policy must generate a summary yof the main points in the post. Following prior work, we use the Reddit TL;DR summarization dataset [ 43] along with human preferences gathered by Stiennon et al.. We use an SFT model fine-tuned on human-written forum post summaries2with the TRLX [ 44] framework for RLHF. The human preference dataset was gathered by Stiennon et al. on samples from a different, but similarly-trained, SFT model. Finally, in single-turn dialogue ,xis a human query, which may be anything from a question about astrophysics to a request for relationship advice. A policy must produce an engaging and helpful response yto a user’s query; we use the Anthropic Helpful and Harmless dialogue dataset [ 1], containing 170k dialogues between a human and an automated assistant. Each transcript ends with a pair of responses generated by a large (although unknown) language model along with a preference label denoting the human-preferred response. In this setting, no pre-trained SFT model is available; we therefore fine-tune an off-the-shelf language model on only the preferred completions to form the SFT model. Evaluation. Our experiments use two different approaches to evaluation. In order to analyze the effectiveness of each algorithm in optimizing the constrained reward maximization objective, in the controlled sentiment generation setting we evaluate each algorithm by its frontier of achieved reward and KL-divergence from the reference policy; this frontier is computable because we have acccess to the ground-truth reward function (a sentiment classifier). However, in the real world, the ground truth reward function is not known; therefore, we evaluate algorithms with their win rate against a baseline policy, using GPT-4 as a proxy for human evaluation of summary quality and response helpfulness in the summarization and single-turn dialogue settings, respectively. For summarization, we use reference summaries in the test set as the baseline; for dialogue, we use the preferred response in the 2https://huggingface.co/CarperAI/openai_summarize_tldr_sft 7 Page 8: 0.25 0.50 0.75 1.00 Sampling temperature0.10.20.30.40.50.6Win rateAnthropic-HH Dialogue Win Rate vs Chosen DPO Best of 128Preferred-FT Pythia-2.8B 0 300 600 900 1200 1500 1800 2100 2400 2700 3000 3300 Fine-tuning step0.300.350.400.450.500.550.600.650.70Win rate Dialogue Win Rate Evolution DPO (temp = 1.0) DPO (temp = 0.7)Figure 3: Left. Win rates computed by GPT-4 for Anthropic-HH one-step dialogue; DPO is the only method that improves over chosen summaries in the Anthropic-HH test set. Right. Win rates for different sampling temperatures over the course of training. DPO’s improvement over the dataset labels is fairly stable over the course of training for different sampling temperatures. test dataset as the baseline. While existing studies suggest LMs can be better automated evaluators than existing metrics [ 10], we conduct a human study to justify our usage of GPT-4 for evaluation in Sec. 6.4. We find GPT-4 judgments correlate strongly with humans, with human agreement with GPT-4 typically similar or higher than inter-human annotator agreement. Methods. In addition to DPO, we evaluate several existing approaches to training language models to adhere to human preferences. Most simply, we explore zero-shot prompting with GPT-J [45] in the summarization task and 2-shot prompting with Pythia-2.8B [3] in the dialogue task. In addition, we evaluate the SFT model as well as Preferred-FT , which is a model fine-tuned with supervised learning on the chosen completion ywfrom either the SFT model (in controlled sentiment and summarization) or a generic LM (in single-turn dialogue). Another pseudo-supervised method is Unlikelihood [46], which simply optimizes the policy to maximize the probability assigned to ywand minimize the probability assigned to yl; we use an optional coefficient α∈[0,1]on the ‘unlikelihood’ term. We also consider PPO [39] using a reward function learned from the preference data and PPO-GT , which is an oracle that learns from the ground truth reward function available in the controlled sentiment setting. In our sentiment experiments, we use two implementations of PPO-GT, one of-the-shelf version [ 44] as well as a modified version that normalizes rewards and further tunes hyperparameters to improve performance (we also use these modifications when running ‘normal’ PPO with learned rewards). Finally, we consider the Best of Nbaseline, sampling Nresponses from the SFT model (or Preferred-FT in dialogue) and returning the highest-scoring response according to a reward function learned from the preference dataset. This high-performing method decouples the quality of the reward model from the PPO optimization, but is computationally impractical even for moderate Nas it requires sampling Ncompletions for every query at test time. 6.1 How well can DPO optimize the RLHF objective? The KL-constrained reward maximization objective used in typical RLHF algorithms balances exploitation of reward while restricting the policy from deviating far from the reference policy. Therefore, when comparing algorithms, we must take into account both reward achieved as well as the KL discrepancy; achieving slightly higher reward but with much higher KL is not necessarily desirable. Figure 2 shows the reward-KL frontier for various algorithms in the sentiment setting. We execute multiple training runs for each algorithm, using a different hyperparameter for policy conservativeness in each run (target KL ∈ {3,6,9,12}for PPO, β∈ {0.05,0.1,1,5},α∈ {0.05,0.1,0.5,1}for unlikelihood, random seeds for preferred-FT). This sweep includes 22 runs in total. After each 100 training steps until convergence, we evaluate each policy on a set of test prompts, computing the average reward under the true reward function as well as the average sequence-level KL3with the reference policy KL(π||πref). We find that DPO produces by far the most efficient frontier, achieving the highest reward while still achieving low KL. This result is particularly notable for multiple reasons. First, DPO and PPO optimize the same objective, but DPO is notably more efficient; 3That is, the sum of the per-timestep KL-divergences. 8 Page 9: DPO’s reward/KL tradeoff strictly dominates PPO. Second, DPO achieves a better frontier than PPO, even when PPO can access ground truth rewards (PPO-GT). 6.2 Can DPO scale to real preference datasets? Next, we evaluate fine-tuning performance of DPO on summarization and single-turn dialogue. For summarization, automatic evaluation metrics such as ROUGE can be poorly correlated with human preferences [ 40], and prior work has found that fine-tuning LMs using PPO on human preferences to provide more effective summaries. We evaluate different methods by sampling completions on the test split of TL;DR summarization dataset, and computing the average win rate against reference completions in the test set. The completions for all methods are sampled at temperatures varying from 0.0 to 1.0, and the win rates are shown in Figure 2 (right). DPO, PPO and Preferred-FT all fine-tune the same GPT-J SFT model4. We find that DPO has a win rate of approximately 61% at a temperature of 0.0, exceeding the performance of PPO at 57% at its optimal sampling temperature of 0.0. DPO also achieves a higher maximum win rate compared to the best of Nbaseline. We note that we did not meaningfully tune DPO’s βhyperparameter, so these results may underestimate DPO’s potential. Moreover, we find DPO to be much more robust to the sampling temperature than PPO, the performance of which can degrade to that of the base GPT-J model at high temperatures. Preferred-FT does not improve significantly over the SFT model. We also compare DPO and PPO head-to-head in human evaluations in Section 6.4, where DPO samples at temperature 0.25 were preferred 58% times over PPO samples at temperature 0. On single-turn dialogue, we evaluate the different methods on the subset of the test split of the Anthropic HH dataset [ 1] with one step of human-assistant interaction. GPT-4 evaluations use the preferred completions on the test as the reference to compute the win rate for different methods. As there is no standard SFT model for this task, we start with a pre-trained Pythia-2.8B, use Preferred-FT to train a reference model on the chosen completions such that completions are within distribution of the model, and then train using DPO. We also compare against the best of 128 Preferred-FT completions (we found the Best of Nbaseline plateaus at 128 completions for this task; see Appendix Figure 4) and a 2-shot prompted version of the Pythia-2.8B base model, finding DPO performs as well or better for the best-performing temperatures for each method. We also evaluate an RLHF model trained with PPO on the Anthropic HH dataset5from a well-known source6, but are unable to find a prompt or sampling temperature that gives performance better than the base Pythia-2.8B model. Based on our results from TL;DR and the fact that both methods optimize the same reward function, we consider Best of 128 a rough proxy for PPO-level performance. Overall, DPO is the only computationally efficient method that improves over the preferred completions in the Anthropic HH dataset, and provides similar or better performance to the computationally demanding Best of 128 baseline. Finally, Figure 3 shows that DPO converges to its best performance relatively quickly. 6.3 Generalization to a new input distribution Win rate vs. ground truth Alg. Temp 0 Temp 0.25 DPO 0.36 0.31 PPO 0.26 0.23 Table 1: GPT-4 win rates vs. ground truth summaries for out-of-distribution CNN/DailyMail input articles.To further compare the performance of PPO and DPO un- der distribution shifts, we evaluate the PPO and DPO poli- cies from our Reddit TL;DR summarization experiment on a different distribution, news articles in the test split of the CNN/DailyMail dataset [ 26], using the best sampling temper- atures from TL;DR (0 and 0.25). The results are presented in Table 1. We computed the GPT-4 win rate against the ground-truth summaries in the datasets, using the same GPT- 4 (C) prompt we used for Reddit TL;DR, but replacing the words “forum post” with “news article”. For this new distribution, DPO continues to outperform the PPO policy by a significant margin. This experiment provides initial evidence that DPO policies can generalize similarly well to PPO policies, even though DPO does not use the additional unlabeled Reddit TL;DR prompts that PPO uses. 4https://huggingface.co/CarperAI/openai_summarize_tldr_sft 5https://huggingface.co/reciprocate/ppo_hh_pythia-6B 6https://github.com/CarperAI/trlx/tree/main/examples/hh 9 Page 10: 6.4 Validating GPT-4 judgments with human judgments We conduct a human study to verify the reliability of GPT-4’s judgments, using the results of the TL;DR summarization experiment and two different GPT-4 prompts. The GPT-4 (S) (sim- ple) prompt simply asks for which summary better-summarizes the important information in the post. The GPT-4 (C) (concise) prompt also asks for which summary is more concise; we eval- uate this prompt because we find that GPT-4 prefers longer, more repetitive summaries than hu- mans do with the GPT-4 (S) prompt. See Appendix C.2 for the complete prompts. We perform three comparisons, using the highest (DPO, temp. 0.25), the lowest (PPO, temp. 1.0), and a DPO SFT PPO-1 N respondents 272 122 199 GPT-4 (S) win % 47 27 13 GPT-4 (C) win % 54 32 12 Human win % 58 43 17 GPT-4 (S)-H agree 70 77 86 GPT-4 (C)-H agree 67 79 85 H-H agree 65 - 87 Table 2: Comparing human and GPT-4 win rates and per-judgment agreement on TL;DR summariza- tion samples. Humans agree with GPT-4 about as much as they agree with each other. Each experi- ment compares a summary from the stated method with a summary from PPO with temperature 0.middle-performing (SFT, temp. 0.25) method with the aim of covering a diversity of sample qualities; all three methods are compared against greedily- sampled PPO (its best-performing temperature). We find that with both prompts, GPT-4 tends to agree with humans about as often as humans agree with each other, suggesting that GPT-4 is a reason- able proxy for human evaluations (due to limited human raters, we only collect multiple human judg- ments for the DPO and PPO-1 comparisons). Over- all, the GPT-4 (C) prompt generally provides win rates more representative of humans; we therefore use this prompt for the main results in Section 6.2. For additional details about the human study, in- cluding the web interface presented to raters and the list of human volunteers, see Appendix D.3. 7 Discussion Learning from preferences is a powerful, scalable framework for training capable, aligned language models. We have introduced DPO, a simple training paradigm for training language models from preferences without reinforcement learning. Rather than coercing the preference learning problem into a standard RL setting in order to use off-the-shelf RL algorithms, DPO identifies a mapping between language model policies and reward functions that enables training a language model to satisfy human preferences directly , with a simple cross-entropy loss, without reinforcement learning or loss of generality. With virtually no tuning of hyperparameters, DPO performs similarly or better than existing RLHF algorithms, including those based on PPO; DPO thus meaningfully reduces the barrier to training more language models from human preferences. Limitations & Future Work. Our results raise several important questions for future work. How does the DPO policy generalize out of distribution, compared with learning from an explicit reward function? Our initial results suggest that DPO policies can generalize similarly to PPO-based models, but more comprehensive study is needed. For example, can training with self-labeling from the DPO policy similarly make effective use of unlabeled prompts? On another front, how does reward over-optimization manifest in the direct preference optimization setting, and is the slight decrease in performance in Figure 3-right an instance of it? Additionally, while we evaluate models up to 6B parameters, exploration of scaling DPO to state-of-the-art models orders of magnitude larger is an exciting direction for future work. Regarding evaluations, we find that the win rates computed by GPT-4 are impacted by the prompt; future work may study the best way to elicit high-quality judgments from automated systems. Finally, many possible applications of DPO exist beyond training language models from human preferences, including training generative models in other modalities. Acknowledgements EM gratefully acknowledges funding from a Knight-Hennessy Graduate Fellowship. CF and CM are CIFAR Fellows. This work was supported in part by the Stanford Accelerator for Learning (SAL) and Stanford Institute for Human-Centered Artificial Intelligence (HAI) Generative AI for the Future of Learning seed grant program. The Stanford Center for Research on Foundation Models (CRFM) provided part of the compute resources used for the experiments in this work. This work was supported in part by ONR grant N00014-20-1-2675. 10 Page 11: References [1]Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. [2]Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby, R. Lar- son, S. Ringer, S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. Brown, and J. Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. [3]S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. [4]H. Bong and A. Rinaldo. Generalized results for the existence and consistency of the MLE in the Bradley-Terry-Luce model. International Conference on Machine Learning , 2022. arXiv:2110.11487. [5]R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika , 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029. [6]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Lan- guage models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 1877– 1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf . [7]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. [8]S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023. arXiv preprint arXiv:2303.12712. [9]R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning , 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8 . [10] Y . Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. ArXiv , abs/2304.00723, 2023. [11] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 , 2022. [12] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf . 11 Page 12: [13] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei. Scaling instruction-finetuned language models, 2022. [14] M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In P. Grünwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Conference on Learning Theory , volume 40 of Proceedings of Machine Learning Research , pages 563–587, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Dudik15.html . [15] D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with preferences through f-divergence minimization. In Proceedings of the 40th International Conference on Machine Learning , ICML’23. JMLR.org, 2023. [16] A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manip- ulators via iterative improvement. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/ 2013/file/c058f544c737782deacefa532d9add4c-Paper.pdf . [17] N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner, and D. Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning , pages 1645–1654. PMLR, 2017. [18] N. Jaques, J. H. Shen, A. Ghandeharioun, C. Ferguson, A. Lapedriza, N. Jones, S. S. Gu, and R. Picard. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848 , 2020. [19] T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 16203–16220. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 67496dfa96afddab795530cc7c69b57a-Paper-Conference.pdf . [20] J. Kreutzer, J. Uyheng, and S. Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1777–1788, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/ P18-1165. URL https://aclanthology.org/P18-1165 . [21] A. Kupcsik, D. Hsu, and W. S. Lee. Learning Dynamic Robot-to-Human Object Handover from Human Feedback , pages 161–176. Springer International Publishing, 01 2018. ISBN 978-3-319-51531-1. doi: 10.1007/978-3-319-51532-8_10. [22] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018. [23] R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation , 2012. [24] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/P11-1015 . [25] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long. 244. URL https://aclanthology.org/2022.acl-long.244 . 12 Page 13: [26] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280–290, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https:// aclanthology.org/K16-1028 . [27] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , SC ’21, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209 . [28] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/ paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf . [29] R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations , 2018. URL https://openreview. net/forum?id=HkAClQgA- . [30] X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 , 2019. [31] J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning , pages 745–750, 2007. [32] R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics) , 24(2):193–202, 1975. doi: https://doi.org/10.2307/2346567. [33] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019. Ms., OpenAI. [34] R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa, C. Bauckhage, H. Hajishirzi, and Y . Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations , 2023. URL https://openreview. net/forum?id=8aHzds2uUyB . [35] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. CoRR , abs/1511.06732, 2015. [36] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS) , 2017. [37] A. Saha, A. Pacchiano, and J. Lee. Dueling rl: Reinforcement learning with trajectory preferences. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , volume 206 of Proceed- ings of Machine Learning Research , pages 6263–6289. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/saha23a.html . [38] V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chh- ablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multi- task prompted training enables zero-shot task generalization. In International Conference on Learning Representations , 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4 . 13 Page 14: [39] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. [40] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. Christiano. Learning to summarize from human feedback, 2022. [41] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, V . Zhao, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V . Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V . Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera- Arcas, C. Cui, M. Croak, E. Chi, and Q. Le. Lamda: Language models for dialog applications, 2022. [42] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 , 2023. [43] M. Völske, M. Potthast, S. Syed, and B. Stein. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization , pages 59–63, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508 . [44] L. von Werra, J. Tow, reciprocated, S. Matiana, A. Havrilla, cat state, L. Castricato, Alan, D. V . Phung, A. Thakur, A. Bukhtiyarov, aaronrmm, F. Milo, Daniel, D. King, D. Shin, E. Kim, J. Wei, M. Romero, N. Pochinkov, O. Sanseviero, R. Adithyan, S. Siu, T. Simonini, V . Blagojevic, X. Song, Z. Witten, alexandremuzio, and crumb. CarperAI/trlx: v0.6.0: LLaMa (Alpaca), Benchmark Util, T5 ILQL, Tests, Mar. 2023. URL https://doi.org/10.5281/zenodo. 7790115 . [45] B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax , May 2021. [46] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 , 2019. [47] R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Mach. Learn. , 8(3–4):229–256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696 . [48] Y . Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence , AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8. [49] X. Yan, C. Luo, C. L. A. Clarke, N. Craswell, E. M. V oorhees, and P. Castells. Human preferences as dueling bandits. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’22, page 567–577, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi: 10.1145/3477495.3531991. URL https://doi.org/10.1145/3477495.3531991 . [50] Y . Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences , 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https: //doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/ article/pii/S0022000012000281 . JCSS Special Issue: Cloud Computing 2011. [51] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences, 2020. 14 Page 15: Author Contributions All authors provided valuable contributions to designing, analyzing, and iterating on experiments, writing and editing the paper, and generally managing the project’s progress. RRproposed using autoregressive reward models in discussions with EM; derived the DPO objective; proved the theoretical properties of the algorithm and wrote the relevant sections and appendices. He also suggested and helped with organizing experiments and contributed some of the PPO and reward learning baselines. ASinitiated the discussion on using weighted regression methods as an alternative to PPO; initiated project-related organization, wrote initial analysis connecting DPO with weighted regression and unlikelihood; design and iterations of DPO + baseline implementations, initial exploratory exper- iments for DPO; substantial experiment organization and design (datasets, baselines, evaluation); led model training and evaluation for controlled sentiment generation and summarization; design iterations for GPT-4 evaluation (particularly summarization); substantial writing contributions to abstract, prelims/method and experiments; editing contributions to other sections. EMprovided input on early discussions on learning autoregressive reward functions; wrote the first implementation of DPO and ran the first DPO experiments; trained the large-scale (summarization and dialogue) DPO models used in paper experiments; conducted initial GPT-4 win rate evaluations and set up related infrastructure; recruited participants for, conducted, and analyzed results from the human study; wrote the abstract, introduction, related work, discussion, and most of experiments; and assisted with editing the rest of the paper. CF, CM, & SE supervised the research, suggested ideas and experiments, and assisted in writing the paper. A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective In this appendix, we will derive Eq. 4. Analogously to Eq. 3, we optimize the following objective: max πEx∼D,y∼π r(x, y) −βDKL π(y|x)||πref(y|x) (11) under any reward function r(x, y), reference model πrefand a general non-parametric policy class. We now have: max πEx∼D,y∼π r(x, y) −βDKL π(y|x)||πref(y|x) = max πEx∼DEy∼π(y|x) r(x, y)−βlogπ(y|x) πref(y|x) = min πEx∼DEy∼π(y|x) logπ(y|x) πref(y|x)−1 βr(x, y) = min πEx∼DEy∼π(y|x) logπ(y|x) 1 Z(x)πref(y|x) exp 1 βr(x, y)−logZ(x) (12) where we have partition function: Z(x) =X yπref(y|x) exp1 βr(x, y) . Note that the partition function is a function of only xand the reference policy πref, but does not depend on the policy π. We can now define π∗(y|x) =1 Z(x)πref(y|x) exp1 βr(x, y) , 15 Page 16: which is a valid probability distribution as π∗(y|x)≥0for all yandP yπ∗(y|x) = 1 . Since Z(x)is not a function of y, we can then re-organize the final objective in Eq 12 as: min πEx∼D Ey∼π(y|x) logπ(y|x) π∗(y|x) −logZ(x) = (13) min πEx∼D[DKL(π(y|x)||π∗(y|x))−logZ(x)] (14) Now, since Z(x)does not depend on π, the minimum is achieved by the policy that minimizes the first KL term. Gibbs’ inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical. Hence we have the optimal solution: π(y|x) =π∗(y|x) =1 Z(x)πref(y|x) exp1 βr(x, y) (15) for all x∈ D. This completes the derivation. A.2 Deriving the DPO Objective Under the Bradley-Terry Model It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have p∗(y1≻y2|x) =exp (r∗(x, y 1)) exp (r∗(x, y 1)) + exp ( r∗(x, y 2))(16) In Section 4 we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy: r∗(x, y) =βlogπ∗(y|x) πref(y|x)+βlogZ(x) (17) Substituting Eq. 17 into Eq. 16 we obtain: p∗(y1≻y2|x) =exp βlogπ∗(y1|x) πref(y1|x)+βlogZ(x) exp βlogπ∗(y1|x) πref(y1|x)+βlogZ(x) + exp βlogπ∗(y2|x) πref(y2|x)+βlogZ(x) =1 1 + exp βlogπ∗(y2|x) πref(y2|x)−βlogπ∗(y1|x) πref(y1|x) =σ βlogπ∗(y1|x) πref(y1|x)−βlogπ∗(y2|x) πref(y2|x) . The last line is the per-instance loss in Equation 7. A.3 Deriving the DPO Objective Under the Plackett-Luce Model The Plackett-Luce model [ 32,23] is a generalization of the Bradley-Terry model over rankings (rather than just pair-wise comparisons). Similar to to the Bradley-Terry model, it stipulates that when presented with a set of possible choices, people prefer a choice with probability proportional to the value of some latent reward function for that choice. In our context, when presented with a prompt x and a set of Kanswers y1, . . . , y Ka user would output a permutation τ: [K]→[K], giving their ranking of the answers. The Plackett-Luce model stipulates that p∗(τ|y1, . . . , y K, x) =KY k=1exp(r∗(x, yτ(k))) PK j=kexp(r∗(x, yτ(j)))(18) Notice that when K= 2, Equation 18 reduces to the Bradley-Terry model. However, for the general Plackett-Luce model, we can still utilize the results of Eq. 5 and substitute the reward function parameterized by its optimal policy. Similarly to Appendix A.2, the normalization constant Z(x) cancels out and we’re left with: p∗(τ|y1, . . . , y K, x) =KY k=1exp βlogπ∗(yτ(k)|x) πref(yτ(k)|x) PK j=kexp βlogπ∗(yτ(j)|x) πref(yτ(j)|x) (19) 16 Page 17: Similarly to the approach of Section 4, if we have access to a dataset D = {τ(i), y(i) 1, . . . , y(i) K, x(i)}N i=1of prompts and user-specified rankings, we can use a parameterized model and optimize this objective with maximum-likelihood.: LDPO(πθ, πref) =−Eτ,y1,...,y K,x∼D logKY k=1exp βlogπθ(yτ(k)|x) πref(yτ(k)|x) PK j=kexp βlogπθ(yτ(j)|x) πref(yτ(j)|x)  (20) A.4 Deriving the Gradient of the DPO Objective In this section we derive the gradient of the DPO objective: ∇θLDPO(πθ;πref) =−∇θE(x,yw,yl)∼D logσ βlogπθ(yl|x) πref(yl|x)−βlogπθ(yw|x) πref(yw|x) (21) We can rewrite the RHS of Equation 21 as ∇θLDPO(πθ;πref) =−E(x,yw,yl)∼Dσ′(u) σ(u)∇θ(u) , (22) where u=βlogπθ(yl|x) πref(yl|x)−βlogπθ(yw|x) πref(yw|x). Using the properties of sigmoid function σ′(x) =σ(x)(1−σ(x))andσ(−x) = 1−σ(x), we obtain the final gradient ∇θLDPO(πθ;πref) = −E(x,yw,yl)∼D βσ βlogπθ(yw|x) πref(yw|x)−βlogπθ(yl|x) πref(yl|x) ∇θlogπ(yw|x)−∇θlogπ(yl|x) , After using the reward substitution of ˆrθ(x, y) =βlogπθ(y|x) πref(y|x)we obtain the final form of the gradient from Section 4. A.5 Proof of Lemma 1 and 2 In this section, we will prove the two lemmas from Section 5. Lemma 1 Restated. Under the Plackett-Luce preference framework, and in particular the Bradley- Terry framework, two reward functions from the same equivalence class induce the same preference distribution. Proof. We say that two reward functions r(x, y)andr′(x, y)are from the same equivalence class ifr′(x, y) =r(x, y) +f(x)for some function f. We consider the general Plackett-Luce (with the Bradley-Terry model a special case for K= 2) and denote the probability distribution over rankings induced by a particular reward function r(x, y)aspr. For any prompt x, answers y1, . . . , y Kand ranking τwe have: pr′(τ|y1, . . . , y K, x) =KY k=1exp(r′(x, yτ(k))) PK j=kexp(r′(x, yτ(j))) =KY k=1exp(r(x, yτ(k)) +f(x)) PK j=kexp(r(x, yτ(j)) +f(x)) =KY k=1exp(f(x)) exp( r(x, yτ(k))) exp(f(x))PK j=kexp(r(x, yτ(j))) =KY k=1exp(r(x, yτ(k))) PK j=kexp(r(x, yτ(j))) =pr(τ|y1, . . . , y K, x), which completes the proof. 17 Page 18: Lemma 2 Restated. Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. Proof. Let us consider two reward functions from the same class, such that r′(x, y) =r(x, y) +f(x) and, let us denote as πrandπr′the corresponding optimal policies. By Eq. 4, for all x, ywe have πr′(y|x) =1 P yπref(y|x) exp 1 βr′(x, y)πref(y|x) exp1 βr′(x, y) =1 P yπref(y|x) exp 1 β(r(x, y) +f(x))πref(y|x) exp1 β(r(x, y) +f(x)) =1 exp 1 βf(x)P yπref(y|x) exp 1 βr(x, y)πref(y|x) exp1 βr(x, y) exp1 βf(x) =1 P yπref(y|x) exp 1 βr(x, y)πref(y|x) exp1 βr(x, y) =πr(y|x), which completes the proof. A.6 Proof of Theorem 1 In this section, we will expand on the results of Theorem 1. Theorem 1 Restated. Assume, we have a reference model, such that πref(y|x)>0for all pairs of prompts xand answers yand a parameter β > 0. All reward equivalence classes, as defined in Section 5 can be represented with the reparameterization r(x, y) =βlogπ(y|x) πref(y|x)for some model π(y|x). Proof. Consider any reward function r(x, y), which induces an optimal model πr(y|x)under the KL-constrained RL problem, with solution given by 4. Following Eq. 5, when we log-linearize both sides we obtain: r(x, y) =βlogπr(y|x) πref(y|x)+βlogZ(x) where Z(x) =P yπref(y|x) exp 1 βr(x, y) (notice that Z(x)also depends on the reward function r). Using the operator r′(x, y) =f(r, π ref, β)(x, y) =r(x, y)−βlogZ(x), we see that this new reward function is within the equivalence class of rand, we have: r′(x, y) =βlogπr(y|x) πref(y|x) which completes the proof. We can further expand on these results. We can see that if randr′are two reward functions in the same class, then f(r, π ref, β)(x, y) =βlogπr(y|x) πref(y|x)=βlogπ′ r(y|x) πref(y|x)=f(r′, πref, β)(x, y) where the second equality follows from Lemma 2. We have proven that the operator fmaps all reward functions from a particular equivalence class to the same reward function. Next, we show that for every equivalence class of reward functions, the reward function that has the reparameterization outlined in Theorem 1 is unique. Proposition 1. Assume, we have a reference model, such that πref(y|x)>0for all pairs of prompts xand answers yand a parameter β > 0. Then every equivalence class of reward functions, as defined in Section 5, has a unique reward function r(x, y), which can be reparameterized as r(x, y) =βlogπ(y|x) πref(y|x)for some model π(y|x). 18 Page 19: Proof. We will proceed using proof by contradiction. Assume we have two reward functions from the same class, such that r′(x, y) =r(x, y) +f(x). Moreover, assume that r′(x, y) =βlogπ′(y|x) πref(y|x) for some model π′(y|x)andr(x, y) =βlogπ(y|x) πref(y|x)for some model π(y|x), such that π̸=π′. We then have r′(x, y) =r(x, y) +f(x) =βlogπ(y|x) πref(y|x)+f(x) =βlogπ(y|x) exp(1 βf(x)) πref(y|x)=βlogπ′(y|x) πref(y|x) for all prompts xand completions y. Then we must have π(y|x) exp(1 βf(x)) =π′(y|x). Since these are distributions, summing over yon both sides, we obtain that exp(1 βf(x)) = 1 and since β >0, we must have f(x) = 0 for all x. Therefore r(x, y) =r′(x, y). This completes the proof. We have now shown that every reward class has a unique reward function that can be represented as outlined in Theorem 1, which is given by f(r, π ref, β)for any reward function in that class. B DPO Implementation Details and Hyperparameters DPO is relatively straightforward to implement; PyTorch code for the DPO loss is provided below: 19 Page 20: import torch.nn.functional as F def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta): """ pi_logps: policy logprobs, shape (B,) ref_logps: reference model logprobs, shape (B,) yw_idxs: preferred completion indices in [0, B-1], shape (T,) yl_idxs: dispreferred completion indices in [0, B-1], shape (T,) beta: temperature controlling strength of KL penalty Each pair of (yw_idxs[i], yl_idxs[i]) represents the indices of a single preference pair. """ pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs] ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs] pi_logratios = pi_yw_logps - pi_yl_logps ref_logratios = ref_yw_logps - ref_yl_logps losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios)) rewards = beta * (pi_logps - ref_logps).detach() return losses, rewards Unless noted otherwise, we use a β= 0.1, batch size of 64and the RMSprop optimizer with a learning rate of 1e-6 by default. We linearly warmup the learning rate from 0to1e-6 over150steps. For TL;DR summarization, we use β= 0.5, while rest of the parameters remain the same. C Further Details on the Experimental Set-Up In this section, we include additional details relevant to our experimental design. C.1 IMDb Sentiment Experiment and Baseline Details The prompts are prefixes from the IMDB dataset of length 2-8 tokens. We use the pre-trained senti- ment classifier siebert/sentiment-roberta-large-english as a ground-truth reward model andgpt2-large as a base model. We use these larger models as we found the default ones to generate low-quality text and rewards to be somewhat inaccurate. We first use supervised fine-tuning on a subset of the IMDB data for 1 epoch. We then use this model to sample 4 completions for 25000 prefixes and create 6 preference pairs for each prefix using the ground-truth reward model. The RLHF reward model is initialized from the gpt2-large model and trained for 3 epochs on the preference datasets, and we take the checkpoint with the highest validation set accuracy. The “TRL” run uses the hyper-parameters in the TRL library. Our implementation uses larger batch samples of 1024 per PPO step. C.2 GPT-4 prompts for computing summarization and dialogue win rates A key component of our experimental setup is GPT-4 win rate judgments. In this section, we include the prompts used to generate win rates for the summarization and dialogue experiments. We use gpt-4-0314 for all our experiments. The order of summaries or responses are randomly chosen for every evaluation. Summarization GPT-4 win rate prompt (S). Which of the following summaries does a better job of summarizing the most \ important points in the given forum post? Post: 20 Page 21: <post> Summary A: <Summary A> Summary B: <Summary B> FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format: Comparison: <one-sentence comparison and explanation> Preferred: <"A" or "B"> Summarization GPT-4 win rate prompt (C). Which of the following summaries does a better job of summarizing the most \ important points in the given forum post, without including unimportant or \ irrelevant details? A good summary is both precise and concise. Post: <post> Summary A: <Summary A> Summary B: <Summary B> FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format: Comparison: <one-sentence comparison and explanation> Preferred: <"A" or "B"> Dialogue GPT-4 win rate prompt. For the following query to a chatbot, which response is more helpful? Query: <the user query> Response A: <either the test method or baseline> Response B: <the other response> FIRST provide a one-sentence comparison of the two responses and explain \ which you feel is more helpful. SECOND, on a new line, state only "A" or \ "B" to indicate which response is more helpful. Your response should use \ the format: Comparison: <one-sentence comparison and explanation> More helpful: <"A" or "B"> C.3 Unlikelihood baseline While we include the unlikelihood baseline [ 46] (simply maximizing logp(yw|x), the log probability of the preferred response, while minimizing logp(yl|x), the log probability of the dispreferred response) in our sentiment experiments, we do not include it as a baseline in either the summarization 21 Page 22: Prompt Response SUBREDDIT: r/relationships TITLE: The girl [26 F] I [22 M] have been seeing for a month didn’t respond to me at all yesterday while hanging out with a friend [ 30? M]. POST: She gets terrible service while at her house, but I texted her 3 times yesterday, 4-5 hours apart. She didn’t call me until early this morning and left a voicemail that she was busy all day with a friend who showed up out of the blue. I saw that she posted a picture of the two of them out of her dead zone house on facebook before I texted her the last time. I don’t mind that she hangs out with friends, and I know it’s pretty early [. . . ] TL;DR:girl when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when whenUB when when when when whenUB when when when whenUB SUBREDDIT: r/tifu TITLE: TIFU by accidently kicking an old woman POST: So this didn’t happen today but actually about a year or two ago. I was at my granddads funeral so of course it was all very sad and full of lots of crying old people. After the ceremony everyone walks outside the building and onto the other side of the small road the hearses drive down. Now the road is important because obviously if there’s a road, there’s a curb onto the sidewalk, so most of us are on the other side of the road, besides a few older people walking a lot slower. As one of the old woman goes to walk up the curb [. . . ] TL;DR:when an old woman was tripping the when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when Table 3: Unlikelihood samples from TL;DR prompts sampled at temperature 1.0. In general, we find unlikelihood fails to generate meaningful responses for more complex problems such as summarization and dialogue. or dialogue experiment because it produces generally meaningless responses, which we believe is a result of unconstrained likelihood minimization . D Additional Empirical Results D.1 Performance of Best of Nbaseline for Various N We find that the Best of Nbaseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various Nfor the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4. D.2 Sample Responses and GPT-4 Judgments In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples. 22 Page 23: 0.25 0.50 0.75 1.00 Sampling temperature0.200.250.300.350.400.450.500.550.60Win rateAnthropic-HH Dialogue Win Rate vs Chosen Best of 1 Best of 4Best of 16 Best of 64Best of 128 0.00 0.25 0.50 0.75 1.00 Sampling temperature0.00.10.20.30.40.50.60.7Win rateTL;DR Summarization Win Rate vs Reference Best of 64 Best of 128 Best of 256Figure 4: Best of Nbaseline for N={1,4,16,64,128}. Performance plateaus after roughly 64-128 samples. Prompt SUBREDDIT: r/AskReddit TITLE: I’ve been ungrateful to parents who have wanted only the best for me for the longest time. I am so lucky to have the life I have but I squandered it, this isn’t a humblebrag about my "wicked life" just tell me, what can I do to make it up to them after this. POST: I’m 17, leaving for University, and I’ve had epiphany after epiphany of how shitty i’ve beent to them, or how all the nice things they do I ruin with moods etc., I know there isn’t a quick fix, but I want to do something nice. I’ve always thanked them and apologized where it is due (in earlier years not so much) but as they say, actions speak louder than words, so I was wondering what can I do now, before I go, to just make them happy. What can I keep doing, I know all the little things, doing chores, keeping my room clean, cleaning up after myself, helping around the house, but what’s a big thing, for say christmas, or their birthdays, that I can do to make them happy? DPO TL;DR: I’ve been a shitty child and I want to make up for it, what can I do for my parents on their birthdays and christmas to show them how much I love them? PPO TL;DR: I’ve been a shitty person to my parents, what can I do to make it up to them, before I go, and what can I do after university to make them happy, and what can I do to make them happy after university (if that is the case) Judgment Summary B [DPO] more effectively captures the main points of the post by focusing on making the parents happy through gestures on special occasions, while Summary A seems repetitive and less clear. Table 4: Sample summaries to a post from the TL;DR test set. DPO sample generated with a temperature 0.25; PPO sample generated at a temperature of 0. The order in which summaries are presented in randomized when evaluating with GPT-4, so the order in the Judgment may not correspond to the order in the table. For clarity, post-hoc annotations are included in bold, formatted as [annotation] . These annotations are not part of the model generations. 23 Page 24: Prompt SUBREDDIT: r/relationships TITLE: I [22 M] am about to propose to my girlfriend [22F] of almost 2 years, and I can’t calm myself down POST: Alright, so I am preparing to propose to my girlfriend of almost 2 years (we were friends for double that beforehand) in a couple days and I am starting to get EXTREMELY anxious/nervous. We have talked about this for months, and I know she will say yes, but I can’t seem to relax. The other day I bought the ring, which I fell in love with myself and know she will too. And then two days ago, I asked her dad’s blessing. Everything has gone great so far, but now leading up to it I can’t think about anything else. I can’t sleep. I am anxious as hell, and all I want to do is pop the question now, but I have to wait since I don’t see her until Thursday. I know how I am going to do it. I have thoroughly planned it out and I am more than prepared. But, at the same time, I am freaking out. I asked my best friend for advice since he just got married recently, and all he could say was "It will feel 10x worse right before you do ask." Any suggestions on how to relax leading up to it? I would like to be able to sleep... DPO TL;DR: About to propose to my girlfriend of almost 2 years, but I am extremely anxious and can’t sleep. How can I relax? PPO TL;DR: I am about to ask my girlfriend of almost 2 years to marry me, and I can’t calm myself down. Any tips on how to re- lax? I would like to be able to sleep tonight (and the night be- fore).............................................................................................................................. Judgment Summary A [DPO] more effectively summarizes the main points of the post, concisely conveying the asker’s anxiety and goal of finding ways to relax. Table 5: Sample summaries to a post from the TL;DR test set. DPO sample generated with a temperature 0.25; PPO sample generated at a temperature of 0. The order in which summaries are presented in randomized when evaluating with GPT-4, so the order in the Judgment may not correspond to the order in the table. For clarity, post-hoc annotations are included in bold, formatted as [annotation] . These annotations are not part of the model generations. Prompt SUBREDDIT: r/loseit TITLE: Is It Bullshit?: Fiber Gourmet Pasta POST: I was looking for low-cal alternatives to pasta, because I love pasta and it’s easy to make and I eat it several times a week. I find that whole grain pasta has a weird taste/texture, and I’m not a big fan of it. I was doing some research into spaghetti squash (which is on my grocery list for next time), but I also heard someone rave about some high-fiber pasta brand so I looked into it. What sorcery is this? It has all the trappings of regular pasta (and the reviews I’ve read say it tastes the same too) but a little over half the calories. My mind boggles over how much extra pasta I could eat! I can’t believe this hasn’t taken the world by storm, which makes me wonder what the catch is. DPO TL;DR: I’m trying to cut back on calories and pasta is one of my main sources. Found a high-fiber pasta that has all the trappings of regular pasta and seems like it would be a good substitute. Is it bullshit? PPO TL;DR: I want to lose weight, but I love pasta and don’t want to give it up. Is there a high-fiber pasta that I can eat more of and lose weight with, or is it all bullshit? Judgment Summary B [DPO] more accurately captures the intent to find a low-calorie pasta alternative, without assuming the need for weight loss. Table 6: Sample summaries to a post from the TL;DR test set. DPO sample generated with a temperature 0.25; PPO sample generated at a temperature of 0. The order in which summaries are presented in randomized when evaluating with GPT-4, so the order in the Judgment may not correspond to the order in the table. For clarity, post-hoc annotations are included in bold, formatted as [annotation] . These annotations are not part of the model generations. 24 Page 25: Prompt Can you help me write an essay on the Civil Rights Movement? DPO The Civil Rights Movement was a series of social and political movements that took place between the 1950s and the 1980s, leading to the integration of public spaces, higher education, employment, and voting rights. The movement is often credited with the end of legal segregation in the United States, and it was a major catalyst for the Civil Rights Act of 1964. The Movement also played a large role in the rise of the Black Power and Civil Rights Movements, which focused on economic equality and racial justice. The Movement focused on the rights of African Americans, who were then the dominant racial and ethnic group in the United States, but it also highlighted the rights of other minority groups, such as women, immigrants, and LGBT people. The Movement is often considered one of the most important social movements of the 20th century, and it played a significant role in the advancement of civil rights for all Americans. I’ll be happy to write an essay on this topic for you. What kind of essay would you like? A paper on the history of the Civil Rights Movement? A paper on the causes of the Civil Rights Movement? Or a paper on the effects of the Civil Rights Movement? Any ideas? I’ll be happy to [maximum generation length reached] GT Oh, I’m not sure if I can actually write an essay, but I can at least help you think through the process. First, can you tell me what you’re trying to write about? In other words, what are the main points or questions you want to address? Judgment Response B [DPO] provides more detailed information about the Civil Rights Movement and offers specific suggestions for essay topics, making it more helpful for someone writing an essay. Table 7: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 0.7; GTis the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation] . These annotations are not part of the model generations. Prompt I am trying to locate the residential address for J.B. Pritzker, an hier [sic] to the Hyatt Hotel fortune. DPO I am not able to assist you in this matter. It’s unlawful and irresponsible to locate private information on people without their consent, and I’m not aware of any way you could do this for human. Please stop this activity. GT J.B. Pritzker is one of the co-founders and investors in Hyatt. He and his family have been listed in the Forbes 400, and the family also maintains a 4,000-acre ranch in Colorado. Judgment Response A [DPO] emphasizes the importance of privacy and laws, while Response B gives background information on J.B. Pritzker without providing his address. Table 8: GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 1.0; GTis the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation] . These annotations are not part of the model generations. 25 Page 26: Prompt Can you tell me what got the USA involved in World War II? DPO In the 1930s, the United States found itself at the center of a growing political and economic crisis that threatened the world’s stability. The Great Depression had brought many other countries to the brink, including many European ones, and the United States found itself in a very precarious position. The Great Depression had left the government and the economy on the edge of collapse, it faced severe international indebtedness, and it faced the threat of a foreign default on its loans. When Japan attacked China in October 1937, the United States found itself in an especially dangerous military position, as the United States was the largest creditor. The United States found itself desperately seeking ways to avoid a complete col- lapse, and two main strategies emerged: both involved creating an alliance of nations to protect American economic security and stability, as well as ensuring that the United States remained politically relevant. The first strategy was to create a “coalition of the willing”, a network of international alliances to counteract the isolation and vulnerability of countries like Japan. The second strategy was to form an “all-inclusive association”, a group of nations that the United States would be the most powerful member of, and which would protect the prosperity of the United States and protect it from other powers [maximum generation length reached] GT The United States entered the war against Japan in response to the Japanese attack on Pearl Harbor, Hawaii, on December 7, 1941. This was a full month before the United States officially entered the war. At this point, the German Axis powers controlled France, Belgium, Norway, and other parts of Europe, and held significant military influence in much of Asia. President Roosevelt officially asked Congress for a declaration of war against Japan after the bombing of Pearl Harbor. Judgment Response B [GT] provides a direct and accurate answer to the question, while Response A is overly complicated and doesn’t provide the correct answer. Table 9: GPT-4 chooses GT over DPO. DPO’s response is verbose and plausible, but contains factually incorrect information (the ‘coalition of the willing’ does not refer to events of WWII; the ‘all-inclusive association’ is not a real organization). Prompt what is 7 plus 2 DPO Hm, let me see... seven plus two is nine, I’m pretty sure. I’ll be happy to read the input aloud and double check if I’ve got it right. Let me try: "7 + 2" I’ve got that right! I think our input was meant to be seven, plus two. You can try it yourself: "7 + 2" "9" I’ve got that right! Now, what I’m wondering is if you want to learn this formula for yourself, or if you want me to help you practice it. The formula is called the “arithmetic-prefix method”, and it’s a great way to memorize numbers – and in fact I personally learned to count by memorizing the arithmatic prefixes! I’ll give you a refresher on arithmatic prefixes: you can think of them as an abbreviated way of describing a number by its first two significant digits. There are three arithmatic prefixes: The “1-arithmatic prefix” (1-A) The “2-arithmatic prefix” (2-A) GT 11 Judgment Response B [GT] provides a direct and accurate answer to the question, while Response A is overly complicated and doesn’t provide the correct answer. Table 10: GPT-4 chooses GT over DPO. GPT-4 incorrectly states that the ground truth is correct while DPO’s (more verbose) output is wrong. 26 Page 27: D.3 Human study details Figure 5: Layout of the survey in SurveyMonkey. Each respon- dent completed 25 similarly-formatted judgments.In order to validate the usage of GPT- 4 for computing win rates, our human study collects human preference data for several matchups in the TL;DR summa- rization setting. We select three different algorithmic matchups, evaluating DPO (temp. 0.25), SFT (temp. 0.25), and PPO (temp 1.0) compared to the refer- ence algorithm PPO (temp 0.). By select- ing matchups for three unique algorithms as well as algorithms with a wide range of win rates vs the reference, we capture the similarity of human and GPT-4 win rates across the response quality spec- trum. We sample 150 random compar- isons of DPO vs PPO-0 and 100 random comparisons PPO-1 vs PPO-0, assigning two humans to each comparison, produc- ing 275 judgments for DPO-PPO7and 200 judgments for PPO-PPO. We sam- ple 125 SFT comparisons, assigning a single human to each. We ignore judg- ments that humans labeled as ties (which amount to only about 1% of judgments), and measure the raw agreement percent- age between human A and human B (for comparisons where we have two human annotators, i.e., not SFT) as well as be- tween each human and GPT-4. Participants. We have 25 volunteer human raters in total, each comparing 25 summaries (one volunteer completed the survey late and was not included in the final analysis, but is listed here). The raters were Stanford students (from undergrad through Ph.D.), or recent Stanford graduates or visitors, with a STEM (mainly CS) focus. See Figure 5 for a screenshot of the survey interface. We gratefully acknowledge the contribution of each of our volunteers, listed in random order: 1. Gordon Chi 2. Virginia Adams 3. Max Du 4. Kaili Huang 5. Ben Prystawski 6. Ioanna Vavelidou 7. Victor Kolev 8. Karel D’Oosterlinck 9. Ananth Agarwal 10. Tyler Lum 11. Mike Hardy 12. Niveditha Iyer 13. Helena Vasconcelos 14. Katherine Li 15. Chenchen Gu 16. Moritz Stephan 17. Swee Kiat Lim 18. Ethan Chi 19. Kaien Yang 20. Ryan Chi 21. Joy Yun 22. Abhay Singhal 23. Siyan Li 24. Amelia Hardy 25. Zhengxuan Wu 7One volunteer did not respond for the DPO-PPO comparison. 27

---