loader
Generating audio...

arxiv

Paper 2501.19393

s1: Simple test-time scaling

Authors: Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto

Published: 2025-01-31

Abstract:

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Paper Content:
Page 1: s1: Simple test-time scaling Niklas Muennighoff*134Zitong Yang*1Weijia Shi*23Xiang Lisa Li*1Li Fei-Fei1Hannaneh Hajishirzi23 Luke Zettlemoyer2Percy Liang1Emmanuel Candès1Tatsunori Hashimoto1 Abstract Test-time scaling is a promising new approach to languagemodelingthatusesextratest-timecom- putetoimproveperformance. Recently,OpenAI’s o1modelshowedthiscapabilitybutdidnotpub- liclyshareitsmethodology,leadingtomanyrepli- cationefforts. Weseekthesimplestapproachto achievetest-timescalingandstrongreasoningper- formance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through abla- tions: difficulty,diversity,andquality. Second,we developbudgetforcingtocontroltest-timecom- putebyforcefullyterminatingthemodel’sthink- ing process or lengthening it by appending “Wait” multipletimes tothe model’sgeneration whenit tries to end. This can lead the model to double- check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5- 32B-Instructlanguagemodelon s1Kandequip- ping it with budget forcing, our model s1-32Bex- ceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scalings1-32Bwith budget forcing allows extrap- olatingbeyonditsperformancewithouttest-time intervention: from 50% to 57% on AIME24. Our model,data,andcodeareopen-sourceat https: //github.com/simplescaling/s1 . 1. Introduction Performance improvements of language models (LMs) over the past years have largely relied on scaling up train-time computeusinglarge-scaleself-supervisedpretraining(Ka- plan et al., 2020; Hoffmann et al., 2022). The creation of these powerful models has set the stage for a new scaling paradigmbuiltontopofthem: test-timescaling . Theaim *Equalcontribution. ZYandNMstartedtheproject. WS,NM and ZY collected the prompts, XL, ZY and NM, built the data pipeline, LZ and WS proposed using a 1K subset and ZY and NM built budget forcing.1Stanford University.2University of Washington, Seattle.3Allen Institute for AI.4Contextual AI. 512 204865758595Accuracy (%) Mathematical Problem Solving (MATH500) 512 2048 81920204060 Competition Math (AIME24) 1024 4096405060 PhD-Level Science Questions (GPQA Diamond) Average thinking time (tokens)Figure 1. Test-timescalingwiths1-32B. Webenchmark s1-32B on reasoning-intensive tasks and vary test-time compute. of this approach is to increase the compute at test time to get better results. There has been much work exploring this idea (Snell et al., 2024; Welleck et al., 2024), and the via- bility of this paradigm was recently validated by OpenAI o1(OpenAI,2024). o1hasdemonstratedstrongreasoning performance with consistent gains from scaling test-time compute. OpenAIdescribestheirapproachasusinglarge- scalereinforcementlearning(RL)implyingtheuseofsizable amounts of data (OpenAI, 2024). This has led to various attempts to replicate their models relying on techniques like Monte Carlo Tree Search (Gao et al., 2024b; Zhang et al., 2024a),multi-agentapproaches(Qinetal.,2024),andoth- ers(Wangetal.,2024a;Huangetal.,2024b;2025). Among these approaches, DeepSeek R1 (DeepSeek-AI et al., 2025) hassuccessfullyreplicatedo1-levelperformance,alsoem- ploying reinforcement learning via millions of samples and multiple training stages. However, despite the large num- ber of o1 replication attempts, none have openly replicated a clear test-time scaling behavior. Thus, we ask: what is the simplestapproach to achieveboth test-time scalingand strong reasoning performance? Weshowthattrainingononly1,000sampleswithnext-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing leads to a strong reasoning model that scales in performance with more test-time compute. Specifically, we construct s1K, whichconsistsof1,000carefullycuratedquestionspaired with reasoning traces and answers distilled from Gemini Thinking Experimental (Google, 2024). We perform super- vised fine-tuning (SFT) of an off-the-shelf pretrained model 1arXiv:2501.19393v3 [cs.CL] 1 Mar 2025 Page 2: s1: Simple test-time scaling on our small dataset requiring just 26 minutes of training on 16 H100 GPUs. After training, we control the amount of test-time compute our model spends using budget forc- ing:(I)If the model generates more thinking tokens than a desired limit, we forcefully end the thinking process by appendinganend-of-thinkingtokendelimiter. Endingthe thinkingthiswaymakesthemodeltransitiontogenerating itsanswer. (II)Ifwewantthemodeltospendmoretest-time compute on a problem, we suppress the generation of the end-of-thinkingtokendelimiterandinsteadappend“Wait” to the model’s current reasoning trace to encourage more exploration. Equipped with this simple recipe – SFT on 1,000 samples and test-time budget forcing – our model s1- 32Bexhibitstest-timescaling(Figure1). Further, s1-32Bis the most sample-efficient reasoning model and outperforms closed-source models like OpenAI’s o1-preview (Figure 2). Weconductextensiveablationexperimentstargeting(a)our selection of 1,000 (1K) reasoning samplesand (b) our test- time scaling. For (a), we find that jointly incorporating difficulty, diversity, and quality measures into our selec- tion algorithm is important. Random selection, selecting samples with the longest reasoning traces, or only selecting maximally diverse samples all lead to significantly worse performance (around −30% on AIME24 on average). Train- ing on our full data pool of 59K examples, a superset of s1K,doesnot offersubstantialgainsoverour1K selection. This highlights the importance of careful data selection and echoes prior findings for instruction tuning (Zhou et al., 2023). For (b), we define desiderata for test-time scaling methods to compare different approaches. Budget forcing leads to the best scaling as it has perfect controllability with a clear positive slope leading to strong performance. Insummary,ourcontributionsare: Wedevelopsimplemeth- ods for creating a sample-efficient reasoning dataset (§2) and test-time scaling (§3); Based on these we build s1-32B whichiscompetitivewitho1-preview(§4);Weablatesub- tleties of data (§5.1) and test-time scaling (§5.2). We end with a discussion to motivate future work on simple rea- soning(§6). Ourcode,model,anddataareopen-sourceat https://github.com/simplescaling/s1 . 2. Reasoning data curation to create s1K In this section, we describe our process for creating a large datasetfirstin§2.1andthenfilteringitdownto s1Kin§2.2. 2.1. Initial collection of 59K samples Wecollectaninitial59,029questionsfrom16sourcesfollow- ing three guiding principles. Quality: Datasets should be high-quality; we always inspect samples and ignore datasets with,e.g.,poorformatting; Difficulty : Datasetsshouldbe challengingandrequiresignificantreasoningeffort; Diver-sity: Datasetsshouldstemfromvariousfieldstocoverdiffer- ent reasoning tasks. We collect datasets of two categories: Curation of existing datasets Our largest source is Nu- minaMATH (LI et al., 2024) with 30,660 mathematical problems from online websites. We also include histori- cal AIME problems (1983-2021). To enhance diversity, we add OlympicArena (Huang et al., 2024a) with 4,250 questionsspanningAstronomy,Biology,Chemistry,Com- puter Science,Geography,Mathematics, andPhysics from various Olympiads. OmniMath (Gao et al., 2024a) adds 4,238 competition-level mathematics problems. We also include 2,385 problems from AGIEval (Zhong et al., 2023), whichfeaturesquestionsfromstandardizedtestslikeSAT andLSAT,coveringEnglish,Law,andLogic. Wereferto Table 7 in §C for our other sources. New datasets in quantitative reasoning To comple- ment these existing datasets, we create two original datasets. s1-prob consists of 182 questions from the prob- ability section of Stanford University’s Statistics Depart- ment’sPhDQualifyingExams( https://statistics. stanford.edu ),accompaniedbyhandwrittensolutions that cover difficult proofs. The probability qualifying exam is held yearly and requires professional-level mathemati- cal problem-solving. s1-teasers comprises 23 challenging brain-teasers commonly used in interview questions for quantitative trading positions. Each sample consists of a problem and solution taken from PuzzledQuant ( https: //www.puzzledquant.com/ ). Weonlytakeexamples with the highest difficulty level ("Hard"). Foreachquestion,wegenerateareasoningtraceandsolu- tion using the Google Gemini Flash Thinking API (Google, 2024)extractingitsreasoningtraceandresponse. Thisyields 59K triplets of a question, generated reasoning trace, and generatedsolution. Examplesfromourdatasetarein§D.2. Wedecontaminate allsamplesagainstourevaluationques- tions(MATH500,GPQADiamond,AIME24;§C.5)using 8-grams and deduplicate the data. 2.2. Final selection of 1K samples Wecoulddirectlytrainonourpoolof59Kquestions,how- ever, our goal is to find the simplestapproach with minimal resources. Thus,wegothroughthreestagesoffilteringto arriveataminimalsetof1,000samplesrelyingonourthree guiding data principles: Quality, Difficulty, and Diversity. Quality We first remove any questions where we ran into anyAPIerrorsreducingourdatasetto54,116samples. Next, wefilteroutlow-qualityexamplesbycheckingiftheycontain any string patterns with formatting issues, such as ASCII art diagrams, non-existent image references, or inconsistent questionnumberingreducingourdatasetto51,581examples. 2 Page 3: s1: Simple test-time scaling GeometryNumber theory Combin− atoricsReal functionsBiology Complex functions Quantum theory Field theoryCalculus of variationsDifference equationsElectro− dynamicsGroup theoryLinear algebraProbability theory Algebraic systemsMechanicsThermo− dynamicsDifferential equationsComputer science Numerical analysis Calculus Algebraic structures AstronomyDynamical systemsStatistical mechanicsOperations research Math− ematics educationMeasure theoryConvex geometry Fluid mechanics Algebraic geometryStatisticsGeneral topology EconomicsAssociative ringsGeneral relativity Differential geometryMath− ematical logic Partial differential equations Information theory Solid mech −anicsFunctional analysisSpecial functions Comm− utative algebra Integral equationsIntegral transformApproxi− mation theory Potential theory Harmonic analysis Control theoryGeo− physics 1000 17000 800000 N/A Number of Examples80859095100MATH500 Accuracy (%)s1r1-distill Sky-T1QwQBespoke-Stratos o1-previewMost sample-efficient Figure 2. s1Kands1-32B. (left)s1Kisadatasetof1,000high-quality,diverse,anddifficultquestionswithreasoningtraces. (right)s1-32B, a 32B parameter model finetuned on s1Kis on the sample-efficiency frontier. See Table 1 for details on other models. From this pool, we identify 384 samples for our final 1,000 samplesfromdatasets thatweperceiveas high-qualityand not in need of further filtering (see §C.4 for details). Difficulty Fordifficulty,weusetwoindicators: modelper- formance and reasoning trace length. We evaluate two mod- els on each question: Qwen2.5-7B-Instruct and Qwen2.5- 32B-Instruct (Qwen et al., 2024), with correctness assessed by Claude 3.5 Sonnet comparing each attempt against the referencesolution(see§C.3forthegradingprotocol). We measure the token length of each reasoning trace to indicate problem difficulty using the Qwen2.5 tokenizer. This relies ontheassumptionthatmoredifficultproblemsrequiremore thinkingtokens. Basedonthegrading,weremovequestions thateitherQwen2.5-7B-InstructorQwen2.5-32B-Instruct can solve correctly and thus may be too easy. By using two models we reduce the likelihood of an easy sample slipping throughourfilteringduetoararemistakeonaneasyques- tion of one of the models. This brings our total samples down to 24,496, setting the stage for the next round of sub- sampling based on diversity. While filtering with these two modelsmaybeoptimizedforoursetupaswewillalsouse Qwen2.5-32B-Instruct as our model to finetune, the idea of model-based filtering generalizes to other setups. Diversity To quantify diversity, we classify questionsinto domainsusingClaude3.5SonnetbasedontheMathematics Subject Classification (MSC) system (e.g., geometry, com- binatorics, etc.) from the American Mathematical Society.1 The taxonomy focuses on topics in mathematics but also includes other sciences such as biology, physics, and eco- 1https://mathscinet.ams.org/mathscinet/ msc/msc2020.htmlnomics. Toselectourfinalexamplesfromthepoolof24,496 questions, we first choose one domain uniformly at random. Then,wesampleoneproblemfromthisdomainaccording toadistributionthatfavorslongerreasoningtraces(see§C.4 fordetails)asmotivatedin Difficulty. Werepeatthisprocess until we have 1,000 total samples spanning 50 domains. In §5.1, we will show that using our three criteria in combi- nation is important, as only relying on quality, diversity, or difficultyinisolationleadstoworsedatasets. Somedistilled generations are incorrect, which we allow in our data as we focusoncapturingthereasoningprocessratherthanentirely correctsolutions. Ourgrader(§C.3)deems53.6%correct ins1Kand 63.0% in our follow-up s1K-1.1(see §A). 3. Test-time scaling 3.1. Method We classify test-time scaling methods into 1) Sequential , wherelatercomputationsdependonearlierones(e.g.,along reasoningtrace),and 2)Parallel ,wherecomputationsrunin- dependently(e.g.,majorityvoting)(Snelletal.,2024;Brown etal.,2024). Wefocusonsequentialscalingasintuitivelywe believeitshouldscalebetter,sincelatercomputationscan build on intermediate results, allowing for deeper reasoning anditerativerefinement. Weproposenewsequentialscaling methods and ways to benchmark them. Budget forcing We propose a simple decoding-time in- tervention by forcing a maximum and/or minimum number of thinking tokens. Specifically, we enforce a maximum token count by simply appending the end-of-thinking token delimiter and optionally “ Final Answer: ” to early exit 3 Page 4: s1: Simple test-time scaling Ho w many r in r aspberry ?Let ' s br eak do wn the pr ocess of counting the letter ' r ' in the wor d " r aspberry " ... * First letter: ' r ' - This is an ' r ', count = 1. * Second letter: ' a ' - Not an ' r ', count r emains 1 ... * Sixth letter: ' e ' - Not an ' r ', count r emains 1. * Se venth letter: ' r ' - This is an ' r ', count = 2. * Eighth letter: 'y ' - Not an ' r ', count r emains 2 ... The number of ' r ' s in " r aspberry " is 2. , let ' s r e-r ead the question car efully . It asks "Ho w many r in r aspberry ?" ... * r - a - s - p - b - e - r - r - y ... * First ' r ' ... * Second ' r ' ... * Thir d ' r ' ... Count = 3 ...W aitMy initial answer of 2 was incorr ect due to a quick r eading of the wor d. Final Answer: The final answer isQuestion Reasoning tr aceResponse Figure 3. Budget forcing with s1-32B. The model tries to stop after “...is 2.”, but we suppress the end-of-thinking token delimiter insteadappending“Wait”leading s1-32Btoself-correctitsanswer. the thinking stage and make the model provide its current best answer. To enforce a minimum, we suppress the gener- ationoftheend-of-thinkingtokendelimiterandoptionally append thestring “Wait” tothe model’scurrent reasoning trace to encourage the model to reflect on its current gen- eration. Figure 3 contains an example of how this simple approach can lead the model to arrive at a better answer. Baselines We benchmark budget forcing with: (I) Con- ditional length-control methods , which rely on telling the model in the prompt how long it should generate for. We groupthembygranularityinto(a)Token-conditionalcon- trol: We specify an upper bound of thinking tokens in the prompt;(b)Step-conditionalcontrol: Wespecifyanupper boundofthinkingsteps,whereeachstepisaround100to- kens; (c) Class-conditional control: We write two generic promptsthattellthemodeltoeitherthinkforashortorlong amountoftime(see§E.1fordetails). (II)Rejectionsam- pling,whichsamplesuntilagenerationfitsapredetermined compute budget. This oracle captures the posterior over responses conditioned on its length. 3.2. Metrics We establish a set of desiderata as evaluation metrics to measuretest-time scalingacrossmethods. Importantly,we do not only care about the accuracy a method can achieve butalsoitscontrollabilityandtest-timescalingslope. For each method we consider, we run a set of evaluations 𝑎∈ varying test-time compute on a fixed benchmark, e.g. AIME24. This produces a piece-wise linear function 𝑓 with compute as the x-axis measured in thinking tokens and accuracyasthey-axis(seeFigure1,wheretherightmostdot for AIME24corresponds to 𝑓(7320)=57% ). Wemeasurethree metrics: Control=1 ||∑ 𝑎∈𝕀(𝑎min≤𝑎≤𝑎max)(1) where𝑎min,𝑎maxrefer to a pre-specified minimum and max- imum amount of test-time compute; in our case thinking tokens. We usually only constrain 𝑎max. As tokens gener- atedcorrespondtotheamountoftest-timecomputespent, this metric measures the extent to which a method allows controllability over the use of that test-time compute. We report it as a percentage with 100% being perfect control. Scaling=1(|| 2)∑ 𝑎,𝑏∈ 𝑏>𝑎𝑓(𝑏)−𝑓(𝑎) 𝑏−𝑎(2) Scalingistheaverageslopeofthepiece-wiselinearfunction. It must be positive for useful methods and larger is better. Performance =max 𝑎∈𝑓(𝑎) (3) Performance is simply the maximum performance the methodachievesonthebenchmark. Amethodwithmono- tonicallyincreasing scaling achieves 100%performance on any benchmark in the limit. However, the methods we in- vestigate eventually flatten out or further scaling fails due to control or context window limitations. 4. Results 4.1. Setup Training We perform supervised finetuning on Qwen2.5- 32B-Instructusing s1Ktoobtainourmodel s1-32Busing basichyperparametersoutlinedin§D. Finetuningtook26 minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP. Evaluation We select three representative reasoning benchmarkswidely usedinthefield: AIME24 (ofAmerica, 2024)has30problemsthatwereusedinthe2024American Invitational Mathematics Examination (AIME) held from January31–February1,2024. AIMEtestsmathematical problem-solving with arithmetic, algebra, counting, geome- try, number theory, probability, and other secondary school mathtopics. High-scoringhighschoolstudentsinthetestare invitedtoparticipateintheUnitedStatesofAmericaMathe- maticsOlympiad(USAMO).AllAIMEanswersareintegers ranging from 000to999, inclusive. Some AIME problems relyonfiguresthatweprovidetoourmodelusingthevec- tor graphics language Asymptote as it cannot take image inputs.MATH500 (Hendrycks et al., 2021) is a benchmark of competition math problems of varying difficulty. We evaluate on the same 500 samples selected by OpenAI in prior work (Lightman et al., 2023). GPQA Diamond (Rein et al., 2023) consists of 198 PhD-level science questions 4 Page 5: s1: Simple test-time scaling For cing 2048/4096 max thinking tokensIgnoring end-of-thinking 2x/4x/6x and appending “ W ait ” thus for cing it to continue r easoning when it tries to stop (a) Sequential scaling via budget forcing Par allel scaling via Majority voting with base model (Qwen2.5-32B-Instruct)Sequential scaling via For cing with s160% accur acy (b) Parallel scaling via majority voting Figure 4. Sequential and parallel test-time scaling. (a):Budget forcing shows clear scaling trends and extrapolates to some extent. For the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning trace.(b):For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance when majority voting across 2, 4, 8, 16, 32, and 64 of these. fromBiology,ChemistryandPhysics. ExpertswithPhDsin the corresponding domains only achieved 69.7% on GPQA Diamond (OpenAI, 2024). When we write “GPQA” in the context of evaluation in this work, we always refer to the Diamond subset. We buildon the “lm-evaluation-harness” framework(Gaoetal.,2021;Bidermanetal.,2024). Unless otherwise specified, we evaluate with a temperature of 0 (greedy) and measure accuracy (equivalent to pass@1). Othermodels Webenchmark s1-32Bagainst:OpenAI o1series (OpenAI,2024),closed-sourcemodelsthatpop- ularizedtest-timescaling; DeepSeekr1series (DeepSeek- AIetal.,2025),open-weightreasoningmodelswithupto o1-levelperformance;Qwen’s QwQ-32B-preview (Team, 2024), a 32B open-weight reasoning model without dis- closed methodology; Sky-T1-32B-Preview (Team, 2025) andBespoke-32B (Labs,2025),openmodelswithopenrea- soningdatadistilledfromQwQ-32B-previewandr1; Google Gemini2.0FlashThinkingExperimental (Google,2024), the API that we distill from. As it has no official evaluation scores, we use the Gemini API to benchmark it ourselves. However, the “recitation error” of the Gemini API makes evaluationchallenging.2Wecircumventthis,bymanually insertingall30AIME24questionsinitswebinterfacewhere theerrordoesnotappear. However,weleaveoutMATH500 (500questions)andGPQADiamond(198 questions),thus theyare N.A.in Table1. Ourmodel, s1-32B,is fullyopen including weights, reasoning data, and code. 2https://github.com/google/ generative-ai-docs/issues/257Table 1.s1-32Bisastrongopenreasoningmodel. Weevaluate s1-32B, Qwen, and Gemini (some entries are unknown (N.A.), see §4). Other results are from the respective reports (Qwen et al., 2024;Team,2024;OpenAI,2024;DeepSeek-AIetal.,2025;Labs, 2025; Team, 2025). #ex. = numberexamples used forreasoning finetuning; BF = budget forcing. See §A for our better s1.1model. Model # ex.AIME 2024MATH 500GPQA Diamond API only o1-preview N.A. 44.6 85.5 73.3 o1-mini N.A. 70.0 90.0 60.0 o1 N.A. 74.4 94.8 77.3 Gemini 2.0N.A. 60.0 N.A. N.A.Flash Think. Open Weights Qwen2.5-N.A. 26.7 84.0 49.032B-Instruct QwQ-32B N.A. 50.0 90.6 54.5 r1 ≫800K 79.8 97.3 71.5 r1-distill 800K 72.6 94.3 62.1 Open Weights and Open Data Sky-T1 17K 43.3 82.4 56.8 Bespoke-32B 17K 63.3 93.0 58.1 s1 w/o BF 1K50.0 92.6 56.6 s1-32B 1K 56.7 93.0 59.6 5 Page 6: s1: Simple test-time scaling 4.2. Performance Test-time scaling Figure 1 shows the performance of s1- 32Bwithbudgetforcingscaleswithmoretest-timecompute. InFigure4(left),weexpandtheplotfromFigure1(middle) showingthatwhilewecanimproveAIME24performance using our budget forcing technique (§3) and more test-time computeitdoeseventuallyflattenoutatsixtimes. Suppress- ingtheend-of-thinkingtokendelimitertoooftencanleadthe modelintorepetitiveloopsinsteadofcontinuedreasoning. In Figure 4 (right), we show that after training Qwen2.5- 32B-Instructonour1,000samplestoproduce s1-32Band equipping it with the simple budget forcing technique, it operatesinadifferentscalingparadigm. Scalingtest-time computeonthebasemodelviamajorityvotingcannotcatch up with the performance of s1-32Bwhich validates our intuition from §3 that sequential scaling is more effective than parallel. We provide example generations of s1-32Bin Figure 5. Sample-efficiency In Figure 2 (right) and Table 1 we compare s1-32Bwith other models. We find that s1- 32Bis the most sample-efficient open data reasoning model. It performs significantly better than our base model (Qwen2.5-32B-Instruct) despite just training it on an ad- ditional1,000samples. Theconcurrentlyreleasedr1-32B shows stronger performance than s1-32Bwhile also only usingSFT(DeepSeek-AIetal.,2025). However,itistrained on 800×more reasoning samples. It is an open question whetheronecanachievetheirperformancewithjust1,000 samples. Finally, our model nearly matches Gemini 2.0 Thinking on AIME24. As the data for s1-32Bis distilled from Gemini2.0, this showsour distillation procedurewas likely effective. 5. Ablations 5.1. Data Quantity, Diversity, and Difficulty In §2 we outlined our three guiding principles in curating s1K: Quality, Difficulty, and Diversity. Here we test the importanceofcombiningthemandtheoverallefficacyofour selection. Only Quality (1K-random) : After obtaining our high-quality reasoning chains from Gemini, we select 1,000 samplesatrandom;notrelyingonourdifficultyanddiversity filtering at all. Table 2 shows this approach performs much worse than s1Kacross all benchmarks. Only Diversity (1K-diverse) : For this dataset, we sample uniformly across domainstomaximizediversitydisregardinganynotionof difficulty. This approach also leads to poor performance similar to 1K-random. Only Difficulty (1K-longest) : Here werelyononeofourdifficultyindicatorsintroducedin§2by selectingthe1,000sampleswiththelongestreasoningtraces. ThisapproachsignificantlyboostsGPQAperformancebut overallstillfallsshortofusing s1K.MaximizeQuantity :Table 2.s1Kdataablations. Webudgetforce(BF)amaximum ofaround30,000thinkingtokensforallscoresinthistable. This performs slightly better than the scores without BF (Table 1) as it allows the model to finish with a best guess when stuck in an infinite loop. We report 95% paired bootstrap confidence intervals for differences relative to the s1Kmodel using 10,000 bootstrap samples. E.g., the interval [-13%, 20%] means that, with 95% con- fidence, thetrue differencebetween 59K-full and s1Kis between -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%], we can confidently say that the performance is worse than s1K. ModelAIME 2024MATH 500GPQA Diamond 1K-random36.7 90.6 52.0 [-26.7%, -3.3%] [-4.8%, 0.0%] [-12.6%, 2.5%] 1K-diverse26.7 91.2 54.6 [-40.0%, -10.0%] [-4.0%, 0.2%] [-10.1%, 5.1%] 1K-longest33.3 90.4 59.6 [-36.7%, 0.0%] [-5.0%, -0.2%] [-5.1%, 10.1%] 59K-full53.3 92.8 58.1 [-13.3%, 20.0%] [-2.6%, 2.2%] [-6.6%, 8.6%] s1K 50.0 93.0 57.6 Finally, we compare with just training on all of our 59K samples, a superset of all the 1K-sample versions. This leadstoastrongmodelbutusesmuchmoreresources. To finetune on 59K samples, we use 394 H100 GPU hours whiles1-32Bonly required7 H100 GPUhours. Moreover, relyingonlyon s1Kisextremelycompetitiveasshownin§2. Overall, combining all three criteria – Quality,Difficulty, Diversity – via our methodology in §2 is key for sample- efficient reasoning training. 5.2. Test-time scaling methods Table 3.Ablations on methods to scale test-time compute on AIME24. ||refers to the number of evaluation runs used to estimate theproperties; thus ahigher valueindicates more robust- ness.Boldindicates our chosen method and the best values. BF =budgetforcing,TCC/SCC/CCC=token/step/class-conditional control, RS = rejection sampling. Method Control Scaling Performance || BF 100% 15 56.7 5 TCC 40% -24 40.0 5 TCC + BF 100% 13 40.0 5 SCC 60% 3 36.7 5 SCC + BF 100% 6 36.7 5 CCC 50% 25 36.7 2 RS 100% -35 40.0 5 6 Page 7: s1: Simple test-time scaling [AIME24] Alice and Bob play the following game. A stack of 𝑛tokens lies before them. The players take turns with Alice going first. On each turn, the player removeseither 1tokenor4tokensfromthestack. Whoever removes the last token wins. Find the numberofpositiveintegers 𝑛lessthanorequalto 2024for which there exists a strategy for Bob that guaranteesthatBobwillwinthegameregardless of Alice’s play. Theproblemasks forthenumberofpositiveintegers 𝑛≤2024suchthatBobhasawinningstrategy. This is a typical impartial game that can be analyzed using the Sprague-Grundy theorem or by directly finding the winning and losing positions. Let 𝑊be the set of winning positions (number of tokens 𝑛such that the playerwhoseturnitiscanforceawin)and 𝐿bethe set of losing positions. [...] Let𝑃(𝑛)be the statement "the player whose turn it is whenthereare 𝑛tokenshasawinningstrategy. 𝑃(0) is false (the player whose turn it is and faces 0tokens has already lost). Consider the small values of 𝑛: 𝑛= 1: The player whose turn it is can take 1token and win. So 𝑃(1)is true.1∈𝑊. 𝑛=2: The player whose turn it is can take 1token, leaving1token for the opponent. Since 1∈𝑊, the opponent wins. So 𝑃(2)is false.2∈𝐿. 𝑛=3: The player whose turn it is can take 1token, leaving2tokens for the opponent. Since 2∈𝐿, the player whose turn it is wins. So 𝑃(3)is true.3 ∈ 𝑊.[...] We are looking for the number of positive integers 𝑛≤2024for which Bob has a winning strategy. This means that if the game starts with 𝑛tokens, and Alice goes first, Bob can win. This happens if and only if 𝑛isalosingposition(becauseif 𝑛∈𝐿,thenAlice’s first move must lead to a winning position for Bob). [...] Let’s try to find a pattern in the losing positions 𝐿={0,2,5,7,10,12,15,17,20,22,...}. Differences between consecutive terms: 2,3,2,3,2,3,2,3,2,... The pattern of differences is alternating 2,3. [...] The losing positions are of the form 5𝑚or5𝑚+2, where𝑚≥0. We need to find the number of positive integers𝑛≤2024thatareinthesetoflosingpositions 𝐿. [...] Case 1:𝑛= 5𝑚, where𝑚≥1.1≤5𝑚≤2024 1∕5≤𝑚≤2024∕5 = 404 .8Since𝑚is an integer, 1≤𝑚≤404. The number of values of 𝑚is404. Case 2:𝑛= 5𝑚+2, where𝑚≥0.5𝑚+2≤2024 5𝑚≤2022𝑚≤2022∕5 = 404 .4Since𝑚is an in- teger,0≤𝑚≤404. The number of values of 𝑚is 404−0+1=405 . The total number of positive integers 𝑛≤2024that are losing positions is 404+405= 809 .[MATH500] Find the projection of 𝑎onto𝑏=(2 6 3) if𝑎⋅𝑏=8. Theprojectionofvector 𝐚ontovector 𝐛is given by the formula: proj𝐛𝐚=𝐚⋅𝐛 ‖𝐛‖2𝐛 Wearegiven 𝐛=(2 6 3) and𝐚⋅𝐛=8. First,weneedtofindthemagnitude squared of 𝐛, which is ‖𝐛‖2. ‖𝐛‖2=𝐛⋅𝐛=22+62+32=4+36+9=49 . Nowwecanplugthevaluesintothe projection formula: proj𝐛𝐚=8 49𝐛 Substitute the vector 𝐛: proj𝐛𝐚=8 49(2 6 3) Multiply the scalar by each compo- nent of the vector: proj𝐛𝐚=⎛ ⎜ ⎜ ⎜⎝8 49×2 8 49×6 8 49×3⎞ ⎟ ⎟ ⎟⎠=⎛ ⎜ ⎜ ⎜⎝16 4948 4924 49⎞ ⎟ ⎟ ⎟⎠ Final Answer: The final answer is ⎛ ⎜ ⎜ ⎜⎝16 4948 4924 49⎞ ⎟ ⎟ ⎟⎠[GPQA] A quantum mechanical particle of mass𝑚movesintwodimensionsin the following potential, as a function of the polar coordinates (𝑟,𝜃): 𝑉(𝑟,𝜃)=1 2𝑘𝑟2+3 2𝑘𝑟2cos2(𝜃) Find the energy spectrum. Hint: WritethepotentialinCartesianco- ordinates. Theproblemasksfortheenergyspec- trum of a particle in a given 2D po- tential. Thepotentialisgiveninpolar coordinates(𝑟,𝜃): 𝑉(𝑟,𝜃)=1 2𝑘𝑟2+3 2𝑘𝑟2cos2(𝜃). We can rewrite the potential as: 𝑉(𝑟,𝜃)=1 2𝑘𝑟2(1+3cos2(𝜃)). TheHamiltonianforaparticlein2D is given by: 𝐻=−ℏ2 2𝑚( 𝜕2 𝜕𝑥2+𝜕2 𝜕𝑦2) +𝑉(𝑥,𝑦). In polar coordinates (𝑟,𝜃), the Lapla- cian operator is: ∇2=1 𝑟𝜕 𝜕𝑟( 𝑟𝜕 𝜕𝑟) +1 𝑟2𝜕2 𝜕𝜃2. So the Hamiltonian in polar coordi- nates is: 𝐻=−ℏ2 2𝑚[ 1 𝑟𝜕 𝜕𝑟( 𝑟𝜕 𝜕𝑟) +1 𝑟2𝜕2 𝜕𝜃2] +𝑉(𝑟,𝜃). The time-independent Schrödinger equation is𝐻𝜓=𝐸𝜓. We can try to separate variables by assuming a so- lutionoftheform 𝜓(𝑟,𝜃)=𝑅(𝑟)Θ(𝜃). [...] This confirms the energy spectrum is: 𝐸=ℏ√ 𝑘 𝑚( 2𝑛𝑥+𝑛𝑦+3 2) . Final Answer: 𝐸= ℏ√ 𝑘 𝑚( 2𝑛𝑥+𝑛𝑦+3 2) . Figure 5. Examplemodeloutputs. WepickaquestionfromAIME24( left),MATH500( middle),andGPQA( right),whereourmodel generates the correct answer. The black text is the prompt, the light blue text is the reasoning trace, and the blue text is the answer of s1-32B. The gray ellipsis [...] indicates that the text was trimmed to fit this page, but the generated text is actually longer. 7 Page 8: s1: Simple test-time scaling Budget forcing In Table 3 we compare the test-time scal- ing methods we have introduced in§3. Overall, we find that budgetforcing providesperfectcontrol,goodscaling,and leads to our best AIME24 score. Thus, this is the method we use for s1-32Bin Figure 1 and in §4. In Table 4, we comparedifferentstringsforextrapolatingperformance. We find that “Wait” generally gives the best performance. Class-conditional control We provide benchmark scores for this method in §E.1 and summarize three findings here: (1)Token-conditionalcontrolfailswithoutbudgetforcing,as our model cannot reliably count tokens - even when trained to do so.(2)Under step-conditional control, the model gen- erates a similar total number of tokens when given different step targets, as the model goes from few steps with many tokens per step, to many steps with few tokens in each step. Thus, the model learns to hack its way around the com- pute constraint making the controllability of this method mediocre. (3)Class-conditional control can work - telling a model to simply think longer can increase its test-time com- pute and performance, which leads good scaling in Table 3. Table 4.Budget forcing extrapolation ablations. We compare ignoringtheend-of-thinkingdelimitertwiceandappendingnone or various strings. ModelAIME 2024MATH 500GPQA Diamond No extrapolation 50.0 93.0 57.6 2x without string 50.0 90.2 55.1 2x “Alternatively” 50.0 92.2 59.6 2x “Hmm” 50.0 93.0 59.6 2x “Wait” 53.3 93.0 59.6 Rejectionsampling Surprisingly,wefindthatsimplysam- pling until the generation fits a specific length leads to an inverse scaling trend as depicted in Figure 6. In §E.2 we in- 3072 4096 5120 Average thinking time (tokens)2227323742Accuracy (%) Figure 6. Rejection sampling on AIME24 with s1-32B. We sam- plewithatemperatureof1untilallgenerationshavelessthan(from lefttoright)3500,4000,5000,8000,and16000thinkingtokens requiring an average of 655, 97, 8, 3, 2, and 1 tries per sample.spectaquestion,whichwasansweredcorrectlybythemodel when rejection sampling for ≤4000, but not for the ≤8000 tokensetting. Inthe ≤4000settingthemodeldirectlyjumps tothecorrectapproach,whileforthe ≤8000settingitback- tracks a lot. We hypothesize that there is a correlation such thatshortergenerationstendtobetheoneswherethemodel was on the right track from the start, whereas longer ones tend to be ones where the model made mistakes and thus backtracks or questions itself. This leads to longer samples often being wrong when rejection sampling and thus the inverse scaling trend. 6. Discussion and related work 6.1. Sample-efficient reasoning Models There are a number of concurrent efforts to build modelsthatreplicatetheperformanceofo1(OpenAI,2024). For example, DeepSeek-r1 and k1.5 (DeepSeek-AI et al., 2025; Team et al., 2025) are built with reinforcement learn- ing methods, while others rely on SFT using tens of thou- sands of distilled examples (Team, 2025; Xu et al., 2025; Labs, 2025). We show that SFT on only 1,000 examples suffices to build a competitive reasoning model matching o1-preview and produces a model that lies on the pareto frontier (Figure 2). Further, we introduce budget forcing which combined with our reasoning model leads to the first reproduction of OpenAI’s test-time scaling curves (OpenAI, 2024). Whydoessupervisedfinetuningonjust1,000sam- plesleadtosuchperformancegains? Wehypothesizethat the model is already exposed to large amounts of reason- ingdataduringpretrainingwhichspanstrillionsoftokens. Thus, the ability to perform reasoning is already present in our model. Our sample-efficient finetuning stage just acti- vatesitandwescaleitfurtherattesttimewithbudgetforcing. This is similar to the "Superficial Alignment Hypothesis" presented in LIMA (Zhou et al., 2023), where the authors findthat1,000examplescanbesufficienttoalignamodel to adhere to user preferences. Benchmarks and methods To evaluate and push the lim- its of these models, increasingly challenging benchmarks have been introduced, such as Olympiad-level science com- petitions (He et al., 2024; Jain et al., 2024; Zhong et al., 2023)andothers(Srivastavaetal.,2023;Glazeretal.,2024; Su et al., 2024; Kim et al., 2024; Phan et al., 2025). To enhancemodels’performanceonreasoning-relatedtasks,re- searchers have pursued several strategies: Prior works have explored continuing training language models on special- izedcorporarelatedtomathematicsandscience(Azerbayev etal.,2023;Yangetal.,2024),sometimesevensynthetically generated data (Yu et al., 2024). Others have developed training methodologies specifically aimed at reasoning per- formance (Zelikman et al., 2022; 2024; Luo et al., 2025; 8 Page 9: s1: Simple test-time scaling Yuanetal.,2025;Wuetal.,2024a). Anothersignificantline ofworkfocusesonprompting-basedmethodstoelicitand improvereasoningabilities,includingmethodslikeChain- of-Thought prompting (Wei et al., 2023; Yao et al., 2023a;b; Bi et al., 2023; Fu et al., 2023; Zhang et al., 2024b; Xi- ang etal., 2025;Hu et al.,2024; Diaoet al.,2024). These combined efforts aim to advance the reasoning ability of language models, enabling them to handle more complex and abstract tasks effectively. 6.2. Test-time scaling Methods Asweintroducein§3,wedifferentiatetwometh- odstoscaletest-timecompute: parallelandsequential . The former relies on multiple solution attempts generated in par- allel and selecting the best outcome via specific criteria. Thesecriteriaincludechoosingthemostfrequentresponse formajorityvotingorthebestresponsebasedonanexter- nalreward forBest-of-N(Brown etal.,2024; Irvineetal., 2023;Levi,2024). Unlikerepeatedsampling,previousse- quential scaling methods let the model generate solution attemptssequentiallybasedonpreviousattempts,allowing it to refine each attempt based on previous outcomes (Snell etal.,2024;Houetal.,2025;Leeetal.,2025). Tree-based search methods (Gandhi et al., 2024; Wu et al., 2024b) offer ahybridapproachbetweensequentialandparallelscaling, such as Monte-Carlo Tree Search (MCTS) (Liu et al., 2024; Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023) and guided beam search (Xie et al., 2023). REBASE (Wu et al., 2024b)employsaprocessrewardmodeltobalanceexploita- tion and pruning during tree search. Empirically, REBASE has been shown to outperform sampling-based methods and MCTS (Wu et al., 2024b). Reward models (Lightman et al., 2023; Wang et al., 2024b;c) play a key role in these meth- ods. They come in two variants: outcome reward models andprocessrewardmodels. Outcomerewardmodels(Xin et al., 2024; Ankner et al., 2024) assign a score to complete solutions and are particularly useful in Best-of-N selection, while process reward models (Lightman et al., 2023; Wang et al., 2024b; Wu et al., 2024b) assess individual reasoning stepsandareeffectiveinguidingtree-basedsearchmethods. Limitstofurthertest-timescaling Wehaveshownthat budget forcing allows extrapolating test-time compute in §4, e.g., improvingAIME24performancefrom50%to57%. However, it has two key limitations when scaling further: it eventually flattens out (Figure 4), and the context window of the underlying language model constrains it. Despite these constraints, our work shows test-time scaling across a wide range of accuracies (Figure 1), partly because scaling downtest-timecomputebehavespredictablyanddoesnot suffer from these constraints. Continuing test-time scaling will require approaches that can further extrapolate test-time compute. How can we get 2048 8192 32768 131072 Average thinking time (tokens)30405060Accuracy (%) REBASE Majority Sequential scalingFigure 7. Scalingfurtherwithparallelscalingmethods. Allmet- rics averaged over the 30 questions in AIME24. Average thinking tokensfor REBASE donotaccountfortheadditionalcomputefrom the reward model. For sequential scaling, we prompt the model to use up to (from left to right) 32, 64, 256, and 512 steps. For REBASE and majority voting we generate 16 parallel trajectories to aggregate across. such extrapolation? There may be improvements to budget forcing such as rotating through different strings, not only “Wait”, or combining it with frequency penalties or higher temperature to avoid repetitive loops. An exciting direc- tion for future work is also researching whether applying budget forcing to a reasoning model trained with reinforce- mentlearningyieldsbetterextrapolation;orifRLallowsfor newwaysoftest-timescalingbeyondbudgetforcing. Our workdefinestherightmetrics(§3.2)–Control,Scaling,and Performance – to enable future research and progress on extrapolating test-time compute. Parallel scalingasa solution Parallelscaling offersone solution to the limits of sequential scaling, thus we aug- mentoursequentiallyscaledmodelwithtwomethods: (I) Majority voting: After generating 𝑘solutions, the final solution is the most frequent one across generations; (II) Treesearchvia REBASE :Weusethe REBASE process reward model, which is initialized from LLaMA-34B and further finetuned on a synthetic process reward modeling dataset (Wu et al., 2024b). We then aggregate the solutions generated by REBASE via majority voting. As shown in Figure7,augmentingourmodelwith REBASE scalesbetter thanmajorityvoting,andevensequentialscalinginthissce- nario. However, REBASE requires an additional forward passateachstepfortherewardmodeladdingsomecomputa- tionoverhead. Forsequentialscaling,whenpromptedtouse upto512steps,for12outofthe30evaluationquestionsthe modelgeneratesaresponsethatexceedsthecontextwindow leadingtoalargeperformancedrop. Overall,wefindthat these parallel scaling methods complement sequential scal- ing thus they offer an avenue for scaling test-time compute even further; beyond fixed context windows. 9 Page 10: s1: Simple test-time scaling Impact Statement Language models with strong reasoning capabilities have thepotentialto greatlyenhance humanproductivity, from assistingincomplexdecision-makingtodrivingscientific breakthroughs. However,recentadvancesinreasoning,such asOpenAI’so1andDeepSeek’sr1, lacktransparency, limit- ing broader research progress. Our work aims to push the frontier of reasoning in a fully open manner, fostering in- novation and collaboration to accelerate advancements that ultimately benefit society. Acknowledgements WethankRyanMartenforgeneratingtracesfromDeepSeek r1 for s1.1 using Bespoke Curator (Marten et al., 2025). ThisworkwaspartlyconductedusingtheStanfordMarlowe GPUcluster(Kapferetal.,2025),madepossiblebyfinancial supportfromStanfordUniversity. WethankAlexanderM. Rush,AndrewIlyas,BanghuaZhu,ChengleiSi,Chunting Zhou, John Yang, Ludwig Schmidt, Samy Jelassi, Suhas Kotha,TengyuMa,XuechenLi,YuSun,andYueZhangfor very constructive discussions. References Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am- manabrolu, P. Critique-out-loud reward models, 2024. URL https://arxiv.org/abs/2408.11791 . Arora, D., Singh, H. G., and Mausam. Have llms advanced enough? a challenging problem solving benchmark for largelanguagemodels,2023. URL https://arxiv. org/abs/2305.15074 . Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for math- ematics, 2023. Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and Chen, H. When do program-of-thoughts work for reason- ing?,2023.URL https://arxiv.org/abs/2308. 15452. Biderman,S.,Schoelkopf,H.,Sutawika,L.,Gao,L.,Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde, J.Z.,Foster,C.,Hsu,J.,Jaiswal,M.,Lee,W.Y.,Li,H., Lovering, C., Muennighoff, N., Pavlick, E., Phang, J., Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata, G.I.,Yvon,F.,andZou,A. Lessonsfromthetrencheson reproducible evaluation of language models, 2024. Brown,B.,Juravsky,J.,Ehrlich,R.,Clark,R.,Le,Q.V.,Ré, C., and Mirhoseini, A. Large language monkeys: Scalinginference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787 . Cesista,F.L. Multimodalstructuredgeneration: Cvpr’s2nd mmfm challenge technical report, 2024. URL https: //arxiv.org/abs/2406.11403 . Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem- drivenquestionansweringdataset,2023. URL https: //arxiv.org/abs/2305.12524 . Choi,S.,Fang,T.,Wang,Z.,andSong,Y. Kcts: Knowledge- constrained tree search decoding with token-level halluci- nationdetection,2023. URL https://arxiv.org/ abs/2310.09044 . DeepSeek-AI,Guo,D.,Yang,D.,Zhang,H.,Song,J.,Zhang, R.,Xu,R.,Zhu,Q.,Ma,S.,Wang,P.,Bi,X.,Zhang,X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G.,Chen,G.,Li,G.,Zhang,H.,Bao,H.,Xu,H.,Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M.,Zhang,M.,Tang,M.,Li,M., Wang,M.,Li,M.,Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S.,Ye,S.,Yun,T.,Pei,T.,Sun,T.,Wang,T.,Zeng,W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan,X.,Li,Y.K.,Wang,Y.Q.,Wei,Y.X.,Zhang,Y., Xu,Y.,Li,Y.,Zhao,Y.,Sun,Y.,Wang,Y.,Yu,Y.,Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y.,Zou,Y.,He,Y.,Xiong,Y.,Luo,Y.,You,Y.,Liu,Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng, Y.,Zhu,Y.,Ma,Y.,Tang,Y.,Zha,Y.,Yan,Y.,Ren,Z.Z., Ren,Z.,Sha,Z.,Fu,Z.,Xu,Z.,Xie,Z.,Zhang,Z.,Hao, Z.,Ma,Z.,Yan,Z.,Wu,Z.,Gu,Z.,Zhu,Z.,Liu,Z.,Li, Z.,Xie,Z.,Song,Z.,Pan,Z.,Huang,Z.,Xu,Z.,Zhang, Z.,andZhang,Z. Deepseek-r1: Incentivizingreasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948 . Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., and Zhang, T. Activeprompting withchain-of-thought forlarge lan- 10 Page 11: s1: Simple test-time scaling guage models, 2024. URL https://arxiv.org/ abs/2302.12246 . Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman,A.,Mathur,A.,Schelten,A.,Yang,A.,Fan,A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- mar,A.,Korenev,A.,Hinsvark,A.,Rao,A.,Zhang,A., Rodriguez, A., Gregerson, A., et al. The llama 3 herd ofmodels,2024. URL https://arxiv.org/abs/ 2407.21783 . Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot, T. Complexity-based prompting for multi-step reason- ing,2023. URL https://arxiv.org/abs/2210. 00720. Gandhi,K.,Lee,D.,Grand,G.,Liu,M.,Cheng,W.,Sharma, A.,andGoodman,N.D. Streamofsearch(sos): Learning to search in language, 2024. URL https://arxiv. org/abs/2404.03683 . Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L.,Ma,C.,Chen,L.,Xu,R.,Tang,Z.,Wang,B.,Zan,D., Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T., andChang,B. Omni-math: Auniversalolympiadlevel mathematic benchmark for large language models, 2024a. URL https://arxiv.org/abs/2410.07985 . Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang,K.,andZou,A.Aframeworkforfew-shotlanguage model evaluation, September 2021. URL https:// doi.org/10.5281/zenodo.5371628 . Gao,Z.,Niu,B.,He,X.,Xu,H.,Liu,H.,Liu,A.,Hu,X.,and Wen, L. Interpretable contrastive monte carlo tree search reasoning,2024b. URL https://arxiv.org/abs/ 2410.01707 . Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A., deOliveiraSantos,E.,Järviniemi,O.,Barnett,M.,San- dler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine, L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T., Enugandla,S.V.,andWildon,M. Frontiermath: Abench- mark for evaluating advanced mathematical reasoning in ai, 2024. URL https://arxiv.org/abs/2411. 04872. Google. Gemini 2.0 flash thinking mode (gemini-2.0- flash-thinking-exp-1219), December 2024. URL https://cloud.google.com/vertex-ai/ generative-ai/docs/thinking-mode . Groeneveld,D.,Beltagy,I.,Walsh,P.,Bhagia,A.,Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I.,Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K.R.,Cohan,A.,Dumas,J.,Elazar,Y.,Gu,Y.,Hessel,J., Khot,T.,Merrill,W.,Morrison,J.,Muennighoff,N.,Naik, A.,Nam,C.,Peters,M.E.,Pyatkin,V.,Ravichander,A., Schwenk,D.,Shah,S.,Smith,W.,Strubell,E.,Subramani, N.,Wortsman,M.,Dasigi,P.,Lambert,N.,Richardson, K.,Zettlemoyer,L.,Dodge,J.,Lo,K.,Soldaini,L.,Smith, N. A., and Hajishirzi, H. Olmo: Accelerating the science of language models, 2024. He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L., Liu, Z., and Sun, M. Olympiadbench: A challeng- ing benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008 . Hendrycks,D.,Burns,C.,Kadavath,S.,Arora,A.,Basart, S.,Tang,E.,Song,D.,andSteinhardt,J. Measuringmath- ematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874 . Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai,T.,Rutherford,E.,deLasCasas,D.,Hendricks,L.A., Welbl,J.,Clark,A.,Hennigan,T.,Noland,E.,Millican, K.,vandenDriessche,G.,Damoc,B.,Guy,A.,Osindero, S.,Simonyan,K.,Elsen,E.,Rae,J.W.,Vinyals,O.,and Sifre, L. Training compute-optimal large language mod- els, 2022. URL https://arxiv.org/abs/2203. 15556. Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. Advancing language model rea- soningthroughreinforcementlearningandinferencescal- ing,2025. URL https://arxiv.org/abs/2501. 11651. Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettle- moyer,L.,Smith,N.A.,andKrishna,R. Visualsketch- pad: Sketchingasavisualchainofthoughtformultimodal languagemodels,2024. URL https://arxiv.org/ abs/2406.09403 . Huang,Z.,Wang,Z.,Xia,S.,Li,X.,Zou,H.,Xu,R.,Fan, R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu, T.,Wang,B.,Sun,S.,Xiao,Y.,Li,Y.,Zhou,F.,Chern,S., Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin, D.,Qiao,Y.,andLiu,P. Olympicarena: Benchmarking multi-disciplinecognitivereasoningforsuperintelligent ai, 2024a. URL https://arxiv.org/abs/2406. 12753. Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E., Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 replication journey–part2: Surpassing o1-previewthroughsimple distillation, big progress or bitter lesson?, 2024b. URL https://arxiv.org/abs/2411.16489 . 11 Page 12: s1: Simple test-time scaling Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang, S.,Liu,P.,andZhang,X. O1replicationjourney–part3: Inference-time scaling for medical reasoning, 2025. URL https://arxiv.org/abs/2501.06458 . Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z., Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., As- sassi, V., Beauchamp, C.-C., Lu, X., Rialan, T., and Beauchamp, W. Rewarding chatbots for real-world en- gagement with millions of users, 2023. URL https: //arxiv.org/abs/2303.06135 . Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evalu- ation of large language models for code, 2024. URL https://arxiv.org/abs/2403.07974 . Kapfer, C., Stine, K., Narasimhan, B., Mentzel, C., and Candes,E.Marlowe: Stanford’sgpu-basedcomputational instrument, January 2025. URL https://doi.org/ 10.5281/zenodo.14751899 . Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language mod- els, 2020. URL https://arxiv.org/abs/2001. 08361. Kim,E.,Suk,J.,Kim,S.,Muennighoff,N.,Kim,D.,andOh, A. Llm-as-an-interviewer: Beyondstatic testingthrough dynamic llm evaluation, 2024. URL https://arxiv. org/abs/2412.10424 . Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Effi- cientmemorymanagementforlargelanguagemodelserv- ing with pagedattention, 2023. URL https://arxiv. org/abs/2309.06180 . Labs,B. Bespoke-stratos: Theunreasonableeffectiveness of reasoning distillation, 2025. URL https://hf. co/bespokelabs/Bespoke-Stratos-32B . Ac- cessed: 2025-01-22. Lee, K.-H., Fischer, I., Wu, Y.-H., Marwood, D., Baluja, S., Schuurmans, D., and Chen, X. Evolving deeper llm thinking, 2025. URL https://arxiv.org/abs/ 2501.09891 . Levi, N. A simple model of inference scaling laws, 2024. URL https://arxiv.org/abs/2410.16377 . LI, J., Beeching, E., Tunstall, L., Lipkin, B., Solet- skyi, R., Huang, S. C., Rasul, K., Yu, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., and Polu, S. Numinamath, 2024. URL https://github.com/project-numina/aimo-progress-prize/blob/main/report/ numina_dataset.pdf . Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step, 2023. URL https: //arxiv.org/abs/2305.20050 . Ling,W.,Yogatama,D.,Dyer,C.,andBlunsom,P. Program induction by rationale generation : Learning to solve and explain algebraic word problems, 2017. URL https: //arxiv.org/abs/1705.04146 . Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang, Y. Logiqa: A challenge dataset for machine reading com- prehensionwithlogicalreasoning,2020. URL https: //arxiv.org/abs/2007.08124 . Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi, H., and Celikyilmaz, A. Don’t throw away your value model! generatingmorepreferabletextwithvalue-guided monte-carlo tree search decoding, 2024. URL https: //arxiv.org/abs/2309.15028 . Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization, 2019. Luo,H.,Sun,Q.,Xu,C.,Zhao,P.,Lou,J.,Tao,C.,Geng,X., Lin,Q.,Chen,S.,Tang,Y.,andZhang,D. Wizardmath: Empoweringmathematicalreasoningforlargelanguage models via reinforced evol-instruct, 2025. URL https: //arxiv.org/abs/2308.09583 . Marten, R., Vu, T., Ji, C. C.-J., Sharma, K., Pimpalgaonkar, S.,Dimakis,A.,andSathiamoorthy,M. Curator: Atool forsyntheticdatacreation. https://github.com/ bespokelabsai/curator , January 2025. Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison,J.,Min,S.,Shi,W.,Walsh,P.,Tafjord,O.,Lambert, N.,Gu,Y.,Arora,S.,Bhagia, A.,Schwenk,D.,Wadden, D.,Wettig,A.,Hui,B.,Dettmers,T.,Kiela,D.,Farhadi, A.,Smith,N.A.,Koh,P.W.,Singh,A.,andHajishirzi,H. Olmoe: Open mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2409.02060 . of America, M. A. Aime, February 2024. URL https: //artofproblemsolving.com/wiki/index. php/AIME_Problems_and_Solutions/ . OpenAI. Learning to reason with llms, Septem- ber 2024. URL https://openai.com/index/ learning-to-reason-with-llms/ . OpenAI. Openaio3-mini,2025. URL https://openai. com/index/openai-o3-mini/ . Accessed: 2025- 02-24. 12 Page 13: s1: Simple test-time scaling Phan,L.,Gatti,A.,Han,Z.,Li,N.,Hu,J.,Zhang,H.,Shi, S., Choi, M., Agrawal, A., Chopra, A., et al. Humanity’s last exam, 2025. URL https://arxiv.org/abs/ 2501.14249 . Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye, Y.,Yuan,W.,Liu,H.,Li,Y.,andLiu,P. O1replication journey: A strategic progress report – part 1, 2024. URL https://arxiv.org/abs/2410.18982 . Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T.,Xia,T.,Ren,X.,Ren,X.,Fan,Y.,Su,Y.,Zhang,Y., Wan,Y.,Liu,Y.,Cui,Z.,Zhang,Z.,andQiu,Z. Qwen2.5 technicalreport,2024. URL https://arxiv.org/ abs/2412.15115 . Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022 . Shi, Q., Tang, M., Narasimhan, K., and Yao, S. Can lan- guagemodelssolveolympiadprogramming?,2024. URL https://arxiv.org/abs/2404.10952 . Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- timecomputeoptimallycanbemoreeffectivethanscal- ing model parameters, 2024. URL https://arxiv. org/abs/2408.03314 . Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifyingandextrapolatingthecapabilitiesoflanguage models, 2023. Su,H.,Yen,H.,Xia,M.,Shi,W.,Muennighoff,N.,yuWang, H.,Liu,H.,Shi,Q.,Siegel,Z.S.,Tang,M.,Sun,R.,Yoon, J., Arik, S. O., Chen, D., and Yu, T. Bright: A realis- tic and challenging benchmark for reasoning-intensive retrieval, 2024. URL https://arxiv.org/abs/ 2407.12883 . Sun,L.,Han,Y.,Zhao,Z.,Ma,D.,Shen,Z.,Chen,B.,Chen, L., and Yu, K. Scieval: A multi-level large language model evaluation benchmark for scientific research, 2024. URL https://arxiv.org/abs/2308.13149 . Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C.,Xiao,C.,Du,C.,Liao,C.,Tang,C.,Wang,C.,Zhang, D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G., Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H., Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng,H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J., Zhang,J.,Liu,J.,Yan,J.,Wu,J.,Shi,L.,Ye,L.,Yu,L., Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S., Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W., Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X., Jia,X.,Wu,X.,Xu,X.,Zu,X.,Zhou,X.,Pan,X.,Charles, Y.,Li,Y.,Hu,Y.,Liu,Y.,Chen,Y.,Wang,Y.,Liu,Y.,Qin, Y.,Liu,Y.,Yang,Y.,Bao,Y.,Du,Y.,Wu,Y.,Wang,Y., Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z., Yang,Z.,Huang,Z.,Huang,Z.,Xu,Z.,andYang,Z.Kimi k1.5: Scaling reinforcement learning with llms, 2025. URL https://arxiv.org/abs/2501.12599 . Team, N. Sky-t1: Fully open-source reasoning model with o1-preview performance in $450 budget, 2025. URL https://novasky-ai.github.io/ posts/sky-t1 . Accessed: 2025-01-09. Team, Q. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm. github.io/blog/qwq-32b-preview/ . Wang, J., Meng, F., Liang, Y., and Zhou, J. Drt-o1: Op- timized deep reasoning translation via long chain-of- thought, 2024a. URL https://arxiv.org/abs/ 2412.17498 . Wang, P., Li, L.,Shao, Z., Xu, R.X., Dai, D.,Li, Y., Chen, D.,Wu,Y.,andSui,Z. Math-shepherd: Verifyandrein- forcellmsstep-by-stepwithouthumanannotations,2024b. URL https://arxiv.org/abs/2312.08935 . Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z., and Duan, N. From lsat: The progress and challenges of complex reasoning, 2021. URL https://arxiv. org/abs/2108.00648 . Wang,Z.,Dong,Y.,Delalleau,O.,Zeng,J.,Shen,G.,Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help- steer2: Open-sourcedatasetfortrainingtop-performing rewardmodels,2024c. URL https://arxiv.org/ abs/2406.08673 . Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of- thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/ 2201.11903 . Welleck,S.,Bertsch,A.,Finlayson,M.,Schoelkopf,H.,Xie, A.,Neubig,G.,Kulikov,I.,andHarchaoui,Z. Fromde- coding to meta-generation: Inference-time algorithms for largelanguagemodels,2024. URL https://arxiv. org/abs/2406.16838 . 13 Page 14: s1: Simple test-time scaling Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and Sukhbaatar, S. Thinking llms: General instruction fol- lowing with thought generation, 2024a. URL https: //arxiv.org/abs/2410.10630 . Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Infer- ence scaling laws: An empirical analysis of compute- optimalinferenceforproblem-solvingwithlanguagemod- els,2024b. URL https://arxiv.org/abs/2408. 00724. Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A., Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan, D.,Castricato,L., Franken,J.-P.,Haber, N.,andFinn,C. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025. URL https: //arxiv.org/abs/2501.04682 . Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q. Self-evaluation guided beam search for reasoning,2023. URL https://arxiv.org/abs/ 2305.00633 . Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan, C., Li, W., and Liang, X. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data,2024. URL https://arxiv.org/abs/2405. 14333. Xu, H., Wu, X., Wang, W., Li, Z., Zheng, D., Chen, B., Hu, Y., Kang, S., Ji, J., Zhang, Y., Guo, Z., Yang, Y., Zhang,M.,andZhang,D. Redstar: Doesscalinglong-cot dataunlockbetterslow-reasoningsystems?,2025. URL https://arxiv.org/abs/2501.11284 . Yang, Z., Band, N., Li, S., Candès, E., and Hashimoto, T. Synthetic continued pretraining, 2024. URL https: //arxiv.org/abs/2409.07431 . Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problemsolvingwithlargelanguagemodels,2023a. URL https://arxiv.org/abs/2305.10601 . Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models, 2023b. URL https://arxiv. org/abs/2210.03629 . Ye,Y.,Huang,Z.,Xiao,Y.,Chern,E.,Xia,S.,andLiu,P. Limo: Less ismorefor reasoning,2025a. URL https: //arxiv.org/abs/2502.03387 . Ye, Y., Xiao, Y., Mi, T., and Liu, P. Aime-preview: A rigorous and immediate evaluation framework for ad- vanced mathematical reasoning. https://github. com/GAIR-NLP/AIME-Preview , 2025b. GitHub repository.Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T.,Li,Z.,Weller,A.,andLiu,W. Metamath: Bootstrap yourownmathematicalquestionsforlargelanguagemod- els, 2024. URL https://arxiv.org/abs/2309. 12284. Yuan, S., Chen, Z., Xi, Z., Ye, J., Du, Z., and Chen, J. Agent-r: Training language model agents to reflect via iterative self-training, 2025. URL https://arxiv. org/abs/2501.11425 . Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. Star: Bootstrapping reasoning with reasoning, 2022. URL https://arxiv.org/abs/2203.14465 . Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., and Goodman, N. D. Quiet-star: Language models can teach themselves to think before speaking, 2024. URL https://arxiv.org/abs/2403.09629 . Zhang, H. and Chen, C. Test-time compute scaling laws, 2024. URL https://github.com/hughbzhang/ o1_inference_scaling_laws . Zhang,S., Chen,Z.,Shen,Y., Ding,M.,Tenenbaum, J.B., and Gan, C. Planning with large language models for codegeneration,2023. URL https://arxiv.org/ abs/2303.05510 . Zhang, Y., Wu, S., Yang, Y., Shu, J., Xiao, J., Kong, C., and Sang, J. o1-coder: an o1 replication for coding, 2024a. URL https://arxiv.org/abs/2412.00154 . Zhang, Y., Yang, J., Yuan, Y., and Yao, A. C.-C. Cumula- tivereasoningwithlargelanguagemodels,2024b. URL https://arxiv.org/abs/2308.04371 . Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., and Sun, M. Jec-qa: Alegal-domainquestionansweringdataset, 2019. URL https://arxiv.org/abs/1911.12011 . Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. Agieval: A human-centricbenchmarkforevaluatingfoundationmod- els, 2023. URL https://arxiv.org/abs/2304. 06364. Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. Language agent tree search unifies rea- soning acting and planning in language models, 2024. URL https://arxiv.org/abs/2310.04406 . Zhou,C.,Liu,P.,Xu,P.,Iyer,S.,Sun,J.,Mao,Y.,Ma,X., Efrat,A.,Yu,P.,Yu,L.,Zhang,S.,Ghosh,G.,Lewis,M., Zettlemoyer, L., and Levy, O. Lima: Less is more for alignment,2023. URL https://arxiv.org/abs/ 2305.11206 . 14 Page 15: s1: Simple test-time scaling Contents 1 Introduction 1 2 Reasoning data curation to create s1K 2 2.1 Initial collection of 59K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Final selection of 1K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 Test-time scaling 3 3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Results 4 4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 Ablations 6 5.1 Data Quantity, Diversity, and Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.2 Test-time scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 Discussion and related work 8 6.1 Sample-efficient reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6.2 Test-time scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 A s1.1 16 B Evaluation determinism 16 C s1K details 17 C.1s1Ksummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.2 Dataset composition for full 59K questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3s1Kgrading prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.4s1Kdiversity selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.5 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D Training details 21 D.1 Training Ablations: Sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.2 Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E Test-time scaling details 35 E.1 Sequential scaling ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 E.2 Examples for rejection sampling ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 F Version Control 46 15 Page 16: s1: Simple test-time scaling A. s1.1 Sevendaysafterourreleaseofs1,wereleaseds1.1. Weregeneratedtracesforour1,000samplesin s1KusingDeepSeek r1 (DeepSeek-AI et al., 2025) to create s1K-1.1. We use the same training procedure to train our model s1.1. Other updates since our launch include the release of o3 (OpenAI, 2025), LIMO (Ye et al., 2025a), and AIME 2025. We consider all these newdevelopmentsinTable5. Wefindthats1.1performssignificantlybetterthans1. WealsotrieddistillingfromClaude 3.7, which led to worse performance than from r1 (not reported). Table 5.s1-32B is an open and sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini (some entries are unknown (N.A.), see §4). Otherresults are from therespectivereports (Qwen et al.,2024; Team, 2024; OpenAI,2024; 2025; DeepSeek-AI et al., 2025; Labs, 2025; Team, 2025) except for AIME 2025 (Ye et al., 2025b). # ex. = number examples used for reasoning finetuning; BF = budget forcing. Model # Examples MATH500 GPQA AIME 2024 AIME 2025 API only o3-mini-low N/A 95.8 70.6 56.3 42.1 o3-mini-medium N/A 97.3 76.8 75.8 70.4 o3-mini-high N/A 97.9 79.7 83.8 80.9 Open Weights QwQ-32B N.A. 90.6 54.5 46.7 32.7 r1 ≫800K 07.3 71.5 79.8 70.0 r1-distill-Llama-70B 800K 94.5 65.2 57.1 56.3 r1-distill-Qwen-14B 800K 93.9 59.1 61.7 48.0 r1-distill-Qwen-32B 800K 94.3 62.1 58.3 49.6 Open Weights and Open Data LIMO 817 94.8 66.7 56.3 44.6 s1 w/o BF 1K 92.6 56.6 50.0 26.7 s1 with Budget Forcing “Wait” 1x 1K 92.8 59.6 53.3 30.0 s1 with Budget Forcing “Wait” 2x 1K 93.0 59.6 53.3 33.3 s1 with Budget Forcing “Wait” 4x 1K 92.2 58.6 56.7 36.7 s1.1 w/o BF 1K 94.4 60.6 56.7 50.0 s1.1 with Budget Forcing “Wait” 1x 1K 95.4 62.6 56.7 50.0 s1.1 with Budget Forcing “Wait” 2x 1K 95.4 63.6 56.7 50.0 B. Evaluation determinism We run our evaluations using vLLM (Kwon et al., 2023) as it is faster than the alternatives we tried. However, we find that even when using the same random seeds and greedy sampling, evaluation scores can change significantly across runs: •Different batch sizes causing different results see https://github.com/vllm-project/vllm/issues/ 5898 •Continuinggenerationscausingdifferentresultssee https://github.com/vllm-project/vllm/issues/ 11783 •Changes in tensor parallelism causing different results As our model generates long reasoning traces prior to its answer, small numeric changes can snowball into large differences. We encounter many generations that are exactly the same for thousands of tokens and then suddenly differ in one token eventuallyendingupwithanentirelydifferentanswer. Topartlycounterthisissuewegenerallyrunourfinalevaluations using full precision unless otherwise indicated. 16 Page 17: s1: Simple test-time scaling C. s1K details C.1. s1K summary Table 6.Summaryofourdatasets1K .TokencountmeasuredbytheQwen-2.5tokenizer. WepromptClaudetoproducekeywordsgiven several questions from the domain. Domain #questions Total token count Keywords Geometry 109 560.2K Area, Triangle, Distance Number theory 98 522.5K Sequences, Divisibility Combinatorics 75 384.7K Permutations, Counting Real functions 43 234.8K Trigonometry, Calculus Biology 41 120.9K Organic reactions Complex functions 32 170.2K Complex roots Quantum theory 32 127.9K Particles, Wave functions Field theory 28 150.1K Polynomials, Roots Calculus of variations 28 155.5K Optimization, Control Difference equations 24 132.5K Recurrence, Recursion Electromagnetic theory 23 95.8K Optics, Waves, Diffraction Group theory 22 100.0K Groups, Automorphisms Linear algebra 22 128.3K Matrices, Determinants Probability theory 20 114.6K Random walk, Expectation Algebraic systems 19 109.9K Functional equations Mechanics 19 103.6K Forces, Motion, Energy Thermodynamics 19 74.2K Heat engines, Entropy Differential equations 18 89.6K Substitution, Existence Computer science 18 34.2K Complexity theory, Algorithms Numerical analysis 18 76.5K Error analysis, Stability Calculus 17 96.3K Convergence, Summation Algebraic structures 17 90.4K Inequalities, Sets Astronomy 16 37.7K Stellar populations, Orbits Remaining 27 domains 242 982.2K Domains with ≤16 questions All domains (51) 1000 4.7M s1K C.2. Dataset composition for full 59K questions 17 Page 18: s1: Simple test-time scaling Table 7.Compositionoffull59Kquestions. ThinkingandresponselengthsaremeasuredintokensusingtheQwen2.5-32B-Instruct tokenizer (Qwen et al., 2024). In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from 2022-2023 as we use these 90 questions during our development stage of s1-32B. Source Description #Samples Avg. thinking length NuminaMATH (LI et al., 2024) Math problems from online websites 30660 4.1K MATH (Hendrycks et al., 2021) Math problems from competitions 11999 2.9K OlympicArena (Huang et al., 2024a)Astronomy,Biology,Chemistry,Com- puter Science, Geography, Math, and Physics olympiad questions4250 3.2K OmniMath (Gao et al., 2024a) Math problems from competitions 4238 4.4K AGIEval (Zhong et al., 2023; Lingetal.,2017;Hendrycksetal., 2021; Liu et al., 2020; Zhong et al., 2019; Wang et al., 2021)English, Law, Logic and Math prob- lems from the SAT, LSAT and other exams2385 1.2K xword Crossword puzzles 999 0.7K OlympiadBench (He et al., 2024) Math and Physics olympiad questions 896 3.9K AIME (1983-2021) AmericanInvitationalMathematicsEx- amination890 4.7K TheoremQA (Chen et al., 2023) Computer Science, Finance, Math, and Physics university-level questions relat- ing to theorems747 2.1K USACO (Shi et al., 2024) Code problems from the USA Comput- ing Olympiad519 3.6K JEEBench (Arora et al., 2023) Chemistry,Math,andPhysicsproblems used in the university entrance exami- nation of the Indian Institute of Tech- nology515 2.9K GPQA (Rein et al., 2023) PhD-Level Science Questions 348 2.9K SciEval (Sun et al., 2024) Biology, Chemistry, andPhysicsprob- lems from various sources227 0.7K s1-prob Stanford statistics qualifying exams 182 4.0K LiveCodeBench(Jainetal.,2024) Code problems from coding websites (LeetCode, AtCoder,and CodeForces)151 3.5K s1-teasers Mathbrain-teaserscrawledfromtheIn- ternet23 4.1K All 59K questions Composite of the above datasets with reasoning traces and solutions59029 3.6K 18 Page 19: s1: Simple test-time scaling C.3. s1K grading prompt Togradewhetheranexampleiscorrectforourdatasetselectionin§2,weusethepromptinFigure8. WegradeusingClaude 3.5 except for the correctness among the final 1,000 samples, which we graded with Claude 3.7. YouareanAIassistantforgradingascienceproblem. Theuserwillprovideyouwiththequestionitself,anattemptmadeby a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the correctanswer. Iftheexpectedsolutionconcludeswithanumberorchoice,thereshouldbenoambiguity. Iftheexpected solution involves going through the entire reasoningprocess, you should judge the attempt based on whether the reasoning process is correct with correct answer if helpful. The user will provide the attempt and the correct answer in the following format: # Problem {problem} ## Attempt {attempt} ## Correct answer {solution} Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes). Figure 8. Grading prompt. C.4. s1K diversity selection Algorithm 1 provides our algorithm for selecting data in our diversity selection stage. As mentioned in §2, we also include samples from some specific benchmarks we perceive as high-quality. None of the samples overlap with our final evaluation. C.5. Decontamination We filter all samples by checking for an 8-gram overlap between the selected examples and the evaluation benchmarks: MATH500, GPTQA Diamond, and AIME24. We exclude questions with more than an 8-gram overlap. 19 Page 20: s1: Simple test-time scaling Algorithm 1 Two-stage sampling for s1K 1:Input: := Set of 24,496 questions with features 2:Output: := Set of 1,000 selected questions 3:←∅ Initialize the output set (only tracks unique elements) 4:for𝑞∈do 5:ifIsGeminiCorrect( 𝑞)and(IsAIME(𝑞)orIsGPQA(𝑞))then 6: ←∪{𝑞} 7: Select all correct AIME/GPQA solutions 8:else ifIsGeminiCorrect( 𝑞)andIsMATH(𝑞)andThinkingLength( 𝑞) > 5600 then 9: ←∪{𝑞} 10: Select correct MATH500 solutions with long chains 11:end if 12:end for 13:←All available domains 14: Initialize domain pool 15:while||<1000do 16:𝑑←RandomChoice( ) 17: Randomly select a domain 18:𝑄𝑑←Questions in domain 𝑑 19: Get questions from this domain 20:ranks ←RankByThinkingLength( 𝑄𝑑) 21: Rank by thinking length 22:weights ←2−ranks 23: Apply power-law weighting 24:𝑞←WeightedSample( 𝑄𝑑, weights) 25: Sample favoring longer chains 26: ←∪{𝑞} 27: Add selected question 28:𝑄𝑑←𝑄𝑑⧵{𝑞} 29:if𝑄𝑑=∅then 30: ←⧵{𝑑} 31: Remove exhausted domains 32:end if 33:end while 20 Page 21: s1: Simple test-time scaling D. Training details We take a model that has already been pretrained and instruction tuned and further finetune it for reasoning. Specifically, we use Qwen2.5-32B-Instruct (Qwen et al., 2024), which on math tasks generally matches or outperforms the larger Qwen2.5- 72B-Instruct (Qwen et al., 2024) or other open models (Dubey et al., 2024; Groeneveld et al., 2024; Muennighoff et al., 2024). We use token delimiters to separate the thinking stage from the answering stage. We enclose the thinking stage with<|im_start|>think and<|im_start|>answer ;bothprecededandfollowedbyanewline. Samplesfromour datasetarein§D.2. Weusebasicfine-tuninghyperparameters: wetrainfor5epochswithabatchsizeof16foratotalof315 gradientsteps. Wetraininbfloat16precisionwithalearningrateof 1𝑒−5warmeduplinearlyfor5%(16steps)andthen decayed to 0 over the rest of training (299 steps) following a cosine schedule. We use the AdamW optimizer (Loshchilov & Hutter, 2019) with 𝛽1=0.9,𝛽2=0.95and weight decayof 1𝑒−4. We donot compute loss onquestions, only on reasoning traces and solutions. We ensure the sequence length is large enough to avoid cutting off any samples; a setting we ablate in §D.1. The training takes just 26 minutes on 16 NVIDIA H100 GPUs. 0 100 200 300 Training Steps0.00.20.4LossTraining Loss 0 100 200 300 Training Steps0.00.51.0Learning Rate1e5 Learning Rate Schedule 0 100 200 300 Training Steps123Gradient NormGradient Norm Figure 9. Training dynamics of s1-32B on s1K. D.1. Training Ablations: Sequence length Table 8.Training sequence length ablation. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the fewer the thinking tokens (inference cost) the better. Model A Model B Training sequence length 4096 32768 % training samples cutoff 74% 0% AIME24 30.0% / 20721 50.0% / 6984 MATH500 90.0% / 5324 91.0% / 3268 GPQA 52.5% / 6841 53.0% / 3568 Besidesourscalingablationsin§5.2,themaintraininghyperparameterweablateisthesequencelengthusedduringtraining. We find that a shortertraining sequence length leads to longerreasoning traces at test time. This is because when training with a shorter sequence length the answer section of the training sample is more commonly cut off. Inversely, when the trainingsequencelengthislonger,moresamplesappearintheirentiretywiththesectionwherethemodelanswers. Thus themodelreceivesmoregradientupdateswhereitlearnstogenerateananswerfollowingitschain. Thisinturnleadstoa higher log probability of the answer section at any point during the generation and thus shorter reasoning traces at test time. Performance-wise,wealsofindthatthemodeltrainedwithalongersequencelengthperformsbetter. Thusweoptforthe longesttrainingsequencelengthasitleadstobetterperformanceandmakesinferencemoreefficientbyleadingtoshorter reasoning traces. D.2. Training Samples Table 9, Table 10, Table 11 contain training samples from s1K. 21 Page 22: s1: Simple test-time scaling Table 9.s1K sample stemming from s1-prob. The black text is the question and the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training. You are given integrable random variables 𝑋,𝑌0and𝑍0on the same probability space (Ω,,ℙ), and two𝜎-algebras ⊂,⊂.For𝑘=1,2,…,, let 𝑌𝑘∶=𝔼[𝑋|𝜎(,𝑍0,…,𝑍𝑘−1)], 𝑍𝑘∶=𝔼[𝑋|𝜎(,𝑌0,…,𝑌𝑘−1)]. (a) Show that there exist integrable random variables 𝑌∞and𝑍∞such that as 𝑛→∞both𝔼[|𝑌𝑛−𝑌∞|]→0and 𝔼[|𝑍𝑛−𝑍∞|]→0. (b) Prove that almost surely 𝑌∞=𝑍∞. 1.Introducethenotations, 𝑛∶=𝜎(,𝑍0,…,𝑍𝑛−1)and𝑛∶=𝜎(,𝑌0,…,𝑌𝑛−1),forall𝑛≥1and0∶=,0∶=. Clearly, both are filtrations and by definition, 𝑌𝑛=𝔼[𝑋|𝑛]and𝑍𝑛=𝔼[𝑋|𝑛], for all𝑛≥0. Thus{𝑌𝑛} 𝑛≥0(or{𝑍𝑛} 𝑛≥0)isaDoob’sMartingale withrespecttofiltration{𝑛} 𝑛≥0(or{𝑛} 𝑛≥0)andhenceisU.I., 𝑋beingintegrable. Now apply the fact that a Martingale is U.I. if and only if it converges in 𝐿1and conclude. 2. We provide two different proofs. 1. Observe the following identity which follows from Tower Property . 𝔼[𝑌𝑛|𝑛]=𝔼[𝔼[𝑋|𝑛]|𝑛]=𝔼[𝔼[𝑋|𝑛]|𝑛]=𝔼[𝑍𝑛|𝑛]. (4) Notethat 𝔼|𝔼[𝑌𝑛|𝑛]−𝔼[𝑌∞|𝑛]|≤𝔼|𝑌𝑛−𝑌∞|=𝑜(1)andbyLevy’sUpwardTheorem ,𝔼[𝑌∞|𝑛]𝐿1 ⟶𝔼[𝑌∞|∞] , where ∞=𝜎(∪𝑛≥0𝑛)=𝜎(,𝑌0,𝑌1…).It is evident that 𝑌∞∈𝑚∞(since it is almost sure limit of 𝑌𝑛∈𝑚𝑛) and hence 𝔼[𝑌∞|∞]=𝑌∞almost surely. Combining the above observations we conclude that 𝔼[𝑌𝑛|𝑛]𝐿1 ⟶𝑌∞. Similarly, 𝔼[𝑍𝑛|𝑛]𝐿1 ⟶𝑍∞.Now invoke Equation (4), to conclude that 𝑌∞=𝑍∞almost surely. 2.Since𝑋is integrable, we have by Levy’s Upward Theorem ,𝑌∞= lim𝑛→∞𝑌𝑛= lim𝑛→∞𝔼[𝑋|𝑛]=𝔼[𝑋|∞]. Similarly,𝑍∞=𝔼[𝑋|∞].On the otherhand, in first proof we have observed that 𝑌∞∈𝑚∞and hence𝑌∞∈ 𝑚∞∩𝑚∞=𝑚(∞∩∞). This yields the following. 𝑌∞=𝔼[𝑌∞|∞∩∞]=𝔼[𝔼[𝑋|∞]|∞∩∞]=𝔼[𝑋|∞∩∞]. Similarly,𝑍∞=𝔼[𝑋|∞∩∞].This proves that 𝑌∞=𝑍∞almost surely. Let’s analyze the problem. We are given two sequences of random variables 𝑌𝑘and𝑍𝑘defined by conditional expectations. We need to show that these sequences converge in 𝐿1and that their limits are equal almost surely. Part (a): Convergence in 𝐿1. The sigma-algebras involved are increasing: 𝑘=𝜎(,𝑍0,…,𝑍𝑘−1) 𝑘=𝜎(,𝑌0,…,𝑌𝑘−1) For the𝑌𝑘sequence: 𝑌1=𝔼[𝑋|𝜎()] 𝑌2=𝔼[𝑋|𝜎(,𝑍0)] 𝑌3=𝔼[𝑋|𝜎(,𝑍0,𝑍1)] ... Thesequenceofsigma-algebras 𝜎(,𝑍0,…,𝑍𝑘−1)isincreasing. BytheMartingaleConvergenceTheorem(forconditional expectations), since 𝑋is integrable, the sequence 𝑌𝑘converges almost surely and in 𝐿1to𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]. Let 𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]. 22 Page 23: s1: Simple test-time scaling Similarly, for the 𝑍𝑘sequence: 𝑍1=𝔼[𝑋|𝜎()] 𝑍2=𝔼[𝑋|𝜎(,𝑌0)] 𝑍3=𝔼[𝑋|𝜎(,𝑌0,𝑌1)] ... Thesequenceofsigma-algebras 𝜎(,𝑌0,…,𝑌𝑘−1)isincreasing. BytheMartingaleConvergenceTheorem,thesequence 𝑍𝑘converges almost surely and in 𝐿1to𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. Let𝑍∞=𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. Part (b): Prove that almost surely 𝑌∞=𝑍∞. We have𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]and𝑍∞=𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. Let∞=𝜎(,𝑍0,𝑍1,…)and∞=𝜎(,𝑌0,𝑌1,…). Then𝑌∞=𝔼[𝑋|∞]and𝑍∞=𝔼[𝑋|∞]. Consider𝑌1=𝔼[𝑋|].𝑍1=𝔼[𝑋|]. 𝑌2=𝔼[𝑋|𝜎(,𝑍0)]. Since𝑍0=𝔼[𝑋|],𝜎(,𝑍0)=𝜎(,𝔼[𝑋|]). 𝑍2=𝔼[𝑋|𝜎(,𝑌0)]. Since𝑌0is not defined, let’s assume 𝑌0and𝑍0are given. Let’s consider the sigma-algebra generated by all the 𝑌𝑖and𝑍𝑖. Let=𝜎(𝑌0,𝑍0,𝑌1,𝑍1,…). Both𝑌𝑘and𝑍𝑘are measurable with respect to . Thus,𝑌∞and𝑍∞are measurable with respect to . Consider 𝔼[𝑌𝑛+1|,𝑍0,…,𝑍𝑛−1]=𝔼[𝔼[𝑋|𝜎(,𝑍0,…,𝑍𝑛)]|,𝑍0,…,𝑍𝑛−1]=𝑌𝑛. Consider 𝔼[𝑍𝑛+1|,𝑌0,…,𝑌𝑛−1]=𝔼[𝔼[𝑋|𝜎(,𝑌0,…,𝑌𝑛)]|,𝑌0,…,𝑌𝑛−1]=𝑍𝑛. Let=𝜎(∪∪𝜎(𝑌0,𝑍0,𝑌1,𝑍1,…)). Consider 𝔼[𝑋|,𝑍0,…,𝑍𝑘−1]=𝑌𝑘. Consider 𝔼[𝑋|,𝑌0,…,𝑌𝑘−1]=𝑍𝑘. Let’s consider the limit sigma-algebras ∞=𝜎(,𝑍0,𝑍1,…)and∞=𝜎(,𝑌0,𝑌1,…). We know that 𝑌∞is∞-measurable and 𝑍∞is∞-measurable. Let=𝜎(∪). 𝑌1=𝔼[𝑋|].𝑍1=𝔼[𝑋|]. 𝑌2=𝔼[𝑋|𝜎(,𝑍0)]=𝔼[𝑋|𝜎(,𝔼[𝑋|])]. 𝑍2=𝔼[𝑋|𝜎(,𝑌0)]=𝔼[𝑋|𝜎(,𝔼[𝑋|])]. Consider the case when andare independent. Then 𝑌1=𝔼[𝑋|]and𝑍1=𝔼[𝑋|]. 𝑌2=𝔼[𝑋|𝜎(,𝔼[𝑋|])]. Let’s consider the sigma-algebra generated by all the information. Let ∞=𝜎(,,𝑌0,𝑍0,𝑌1,𝑍1,…). Then both𝑌∞and𝑍∞are∞-measurable. We have𝑌∞=lim𝑛→∞𝑌𝑛and𝑍∞=lim𝑛→∞𝑍𝑛a.s. and in𝐿1. Consider 𝔼[𝑋|,𝑍0,𝑍1,…]=𝑌∞and𝔼[𝑋|,𝑌0,𝑌1,…]=𝑍∞. Let∞=𝜎(∪). Consider 𝔼[𝑋|∞]. Let’s try to show that both 𝑌∞and𝑍∞are equal to 𝔼[𝑋|𝜎(∨)]. This seems too simple. Let’s consider the definition of conditional expectation. For any set 𝐴∈∞,𝔼[𝑌∞1𝐴]=𝔼[𝑋1𝐴]. For any set𝐵∈∞,𝔼[𝑍∞1𝐵]=𝔼[𝑋1𝐵]. Consider 𝔼[𝑌𝑘+1|𝑘]=𝑌𝑘where 𝑘=𝜎(,𝑍0,…,𝑍𝑘−1). Consider 𝔼[𝑍𝑘+1|𝑘]=𝑍𝑘where 𝑘=𝜎(,𝑌0,…,𝑌𝑘−1). Consider the sigma-algebra 𝜎(∪). Let𝑊=𝔼[𝑋|𝜎(∪)]. We have𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]. Since𝑍𝑘depends on𝑌𝑘−1which depends on 𝑍𝑘−2etc., it is complex. Let’s consider the case where 𝑌0=𝔼[𝑋|]and𝑍0=𝔼[𝑋|]. Then𝑌1=𝔼[𝑋|]=𝑌0. 𝑍1=𝔼[𝑋|]=𝑍0. If this were the case, then 𝑌𝑘=𝑌0and𝑍𝑘=𝑍0, so𝑌∞=𝑌0and𝑍∞=𝑍0. We need𝑌0=𝑍0a.s., i.e., 𝔼[𝑋|]=𝔼[𝑋|]a.s. This is not generally true. Consider the sigma-algebra generated by all 𝑌𝑖and𝑍𝑖, denoted by ∞=𝜎(𝑌0,𝑍0,𝑌1,𝑍1,…). Then∞=𝜎(,𝑍0,𝑍1,…)⊆𝜎(,∞). ∞=𝜎(,𝑌0,𝑌1,…)⊆𝜎(,∞). 23 Page 24: s1: Simple test-time scaling Consider 𝔼[𝑋|,𝑍0,𝑍1,…]=𝑌∞. This means for any 𝐴∈𝜎(,𝑍0,𝑍1,…),𝔼[𝑋1𝐴]=𝔼[𝑌∞1𝐴]. Consider 𝔼[𝑋|,𝑌0,𝑌1,…]=𝑍∞. This means for any 𝐵∈𝜎(,𝑌0,𝑌1,…),𝔼[𝑋1𝐵]=𝔼[𝑍∞1𝐵]. Let=𝜎(,). Consider 𝔼[𝑋|]. Consider 𝔼[𝑌∞|,𝑌0,𝑌1,…]= 𝔼[𝔼[𝑋|,𝑍0,𝑍1,…]|,𝑌0,𝑌1,…]. Let’s use the fact that 𝑌∞is𝜎(,𝑍0,𝑍1,…)-measurable and 𝑍∞is𝜎(,𝑌0,𝑌1,…)-measurable. Consider the smallest sigma-algebra containing both 𝜎(,𝑍0,𝑍1,…)and𝜎(,𝑌0,𝑌1,…). This is𝜎(∪∪{𝑌𝑖}𝑖≥0∪{𝑍𝑖}𝑖≥0). Let’s try to show that 𝔼[(𝑌∞−𝑍∞)2]=0. 𝔼[(𝑌∞−𝑍∞)2]=𝔼[𝑌2 ∞]+𝔼[𝑍2 ∞]−2𝔼[𝑌∞𝑍∞]. Consider the definition of 𝑌𝑘and𝑍𝑘. 𝔼[𝑌𝑘+1|,𝑍0,…,𝑍𝑘−1]=𝑌𝑘. 𝔼[𝑍𝑘+1|,𝑌0,…,𝑌𝑘−1]=𝑍𝑘. Consider 𝔼[𝑌∞|,𝑌0,𝑌1,…]. Since𝑌∞is measurable with respect to 𝜎(,𝑍0,𝑍1,…), this does not simplify easily. Let’s assume 𝑌∞≠𝑍∞on a set of positive measure. Consider the case where andare trivial. Then𝑌𝑘=𝔼[𝑋|𝜎(𝑍0,…,𝑍𝑘−1)]and𝑍𝑘=𝔼[𝑋|𝜎(𝑌0,…,𝑌𝑘−1)]. Let∞=𝜎(∪∪{𝑌𝑖,𝑍𝑖}𝑖≥0). Consider 𝔼[𝑋|∞]. We have𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]. By the properties of conditional expectation, 𝔼[𝑌∞|]=𝔼[𝔼[𝑋|,𝑍0,𝑍1,…]|]. Consider the sigma-algebra generated by all the random variables involved. =𝜎(∪∪{𝑌𝑖}𝑖≥0∪{𝑍𝑖}𝑖≥0). Let𝑀∞=𝔼[𝑋|∨]. Consider the limits. As 𝑛→∞,𝜎(,𝑍0,…,𝑍𝑛−1)↑𝜎(,𝑍0,𝑍1,…). As𝑛→∞,𝜎(,𝑌0,…,𝑌𝑛−1)↑𝜎(,𝑌0,𝑌1,…). Let∞=𝜎(,𝑍0,𝑍1,…)and∞=𝜎(,𝑌0,𝑌1,…). We have𝑌∞=𝔼[𝑋|∞]and𝑍∞=𝔼[𝑋|∞]. Consider 𝔼[𝑌∞−𝑍∞]2=𝔼[(𝑌∞−𝑋)+(𝑋−𝑍∞)]2 =𝔼[(𝑌∞−𝑋)2]+𝔼[(𝑋−𝑍∞)2]+2𝔼[(𝑌∞−𝑋)(𝑋−𝑍∞)] =𝔼[(𝑋−𝔼[𝑋|∞])2]+𝔼[(𝑋−𝔼[𝑋|∞])2]+2𝔼[𝑌∞𝑋−𝑌∞𝑍∞−𝑋2+𝑋𝑍∞] =𝔼[𝑋2]−𝔼[𝑌2 ∞]+𝔼[𝑋2]−𝔼[𝑍2 ∞]+2𝔼[𝑌∞𝑋]−2𝔼[𝑌∞𝑍∞]−2𝔼[𝑋2]+2𝔼[𝑋𝑍∞] =−𝔼[𝑌2 ∞]−𝔼[𝑍2 ∞]+2𝔼[𝑌2 ∞]−2𝔼[𝑌∞𝑍∞]+2𝔼[𝑍2 ∞] =𝔼[𝑌2 ∞]+𝔼[𝑍2 ∞]−2𝔼[𝑌∞𝑍∞] Consider the sigma-algebra =𝜎(∪). Let𝑊=𝔼[𝑋|]. Consider the limit sigma-algebra generated by all the information. Let∞=𝜎(,,𝑌0,𝑍0,𝑌1,𝑍1,…). Then∞⊆∞and∞⊆∞. We have 𝔼[𝑋|∞]=𝑌∞and𝔼[𝑋|∞]=𝑍∞. Consider 𝔼[𝑌∞|∞]and𝔼[𝑍∞|∞]. Let=𝜎(∪). Consider 𝔼[𝑋|]. Consider the sequence of pairs (𝑌𝑘,𝑍𝑘). If the limits are the same, then for any bounded measurable function 𝑓,𝔼[𝑓(𝑌∞)]=𝔼[𝑓(𝑍∞)]. Consider the sigma-algebra =𝜎(∪∪𝜎(𝑌0,𝑍0,…)). Then both𝑌∞and𝑍∞are measurable with respect to . Consider 𝔼[(𝑌𝑛−𝑍𝑛)2]. Let=𝜎(∪). Consider𝑊=𝔼[𝑋|]. Then𝔼[𝑊|]=𝔼[𝑋|]and𝔼[𝑊|]=𝔼[𝑋|]. 24 Page 25: s1: Simple test-time scaling Consider the limit sigma-algebra =𝜎(∪∪𝜎(𝑌0,𝑍0,…)). We have𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]and𝑍∞=𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. Let=𝜎(,𝑍0,𝑍1,…)∨𝜎(,𝑌0,𝑌1,…). Consider 𝔼[𝑋|]. Consider the random variable 𝑋. Then𝑌∞=lim𝑛𝔼[𝑋|𝑛]and𝑍∞=lim𝑛𝔼[𝑋|𝑛]. Let∞=𝜎(∪∪𝜎(𝑌0,𝑍0,…)). Consider𝐸={𝜔∶𝑌∞(𝜔)≠𝑍∞(𝜔)}. Suppose𝑃(𝐸)>0. Consider the conditional expectation with respect to the join of the sigma-algebras. Let∞=𝜎(,,𝑌0,𝑍0,…). Consider 𝔼[𝑋|∞]. Then𝔼[𝔼[𝑋|∞]|,𝑍0,…]= 𝔼[𝑋|,𝑍0,…]=𝑌𝑛+1. And𝔼[𝔼[𝑋|∞]|,𝑌0,…]= 𝔼[𝑋|,𝑌0,…]=𝑍𝑛+1. So𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]and𝑍∞=𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. Let∞=𝜎(,𝑍0,𝑍1,…)and∞=𝜎(,𝑌0,𝑌1,…). Consider 𝔼[𝑋|∞∨∞]. Then𝔼[𝔼[𝑋|∞∨∞]|∞]=𝔼[𝑋|∞]=𝑌∞. And𝔼[𝔼[𝑋|∞∨∞]|∞]=𝔼[𝑋|∞]=𝑍∞. This means𝑌∞=𝔼[𝑋|∞∨∞]a.s. and𝑍∞=𝔼[𝑋|∞∨∞]a.s. Therefore,𝑌∞=𝑍∞a.s. Solution: (a)Forthesequence 𝑌𝑘,the𝜎-algebras 𝑘=𝜎(,𝑍0,…,𝑍𝑘−1)areincreasingas 𝑘increases. Since 𝑋isintegrable,bythe Martingale Convergence Theorem for conditional expectations, the sequence 𝑌𝑘=𝔼[𝑋|𝑘]converges almost surely and in 𝐿1to𝑌∞=𝔼[𝑋|𝜎(,𝑍0,𝑍1,…)]. Similarly, for the sequence 𝑍𝑘, the𝜎-algebras 𝑘=𝜎(,𝑌0,…,𝑌𝑘−1)are increasing. Since𝑋isintegrable,bytheMartingaleConvergenceTheorem,thesequence 𝑍𝑘=𝔼[𝑋|𝑘]convergesalmostsurelyandin 𝐿1to𝑍∞=𝔼[𝑋|𝜎(,𝑌0,𝑌1,…)]. (b) Let ∞=𝜎(,𝑍0,𝑍1,…)and∞=𝜎(,𝑌0,𝑌1,…). We have𝑌∞=𝔼[𝑋|∞]and𝑍∞=𝔼[𝑋|∞]. Consider the 𝜎-algebra =∞∨∞=𝜎(,,𝑌0,𝑍0,𝑌1,𝑍1,…). Consider𝑊=𝔼[𝑋|]. We have ∞⊆, so𝔼[𝑊|∞]=𝔼[𝔼[𝑋|]|∞]=𝔼[𝑋|∞]=𝑌∞. Since𝑊is-measurable, 𝔼[𝑊|∞]=𝑌∞impliesthat𝑌∞istheconditionalexpectationof 𝑊given∞. Butsince𝑌∞is already ∞-measurable, we have 𝔼[𝑊|∞]=𝑊if𝑊is∞-measurable, which is not necessarily true. However, we know that 𝔼[𝑌∞|∞]=𝑌∞and𝔼[𝑍∞|∞]=𝑍∞. Consider 𝔼[𝑋|∞∨∞]. Then𝔼[𝔼[𝑋|∞∨∞]|∞]=𝔼[𝑋|∞]=𝑌∞. This implies that 𝔼[𝑋|∞∨∞]=𝑌∞almost surely. Similarly, 𝔼[𝔼[𝑋|∞∨∞]|∞]=𝔼[𝑋|∞]=𝑍∞. This implies that 𝔼[𝑋|∞∨∞]=𝑍∞almost surely. Therefore,𝑌∞=𝑍∞almost surely. Final Answer: The final answer is 𝑌∞=𝑍∞ 25 Page 26: s1: Simple test-time scaling Table 10. s1K sample stemming from s1-teasers. The black text is the question and the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training. Alice and Bob each have a coin and flip it until they get a heads. If Bob flipped his coin more times than Alice, what is the expected number of times Alice flipped her coin? 4/3 The problem asks for the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than Alice. Let𝑋be the number of flips Alice makes until she gets a head. Let𝑌be the number of flips Bob makes until he gets a head. 𝑋and𝑌areindependentrandomvariablesfollowingageometricdistributionwithparameter 𝑝=0.5(probabilityofgetting a head). The probability mass function for a geometric distribution is 𝑃(𝐾=𝑘)=(1−𝑝)𝑘−1𝑝, for𝑘=1,2,…. In this case, 𝑝=0.5, so𝑃(𝐾=𝑘)=(0.5)𝑘−1(0.5)=(0.5)𝑘=1 2𝑘. We are given the condition that Bob flipped his coin more times than Alice, which means 𝑌 >𝑋. We want to find the expected value of 𝑋given that𝑌 >𝑋, i.e.,𝐸[𝑋|𝑌 >𝑋]. The formula for conditional expectation is 𝐸[𝑋|𝐴]=∑ 𝑥𝑥𝑃(𝑋=𝑥|𝐴). Also,𝑃(𝑋=𝑥|𝐴)=𝑃(𝑋=𝑥∩𝐴) 𝑃(𝐴). In our case,𝐴is the event𝑌 >𝑋. We need to find 𝑃(𝑌 >𝑋). 𝑃(𝑌 >𝑋)=∑∞ 𝑥=1∑∞ 𝑦=𝑥+1𝑃(𝑋=𝑥,𝑌=𝑦) Since𝑋and𝑌are independent, 𝑃(𝑋=𝑥,𝑌=𝑦)=𝑃(𝑋=𝑥)𝑃(𝑌=𝑦)=1 2𝑥1 2𝑦=1 2𝑥+𝑦. 𝑃(𝑌 >𝑋)=∑∞ 𝑥=1∑∞ 𝑦=𝑥+11 2𝑥+𝑦=∑∞ 𝑥=11 2𝑥∑∞ 𝑦=𝑥+11 2𝑦 The inner sum is a geometric series:∑∞ 𝑦=𝑥+1(1 2)𝑦=(1 2)𝑥+1+(1 2)𝑥+2+… This is a geometric series with first term 𝑎=(1 2)𝑥+1and common ratio 𝑟=1 2. The sum is𝑎 1−𝑟=(1 2)𝑥+1 1−1 2=(1 2)𝑥+1 1 2=(1 2)𝑥. So,𝑃(𝑌 >𝑋)=∑∞ 𝑥=11 2𝑥(1 2)𝑥=∑∞ 𝑥=1(1 4)𝑥. This is another geometric series with first term 𝑎=1 4and common ratio 𝑟=1 4. The sum is𝑎 1−𝑟=1 4 1−1 4=1 4 3 4=1 3. Now, we need to find 𝑃(𝑋=𝑥∩𝑌 >𝑋)=𝑃(𝑋=𝑥,𝑌 >𝑥). 𝑃(𝑋=𝑥,𝑌 >𝑥)=𝑃(𝑋=𝑥)𝑃(𝑌 >𝑥), due to independence. 𝑃(𝑌 >𝑥)=∑∞ 𝑦=𝑥+1𝑃(𝑌=𝑦)=∑∞ 𝑦=𝑥+11 2𝑦=(1 2)𝑥+1+(1 2)𝑥+2+⋯=(1 2)𝑥. So,𝑃(𝑋=𝑥,𝑌 >𝑥)=1 2𝑥×(1 2)𝑥=(1 4)𝑥. Now, we can find 𝑃(𝑋=𝑥|𝑌 >𝑋)=𝑃(𝑋=𝑥∩𝑌>𝑋) 𝑃(𝑌>𝑋)=(1 4)𝑥 1 3=3(1 4)𝑥. Finally, we can calculate the conditional expectation: 𝐸[𝑋|𝑌 >𝑋]=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥|𝑌 >𝑋)=∑∞ 𝑥=1𝑥⋅3(1 4)𝑥=3∑∞ 𝑥=1𝑥(1 4)𝑥. Consider the series 𝑆=∑∞ 𝑥=1𝑥𝑟𝑥, where|𝑟|<1. 𝑆=𝑟+2𝑟2+3𝑟3+4𝑟4+… 𝑆 𝑟=1+2𝑟+3𝑟2+4𝑟3+… Consider the geometric series 𝐺=1+𝑟+𝑟2+𝑟3+⋯=1 1−𝑟. Differentiating with respect to 𝑟: 𝑑𝐺 𝑑𝑟=1+2𝑟+3𝑟2+4𝑟3+⋯=𝑑 𝑑𝑟(1 1−𝑟)=1 (1−𝑟)2. So,𝑆 𝑟=1 (1−𝑟)2, which means 𝑆=𝑟 (1−𝑟)2. In our case,𝑟=1 4. 26 Page 27: s1: Simple test-time scaling ∑∞ 𝑥=1𝑥(1 4)𝑥=1 4 (1−1 4)2=1 4 (3 4)2=1 4 9 16=1 4×16 9=4 9. Therefore,𝐸[𝑋|𝑌 >𝑋]=3×4 9=12 9=4 3. Alternative approach: Symmetry. Consider the events 𝑌 >𝑋,𝑋 >𝑌, and𝑋=𝑌. 𝑃(𝑌 >𝑋)=𝑃(𝑋 >𝑌)due to symmetry. 𝑃(𝑋=𝑌)=∑∞ 𝑘=1𝑃(𝑋=𝑘,𝑌=𝑘)=∑∞ 𝑘=1𝑃(𝑋=𝑘)𝑃(𝑌=𝑘)=∑∞ 𝑘=1(1 2𝑘)2=∑∞ 𝑘=1(1 4)𝑘=1 4 1−1 4=1 4 3 4=1 3. We have𝑃(𝑌 >𝑋)+𝑃(𝑋 >𝑌)+𝑃(𝑋=𝑌)=1. 𝑃(𝑌 >𝑋)+𝑃(𝑌 >𝑋)+1 3=1. 2𝑃(𝑌 >𝑋)=1−1 3=2 3. 𝑃(𝑌 >𝑋)=1 3. This confirms our previous calculation. Let𝐸[𝑋|𝑌 >𝑋]=𝜇. We are looking for the expected number of flips Alice made, given Bob flipped more times. Consider the case where the first flip for both is tails. Then the process restarts, and the condition remains. Let𝐴1be Alice’s first flip, 𝐵1be Bob’s first flip. 𝑃(𝐴1=𝐻)=1∕2,𝑃(𝐴1=𝑇)=1∕2. 𝑃(𝐵1=𝐻)=1∕2,𝑃(𝐵1=𝑇)=1∕2. Case 1:𝐴1=𝐻,𝐵1=𝐻.𝑋=1,𝑌=1, so𝑌 >𝑋is false. Case 2:𝐴1=𝐻,𝐵1=𝑇. Alice stops at 1, Bob continues. 𝑌 >1. Case 3:𝐴1=𝑇,𝐵1=𝐻. Bob stops at 1, Alice continues. Case 4:𝐴1=𝑇,𝐵1=𝑇. Both continue. Consider the conditional probabilities given 𝑌 >𝑋. Let𝐸=𝐸[𝑋|𝑌 >𝑋]. If the first flips are HH, the condition 𝑌 >𝑋is not met. If the first flips are HT, Alice’s number of flips is 1. If the first flips are TH, Bob’s number of flips is 1, Alice’s is at least 2, so 𝑌 <𝑋. If the first flips are TT, the problem restarts, and the expected number of additional flips for Alice is still 𝐸. Let𝑁𝐴be the number of flips Alice makes, 𝑁𝐵be the number of flips Bob makes. We are given 𝑁𝐵>𝑁𝐴. Consider the first flips: HH (prob 1/4): Condition not met. HT (prob 1/4): Alice flips 1 time. TH (prob 1/4): Bob flips 1 time, Alice > 1, so 𝑁𝐴>𝑁𝐵, condition not met. TT (prob 1/4): Both continue. Let𝐸be the expected number of Alice’s flips given Bob flipped more. If the first flips are HT, Alice flipped 1 time, and the condition is met. If the first flips are TT, the situation is the same as the beginning, so the expected number of additional flips for Alice is 𝐸. Consider the event 𝑌 >𝑋. Conditioning on the first flips: 𝐸[𝑋|𝑌 >𝑋]=𝐸[𝑋|𝑌 >𝑋,𝐻𝐻]𝑃(𝐻𝐻|𝑌 >𝑋)+𝐸[𝑋|𝑌 >𝑋,𝐻𝑇]𝑃(𝐻𝑇|𝑌 >𝑋)+𝐸[𝑋|𝑌 >𝑋,𝑇𝐻]𝑃(𝑇𝐻|𝑌 > 𝑋)+𝐸[𝑋|𝑌 >𝑋,𝑇𝑇]𝑃(𝑇𝑇|𝑌 >𝑋) 𝑃(𝑌 >𝑋|𝐻𝐻)=0. 𝑃(𝑌 >𝑋|𝐻𝑇)=1. 𝑃(𝑌 >𝑋|𝑇𝐻)=0. 𝑃(𝑌 >𝑋|𝑇𝑇)depends on the future flips. Let’s use the definition of conditional expectation directly. 𝐸[𝑋|𝑌 >𝑋]=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥|𝑌 >𝑋)=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥,𝑌>𝑥) 𝑃(𝑌>𝑋) Consider the probabilities of the first flips given 𝑌 >𝑋. 𝑃(𝐻𝐻|𝑌 >𝑋)=𝑃(𝑌>𝑋|𝐻𝐻)𝑃(𝐻𝐻) 𝑃(𝑌>𝑋)=0⋅1∕4 1∕3=0. 𝑃(𝐻𝑇|𝑌 >𝑋)=𝑃(𝑌>𝑋|𝐻𝑇)𝑃(𝐻𝑇) 𝑃(𝑌>𝑋)=1⋅1∕4 1∕3=3 4. 27 Page 28: s1: Simple test-time scaling 𝑃(𝑇𝐻|𝑌 >𝑋)=𝑃(𝑌>𝑋|𝑇𝐻)𝑃(𝑇𝐻) 𝑃(𝑌>𝑋)=0⋅1∕4 1∕3=0. 𝑃(𝑇𝑇|𝑌 >𝑋)=𝑃(𝑌>𝑋|𝑇𝑇)𝑃(𝑇𝑇) 𝑃(𝑌>𝑋) If the first flips are TT, the future behavior is the same, and the condition 𝑌 >𝑋holds if Bob takes more flips than Alice in the subsequent trials. 𝑃(𝑌 >𝑋|𝑇𝑇)=𝑃(Bob needs more flips than Alice starting from scratch )=𝑃(𝑌 >𝑋)=1∕3. 𝑃(𝑇𝑇|𝑌 >𝑋)=(1∕3)⋅(1∕4) 1∕3=1 4. Sum of conditional probabilities: 0+3∕4+0+1∕4=1 . Now consider the conditional expectations of 𝑋: 𝐸[𝑋|𝑌 >𝑋,𝐻𝐻]: Not relevant since 𝑃(𝐻𝐻|𝑌 >𝑋)=0. 𝐸[𝑋|𝑌 >𝑋,𝐻𝑇]: Given HT and 𝑌 >𝑋, Alice flipped 1 time. 𝐸=1. 𝐸[𝑋|𝑌 >𝑋,𝑇𝐻]: Not relevant. 𝐸[𝑋|𝑌 >𝑋,𝑇𝑇]: Given TT and 𝑌 >𝑋, Alice has already flipped once, and the expected number of additional flips is 𝐸. So, the total number of flips is 1+𝐸. 𝐸=0⋅0+1⋅(3∕4)+undefined⋅0+(1+𝐸)⋅(1∕4) 𝐸=0+3∕4+0+1∕4+ 𝐸∕4 𝐸=1+𝐸∕4 𝐸−𝐸∕4=1 3𝐸∕4=1 𝐸=4∕3. This confirms the previous result. Let’sthinkabouttheintuition. Ifwehavethecondition 𝑌 >𝑋, itmeansAlicewas"lucky"andgotaheadearly, orBobwas "unlucky". Consider the random variable 𝑍=𝑌−𝑋. We are given 𝑍 >0. We want𝐸[𝑋|𝑍 >0]. Let𝑝=𝑃(head)=0.5. The number of tails before the first head follows a geometric distribution with parameter 𝑝. 𝑋−1∼𝐺𝑒𝑜𝑚(𝑝),𝑌−1∼𝐺𝑒𝑜𝑚(𝑝). 𝑃(𝑋−1=𝑘)=(1−𝑝)𝑘𝑝. 𝑃(𝑋=𝑥)=(1−𝑝)𝑥−1𝑝. Consider the joint distribution of 𝑋and𝑌. 𝑃(𝑋=𝑥,𝑌=𝑦)=𝑝2(1−𝑝)𝑥−1(1−𝑝)𝑦−1 Consider the event 𝑌=𝑋+𝑘, where𝑘≥1. 𝑃(𝑌=𝑋+𝑘)=∑∞ 𝑥=1𝑃(𝑋=𝑥,𝑌=𝑥+𝑘)=∑∞ 𝑥=1𝑝2(1−𝑝)𝑥−1(1−𝑝)𝑥+𝑘−1 =𝑝2(1−𝑝)𝑘−1∑∞ 𝑥=1((1−𝑝)2)𝑥−1=𝑝2(1−𝑝)𝑘−11 1−(1−𝑝)2 =𝑝2(1−𝑝)𝑘−11 1−(1−2𝑝+𝑝2)=𝑝2(1−𝑝)𝑘−11 2𝑝−𝑝2=𝑝2(1−𝑝)𝑘−11 𝑝(2−𝑝) =𝑝(1−𝑝)𝑘−1 2−𝑝. If𝑝=0.5, then1−𝑝=0.5,2−𝑝=1.5. 𝑃(𝑌=𝑋+𝑘)=0.5(0.5)𝑘−1 1.5=(0.5)𝑘 1.5=(1∕2)𝑘 3∕2=1 2𝑘⋅2 3=1 3⋅2𝑘−1. Check if∑∞ 𝑘=1𝑃(𝑌=𝑋+𝑘)=𝑃(𝑌 >𝑋)=1∕3.∑∞ 𝑘=11 3⋅2𝑘−1=1 3∑∞ 𝑘=1(1 2)𝑘−1=1 3(1+1∕2+1∕4+…)=1 31 1−1∕2=1 3⋅2=2 3. Something is wrong here. Let’s recalculate 𝑃(𝑌 >𝑋). 𝑃(𝑌 >𝑋)=∑∞ 𝑥=1∑∞ 𝑦=𝑥+1(1−𝑝)𝑥−1𝑝(1−𝑝)𝑦−1𝑝 =𝑝2∑∞ 𝑥=1(1−𝑝)𝑥−1∑∞ 𝑦=𝑥+1(1−𝑝)𝑦−1 Inner sum:(1−𝑝)𝑥+(1−𝑝)𝑥+1+⋯=(1−𝑝)𝑥 1−(1−𝑝)=(1−𝑝)𝑥 𝑝. 𝑃(𝑌 >𝑋)=𝑝2∑∞ 𝑥=1(1−𝑝)𝑥−1(1−𝑝)𝑥 𝑝=𝑝(1−𝑝)∑∞ 𝑥=1(1−𝑝)2(𝑥−1) =𝑝(1−𝑝)∑∞ 𝑥=1((1−𝑝)2)𝑥−1=𝑝(1−𝑝)1 1−(1−𝑝)2 =𝑝(1−𝑝)1 1−(1−2𝑝+𝑝2)=𝑝(1−𝑝)1 2𝑝−𝑝2=𝑝(1−𝑝)1 𝑝(2−𝑝)=1−𝑝 2−𝑝. If𝑝=0.5,𝑃(𝑌 >𝑋)=0.5 1.5=1 3. Correct. 28 Page 29: s1: Simple test-time scaling Now, let’s find 𝐸[𝑋|𝑌 >𝑋]. 𝐸[𝑋|𝑌 >𝑋]=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥|𝑌 >𝑋)=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥,𝑌>𝑥) 𝑃(𝑌>𝑋). 𝑃(𝑋=𝑥,𝑌 >𝑥)=𝑃(𝑋=𝑥)𝑃(𝑌 >𝑥)(independence) 𝑃(𝑋=𝑥)=(1−𝑝)𝑥−1𝑝. 𝑃(𝑌 >𝑥)=∑∞ 𝑦=𝑥+1(1−𝑝)𝑦−1𝑝=𝑝[(1−𝑝)𝑥+(1−𝑝)𝑥+1+…]=𝑝(1−𝑝)𝑥 1−(1−𝑝)=(1−𝑝)𝑥. 𝑃(𝑋=𝑥,𝑌 >𝑥)=(1−𝑝)𝑥−1𝑝(1−𝑝)𝑥=𝑝(1−𝑝)2𝑥−1. 𝐸[𝑋|𝑌 >𝑋]=1 𝑃(𝑌>𝑋)∑∞ 𝑥=1𝑥𝑝(1−𝑝)2𝑥−1 =2−𝑝 1−𝑝𝑝(1−𝑝)−1∑∞ 𝑥=1𝑥((1−𝑝)2)𝑥(error in power) 𝑃(𝑌 >𝑥)=(1−𝑝)𝑥 1−(1−𝑝)𝑝=(1−𝑝)𝑥. 𝐸[𝑋|𝑌 >𝑋]=1 𝑃(𝑌>𝑋)∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥)𝑃(𝑌 >𝑥) =1 𝑃(𝑌>𝑋)∑∞ 𝑥=1𝑥(1−𝑝)𝑥−1𝑝(1−𝑝)𝑥 =𝑝 𝑃(𝑌>𝑋)∑∞ 𝑥=1𝑥(1−𝑝)2𝑥−1 =𝑝 (1−𝑝)∕(2−𝑝)(1−𝑝)−1∑∞ 𝑥=1𝑥((1−𝑝)2)𝑥 =𝑝2−𝑝 (1−𝑝)2∑∞ 𝑥=1𝑥((1−𝑝)2)𝑥. Let𝑟=(1−𝑝)2.∑∞ 𝑥=1𝑥𝑟𝑥=𝑟 (1−𝑟)2. 1−𝑟=1−(1−𝑝)2=1−(1−2 𝑝+𝑝2)=2𝑝−𝑝2=𝑝(2−𝑝). (1−𝑟)2=𝑝2(2−𝑝)2.∑∞ 𝑥=1𝑥((1−𝑝)2)𝑥=(1−𝑝)2 𝑝2(2−𝑝)2. 𝐸[𝑋|𝑌 >𝑋]=𝑝2−𝑝 (1−𝑝)2(1−𝑝)2 𝑝2(2−𝑝)2=1 𝑝(2−𝑝). If𝑝=0.5,𝐸[𝑋|𝑌 >𝑋]=1 0.5×1.5=1 0.75=1 3∕4=4∕3. Solution: Let𝑋be the number of flips Alice makes until she gets a head, and 𝑌be the number of flips Bob makes until he gets a head.𝑋and𝑌are independent geometric random variables with parameter 𝑝= 0.5. The probability mass function is 𝑃(𝐾=𝑘)=(1−𝑝)𝑘−1𝑝. WeareinterestedintheexpectednumberoftimesAliceflippedhercoin,giventhatBobflippedhiscoinmoretimesthan Alice, i.e.,𝐸[𝑋|𝑌 >𝑋]. Using the definition of conditional expectation, 𝐸[𝑋|𝑌 >𝑋]=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥|𝑌 >𝑋). 𝑃(𝑋=𝑥|𝑌 >𝑋)=𝑃(𝑋=𝑥,𝑌>𝑥) 𝑃(𝑌>𝑋). First, calculate 𝑃(𝑌 >𝑋). 𝑃(𝑌 >𝑋)=∑∞ 𝑥=1∑∞ 𝑦=𝑥+1𝑃(𝑋=𝑥,𝑌=𝑦)=∑∞ 𝑥=1𝑃(𝑋=𝑥)𝑃(𝑌 >𝑥). 𝑃(𝑋=𝑥)=(0.5)𝑥−1(0.5)=(0.5)𝑥=1 2𝑥. 𝑃(𝑌 >𝑥)=∑∞ 𝑦=𝑥+1(0.5)𝑦−1(0.5)=(0.5)𝑥. 𝑃(𝑌 >𝑋)=∑∞ 𝑥=11 2𝑥(1 2)𝑥=∑∞ 𝑥=1(1 4)𝑥=1∕4 1−1∕4=1∕4 3∕4=1 3. Next, calculate 𝑃(𝑋=𝑥,𝑌 >𝑥)=𝑃(𝑋=𝑥)𝑃(𝑌 >𝑥)=1 2𝑥(1 2)𝑥=(1 4)𝑥. Now, find𝑃(𝑋=𝑥|𝑌 >𝑋)=𝑃(𝑋=𝑥,𝑌>𝑥) 𝑃(𝑌>𝑋)=(1∕4)𝑥 1∕3=3(1 4)𝑥. Finally, calculate the conditional expectation: 𝐸[𝑋|𝑌 >𝑋]=∑∞ 𝑥=1𝑥𝑃(𝑋=𝑥|𝑌 >𝑋)=∑∞ 𝑥=1𝑥⋅3(1 4)𝑥=3∑∞ 𝑥=1𝑥(1 4)𝑥. Using the formula∑∞ 𝑥=1𝑥𝑟𝑥=𝑟 (1−𝑟)2with𝑟=1∕4: ∑∞ 𝑥=1𝑥(1 4)𝑥=1∕4 (1−1∕4)2=1∕4 (3∕4)2=1∕4 9∕16=1 4×16 9=4 9. 𝐸[𝑋|𝑌 >𝑋]=3×4 9=12 9=4 3. Final Answer: The final answer is 4∕3 29 Page 30: s1: Simple test-time scaling Table11. s1KsamplestemmingfromtheastronomysubsetofOlympicArena. Theblacktextisthequestionandthebluetextisthe solution. BasedsolelyonthequestionwegeneratethelightbluereasoningtraceandblueresponsewithGemini,whichweusefortraining. An often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but itproducedthesameamountofpowerperunitvolumeasahuman, whatwoulditssurfacetemperaturebe? Assumethe "average human" produces 100 watts of power and has a volume of 66400 cubic centimeters. A:3500K B:10000K C:25000K D:40000K E:50000K Using the numbers from the problem, the average human produces 𝑢=100 66400×10−6=1506W∕m3 Thevolumeofthesunis4 3𝜋𝑅3 ⊙,soitsnewpoweroutputwouldbe 𝑃=4 3𝜋𝑅3 ⊙𝑢. Tofindtheequilibriumtemperature,weuse 𝑃=𝜎𝐴𝑇4=4𝜋𝜎𝑅2 ⊙𝑇4 Solving for𝑇, we get 𝑇=4√ 𝑅⊙𝑢 3𝜎=4√√√√(6.96×108)(1506) 3(5.67×10−8)=49823≈50000K . The problem asks for the surface temperature of the sun if it had the same size but produced power per unit volume as a human. First, let’s find the power density of a human. Power produced by a human, 𝑃ℎ𝑢𝑚𝑎𝑛=100W Volume of a human, 𝑉ℎ𝑢𝑚𝑎𝑛=66400cm3 Convert the volume to cubic meters: 1m=100cm 1m3=(100cm)3=106cm3 𝑉ℎ𝑢𝑚𝑎𝑛=66400cm3=66400×10−6m3=0.0664m3 Power density of a human, 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=𝑃ℎ𝑢𝑚𝑎𝑛 𝑉ℎ𝑢𝑚𝑎𝑛=100W 0.0664m3 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=1506.024W∕m3 Now, consider the sun with the same size as it currently is, but with a power density equal to that of a human. We need the radius of the sun to calculate its volume. The average radius of the sun is 𝑅𝑠𝑢𝑛=6.95×108m. Volume of the sun, 𝑉𝑠𝑢𝑛=4 3𝜋𝑅3 𝑠𝑢𝑛=4 3𝜋(6.95×108m)3 𝑉𝑠𝑢𝑛=4 3𝜋(337.159375×1024)m3 𝑉𝑠𝑢𝑛=1410.24×1024×𝜋m3 𝑉𝑠𝑢𝑛=4.4307×1027m3 If the sun had the same power density as a human, the total power produced by this modified sun would be: 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛×𝑉𝑠𝑢𝑛 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=1506.024W∕m3×4.4307×1027m3 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=6.6726×1030W 30 Page 31: s1: Simple test-time scaling The surface temperature of a star is related to its luminosity (total power output) through the Stefan-Boltzmann law: 𝐿=4𝜋𝑅2𝜎𝑇4 where𝐿is the luminosity (total power output), 𝑅is the radius of the star, 𝜎is the Stefan-Boltzmann constant, and 𝑇is the surface temperature. In this case, the luminosity of the modified sun is 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛. The radius of the modified sun is the same as the actual sun, 𝑅𝑠𝑢𝑛. The Stefan-Boltzmann constant, 𝜎=5.67×10−8Wm−2K−4. 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=4𝜋𝑅2 𝑠𝑢𝑛𝜎𝑇4 6.6726×1030W=4𝜋(6.95×108m)2×5.67×10−8Wm−2K−4×𝑇4 Calculate the term 4𝜋𝑅2 𝑠𝑢𝑛𝜎: 𝑅2 𝑠𝑢𝑛=(6.95×108)2=48.3025×1016m2 4𝜋𝑅2 𝑠𝑢𝑛=4𝜋×48.3025×1016=607.08×1016m2 4𝜋𝑅2 𝑠𝑢𝑛𝜎=607.08×1016m2×5.67×10−8Wm−2K−4 4𝜋𝑅2 𝑠𝑢𝑛𝜎=3441.93×108WK−4=3.44193×1011WK−4 Now, substitute this back into the equation: 6.6726×1030=3.44193×1011×𝑇4 Solve for𝑇4: 𝑇4=6.6726×1030 3.44193×1011=1.9386×1019 Solve for𝑇: 𝑇=(1.9386×1019)1∕4=(193.86×1017)1∕4 𝑇=(1.9386)1∕4×(1019)1∕4=(1.9386)0.25×104.75 Calculate(1.9386)0.25: Let𝑦=1.93860.25, thenln𝑦=0.25ln(1.9386) ln(1.9386)≈0.6619 ln𝑦=0.25×0.6619=0.165475 𝑦=𝑒0.165475≈1.1799 So,𝑇=1.1799×104.75=1.1799×104×100.75 100.75=103∕4=(103)1∕4=(1000)1∕4 We know54=625and64=1296, so(1000)1∕4is between 5 and 6. Alternatively, 100.75=100.5×100.25=√ 10×100.25≈3.16×1.778=5.62 𝑇=1.1799×104×5.62=6.63×104K=66300K Let’s double-check the calculations. Power density of human: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=1506.02W∕m3 Volume of sun: 𝑉𝑠𝑢𝑛=4.43×1027m3 Power of modified sun: 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=1506.02×4.43×1027=6671.67×1027=6.67×1030W Stefan-Boltzmann law: 𝑃=4𝜋𝑅2𝜎𝑇4 𝑇4=𝑃 4𝜋𝑅2𝜎 4𝜋𝑅2 𝑠𝑢𝑛𝜎=3.44×1011 𝑇4=6.67×1030 3.44×1011=1.939×1019 𝑇=(1.939×1019)1∕4=(193.9×1017)1∕4 𝑇=(193.9)0.25×(1017)0.25=(193.9)0.25×104.25 Calculate(193.9)0.25: Let𝑦=193.90.25, thenln𝑦=0.25ln(193.9) ln(193.9)≈5.267 ln𝑦=0.25×5.267=1.31675 𝑦=𝑒1.31675≈3.73 𝑇=3.73×104.25=3.73×104×100.25=37300×1 .778=66329K The result is around 66000K. This is not among the options. Let’s review the assumptions and calculations. Assumption: The sun radiates as a black body, which is reasonable for estimating surface temperature. 31 Page 32: s1: Simple test-time scaling Consider potential errors in calculations. Let’s try a simpler approach by setting up ratios. The power per unit volume of the modified sun is equal to that of a human. 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛 𝑉𝑠𝑢𝑛=𝑃ℎ𝑢𝑚𝑎𝑛 𝑉ℎ𝑢𝑚𝑎𝑛 We also know 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=4𝜋𝑅2 𝑠𝑢𝑛𝜎𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛. 4𝜋𝑅2 𝑠𝑢𝑛𝜎𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛 4 3𝜋𝑅3𝑠𝑢𝑛=𝑃ℎ𝑢𝑚𝑎𝑛 𝑉ℎ𝑢𝑚𝑎𝑛 3𝜎𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛 𝑅𝑠𝑢𝑛=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛𝑅𝑠𝑢𝑛 3𝜎 Substitute the values: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=1506.02W∕m3 𝑅𝑠𝑢𝑛=6.95×108m 𝜎=5.67×10−8Wm−2K−4 𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=1506.02×6.95×108 3×5.67×10−8 𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=10466.839×108 17.01×10−8=1.0466×1012 1.701×10−7=0.6153×1019=6.153×1018 𝑇𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=(6.153×1018)1∕4=(6153×1015)1∕4 𝑇𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=(6153)0.25×(1015)0.25=(6153)0.25×103.75 Calculate(6153)0.25: Let𝑦=61530.25, thenln𝑦=0.25ln(6153) ln(6153)≈8 .724 ln𝑦=0.25×8.724=2.181 𝑦=𝑒2.181≈8.85 𝑇𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=8.85×103.75=8.85×103×100.75=8850×5.62=49737K This is close to option E: 50000K. Let’s re-check the units and calculations carefully. Power density of human:100W 0.0664m3=1506.02W∕m3 The formula derived was 𝑇4=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛𝑅𝑠𝑢𝑛 3𝜎. Units check: [𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛]=W∕m3 [𝑅𝑠𝑢𝑛]=m [𝜎]=Wm−2K−4 (W∕m3)×m Wm−2K−4=W∕m2 Wm−2K−4=W m2×m2K4 W=K4 The units match for 𝑇4. Numerical calculation check: Numerator:1506.02×6.95×108=10466.839×108=1.0466×1012 Denominator: 3×5.67×10−8=17.01×10−8 Ratio:1.0466×1012 17.01×10−8=0.061528×1020=6.1528×1018 𝑇4=6.1528×1018 𝑇=(6.1528×1018)1∕4=(6152.8×1015)1∕4 𝑇=(6152.8)0.25×103.75 Let’s use logarithms for precision. log10(6152.8)≈3.789 0.25×3.789=0.94725 100.94725≈8.856 𝑇=8.856×103.75=8.856×103×100.75=8856×5.623=49806K This is very close to 50000K. Possible sources of minor discrepancies: rounding errors, precision of constants. Let’s verify the formula derivation. Power per unit volume =Total Power Volume 32 Page 33: s1: Simple test-time scaling 𝜌𝑝𝑜𝑤𝑒𝑟=𝐿 𝑉 𝐿=4𝜋𝑅2𝜎𝑇4 𝑉=4 3𝜋𝑅3 𝜌𝑝𝑜𝑤𝑒𝑟=4𝜋𝑅2𝜎𝑇4 4 3𝜋𝑅3=3𝜎𝑇4 𝑅 If the modified sun has the power density of a human: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=3𝜎𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛 𝑅𝑠𝑢𝑛 𝑇4 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛𝑅𝑠𝑢𝑛 3𝜎 The formula derivation seems correct. Let’s use approximate values to check the order of magnitude. 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛≈103W∕m3 𝑅𝑠𝑢𝑛≈7×108m 𝜎≈6×10−8Wm−2K−4 𝑇4≈103×7×108 3×6×10−8=7×1011 18×10−8=0.38×1019=3.8×1018 𝑇≈(3.8×1018)1∕4=(3800×1015)1∕4 𝑇≈(3800)0.25×103.75 (3800)0.25:74=2401,84=4096. So, between 7 and 8. ≈7.8 𝑇≈7.8×103.75=7800×5.6=43680K The order of magnitude seems consistent with the detailed calculation. Theproblemstatementisstraightforwardanddoesn’tseemtohaveanyhiddencomplexities. Theassumptionsmadeare standard for such problems. Consider the precision of the given values. Power of human: 100 W (exact) Volume of human: 66400 cm3(exact) The result obtained is very close to option E. Given the nature of multiple-choice questions, the closest option is likely the intended answer. If there were any misinterpretations of the question, it would likely be in the setup. However, the phrase "If the sun were the same size, but it produced the same amount of power per unit volume as a human" seems clear. The surface temperature is calculated using the Stefan-Boltzmann law, assuming black-body radiation. Double-check the provided options: A:3500K B:10000K C:25000K D:40000K E:50000K Our calculated value is approximately 49800K, which is closest to 50000K. If we used more precise values for the constants, the result might be slightly different, but likely not enough to shift to another option. The fun fact mentioned in the question implies that the power density of a human is higher than that of the sun. Let’s verify this. Power output of the sun: 3.846×1026W Volume of the sun: 4.43×1027m3 Power density of the sun:3.846×1026 4.43×1027=0.868W∕m3 Power density of human: 1506W∕m3 Indeed, the power density of a human is significantly higher than that of the sun. Final check of the calculation steps and logic. The approach of equating the power density of the modified sun to that of a human and then using the Stefan-Boltzmann law seems correct. 33 Page 34: s1: Simple test-time scaling Solution: Thepowerproducedbyanaveragehumanis 𝑃ℎ𝑢𝑚𝑎𝑛=100W,andthevolumeofanaveragehumanis 𝑉ℎ𝑢𝑚𝑎𝑛=66400cm3= 0.0664m3. The power density of a human is 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=𝑃ℎ𝑢𝑚𝑎𝑛 𝑉ℎ𝑢𝑚𝑎𝑛=100W 0.0664m3≈1506.02W∕m3. If the sun were the same size but produced the same amount of power per unit volume as a human, its power density would be𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 . The radius of the sun is 𝑅𝑠𝑢𝑛=6.95×108m. The volume of the sun is 𝑉𝑠𝑢𝑛=4 3𝜋𝑅3 𝑠𝑢𝑛. The power density of the modified sun is given by 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛 𝑉𝑠𝑢𝑛. We also know that 𝑃𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑑 _𝑠𝑢𝑛=4𝜋𝑅2 𝑠𝑢𝑛𝜎𝑇4, where𝑇is the surface temperature of the modified sun. Equating the power density: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛=4𝜋𝑅2 𝑠𝑢𝑛𝜎𝑇4 4 3𝜋𝑅3𝑠𝑢𝑛=3𝜎𝑇4 𝑅𝑠𝑢𝑛 Solving for𝑇4: 𝑇4=𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛𝑅𝑠𝑢𝑛 3𝜎Substitute the values: 𝑇4=1506.02W∕m3×6.95×108m 3×5.67×10−8Wm−2K−4 𝑇4=1.0466×1012 1.701×10−7=6.1528×1018K4 Solving for𝑇: 𝑇=(6.1528×1018)1∕4=(6152.8)0.25×1018∕4=(6152.8)0.25×104.5 Using logarithms: log10(6152.8)≈3.789 0.25×3.789=0.94725 (6152.8)0.25=100.94725≈8.856 𝑇=8.856×104.5=8.856×104×100.5=88560×√ 10≈88560×3 .16=280000K Let’s redo the exponent calculation: 18∕4=4.5. 𝑇=8.856×104.5=8.856×104×100.5- error in previous step. 𝑇=(6.1528×1018)0.25=49806K The surface temperature would be approximately 49806K, which is closest to 50000K. Final Answer: The final answer is 𝐸 34 Page 35: s1: Simple test-time scaling E. Test-time scaling details E.1. Sequential scaling ablations <|im_start|>user WhatistheanswertoLife,theUniverseandEverything? Think for up to 2048 tokens. <|im_start|>assistant<|im_start|>user What is the answer to Life, the Universe and Everything? Think for up to 64 steps. <|im_start|>assistant <|im_start|>think Letmebreakdownthisquestionintothethreepartsit is asking for: 1) Life 2) Universe 3) Everything Let me start with life...<|im_start|>64 steps left Letme breakdownthis question intothe threepartsit isaskingfor: 1) Life 2) Universe 3) Everything <|im_start|>63 steps left Let me start with life... <|im_start|>answer The answer is...<|im_start|>answer The answer is... Figure 10. Token and step instruction data formats for controlling test-time compute. We only train our model on the reasoning trace and the answer. Table 12. Scaling thinking time via tokens-conditional control. All metrics are averaged over the 30 questions in AIME24. Tokens instructed ( →) 1024 2048 4096 8192 16384 No intervention at test-time Thinking tokens 7939 7158 8263 7108 7500 Answer tokens 689 669 659 722 724 AIME24 26.7 30.0 33.3 33.3 40.0 Forcing end of thinking when token budget is reached Thinking tokens 1024 2048 4031 5664 6330 Answer tokens 15 15 142 722 691 AIME24 3.3 30.0 33.3 33.3 40.0 Token-conditionalcontrol Onegeneralapproachistosimplytellamodelinthepromptpreciselyhowmanytokensit shouldgenerate. Ideally,themodelcankeeptrackofitstokencountandadjustitsgenerationtofinishwithinthedesired limits. WeexperimentwiththisapproachbytrainingamodelwithtokeninstructionsusingtheformatinFigure10(left). We bucket the lengths of the reasoning traces from our 1,000 training examples into powers of two (rounded upwards) and add a corresponding instruction to the user prompt. For example, if the instruction says “Think for up to 2048 tokens”, thenthereasoningtracehasanywherebetween1024and2048tokens. InTable12,weshowthataftertrainingthemodel hardly follows the token instruction. It does sometimes generate more tokens when given a higher limit but often overshoots the limit. This may not be unique to our model as prior work suggests that OpenAI o1-mini can also not follow token instructions (Zhang & Chen, 2024). To prevent exceeding the limit, we test budget forcing the thinking to end once the limit is reached. This leads to perfect control (Table 12 (lower)). With budget forcing, the scaling trend is also clearer as the model can no longer overshoot the limit when given a small thinking budget. This leads to better test-time scaling values for Token Prompting + budget forcing in Table 3. To compute Control reported in Table 3 for token-conditional control variants we divide the number of times the thinking tokens in Table 12 are less than the upper limit by the total evaluations (2/5 for without intervention; 5/5 for with intervention). Step-conditionalcontrol Tokeninstructionsfailascurrentmodelscannotcounttokens. Toaccommodatethislackof capability,weexperimentwithmakingthecountingmorecoarse-grained. Wepartitionthereasoningtracesintostepsand askthemodeltothinkforaspecificnumberofstepsratherthantokens. Wesplitourreasoningtracesondoublenewlinesinto steps, which we find act as intuitive separators based on manual inspection of samples. We bucket our training samples into powers of 2 depending on their number of steps and add a corresponding step instruction following the format in Figure 10 (right). Thisformatisbasedonearlyexperiments,wherewefoundthemodeltobemorelikelytoadheretothesteplimit 35 Page 36: s1: Simple test-time scaling Table 13. Scaling thinking time via step-conditional control. All metrics are averaged over the 30 samples in AIME24. Token counts ignore the thinking and step delimiters. Steps instructed ( →) 16 32 64 128 256 No intervention at test-time Steps used 123 90 80 82 136 Tokens per step 60 70 69 66 56 Thinking tokens 7252 6277 5396 5552 7551 Answer tokens 665 653 735 777 754 AIME24 33.3 23.3 33.3 36.7 33.3 Forcing end of thinking when 0 steps are reached Steps used 16 32 59 78 136 Tokens per step 96 94 80 70 56 Thinking tokens 1517 2963 4636 5409 7551 Answer tokens 1111 788 799 794 754 AIME24 23.3 23.3 33.3 36.7 33.3 Table 14. Scalingthinkingtimeviaclass-conditionalcontrol. Wereport“accuracy/averagethinkingtokenspersample”;thehigher the accuracy and the fewer the thinking tokens (inference cost) the better. Prompt appended to the question after two newlines AIME24 MATH500 GPQA Answer after a short amount of thinking. Do not spend excessive time double-checking your work.30.0% / 803390.4% / 253756.6% / 4177 Answerafteralongamountofthinking. Ifyoufeellikeyou are finished early, spend the extra time trying to double- checkyourworkuntilyouareabsolutelysurethatyouhave the correct answer.36.7% / 965191.4% / 387551.0% / 4827 Without generic prompt appending 50.0% / 610993.0% / 329857.6% / 3510 36 Page 37: s1: Simple test-time scaling when counting down (“3 steps left...2 steps left”) rather than counting up (“Step2...Step3...”). This is likely because if countingdown,thefinalstepisalways1,whichwillactasastrongpriortothemodeltofinishitsgeneration. Ifcountingup, the final step before the answer varies, thus if the model does not remember the original step instruction, it may fail to stop. We conclude the following from our results in Table 13: (1)The model still struggles to adhere to the step limit. The model sometimes simply continues counting into negative steps, e.g. “-1 steps left”. To solve this issue, we automatically stop the thinking process once 0steps are reachedand then forcethemodel to transition to answering modeby appending the answer token delimiter (§3). This leads to perfect step adherence (lower half of Table 13), yet problems remain. (2)The model compensatesforfewerstepsbymakingeachsteplonger. Forexample,whenforcedtouseupto16stepsvs256steps,the model generates an average of 96 tokens per step vs 56. Despite this issue, more steps still clearly correlate with more total thinkingtokensinTable13andbetterperformanceleadingtoapositiveslope (3)Stepinstructionsaremorecostlythan othermethods. Thestepdelimitersrequirearound6tokenseachwhichfore.g. 64stepsaddsuptoatotalofaround380 tokens. WhenignoringthestepdelimitersintokencountsasinTable13,themodelstillrequires7551thinkingtokenson averagetoachieveonly33.3%onAIME24. TocomputeControlreportedinTable3forstep-conditionalcontrolvariants, wefirstdecidethat100tokensareanupperlimitperstepandthenmultiplythisnumberbythestepsinstructedtoarrive at a proxy total token limit, e.g. 1600 for 16 steps instructed. We then check whether the thinking tokens in Table 13 fit within the respective limit for each evaluation run (3/5 for without intervention; 5/5 for with intervention). For the model in Figure 7, we use a model with step-conditional control trained on an earlier version of our data and using an earlier version of our evaluation codebase. Class-conditionalcontrol OpenAIexposestest-timecomputecontroltousersviaa“reasoning_effort”APIparameter with three possible settings: low, medium, and high.3The OpenAI documentation also states that “Reducing reasoning effortcanresultinfasterresponsesandfewertokensusedonreasoninginaresponse."suggestingthattheyareunableto control test-time compute with guarantees. Thus, maybe OpenAI simply adjusts the prompt or system instruction depending on thereasoning effort desired. In Table 14,we show thatseparate promptsfor short andlong thinking allowus tocontrol thinking time to some extent: Prompting the model to think for longer leads to longer thinking. However, it does not reliably improve performance and control is not precise. The current adherence to control may suffice when we only have three classes, but it might not scale to finer-grained classes. To compute Control reported in Table 3 for this method, we assume thatpromptingthemodeltothinkforashorttimeinTable14shouldproducefewertokensthanthedefaultforAIME24, whilethelongpromptshouldproducemore. As 8033>6109and9651>6109,oneoutoftwofollowsourexpectedcontrol thus Control is 50%. E.2. Examples for rejection sampling ablation 3https://github.com/openai/openai-python/blob/44d6210f101abedeb2dd68507fcffcb329df70ea/ src/openai/types/chat/completion_create_params.py#L172 37 Page 38: s1: Simple test-time scaling Table 15. Answers generated by an earlier version of s1-32B when rejection sampling on AIME24. The black text is the question and theyellowtextisthesolution. Basedsolelyonthequestionwegeneratethelightbluereasoningtraceandblueresponsewith s1-32B usingatemperatureof1. Wegenerateuntilthereasoningtracehaslessthan4000tokensforthefirstandlessthan8000tokensforthe second example. It required on average 199 (1) tries per example until thinking was ≤4000(≤8000tokens). Let𝑂(0,0),𝐴(1 2,0),and𝐵(0,√ 3 2)be points in the coordinate plane. Let be the family of segments 𝑃𝑄of unit length lyinginthefirstquadrantwith 𝑃onthe𝑥-axisand𝑄onthe𝑦-axis. Thereisauniquepoint 𝐶on𝐴𝐵,distinctfrom 𝐴and𝐵, that does not belong to any segment from other than𝐴𝐵. Then𝑂𝐶2=𝑝 𝑞, where𝑝and𝑞are relatively prime positive integers. Find 𝑝+𝑞. 23 [Correct generation with 3981 thinking tokens and 1055 answer tokens] The problem asks for a point 𝐶on the segment 𝐴𝐵that is not contained in any other segment from the family , except for 𝐴𝐵itself. The points𝑂,𝐴,𝐵form a triangle in the first quadrant. 𝑂=(0,0) 𝐴=(1 2,0) 𝐵=(0,√ 3 2) The segment 𝐴𝐵connects𝐴(1 2,0)and𝐵(0,√ 3 2). The equation of the line containing 𝐴𝐵can be found using the two points: Slope𝑚=√ 3 2−0 0−1 2=−√ 3 Using the point-slope form 𝑦−𝑦1=𝑚(𝑥−𝑥1)with point𝐴(1 2,0): 𝑦−0=−√ 3(𝑥−1 2) 𝑦=−√ 3𝑥+√ 3 2 The segment 𝐴𝐵lies in the first quadrant and connects (1 2,0)to(0,√ 3 2). Thefamily consistsofsegments 𝑃𝑄ofunitlength,where 𝑃isonthe𝑥-axisand𝑄isonthe𝑦-axis. Let𝑃=(𝑎,0)and 𝑄=(0,𝑏). The length of 𝑃𝑄is√ (𝑎−0)2+(0−𝑏)2=√ 𝑎2+𝑏2=1. So𝑎2+𝑏2=1. Since𝑃is on the positive 𝑥-axis and𝑄is on the positive 𝑦-axis, we have 𝑎>0and𝑏>0. The segment 𝑃𝑄connects(𝑎,0)and(0,𝑏). The equation of the line containing 𝑃𝑄is𝑥 𝑎+𝑦 𝑏=1. Theregioncoveredbythefamily istheareaboundedbythelinescontainingthesesegments. Theendpoints (𝑎,0)and (0,𝑏)vary such that 𝑎2+𝑏2=1. Consider the parametrization 𝑎=cos𝜃,𝑏=sin𝜃, where0<𝜃<𝜋 2. The segment 𝑃𝑄connects(cos𝜃,0)and(0,sin𝜃). The equation of the line is𝑥 cos𝜃+𝑦 sin𝜃=1. We are looking for a unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that does not belong to any segment from other than 𝐴𝐵. Let𝐶be a point on 𝐴𝐵. We can parametrize the points on 𝐴𝐵as(1−𝑡)𝐴+𝑡𝐵for0≤𝑡≤1. 𝐶=(1−𝑡)(1 2,0)+𝑡(0,√ 3 2)=(1 2−1 2𝑡,√ 3 2𝑡). Since𝐶is distinct from 𝐴and𝐵, we have0<𝑡<1. Suppose the point 𝐶lies on some segment 𝑃𝑄∈, where𝑃=(cos𝜃,0)and𝑄=(0,sin𝜃). The point𝐶satisfies the equation𝑥 cos𝜃+𝑦 sin𝜃=1. 38 Page 39: s1: Simple test-time scaling Plugging in the coordinates of 𝐶: 1 2−1 2𝑡 cos𝜃+√ 3 2𝑡 sin𝜃=1 1−𝑡 2cos𝜃+√ 3𝑡 2sin𝜃=1 Multiply by 2: 1−𝑡 cos𝜃+√ 3𝑡 sin𝜃=2 (1−𝑡)sin𝜃+√ 3𝑡cos𝜃=2sin𝜃cos𝜃 sin𝜃−𝑡sin𝜃+√ 3𝑡cos𝜃=sin(2𝜃)∕2 We are looking for a point 𝐶on𝐴𝐵such that the only segment from containing𝐶is𝐴𝐵itself. The segment 𝐴𝐵is part of the family . For𝐴𝐵,𝑃=𝐴=(1 2,0)and𝑄=𝐵=(0,√ 3 2). Here𝑎=1 2,𝑏=√ 3 2. 𝑎2+𝑏2=(1 2)2+(√ 3 2)2=1 4+3 4=1. The line segment 𝐴𝐵connects(1 2,0)and(0,√ 3 2). The equation of the line is𝑥 1∕2+𝑦√ 3∕2=1, which simplifies to 2𝑥+2√ 3𝑦=1, or√ 3𝑥+𝑦=√ 3 2. This matches the equation 𝑦=−√ 3𝑥+√ 3 2. Consider the envelope of the family of lines𝑥 cos𝜃+𝑦 sin𝜃=1. Differentiating with respect to 𝜃: −𝑥sin𝜃 cos2𝜃+𝑦cos𝜃 sin2𝜃=0 𝑦cos𝜃 sin2𝜃=𝑥sin𝜃 cos2𝜃 𝑦cos3𝜃=𝑥sin3𝜃 𝑦 𝑥=tan3𝜃 tan𝜃=(𝑦 𝑥)1∕3 From the line equation: 𝑦sin𝜃=sin𝜃cos𝜃−𝑥sin𝜃 𝑦=cos𝜃−𝑥 Substitutecos𝜃=1√ 1+tan2𝜃=1√ 1+(𝑦∕𝑥)2∕3 cos𝜃=𝑥1∕3 (𝑥2∕3+𝑦2∕3)1∕2 sin𝜃=tan𝜃cos𝜃=(𝑦 𝑥)1∕3𝑥1∕3 (𝑥2∕3+𝑦2∕3)1∕2=𝑦1∕3 (𝑥2∕3+𝑦2∕3)1∕2 Substitute back into the line equation: 𝑥 cos𝜃+𝑦 sin𝜃=1 𝑥(𝑥2∕3+𝑦2∕3)1∕2∕𝑥1∕3+𝑦(𝑥2∕3+𝑦2∕3)1∕2∕𝑦1∕3=1 𝑥2∕3(𝑥2∕3+𝑦2∕3)1∕2+𝑦2∕3(𝑥2∕3+𝑦2∕3)1∕2=1 (𝑥2∕3+𝑦2∕3)3∕2=1 𝑥2∕3+𝑦2∕3=1 The envelope of the family is the astroid 𝑥2∕3+𝑦2∕3=1. The segments from are tangent to this astroid. The point𝐶on𝐴𝐵that we are looking for is the point where 𝐴𝐵is tangent to the astroid. The equation of 𝐴𝐵is𝑦=−√ 3𝑥+√ 3 2. Substitute into the astroid equation: 𝑥2∕3+(−√ 3𝑥+√ 3 2)2∕3=1 Let𝑢=𝑥1∕3and𝑣=𝑦1∕3. The astroid is 𝑢2+𝑣2=1. The point𝐶lies on both the line 𝐴𝐵and the astroid. Differentiating the astroid equation implicitly: 2 3𝑥−1∕3+2 3𝑦−1∕3𝑦′=0 𝑦′=−𝑦1∕3 𝑥1∕3=−(𝑦 𝑥)1∕3 39 Page 40: s1: Simple test-time scaling The slope of the tangent to the astroid at a point (𝑥,𝑦)is−3√ 𝑦 𝑥. The slope of the line 𝐴𝐵is−√ 3. So, at the point of tangency, −3√ 𝑦 𝑥=−√ 3. 3√ 𝑦 𝑥=√ 3 𝑦 𝑥=(√ 3)3=3√ 3 The point𝐶lies on the line 𝑦=−√ 3𝑥+√ 3 2and satisfies𝑦 𝑥=3√ 3. So𝑦=3√ 3𝑥. Substitute into the line equation: 3√ 3𝑥=−√ 3𝑥+√ 3 2 3𝑥=−𝑥+1 2 4𝑥=1 2 𝑥=1 8 Then𝑦=3√ 3(1 8)=3√ 3 8. The point𝐶is(1 8,3√ 3 8). We need to check if this point lies on the segment 𝐴𝐵. The point𝐶is of the form (1 2−1 2𝑡,√ 3 2𝑡). 1 2−1 2𝑡=1 8⟹4−4𝑡=1⟹3=4𝑡⟹𝑡=3 4. √ 3 2𝑡=√ 3 2⋅3 4=3√ 3 8. So𝐶lies on𝐴𝐵. We need to make sure 0<𝑡<1, which is true for 𝑡=3 4. Now, we need to check if this point 𝐶does not belong to any other segment from . The point𝐶lies on the astroid. The segment 𝐴𝐵is tangent to the astroid at 𝐶. Any other segment from that contains 𝐶would have𝐶as an interior point, not a tangent point. Consider the parametrization of the family . A point(𝑥,𝑦)belongs to some segment in iff there exists 𝜃such that𝑥 cos𝜃+𝑦 sin𝜃=1. For the point 𝐶=(1 8,3√ 3 8), we have: 1∕8 cos𝜃+3√ 3∕8 sin𝜃=1 1 cos𝜃+3√ 3 sin𝜃=8 sin𝜃+3√ 3cos𝜃=8sin𝜃cos𝜃 sin𝜃+3√ 3cos𝜃=4sin(2𝜃) We know that 𝐶lies on𝐴𝐵, which corresponds to some specific segment in . The problem statement assumes the existence and uniqueness of such a point 𝐶. The unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that does not belong to any segment from other than𝐴𝐵. This means that if 𝐶lies on another segment 𝑃′𝑄′∈, then𝑃′𝑄′=𝐴𝐵. The point𝐶is the point of tangency of 𝐴𝐵with the astroid. Now we need to find 𝑂𝐶2. 𝑂𝐶2=(1 8)2+(3√ 3 8)2=1 64+27 64=28 64=7 16. Here𝑝=7and𝑞=16. 𝑝and𝑞are relatively prime. 𝑝+𝑞=7+16=23 . Final check of the reasoning: The family consists of unit-length segments with endpoints on the axes. The envelope of this family is the astroid 𝑥2∕3+𝑦2∕3=1. The segments from are tangent to the astroid. The segment 𝐴𝐵is also a member of the family . We are looking for a point 𝐶on𝐴𝐵that lies on the astroid. This point𝐶is the point of tangency of 𝐴𝐵with the astroid.40 Page 41: s1: Simple test-time scaling Ifapointon 𝐴𝐵liesontheastroid,itbelongstothetangentsegmentatthatpoint. Since 𝐴𝐵isthetangentsegment,this point does not belong to any other segment from . Consider the case when the segment 𝐴𝐵intersects the astroid at two points. If the segment 𝐴𝐵is not tangent to the astroid, then there are two points of intersection. Let the points of intersection be 𝐶1and𝐶2. If𝐶1lies on𝐴𝐵, then the tangent to the astroid at 𝐶1is the segment 𝐴𝐵. If𝐶2lies on𝐴𝐵, then the tangent to the astroid at 𝐶2is the segment 𝐴𝐵. The slope of 𝐴𝐵is−√ 3. We found the point on the astroid with tangent slope −√ 3. The problem statement guarantees the uniqueness of such a point 𝐶. The point𝐶must be interior to 𝐴𝐵, so0<𝑡<1. The parameter 𝑡=3 4is in this range. The point𝐶is distinct from 𝐴and𝐵. If𝐶=𝐴, then𝑡=0,𝐶=(1 2,0). If𝐶=𝐵, then𝑡=1,𝐶=(0,√ 3 2). The point of tangency 𝐶is the unique point on 𝐴𝐵(excluding endpoints) that lies on the astroid. Solution: Let the family of segments be defined by segments 𝑃𝑄of unit length with 𝑃on the positive 𝑥-axis and𝑄on the positive 𝑦-axis. Let𝑃=(𝑎,0)and𝑄=(0,𝑏). The length condition gives 𝑎2+𝑏2=1. The segment 𝑃𝑄lies on the line𝑥 𝑎+𝑦 𝑏=1. Using the parametrization 𝑎=cos𝜃,𝑏=sin𝜃with0<𝜃<𝜋 2, the equation of the line is𝑥 cos𝜃+𝑦 sin𝜃=1. The envelope of this family of lines is the astroid 𝑥2∕3+𝑦2∕3=1. The segments from are tangent to this astroid. Thesegment 𝐴𝐵connects𝐴(1 2,0)and𝐵(0,√ 3 2). Theequationofthelinecontaining 𝐴𝐵is𝑦=−√ 3𝑥+√ 3 2. Thissegment is part of the family . We are looking for a unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that does not belong to any segment from other than 𝐴𝐵. This means 𝐶is the point where 𝐴𝐵is tangent to the astroid. The slope of the tangent to the astroid at a point (𝑥,𝑦)is−3√ 𝑦 𝑥. The slope of 𝐴𝐵is−√ 3. Setting the slopes equal: −3√ 𝑦 𝑥=−√ 3⟹𝑦 𝑥=3√ 3. The point𝐶lies on the line 𝑦=−√ 3𝑥+√ 3 2and satisfies 𝑦=3√ 3𝑥. Substituting, 3√ 3𝑥=−√ 3𝑥+√ 3 2⟹4√ 3𝑥=√ 3 2⟹𝑥=1 8. Then𝑦=3√ 3(1 8)=3√ 3 8. The point𝐶is(1 8,3√ 3 8). We need to check if 𝐶lies on𝐴𝐵. The points on 𝐴𝐵are(1 2−1 2𝑡,√ 3 2𝑡)for0≤𝑡≤1. For𝑥=1 8,1 2−1 2𝑡=1 8⟹𝑡=3 4. For𝑦=3√ 3 8,√ 3 2𝑡=3√ 3 8⟹𝑡=3 4. Since0<3 4<1,𝐶is an interior point of 𝐴𝐵. The point𝐶is the unique point on 𝐴𝐵(excluding endpoints) that lies on the astroid. The tangentto the astroid at 𝐶is the segment𝐴𝐵. Thus,𝐶does not belong to any other segment from . Finally, we calculate 𝑂𝐶2=(1 8)2+(3√ 3 8)2=1 64+27 64=28 64=7 16. We have𝑝=7and𝑞=16. These are relatively prime. 𝑝+𝑞=7+16=23 . Final Answer: The final answer is 23 41 Page 42: s1: Simple test-time scaling [Incorrect generation with 5923 thinking tokens and 754 answer tokens] The problem asks for a point 𝐶on the segment 𝐴𝐵such that𝐶is not on any other segment from the family , except𝐴𝐵 itself. The family consists of segments 𝑃𝑄of unit length, with 𝑃on the𝑥-axis and𝑄on the𝑦-axis in the first quadrant. Let𝑃=(𝑥,0)and𝑄=(0,𝑦),where𝑥>0and𝑦>0. Thelengthofthesegment 𝑃𝑄is1,sothedistancebetween 𝑃and𝑄 is 1.√ (𝑥−0)2+(0−𝑦)2=1√ 𝑥2+𝑦2=1 𝑥2+𝑦2=1 The segment 𝑃𝑄lies in the first quadrant, so 𝑥>0and𝑦>0. The equation 𝑥2+𝑦2=1represents the quarter circle in the first quadrant, with endpoints at (1,0)and(0,1). The points on the segment 𝑃𝑄can be parameterized as (1−𝑡)𝑥+𝑡(0),(1−𝑡)0+𝑡𝑦)=(𝑥(1−𝑡),𝑦𝑡), where0≤𝑡≤1. The segment 𝐴𝐵connects𝐴(1 2,0)and𝐵(0,√ 3 2). The equation of the line containing 𝐴𝐵can be found using the two points: Slope𝑚=√ 3 2−0 0−1 2=√ 3 2 −1 2=−√ 3. Using point-slope form with point 𝐴(1 2,0): 𝑦−0=−√ 3(𝑥−1 2) 𝑦=−√ 3𝑥+√ 3 2 The segment 𝐴𝐵lies in the first quadrant, between 𝑥=0and𝑥=1 2. Alternatively, we can parameterize the points on 𝐴𝐵. Let𝐶be a point on 𝐴𝐵. Then𝐶=(1−𝜆)𝐴+𝜆𝐵for some0≤𝜆≤1. 𝐶=(1−𝜆)(1 2,0)+𝜆(0,√ 3 2)=(1 2−1 2𝜆,√ 3 2𝜆). We are looking for a unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that does not belong to any segment from other than 𝐴𝐵. Consider a segment from , with endpoints 𝑃=(𝑥,0)and𝑄=(0,𝑦), where𝑥2+𝑦2=1. A point on this segment is of the form (𝑥(1−𝑡),𝑦𝑡), where0≤𝑡≤1. We want to find a point 𝐶on𝐴𝐵,𝐶≠𝐴,𝐵, such that if 𝐶lies on a segment 𝑃𝑄∈, then𝑃𝑄=𝐴𝐵. Let𝐶=(1 2−1 2𝜆,√ 3 2𝜆), where0<𝜆<1. If𝐶lies on a segment 𝑃𝑄from, then there exist 𝑥,𝑦>0with𝑥2+𝑦2=1, and𝑡∈[0,1]such that: 1 2−1 2𝜆=𝑥(1−𝑡) √ 3 2𝜆=𝑦𝑡 From the first equation, 1−𝑡=1 2−1 2𝜆 𝑥=1−𝜆 2𝑥. From the second equation, 𝑡=√ 3 2𝜆 𝑦=√ 3𝜆 2𝑦. Since𝑡∈[0,1], we have1−𝑡∈[0,1]. Also,𝑥>0, so1−𝜆 2𝑥>0, which implies 1−𝜆>0, or𝜆<1. This is consistent with the condition for point on 𝐴𝐵. We have1−𝑡=1−𝜆 2𝑥and𝑡=√ 3𝜆 2𝑦. Adding these equations: 1=1−𝜆 2𝑥+√ 3𝜆 2𝑦 2𝑥𝑦=(1−𝜆)𝑦+√ 3𝜆𝑥 42 Page 43: s1: Simple test-time scaling We know𝑥2+𝑦2=1. If the point𝐶belongs to the segment 𝐴𝐵itself, then the endpoints of the segment are 𝐴and𝐵. So, the segment is from (1 2,0)to(0,√ 3 2). Here,𝑥=1 2and𝑦=√ 3 2. Check if𝑥2+𝑦2=1:(1 2)2+(√ 3 2)2=1 4+3 4=1. So𝐴𝐵∈. If𝐶belongs to another segment 𝑃𝑄from, with endpoints 𝑃=(𝑥,0)and𝑄=(0,𝑦), where(𝑥,𝑦)≠(1 2,√ 3 2). Considerthegeometricinterpretation. Thefamily consistsofsegmentsofunitlengthwithendpointsontheaxes. The midpoints of these segments lie on the circle (𝑥−1 2)2+(𝑦−1 2)2=1 2. Let the point 𝐶correspond to some 𝜆. If𝐶lies on𝑃𝑄, then𝐶=(𝑥(1−𝑡),𝑦𝑡). So,1 2−1 2𝜆=𝑥(1−𝑡)and√ 3 2𝜆=𝑦𝑡. We need to find a 𝜆∈(0,1)such that the only solution to these equations, with 𝑥2+𝑦2=1,𝑥,𝑦>0, is𝑥=1 2and𝑦=√ 3 2. From the equations, 1−𝑡=1−𝜆 2𝑥and𝑡=√ 3𝜆 2𝑦. Substitute these into 𝑥2+𝑦2=1. Try a different approach. Consider the tangents to the family of segments. The segments are formed by points (𝑥,0)and(0,𝑦)with𝑥2+𝑦2=1. The equation of the line containing the segment 𝑃𝑄is𝑥 𝑥+𝑦 𝑦=1, or𝑥 𝑥+𝑦 𝑦=1. The equation of the line is𝑥 𝑥+𝑦 𝑦=1. Consider the equation of the segment in parametric form: (𝑥(1−𝑡),𝑦𝑡). If a point𝐶lies on this segment, then 𝐶=(𝑥𝑐,𝑦𝑐), where𝑥𝑐=𝑥(1−𝑡)and𝑦𝑐=𝑦𝑡. So,𝑥𝑐 𝑥=1−𝑡and𝑦𝑐 𝑦=𝑡. 𝑥𝑐 𝑥+𝑦𝑐 𝑦=1−𝑡+𝑡=1. The segments from are parts of the lines𝑥 𝑥0+𝑦 𝑦0=1, where𝑥2 0+𝑦2 0=1,𝑥0,𝑦0>0. Consider the point 𝐶=(1 2−1 2𝜆,√ 3 2𝜆). If𝐶lies on a segment from with endpoints (𝑥,0)and(0,𝑦), then: 1 2−1 2𝜆 𝑥+√ 3 2𝜆 𝑦=1 1−𝜆 2𝑥+√ 3𝜆 2𝑦=1 We need to find a unique 𝜆∈(0,1)such that the only solution (𝑥,𝑦)to this equation, with 𝑥2+𝑦2=1,𝑥,𝑦>0, is(1 2,√ 3 2). Substitute𝑥=1 2and𝑦=√ 3 2into the equation: 1−𝜆 2(1 2)+√ 3𝜆 2(√ 3 2)=1 1−𝜆+𝜆=1 1=1 This shows that any point on 𝐴𝐵lies on the segment 𝐴𝐵itself. Consider the line containing 𝐴𝐵:𝑦=−√ 3𝑥+√ 3 2. If a point𝐶on𝐴𝐵also lies on another segment from with endpoints (𝑥,0)and(0,𝑦), then: 𝑥𝐶 𝑥+𝑦𝐶 𝑦=1. Let the point 𝐶divide𝐴𝐵in the ratio𝜇∶(1−𝜇), where0<𝜇<1. 𝐶=(1−𝜇)𝐴+𝜇𝐵=(1−𝜇)(1 2,0)+𝜇(0,√ 3 2)=(1 2−1 2𝜇,√ 3 2𝜇). If𝐶lies on a segment with endpoints (𝑥,0)and(0,𝑦), then: 1 2−1 2𝜇 𝑥+√ 3 2𝜇 𝑦=1. We are looking for a point 𝐶such that the line passing through 𝐶with the form𝑥 𝑥0+𝑦 𝑦0=1, where𝑥2 0+𝑦2 0=1, is unique. 43 Page 44: s1: Simple test-time scaling Considerthetangentstotheunitcirclecenteredattheorigin. Thelinesoftheform𝑥 𝑥0+𝑦 𝑦0=1aretangentstothecircle 𝑥2+𝑦2=𝑟2. Consider the dual problem. The family of segments corresponds to points on the quarter circle 𝑥2+𝑦2= 1in the first quadrant. The lines containing the segments are𝑥 𝑥0+𝑦 𝑦0=1. The envelope of these lines is the curve whose tangents are these lines. The equation𝑥 𝑥0+𝑦 𝑦0=1, with the constraint 𝑥2 0+𝑦2 0=1. Let𝑥0=cos𝜃,𝑦0=sin𝜃, where0<𝜃<𝜋 2. The equations of the lines are 𝑥cos𝜃+𝑦sin𝜃=1. The envelope of the family of lines 𝑥cos𝜃+𝑦sin𝜃=1is the circle𝑥2+𝑦2=1. The segments from are parts of the tangents to the unit circle in the first quadrant. The point𝐶on𝐴𝐵that does not belong to any other segment from means that the tangent to the unit circle passing through𝐶is unique. Consider the point 𝐶=(1 2−1 2𝜆,√ 3 2𝜆). The tangents to the unit circle passing through 𝐶have the form 𝑥𝑥𝐶+𝑦𝑦𝐶=1. If the tangent is unique, then 𝐶lies on the unit circle. (1 2−1 2𝜆)2+(√ 3 2𝜆)2=1 (1 4−1 2𝜆+1 4𝜆2)+3 4𝜆2=1 1 4−1 2𝜆+𝜆2=1 4𝜆2−2𝜆+1=4 4𝜆2−2𝜆−3=0 Solve for𝜆: 𝜆=−(−2)±√ (−2)2−4(4)(−3) 2(4)=2±√ 4+48 8=2±√ 52 8=2±2√ 13 8=1±√ 13 4. We need0<𝜆<1. √ 13≈3.6. 1+3.6 4=4.6 4>1. 1−3.6 4=−2.6 4<0. Something is wrong. The point 𝐶is inside the unit circle. Let’s rephrase the condition. There is a unique point 𝐶on𝐴𝐵such that if𝐶lies on a segment 𝑃𝑄∈, then𝑃𝑄=𝐴𝐵. Let the point 𝐶on𝐴𝐵be(1 2−1 2𝜆,√ 3 2𝜆). Consider a segment from with endpoints (𝑥,0)and(0,𝑦),𝑥2+𝑦2=1. The parametric form of the segment is (𝑥(1−𝑡),𝑦𝑡). If𝐶is on this segment, then for some 𝑡∈[0,1]: 1 2−1 2𝜆=𝑥(1−𝑡) √ 3 2𝜆=𝑦𝑡 If the segment is unique, then the values of 𝑥and𝑦are unique. The only solution to1−𝜆 2𝑥+√ 3𝜆 2𝑦=1, with𝑥2+𝑦2=1,𝑥,𝑦>0, should be𝑥=1 2,𝑦=√ 3 2. Consider the intersection of the line containing 𝐴𝐵with the boundary of the region formed by the segments in . The boundaries are the 𝑥-axis,𝑦-axis, and the quarter circle 𝑥2+𝑦2=1. Consider the dual problem again. The family of lines𝑥 𝑥0+𝑦 𝑦0=1, where𝑥2 0+𝑦2 0=1,𝑥0,𝑦0>0. The envelope of these lines is the circle 𝑥2+𝑦2=1. A point belongs to a unique segment from if it lies on the circle 𝑥2+𝑦2=1. The point𝐶on𝐴𝐵satisfies𝐶=𝑡𝐴+(1−𝑡)𝐵. Let’s consider the case where the segment is 𝐴𝐵. Here𝑥=1 2,𝑦=√ 3 2. The point𝐶can be represented as (1 2(1−𝑠),√ 3 2𝑠)for0≤𝑠≤1. We are looking for a point 𝐶on𝐴𝐵that does not lie on any other segment from . 44 Page 45: s1: Simple test-time scaling Consider the geometry. The family forms a region bounded by the axes and the quarter circle. The segments are chords of circles centered on the line 𝑦=𝑥, tangent to the axes. Let the endpoints of a segment be (cos𝜃,0)and(0,sin𝜃). A point on the segment is (cos𝜃(1−𝑡),sin𝜃𝑡). The point𝐶=(1 2−1 2𝜆,√ 3 2𝜆). If𝐶lies on such a segment, then1 2−1 2𝜆 cos𝜃=1−𝑡and√ 3 2𝜆 sin𝜃=𝑡. 1−𝜆 2cos𝜃+√ 3𝜆 2sin𝜃=1. We need a unique 𝜆∈(0,1)such that the only solution (cos𝜃,sin𝜃)is(1 2,√ 3 2), which corresponds to 𝜃=𝜋 3. Consider the function 𝑓(𝜃)=1 2cos𝜃−𝜆 2cos𝜃+√ 3𝜆 2sin𝜃. We need𝑓(𝜃)=1. If𝜆=1 2, then𝐶=(1 4,√ 3 4). 1−1∕2 2cos𝜃+√ 3(1∕2) 2sin𝜃=1 1 4cos𝜃+√ 3 4sin𝜃=1 1 cos𝜃+√ 3 sin𝜃=4 sin𝜃+√ 3cos𝜃=4sin𝜃cos𝜃 sin𝜃+√ 3cos𝜃=2sin(2𝜃) Divide by 2:1 2sin𝜃+√ 3 2cos𝜃=sin(2𝜃) sin(𝜃+𝜋 3)=sin(2𝜃). 𝜃+𝜋 3=2𝜃+2𝜋𝑘, or𝜃+𝜋 3=𝜋−2𝜃+2𝜋𝑘. 𝜃=𝜋 3−2𝜋𝑘, so𝜃=𝜋 3. 3𝜃=𝜋−𝜋 3+2𝜋𝑘=2𝜋 3+2𝜋𝑘. 𝜃=2𝜋 9+2𝜋𝑘 3. If𝑘=0,𝜃=2𝜋 9. If𝑘=1,𝜃=2𝜋 9+2𝜋 3=2𝜋+6𝜋 9=8𝜋 9. If𝜆=1 2, the point is(1 4,√ 3 4). This point lies on segments corresponding to 𝜃=𝜋 3and𝜃=2𝜋 9. The unique point 𝐶means that for other values of 𝜆, there is more than one solution for (𝑥,𝑦). Consider the tangent to the unit circle passing through 𝐶. If the tangent is unique, 𝐶is on the unit circle. This leads to 𝜆=1±√ 13 4, neither of which is in (0,1). Consider the geometric interpretation of the family . The segments connect points on the axes such that the distance is 1. The locus of the centers of these segments is the circle (𝑥−1 2)2+(𝑦−1 2)2=1 2, in the first quadrant. If a point𝐶lies on a segment from , then𝐶is on a line segment connecting (𝑥,0)and(0,𝑦)with𝑥2+𝑦2=1. The line containing the segment is𝑥 𝑥0+𝑦 𝑦0=1. If𝐶belongs to a unique segment, then there is a unique line of this form passing through 𝐶. The point𝐶on𝐴𝐵is(1 2−1 2𝜆,√ 3 2𝜆). We need a unique (𝑥0,𝑦0)with𝑥2 0+𝑦2 0=1,𝑥0,𝑦0>0, such that1 2−1 2𝜆 𝑥0+√ 3 2𝜆 𝑦0=1. Let the point 𝐶be such that the line passing through 𝐶with the form𝑥 𝑥0+𝑦 𝑦0=1is unique, subject to 𝑥2 0+𝑦2 0=1. This line is tangent to the unit circle. The point𝐶lies on the intersection of the line containing 𝐴𝐵and the unit circle. (1 2−1 2𝜆)2+(√ 3 2𝜆)2=1. We found no solution for 𝜆∈(0,1). 45 Page 46: s1: Simple test-time scaling The unique point 𝐶means that if any other segment passes through 𝐶, it must be𝐴𝐵. Consider the slope of the line containing the segment connecting (𝑥,0)and(0,𝑦). Slope is−𝑦 𝑥. The equation is 𝑦=−𝑦 𝑥𝑥+𝑦. If𝐶lies on the segment, then1−𝜆 2𝑥+√ 3𝜆 2𝑦=1. Consider the geometric mean. Let the point 𝐶have coordinates (𝑎,𝑏). The segment connecting (𝑥,0)and(0,𝑦)passes through (𝑎,𝑏). 𝑎 𝑥+𝑏 𝑦=1. If𝐶is on𝐴𝐵, then𝑏=−√ 3𝑎+√ 3 2. 𝑎=1 2−1 2𝜆,𝑏=√ 3 2𝜆. The point𝐶is tangent to the envelope of the family of segments. Solution: Let the family of segments be . The segments have endpoints 𝑃=(𝑥,0)and𝑄=(0,𝑦)with𝑥2+𝑦2=1and𝑥,𝑦>0. The points on the segment 𝑃𝑄satisfy𝑥 𝑥+𝑦 𝑦=1. The segments lie on the lines𝑥 𝑥0+𝑦 𝑦0=1, where𝑥2 0+𝑦2 0=1,𝑥0,𝑦0>0. The segment 𝐴𝐵has endpoints 𝐴(1 2,0)and𝐵(0,√ 3 2). The equation of the line containing 𝐴𝐵is𝑦=−√ 3𝑥+√ 3 2. A point𝐶on𝐴𝐵can be parameterized as 𝐶=(1 2−1 2𝜆,√ 3 2𝜆), where0≤𝜆≤1. If𝐶belongs to a segment from with endpoints (𝑥,0)and(0,𝑦), then1 2−1 2𝜆 𝑥+√ 3 2𝜆 𝑦=1. We are looking for a unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that does not belong to any segment from other than 𝐴𝐵. This means that the only solution (𝑥,𝑦)to the equation with 𝑥2+𝑦2=1,𝑥,𝑦>0, is(1 2,√ 3 2). The segments from are parts of the tangents to the unit circle. The point 𝐶must be tangent to the envelope of the family of segments. The envelope is the unit circle itself. The unique point 𝐶on𝐴𝐵, distinct from 𝐴and𝐵, that lies on the unit circle. We need to solve: (1 2−1 2𝜆)2+(√ 3 2𝜆)2=1 𝜆=1 2. The point𝐶=(1 4,√ 3 4). 𝑂𝐶2=(1 4)2+(√ 3 4)2=1 16+3 16=4 16=1 4. 𝑂𝐶2=1 4=𝑝 𝑞, so𝑝=1,𝑞=4. 𝑝and𝑞are relatively prime. 𝑝+𝑞=1+4=5 . Final Answer: The final answer is 5 F. Version Control V2→V3 (2025-03): •Added §A •Added number of correct samples in §2 V1→V2 (2025-02): •Added citations and other small writing changes 46

---