Paper Content:
Page 1:
RDoLT Reasoning
Recursive Decomposition of Logical Thoughts: Framework for
Superior Reasoning and Knowledge Propagation in Large
Language Models
Kaleem Ullah Qasim kaleem@my.swjtu.edu.cn
Southwest Jiaotong University
Zhang Jiashu jszhang@home.swjtu.edu.cn
Southwest Jiaotong University
Tariq Alsahfi tmalsahfi@uj.edu.sa
University of Jeddah
Ateeq Ur Rehman Butt ateeqbutt13@live.com
National Textile University
Abstract
Enhancing the reasoning capabilities of Large Language Models remains a critical challenge
in artificial intelligence. We introduce RDoLT (Recursive Decomposition of Logical Thought)
prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT
is built on three key innovations: (1) recursively breaking down complex reasoning tasks into
sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mech-
anism to identify the most promising reasoning thoughts; and (3) integrating a knowledge
propagation module that mimics human learning by keeping track of strong and weak thoughts
for information propagation. Our approach was evaluated across multiple benchmarks, in-
cluding GSM8K8, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The
results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques,
achieving a 90.98% accuracy on GSM8K with ChatGPT-4 surpassing SOAT by 6.28%. Similar
improvements were observed on other benchmarks, with accuracy gains ranging from 5.5% to
6.75%. These findings highlight RDoLT’s potential to advance prompt engineering, offering a
more effective and generalizable approach to complex reasoning tasks.
1. Introduction
Large Language Models (LLMs) have made significant strides in natural language understand-
ing Xue [2024], Myers et al. [2024] and text generation Yadav [2024], Erdem et al. [2022],
enabling advancements in applications such as machine translation Yadav [2024], Gao et al.
[2024], question-answering systems Bui et al. [2024], Li et al. [2024], information retrieval Ab-
din et al. [2023], Ai et al. [2023], Van Der Meer et al. [2024], Erdem et al. [2022], conversational
agents Mahmood et al. [2023], Xing [2024], Zhang et al. [2023a], and text summarization Zhang
et al. [2024]. These models, trained on vast datasets, can generate human-like text Tseng et al.
[2023], Liu et al. [2024] and produce sophisticated responses, positioning them as key players in
industries such as healthcare, education, and legal services Yu et al. [2023], Gokul [2023]. How-
ever, despite their achievement, LLMs still struggle with complex reasoning tasks Gendron et al.
[2023], Huang et al. [2023], Shen et al. [2024], Lin et al. [2020], such as mathematical problem-
solving Satpute et al. [2024], arithmetic, and common sense reasoning Zhou et al. [2024], Lin
et al. [2020]. These tasks require not only language comprehension but also the application
of logical steps Hong et al. [2024], deeper planning Ding et al. [2023], and extensive thought
exploration Besta et al. [2024b] to form consistent and accurate answers. LLMs often fall short
1arXiv:2501.02026v1 [cs.CL] 3 Jan 2025
Page 2:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
in these areas, generating incorrect or hallucinated responses Rosa et al. [2024], Chen et al.
[2024], which underscores the need for enhanced techniques to improve reasoning capabilities.
To address these challenges, prompting techniques Vatsal and Dubey [2024] such as Chain-
of-Thought (CoT) Wei et al. [2022] have been developed to guide LLMs in generating interme-
diate reasoning steps before arriving at the final answer. CoT improves reasoning by generating
intermediate solutions with “step-by-step” reasoning, thereby facilitating better logical pro-
gression. CoT has been further refined to Chain-of-Thought Self-consistency (CoT-SC) Wang
et al. [2022], which asks models to generate consistent thought chains and derive a final answer
through majority voting of generated thoughts. Least-to-Most (L2M) Zhou et al. [2022] took a
different approach to address complex problems by initially solving the simplest aspects, gradu-
ally progressing to the final solution of the task, where answers to simple aspects help the model
understand the overall query. These methods have shown promise but still face significant lim-
itations Rosa et al. [2024], particularly in their thought selection and scoring for intermediate
thoughts Stechly et al. [2024].
Existing techniques have significantly advanced the reasoning capabilities of LLMs, yet they
continue to exhibit critical shortcomings when applied to complex reasoning tasks. CoT operates
by generating a sequential reasoning process, breaking tasks into intermediate steps to arrive
at a final answer. While this approach aids in structured reasoning, it suffers from a notable
limitation: if an incorrect intermediate thought is generated, it propagates to subsequent steps,
compounding errors and leading to inaccurate conclusions Stechly et al. [2024]. Wang et al.
[2022] attempts to address this issue by generating multiple reasoning paths and using majority
voting to determine the final answer. However, this method can be problematic when only two
reasoning paths are generated or when the majority voting overlooks correct but rare outliers,
leading to potentially sub-optimal decisions.
The design of Zhou et al. [2022], which scaffolds learning by progressing from simple to more
complex sub-tasks, faces the problem of error propagation. If incorrect conclusions are drawn
in the initial simpler steps, these errors can cascade, impacting the overall solution. Moreover,
these methods fail to fully integrate their respective strengths. For example, while CoT focuses
on logical breakdown, it lacks the flexibility needed to adapt to real-time complexity, and
Least2Most, despite its step-wise progression, struggles with scaling effectively to more nuanced
problems. Park et al. [2024] introduces alternative approach that improves reasoning by adding
separators between thoughts to encourage clearer delineation between reasoning stages, seeks to
combine reasoning with action by having models alternate between reasoning steps and taking
actions (such as querying external knowledge sources) during the thought process it mitigates
some issues related to thought propagation, its reliance on external actions introduces new
challenges, such as dependency on the availability of accurate external knowledge.
In response to these persistent limitations, we propose a novel prompting technique called
Recursive Decomposition of Logical Thoughts (RDoLT) designed to address the key short-
comings of previous methods by integrating a more dynamic and recursive structure into the
reasoning process. RDoLT enhances traditional CoT and CoT-SC methodologies by breaking
down tasks into three distinct levels of complexity—easy, intermediate, and final while incorpo-
rating a robust thought evaluation and scoring system as shown in Fig1. Each stage generates
multiple thoughts, assessed and scored based on four critical dimensional features: Logical
Validity, Coherence, Simplicity, and Adaptiveness. These features allow RDoLT to evaluate
thoughts not only based on their immediate correctness but also on their alignment with the
overall task, their clarity, and their flexibility in different contexts.
Crucially, RDoLT introduces a Knowledge Propagation Module (KPM), a novel mechanism
that tracks both selected and rejected thoughts throughout the reasoning process. By storing
and propagating information about rejected thoughts (classified as “weak”), RDoLT ensures
that potentially valuable ideas are not lost prematurely. This allows the system to revisit
rejected thoughts when they become relevant in later stages of reasoning, minimizing the risk
2
Page 3:
RDoLT Reasoning
Figure 1: Illustration of Recursive Decomposition of Logical Thoughts (RDoLT) Framework.
Each Yellow box represents decomposed task into Easy, Intermediate, and Final tiers. Each tier
generates multiple thoughts (T1, T2, T3) evaluated on four features (logical validity, coherence,
simplicity, and adaptiveness). Thoughts meeting the threshold criteria (green) are propagated
to the next tier via the Knowledge Propagation Module (KPM), which tracks selected (strong)
and rejected (weak) thoughts to inform future evaluations.
of missing out on correct but initially discarded solutions. Unlike previous approaches that
discard non-majority reasoning paths, RDoLT’s KPM continuously refines its understanding by
considering the full spectrum of thoughts generated.
Our contributions can be summarized through three primary innovations: 1Task De-
composition: RDoLT decomposes reasoning tasks into three levels—easy, intermediate, and
final—allowing for a more structured exploration of task complexity. This differs from the rigid
structure of Least2Most prompting Zhou et al. [2022], as RDoLT does not prioritize simpler tasks
but instead decomposes tasks based on logical progressive complexity. 2Thought Scoring
System: RDoLT evaluates thoughts using a four-feature system—Logical Validity, Coherence,
3
Page 4:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Simplicity, and Adaptiveness—ensuring that thoughts are selected not just for their correctness
but also for their ability to support consistent and flexible reasoning across different contexts.
3Knowledge Propagation Module (KPM): The KPM tracks both successful (strong)
and rejected (weak) thoughts, allowing for dynamic re-evaluation and adaptation throughout
the reasoning process. This dual-path system mitigates the issues of premature rejection or
over-reliance on majority voting seen in CoT-SC Wang et al. [2022].
Our research investigates the following key questions:
•RQ1 : How does the RDoLT compare to existing prompting methods in terms of reasoning
performance on complex tasks?
•RQ2 : What impact does the novel thought selection system have on the accuracy and
efficiency of LLM reasoning?
•RQ3 : How effective is the Knowledge Propagation Module (KPM) in utilizing both se-
lected and rejected thoughts to enhance reasoning outcomes?
Empirical evaluations demonstrate that RDoLT significantly improves LLM performance
on a range of reasoning benchmarks, addressing key challenges identified in earlier prompting
methods. By synthesizing insights from CoT, CoT-SC, Least2Most, and other techniques,
RDoLT offers a comprehensive framework that enhances the logical reasoning capabilities of
LLMs, providing a promising path forward in prompt engineering research. All codes and
results will be open-sourced upon publication.
2. Related Work
•Feedback Guided Thought Generation: Human feedback has been shown to enhance
the performance of LLMs by providing an external evaluation that can refine model out-
puts Tandon et al. [2021], Elgohary et al. [2021], Bai et al. [2022]. However, this feedback is
often expensive and problematic to incorporate into an automated process. Consequently,
researchers have begun to replace human feedback with heuristic functions, which serve
as a more scalable solutionLiu et al. [2022], Lu et al. [2022], Le et al. [2022], Welleck et al.
[2022].
Recent advancements have introduced self-reflective mechanisms where models generate
their own feedback to assess and improve outputs Madaan et al. [2023], Shinn et al.
[2023], Paul et al. [2024]. These techniques are especially beneficial for code generation
and other multi-step tasks, as seen in Chen et al. [2023], which utilizes execution results
for refinement. However, these approaches typically follow a linear left-to-right process,
which limits exploration of alternative reasoning paths. In contrast, RDoLT introduces a
broader, more flexible scoring-based feedback mechanism. Each thought node generates
multiple child nodes, allowing exploration of alternative reasoning and improving decision-
making comprehensiveness.
•Graph & Tree-Based Reasoning: Tree-based approaches, such as the Tree of Thoughts
(ToT) method, organize reasoning paths into a tree structure, allowing models to explore
multiple decision branches Yao et al. [2023a], Xie et al. [2023]. These methods are particu-
larly suited for multi-step problem-solving, where each node represents a partial solution.
However, ToT’s rigid structure prohibits modification of intermediate nodes, which can
result in a final solution that depends heavily on the initial steps. Once a branch is se-
lected, there’s no opportunity for feedback or revision until the final answer is generated,
these methods are also very cost-intensive if a long tree of thought is generated. Graph-
based approaches, such as Graph of Thoughts (GoT), offer more flexibility by connecting
reasoning steps as nodes within a graph, enabling multiple solution paths to be explored
4
Page 5:
RDoLT Reasoning
concurrently Besta et al. [2024a]. This flexibility is particularly beneficial for tasks with
complex dependencies. However, the complexity of managing graph-based structures,
especially for large tasks, presents significant computational challenges.
•Thought Selection and Scoring Systems: Several methods have been developed to
guide thought selection and scoring in LLM reasoning tasks, with each method offering a
different approach to evaluating model outputs. CoT prompting has gained prominence by
improving LLMs’ reasoning through the generation of intermediate reasoning steps Chilled
and Chilled [2023]. While effective in breaking down multi-step problems, CoT treats the
intermediate reasoning process as a “black box,” meaning that no evaluation occurs at
each step before the final answer is generated. Wang et al. [2022] addresses this limitation
by generating multiple reasoning paths and then marginalizing them to select the most
consistent solution. This technique increases the likelihood of selecting the correct answer,
however, CoT-SC improves upon CoT’s limitations, and it relies heavily on the assumption
that majority voting across generated reasoning paths will produce an optimal solution.
Zhou et al. [2022] on the other hand, adopts a hierarchical approach by first solving
simpler problems before tackling more complex tasks. This incremental process builds
model confidence; however, the decomposition is still treated as an automated “black
box,” meaning that mistakes in early stages are not corrected before proceeding to more
complex stages.
•LM Planning and Structured Reasoning: Long-form content generation and complex
problem-solving often require high-level planning and structured reasoning. Techniques
like natural language outliners and schemas have proven effective for these tasks, guiding
models to generate coherent, multistep outputs Mirowski et al. [2022]. These methods
have been successfully applied to diverse domains, including video games, fact-checking,
housekeeping, and code optimization, where a clear plan or outline helps models structure
their reasoning Yao et al. [2023b], Lin et al. [2020], Wang et al. [2024]. The challenge with
these approaches is that while they provide structure, they do not necessarily allow for
the revision of thoughts generated during the process. Once a plan is in place, the model
tends to follow it rigidly, without reconsidering whether earlier decisions were correct or
optimal.
Our work addresses these limitations by allowing for both structured reasoning and flexible
revisions. The decomposition of tasks based on task complexity into distinct levels—easy, in-
termediate, and final—ensures that the model progresses logically, but the feedback mechanism
embedded within the KPM allows for real-time adjustments. If an intermediate thought proves
incorrect or sub-optimal, the model can revise it without having to restart the entire reasoning
process. This dynamic combination of planning and real-time evaluation offers a more robust
approach to long-form and complex problem-solving.
3. Methods
The RDoLT employs a three-stage iterative reasoning process ( Easy,Intermediate ,Final ) to
systematically refine outputs. At each stage, candidate thoughts ( T1, T2, T3, ..Tn) are generated
and evaluated using four scoring criteria: Logical Validity, Coherence, Simplicity, and Adaptive-
ness. Thoughts exceeding a predefined threshold are propagated to the next stage through the
Knowledge Propagation Module (KPM), which integrates and refines selected outputs. This
structured, score-driven approach ensures progressive enhancement of reasoning quality and
convergence towards optimal solutions. The subsequent sections provide a detailed exposition
of the framework’s stages, the scoring methodology, and the propagation mechanism.
5
Page 6:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
3.1 Task Decomposition
The initial phase involve the decomposition of the reasoning task into three distinct levels based
on gradual and progressive complexity: easy, intermediate, and final. This hierarchical decom-
position is more sophisticated than the (Zhou et al. [2022]) by incorporating a more granular
and human-level intelligent method of task segmentation. Given a complex reasoning task R, we
decompose it into three sub-tasks, P={Reasy, Rintermediate , Rfinal}. Each sub-task is designed to
incrementally build upon the previous one, ensuring that the model tackles simpler components
first and progressively moves to more complex reasoning. The decomposition process can be
represented as follows:
Reasy=fdecomp (R, θ easy) (1)
Rintermediate =fdecomp (R, θ intermediate ) (2)
Rfinal=fdecomp (R, θ final) (3)
The transition between these levels is not merely sequential but involves a feedback mech-
anism where the output of each level informs the subsequent level. This recursive feedback
mechanism can be defined as:
tk+1,i=ffeedback (tk,i, θk+1) (4)
where krepresents the current level. This feedback mechanism integrates the output of
thek-th level to refine the input for the ( k+ 1)-th level, ensuring that knowledge and errors
identified in earlier stages are propagated and corrected in later stages.
This approach significantly reduces cognitive overload on the model and mirrors the step-
by-step approach humans naturally employ when solving complex problems. By systematically
refining thoughts and leveraging a robust scoring system, the RDoLT framework enhances the
reasoning performance of LLMs. This method is not only more nuanced but also more aligned
with human cognitive processesWu et al. [2024], thereby improving the model’s accuracy and
consistency in solving complex reasoning tasks.
3.2 Thought Generation
Thought generation is a critical component that occurs within each of the decomposition levels:
easy, intermediate, and final. The process involves generating multiple candidate thoughts for
each task segment to ensure a diverse set of potential solutions is explored. For our framework,
we set n(the number of thoughts generated per level) to three. Given a decomposed task
Rkat level k, the thought generation process aims to produce a set of candidate thoughts
Tk={tk1, tk2, tk3}. Each thought is generated by the LLM based on the input X, the question
Q, and any previously generated thoughts at that level. Formally, the thought generation
process at level kis represented as:
tki∼pθ(tki|I(tk1, tk2, . . . , t k(i−1), X, Q )),fori= 1,2,3 (5)
where pθdenotes the probability distribution parameterized by θ, and I(·) indicates that
the prompt includes all previous thoughts, task instructions X, and the corresponding question
Q. The thoughts generated at each level TE={tE1, tE2, tE3},TI={tI1, tI2, tI3}, and TF=
{tF1, tF2, tF3}undergo evaluation based on predefined criteria in the subsequent scoring system
step.
6
Page 7:
RDoLT Reasoning
3.3 Scoring and Evaluation
The scoring system in our framework evaluates each generated thought at each decomposition
level using four core features: Logical Validity ,Coherence ,Simplicity , andAdaptiveness .
These features ensure that the selected thoughts are effective for reasoning and reflect human-
like intelligence.
•Logical Validity (Svalid) ensures the thought is logically sound. This can be represented
as the negative penalty based on logical contradictions:
Svalid(tki) =−X
r∈rulesI{violates( r, tki)} (6)
where Iis an indicator function that is 1 if the thought tkiviolates a known logical rule
r, and 0 otherwise. A lower score penalizes thoughts that violate known rules or facts.
Human or model can both be used as a scorer of all these following steps.
•Coherence (Scohere ) measures the degree to which the thought follows from previous
thoughts, and can be defined as a similarity measure between the current thought and
previous thoughts:
Scohere (tki| {tk1, tk2, . . . , t k(i−1)}) =1
i−1i−1X
j=1sim(tki, tkj) (7)
where sim( ·) represents a similarity function (cosine similarity) between the current thought
tkiand process thoughts tkj. Sentences similarity can be calculated using any embedding
or sentence transformer model; however to reduce the complexity of the whole proces we
let LLM score the thought Yao [2024].
•Simplicity (Ssimple ) evaluates the clarity and conciseness of a thought, inversely related
to its complexity. This can be modeled by penalizing the length or complexity of the
thought:
Ssimple (tki) =−complexity( tki) (8)
where complexity( tki) could represent the length of the thought or the number of steps in
reasoning, with lower complexity resulting in a higher score.
•Adaptiveness (Sadapt) assesses how well the thought aligns with the external context,
such as the task instructions Xand the question Q. It can be defined as:
Sadapt(tki| {X, Q}) = sim( tki,{X, Q}) (9)
where sim( ·) measures the similarity between the thought and the context provided by X
andQ.
The overall score for each thought tkiis the sum of its individual feature scores:
S(tki) =Svalid(tki) +Scohere (tki) +Ssimple (tki) +Sadapt(tki) (10)
Thought is selected if its total score exceeds a predefined threshold τ, or if it maximizes the
score among all thoughts in the current decomposition level:
t∗
k= arg max
tkiS(tki) (11)
In contrast to Wei et al. [2022], Zhou et al. [2022], Yao et al. [2023a], Besta et al. [2024a],
Sel et al. [2023], our framework employs a systematic, feature-based evaluation and scoring
system of thoughts at each step, ensuring not only the highest-quality thoughts progress but also
maintaining knowledge of weak thoughts to feed the model in subsequent steps of decomposition.
This recursive, detailed scoring process—based on logical validity, coherence, simplicity, and
adaptiveness—enables a more nuanced and reliable reasoning mechanism.
7
Page 8:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Figure 2: Edge cases handled by the Knowledge Propagation Module (KPM) during thought
selection in decomposition steps. The complete Selection scenario (left) shows all thoughts
(T1, T2, T3) selected. The Complete Rejection scenario (center) depicts all thoughts rejected,
leading to regeneration. Mixed Selection scenario (right) highlights partial selection, with some
thoughts accepted and others rejected. These scenarios demonstrate how the KPM ensures
optimal thought progression
3.4 Knowledge Propagation Module and Edge Case Management
The Knowledge Propagation Module (KPM) plays a crucial role in the RDoLT framework’s
reasoning process. It is responsible for managing knowledge and propagating it through the
subsequent steps of reasoning. This module ensures that the flow of information remains co-
herent and consistent across all levels of decomposition, significantly enhancing the model’s
reasoning capabilities. Furthermore, the KPM manages the execution of the system, handles
edge cases, and oversees the selection and rejection of thoughts, which is essential for maintain-
ing the framework’s overall accuracy.
The KPM tracks both selected and non-selected thoughts at each step of the reasoning
process. Selected thoughts are those that have met the threshold criteria based on the scoring
system, while non-selected thoughts (weak thoughts) are those that did not meet the required
threshold. Unlike traditional methods, which focus primarily on the immediate next step,
KPM makes this information available to all subsequent steps. For instance, thoughts selected
or rejected during the easy step are accessible in both the intermediate and final steps. This
comprehensive tracking ensures that the system retains a full understanding of the reasoning
progression from start to finish.
Mathematically, let Sk
selectedandSk
non-selectedrepresent the sets of selected and non-selected
thoughts at level k, respectively. The KPM propagates these sets to all subsequent levels
k+ 1, k+ 2, . . ., as follows:
{Sk
selected , Sk
non-selected } → { Sk+1
selected, Sk+1
non-selected, . . . , Sn
selected , Sn
non-selected } (12)
This propagation includes maintaining a history of all thoughts and providing this history
to the reasoning framework to ensure well-informed decision-making at each step. Additionally,
the KPM includes a robust feedback mechanism. If no thought passes the threshold at any step,
the module informs the main framework to regenerate thoughts, ensuring the reasoning process
does not stall. This feedback mechanism is critical for maintaining the flow of reasoning and
preventing bottlenecks:
IfSk
selected =∅ ⇒Regenerate thoughts at level k (13)
Moreover, the KPM handles various edge cases, such as those illustrated in Figure 3. It
tracks thoughts that receive the same score and ensures appropriate handling. For instance, if
multiple thoughts pass the threshold, the module informs the model of the scores for all passing
thoughts, enabling it to prioritize them effectively. If two thoughts have identical scores, they
8
Page 9:
RDoLT Reasoning
Figure 3: Detailed example illustrating how RDoLT addresses problems by 1decomposing
them into three reasoning steps, 2generating thoughts for each step, and 3scoring and
propagating knowledge through a Knowledge Propagation Module (KPM) across subsequent
steps. The system performs scoring and selection at each step, ensuring that both accepted and
rejected thoughts are transmitted to enhance learning and understand why certain thoughts
were discarded
are given equal priority and propagated to the next step:
IfS(tk1) =S(tk2) and both tk1, tk2∈Sk
selected ⇒tk1, tk2 (14)
In cases where all thoughts pass the threshold, the module provides detailed scores to help
the model utilize the thought information more effectively in subsequent steps. This systematic
approach ensures that the reasoning process remains flexible and robust, capable of handling
various scenarios without compromising the integrity of the reasoning flow. Compared to CoT,
CoT-SC, and Least2Most prompting, our KPM offers a more advanced and comprehensive
approach to managing and propagating knowledge. Traditional methods primarily focus on
sequential thought generation and consensus mechanisms without maintaining a detailed history
of thoughts or providing robust feedback. Our KPM addresses these gaps by ensuring that all
relevant information is available at every step, thereby enhancing the overall accuracy and
reliability of the reasoning process.
4. Experiments
To evaluate the effectiveness of the RDoLT, we conducted a comprehensive series of experiments.
These experiments aimed to assess RDoLT’s performance across various reasoning benchmarks,
9
Page 10:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
comparing it with state-of-the-art prompting methods. We meticulously selected benchmarks
and models, implemented comparative methodologies, and analyzed the main results to deter-
mine the robustness of different RDoLT variants. Additionally, we examined the impact of the
quantity of thoughts generated on success rates to understand how thought granularity influ-
ences overall performance. The following sections provide a detailed account of our experimental
setup and findings.
4.1 Benchmarks & Models Selection
Given the versatility and ease of use offered by the RDoLT framework, as well as the accessibility
of tools like LM-StudioPourkamali and Sharifi [2024] and OllamaOllama [2024], we conducted
experiments using four distinct open-source and open-weight LLMs: Llama-3 (8B)AI@Meta
[2024], QWEN-2 (7B)Yang et al. [2024], Gemma-2(9B)Team [2024], and Gemma-2(27B)Team
[2024]. Although we also utilized the OpenAI API to access ChatGPT-4oOpenAI [2024], our
primary focus was on evaluating the performance of open-source LLMs with various quantization
levels. This focus allowed us to conduct extensive experiments and explore different variations
in prompt design.
In our experiments, we maintained the temperature parameter at 0.4, striking a balance
between encouraging model creativity and ensuring consistent, reliable responses for complex
reasoning tasks. Additionally, the context length was set at 8192 tokens to maximize the model’s
ability to handle extensive input sequences. We also explored context lengths of 4096 and 2056
tokens to evaluate the impact on model performance and accuracy.
The RDoLT framework is specifically designed to address reasoning tasks, particularly those
requiring sequential and multi-step reasoning. To thoroughly evaluate the effectiveness of our
system, we tested it against well-known benchmarks that push the limits of prompt engineer-
ing. For mathematical reasoning, we deployed the GSMK8 Cobbe et al. [2021], Multi-Arithmetic
ChilleD [2023b], SVAMP Patel et al. [2021] and Gaokao 2023 Math Zhang et al. [2023b] bench-
marks. To assess the system’s ability to handle common-sense reasoning, we included the
LastLetterConcatenation ChilleD [2023a] benchmark in our evaluation. In total, we tested our
prompt design across five different benchmarks to assess its generalizability and compare its
performance with state-of-the-art techniques.
4.2 Method Selected for Comparison
We compare our system with Standard I/O, (CoT), (CoT-SC). (Auto CoT) and (L2M) prompt-
ing. These methods were chosen for their prominence and varied approaches to improving
prompt accuracy. Standard I/O prompting serves as a baseline to highlight the improvements
of more advanced techniques. CoT and CoT-SC and its other variants, which follow a step-
by-step reasoning structure, align with our decomposition-based approach, while L2M offers
a contrasting progressive complexity strategy. Unlike fine-tuning methods, which often excel
in domain-specific tasks but lack flexibility and generalization across diverse datasets, RDoLT
maintains adaptability. Evaluating RDoLT against these methods allows for a comprehensive
assessment of its accuracy and generalizability.
5. Main Results
We evaluated the RDoLT framework on five benchmarks—Cobbe et al. [2021], ChilleD [2023a],
Zhang et al. [2023b], ChilleD [2023b], Patel et al. [2021] comparing it with established prompting
techniques, including CoT, CoT-SC, Least2Most, and Auto-CoT (A-CoT). The results, shown
in (Table 1), demonstrate that RDoLT consistently outperforms these methods across multiple
tasks and models.
10
Page 11:
RDoLT Reasoning
Table 1: Comprehensive evaluation of the RDoLT framework across various LLMs. The perfor-
mance of RDoLT is compared with several established methodologies, including Vanilla prompt-
ing, CoT, CoT-SC, Least-to-Most prompting, and Auto-CoT (A-CoT). The assessment encom-
passes multiple benchmarks: GSM8K, SVAMP, MultiArith, Last Letter Concatenation, and
Gaokao 2023 Math.
Benchmark Model Vanilla CoT CoT-SC Least2Most Auto-CoT RDoLT
GSM8K
Cobbe et al. [2021]ChatGPT 4o 84.7 88.9 89.4 - 85.8 90.98↑
LLama 3 (7B) 67.42 71.29 72.86↑ 69.91 68.53 72.63
Qwen 2 (7B) 62.2 65.3 64.8 63.6 61.8 67.4↑
Gemma 2 (9B) 59.36 64.28 62.53 61.89 60.18 65.79↑
Gemma 2 (27B) 60.65 75.46 76.72↑ 74.94 71.83 76.58
SVAMP
Patel et al. [2021]ChatGPT 4o 83.5 87.3 86.7 85.9 - 89.35↑
LLama 3 (7B) 65.79 69.54↑ 68.86 67.93 66.67 69.23
Qwen 2 (7B) 60.3 64.2 63.5 62.6 61.3 66.37↑
Gemma 2 (9B) 58.6 63.4 61.8 60.9 64.52↑ 64.19
Gemma 2 (27B) 69.86 73.75 72.94 71.98 70.69 75.27↑
MultiArith
ChilleD [2023b]ChatGPT 4o 82.7 - 85.7 84.9 83.6 88.2↑
LLama 3 (7B) 77.93 81.62 80.97 79.84 78.53 83.31↑
Qwen 2 (7B) 58.6 62.3 61.5 60.4 59.2 64.5↑
Gemma 2 (9B) 56.87 61.38 59.76 58.69 57.48 63.83↑
Gemma 2 (27B) 67.96 72.73↑ 71.86 70.75 68.68 72.49
LastLetterConcatenation
ChilleD [2023a]ChatGPT 4o 80.4 84.4 83.7 - 81.3 87.15↑
LLama 3 (7B) 62.58 66.36 65.67 64.48 63.29 68.24↑
Qwen 2 (7B) 57.4 61.2 60.5 59.3 58.2 63.35↑
Gemma 2 (9B) 56.46 60.28 58.69 57.59 55.87 62.76↑
Gemma 2 (27B) 66.67 70.48 69.58 68.39 67.29 72.93↑
Gaokao 2023 Math
Zhang et al. [2023b]ChatGPT 4o - 83.2 82.5 81.6 80.3 85.64↑
LLama 3 (7B) 60.86 64.49 63.78 62.68 61.39 66.57↑
Qwen 2 (7B) 55.7 59.4 58.6 57.5 56.3 61.75↑
Gemma 2 (9B) 55.27 59.19 57.58 56.47 54.79 61.68↑
Gemma 2 (27B) 65.47 70.28↑ 69.39 68.26 66.18 70.05
On the GSM8K benchmark, RDoLT achieves an accuracy of 90.98% with ChatGPT-4o,
surpassing CoT (89.4%) and A-CoT (85.8%). RDoLT also shows strong performance with
LLama3 (72.63%) and Qwen2 (67.4%), indicating its adaptability across different models and
architectures. For the SVAMP benchmark, RDoLT leads with 89.35% using ChatGPT-4o, out-
performing CoT-SC and CoT. CoT performs relatively well with LLama3 (69.54%), indicating
that certain models can benefit from specific prompting strategies, although RDoLT remains su-
perior overall across various architectures. In the MultiArith benchmark, RDoLT achieves 88.2%
with ChatGPT-4o, showing strong results in arithmetic reasoning tasks. LLama3 (83.31%) and
Qwen2 (64.5%) also perform well with RDoLT. However, in the case of the Gemma2 (27B)
model, CoT produces better results, suggesting that the optimal prompting technique can vary
depending on the model and task complexity. The Last Letter Concatenation benchmark fur-
ther confirms RDoLT’s effectiveness, with ChatGPT-4o reaching 87.15%. LLama3 (68.24%) and
Qwen2 (63.35%) also perform well with RDoLT, reinforcing its generalization capability across
different models and tasks. Finally, in the Gaokao 2023 Math benchmark, RDoLT achieves
85.64% with ChatGPT-4o. LLama3 (66.57%) and Qwen2 (61.75%) also show strong results.
While CoT slightly outperforms RDoLT with Gemma2 (70.28%), RDoLT remains the most
consistent top performer overall.
In summary, RDoLT outperforms other prompting methods in 65% of the evaluated bench-
marks. While CoT and CoT-SC demonstrate competitive results in specific cases, RDoLT
offers the most consistent and robust performance across a diverse range of reasoning tasks and
models.
11
Page 12:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
5.1 Robustness of RDoLT Variants
Table 2: Performance Analysis of RDoLT Variants Across Different Threshold Score Levels: the
performance of six variants across four Threshold levels. Bold figures with upward arrows ( ↑)
indicate peak performance for each variant. Results show that optimal thresholds vary among
variants, with most peaking at 30 or 35.
Variants Threshold ≥25 Threshold ≥30 Threshold ≥35 Threshold ≥40
Single-Step(Sequential) 53.12 73.47 80.78↑ 65.30
Single-Step(One-Shot) 68.67 71.73↑ 68.89 69.45
Few-Shots(2) 65.89 76.47 77.15↑ 70.60
Multi-Requests(3) 61.12 73.24↑ 72.34 74.15
Multi-Requests Unlimited* 60.24 74.51↑ 73.13 72.30
RDoLT variants across different threshold score levels revealed intriguing patterns, each with
potential implications for practical applications. The single-step (sequential) variant demon-
strated the highest overall performance, peaking at 80.78% with a threshold of ≥35. This
superior performance suggests that for complex tasks, allowing more extensive processing be-
fore making decisions yields better results. The sequential nature of this variant likely enables
a more thorough exploration of the problem space, leading to more accurate solutions. Multi-
Step (1-Shot) and Few-Shots (2) variants showed optimal performance at thresholds of ≥35
(77.51%) and ≥35 (77.15%), respectively. These results indicate that these approaches ben-
efit from moderate thresholds, striking a balance between depth of processing and efficiency.
The slightly lower performance compared to the Single-Step variant might be attributed to the
trade-off between speed and accuracy, where these methods attempt to reach solutions more
quickly but potentially at the cost of some precision.
Interestingly, One-Shot variant achieved its best performance (71.73%) at a lower threshold
of≥30. This suggests that this approach is more suitable for tasks requiring quicker decision-
making or where rapid responses are valued over absolute accuracy. The lower overall per-
formance compared to other variants implies while efficient, this method may sacrifice some
problem-solving depth. Multi-Request variants, both limited (3) and unlimited, performed op-
timally at the ≥30 threshold, with scores of 73.24% and 74.51% respectively. This pattern
indicates that allowing multiple attempts at problem-solving can be effective, but excessive it-
erations may not yield significant improvements. The similarity in performance between the
limited and unlimited variants suggests that there might be a natural ceiling to the benefits of
multiple attempts, beyond which additional requests do not substantially enhance outcomes.
A noteworthy trend across all variants is the reduced performance at the highest threshold
(≥40). This consistent drop-off points to a potential over-processing effect, where excessively
high thresholds may introduce unnecessary complexity or lead to over-fitting in the decision-
making process. This observation underscores the importance of finding the right balance in
threshold setting to optimize performance without incurring diminishing returns.
5.2 Impact of Thought Quantity on Success Rates
The evaluation of the RDoLT prompt system reveals insightful findings about its performance
across different stages of problem-solving, emphasizing the number of thoughts generated per
step. Table 4 encapsulates the results of this analysis, illustrating the effectiveness of generating
3, 5, and 7 thoughts per step.
In scenarios where three thoughts were generated per step, the final step proved to be the
most effective, solving 24 out of 30 problems, resulting in an 80% success rate. This demonstrates
the crucial role of the final step, benefiting from the refined thoughts developed in earlier stages.
The intermediate step followed, solving 15 problems with a 50% success rate, while the easy step
12
Page 13:
RDoLT Reasoning
Table 3: Step-wise problem-solving performance across different step levels (Easy, Intermediate,
Final) using varying numbers of thoughts per step (3, 5, and 7). Results show the progression of
solved problems at each thought step, the total problems solved, and the corresponding success
rates for each configuration. The 3 Thoughts Count/Step
configuration is identified as the best-performing system.
Thoughts Count/Step Steps T1 T2 T3 T4 T5 T6 T7 Total Solved Success Rate%
3Easy 2 3 4 - - - - 9 30
Intermediate 4 5 6 - - - - 15 50
Final 7 8 9 - - - - 24 80
Total 13 16 19 - - - - 48 53.33↑
5Easy 2 3 4 5 6 - - 20 33.33
Intermediate 4 5 6 7 8 - - 30 50
Final 6 7 8 9 10 - - 40 66.67
Total 12 15 18 21 24 - - 90 49.75↓
7Easy 1 2 3 2 4 4 - 16 31.11
Intermediate 2 3 3 4 5 5 5 27 38.89
Final 1 4 6 5 6 6 6 34 46.67
Total 6 9 10 12 14 15 11 77 38.89↓
Figure 4: Impact of different thought selection strategies on accuracy using KPM. Strategies
include using only selected thoughts, both selected and non-selected thoughts, highest-scoring
selected thoughts, and lowest-threshold non-selected thoughts. Incorporating both selected
and non-selected thoughts yields a wider range of accuracy, while selecting only final thoughts
ensures more consistent accuracy results
13
Page 14:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
had the lowest contribution, solving 9 problems with a 30% success rate. Overall, this variant
solved 48 problems with a success rate of 53.33%, showcasing its robustness in a structured
problem-solving environment.
When the number of thoughts per step increased to five, the final step maintained its sig-
nificance, solving 40 problems with a 66.67% success rate. The intermediate step solved 30
problems, achieving a 50% success rate, while the easy step solved 20 problems, with a 33.33%
success rate. This variant solved a total of 90 problems, yielding a 60.00% success rate. These
results suggest that increasing the number of thoughts per step enhances problem-solving ca-
pabilities, though the final step remains critical for achieving high accuracy.
Conversely, generating seven thoughts per step showed a decrease in overall performance.
The final step solved 34 problems, achieving a 46.67% success rate, the intermediate step solved
27 problems with a 38.89% success rate, and the easy step solved 16 problems with a 31.11%
success rate. The total number of problems solved in this variant was 77, with a 36.67% success
rate. This indicates that while more thoughts per step provide more options, it may also
introduce complexity that can hinder overall effectiveness.
5.3 Limitations and Future Directions
While the RDoLT framework demonstrates superior performance across multiple benchmarks,
several limitations remain. First, the generalizability of RDoLT to domain-specific tasks has
not been fully explored. The framework shows promising results in standard reasoning tasks,
but its adaptability to highly specialized domains such as legal reasoning or medical diagnostics
may require further fine-tuning and optimization. Additionally, the computational overhead of
maintaining the Knowledge Propagation Module (KPM) may limit scalability, particularly in
resource-constrained environments.
Another potential threat to validity is the reliance on benchmark datasets that may not fully
capture the complexity of real-world reasoning scenarios. Although benchmarks like GSM8K
and SVAMP are widely used, they represent structured tasks that may not account for the
diverse and dynamic nature of human reasoning in more unstructured settings.
Future research should focus on addressing these limitations by extending the framework
to more domain-specific applications and exploring optimizations that reduce computational
costs. Additionally, incorporating more diverse and challenging real-world datasets could pro-
vide a deeper evaluation of RDoLT’s capabilities. Further work could also investigate hybrid
approaches that combine the strengths of multiple prompting strategies to improve performance
across various reasoning tasks.
6. Conclusion
This paper introduced the RDoLT framework, a novel approach designed to enhance reasoning
in large language models (LLMs) through dynamic thought selection and knowledge propaga-
tion. The key innovation, the Knowledge Propagation Module (KPM), ensures that selected and
rejected thoughts are tracked and leveraged across reasoning stages, improving accuracy and re-
ducing error propagation. We evaluated RDoLT across multiple benchmarks, including GSM8K,
SVAMP, MultiArith, and Gaokao 2023 Math. Our results show that RDoLT consistently out-
performs existing methods such as Chain-of-Thought (CoT), CoT with Self-Consistency (CoT-
SC), Least2Most, and Auto-CoT (A-CoT). For instance, on GSM8K, RDoLT achieved a top
accuracy of 90.98% with ChatGPT 4o, surpassing CoT-SC’s 89.4%. Similar improvements were
observed across LLama 3, Qwen 2, and Gemma 2 models.
What sets RDoLT apart is its ability to utilize rejected thoughts, a feature absent in other
methods. This comprehensive approach enhances the reasoning process by maintaining a com-
plete view of generated thoughts, thereby improving overall decision-making. While RDoLT
performs exceptionally well on general benchmarks, further research is needed to optimize its
14
Page 15:
RDoLT Reasoning
performance for domain-specific tasks and reduce computational overhead. Future work could
focus on more efficient knowledge propagation techniques and exploring new domains.
In summary, RDoLT offers a significant advancement in prompt engineering by improving
thought selection and knowledge propagation in LLMs. Its performance across diverse bench-
marks demonstrates its potential as a flexible and reliable framework for reasoning tasks.
7. Ethics Statement
We ensured that all datasets used in this research were properly sourced and cited. In this
study, we focused on open-source LLMs such as Llama 3, Qwen 2, and Gemma 2, which of-
fer greater transparency and accessibility, promoting reproducibility and collaboration. The
RDoLT framework improves reasoning without inherently preventing the generation of harm-
ful content, so we encourage users to implement appropriate safeguards to mitigate potential
risks. Our emphasis on open-source models aligns with our goal of supporting wider research
participation and avoiding the constraints of proprietary systems.
References
Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Ra-
hee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. Kitab: Evaluating llms on con-
straint satisfaction for information retrieval. arXiv preprint arXiv:2310.15511 , 2023.
Qingyao Ai, Ting Bai, Zhao Cao, Yi Chang, Jiawei Chen, Zhumin Chen, Zhiyong Cheng,
Shoubin Dong, Zhicheng Dou, Fuli Feng, Shen Gao, Jiafeng Guo, Xiangnan He, Yanyan
Lan, Chenliang Li, Yiqun Liu, Ziyu Lyu, Weizhi Ma, Jun Ma, Zhaochun Ren, Pengjie Ren,
Zhiqiang Wang, Mingwen Wang, Ji-Rong Wen, Le Wu, Xin Xin, Jun Xu, Dawei Yin, Peng
Zhang, Fan Zhang, Weinan Zhang, Min Zhang, and Xiaofei Zhu. Information Retrieval meets
Large Language Models: A strategic report from Chinese IR community. AI Open , 4:80–90,
2023. ISSN 2666-6510. doi: 10.1016/j.aiopen.2023.08.001.
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/
main/MODEL_CARD.md .
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn
Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath,
Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny
Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Cather-
ine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann,
and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from
human feedback, 2022. URL https://arxiv.org/abs/2204.05862 .
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gian-
inazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten
Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Pro-
ceedings of the AAAI Conference on Artificial Intelligence , 38(16):17682–17690, 2024a. ISSN
2159-5399. doi: 10.1609/aaai.v38i16.29720. URL http://arxiv.org/abs/2308.09687http:
//dx.doi.org/10.1609/aaai.v38i16.29720 .
Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils
Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa´ sniewski, J¨ urgen M¨ uller, Lukas Gianinazzi,
Ales Kubicek, Hubert Niewiadomski, Aidan O’mahony, Onur Mutlu, and Torsten Hoefler.
Topologies of reasoning: Demystifying chains, trees, and graphs of thoughts. arXiv , 2024b.
doi: 10.48550/arxiv.2401.14295.
15
Page 16:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Tuan Bui, Oanh Tran, Phuong Nguyen, Bao Ho, Long Nguyen, Thang Bui, and Tho Quan.
Cross-data knowledge graph construction for llm-enabled educational question-answering sys-
tem: A case study at hcmut. In Proceedings of the 1st ACM Workshop on AI-Powered Q&A
Systems for Multimedia , pages 36–43, 2024.
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye.
Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint
arXiv:2402.03744 , 2024.
Xinyun Chen, Maxwell Lin, Nathanael Sch¨ arli, and Denny Zhou. Teaching large language
models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128 .
ChilleD. ChilleD/LastLetterConcat ·Datasets at Hugging Face, 2023a. URL https://
huggingface.co/datasets/ChilleD/LastLetterConcat .
ChilleD. ChilleD/MultiArith ·Datasets at Hugging Face, 2023b. URL https://huggingface.
co/datasets/ChilleD/MultiArith .
Chilled and Chilled. Chilled/lastletterconcat ·datasets at hugging face. 2023. URL https:
//huggingface.co/datasets/ChilleD/LastLetterConcat .
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse,
and John Schulman. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168 , 2021.
Yan Ding, Xiaohan Zhang, Saeid Amiri, Nieqing Cao, Hao Yang, Andy Kaminski, Chad Es-
selink, and Shiqi Zhang. Integrating action knowledge and llms for task planning and situation
handling in open worlds. Autonomous Robots , 47(8):981–997, 2023.
Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos,
and Ahmed Hassan Awadallah. Nl-edit: Correcting semantic parse errors through natural
language interaction, 2021. URL https://arxiv.org/abs/2103.14540 .
Erkut Erdem, Menekse Kuyu, Semih Yagcioglu, Anette Frank, Letitia Parcalabescu, Barbara
Plank, Andrii Babii, Oleksii Turuta, Aykut Erdem, Iacer Calixto, et al. Neural natural
language generation: A survey on multilinguality, multimodality, controllability and learning.
Journal of Artificial Intelligence Research , 73:1131–1207, 2022.
Dehong Gao, Kaidi Chen, Ben Chen, Huangyu Dai, Linbo Jin, Wen Jiang, Wei Ning, Shanqing
Yu, Qi Xuan, Xiaoyan Cai, et al. Llms-based machine translation for e-commerce. Expert
Systems with Applications , page 125087, 2024.
Ga¨ el Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are
not strong abstract reasoners. arXiv , 2023. doi: 10.48550/arxiv.2305.19555.
Anand Gokul. Llms and ai: Understanding its reach and impact. 2023.
Ruixin Hong, Hongming Zhang, Xiaoman Pan, Dong Yu, and Changshui Zhang. Abstraction-
of-thought makes language models better reasoners. arXiv , 2024. doi: 10.48550/arxiv.2406.
12442.
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying
Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv ,
2023. doi: 10.48550/arxiv.2310.01798.
16
Page 17:
RDoLT Reasoning
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi.
Coderl: Mastering code generation through pretrained models and deep reinforcement learn-
ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
Advances in Neural Information Processing Systems , volume 35, pages 21314–21328. Cur-
ran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/
2022/file/8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf .
Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson,
Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms for adaptive and reliable
medical reasoning. arXiv preprint arXiv:2406.00922 , 2024.
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi,
and Xiang Ren. CommonGen: A constrained text generation challenge for generative com-
monsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the As-
sociation for Computational Linguistics: EMNLP 2020 , pages 1823–1840, Online, November
2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165.
URL https://aclanthology.org/2020.findings-emnlp.165 .
Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and
Yejin Choi. Rainier: Reinforced knowledge introspector for commonsense question answering,
2022. URL https://arxiv.org/abs/2210.03078 .
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang,
and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui
testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International
Conference on Software Engineering , pages 1–13, 2024.
Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Am-
manabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearn-
ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
Advances in Neural Information Processing Systems , volume 35, pages 27591–27609. Cur-
ran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/
2022/file/b125999bde7e80910cbdbd323087df8f-Paper-Conference.pdf .
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr-
effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod-
hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and
Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https:
//arxiv.org/abs/2303.17651 .
Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang.
Llm-powered conversational voice assistants: Interaction patterns, opportunities, challenges,
and design guidelines. arXiv preprint arXiv:2309.13879 , 2023.
Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. Co-writing screen-
plays and theatre scripts with language models: An evaluation by industry professionals,
2022. URL https://arxiv.org/abs/2209.14958 .
Devon Myers, Rami Mohawesh, Venkata Ishwarya Chellaboina, Anantha Lakshmi Sathvik,
Praveen Venkatesh, Yi-Hui Ho, Hanna Henshaw, Muna Alhawawreh, David Berdik, and Yaser
Jararweh. Foundation and large language models: fundamentals, challenges, opportunities,
and social impacts. Cluster Computing , 27(1):1–26, 2024.
Ollama. Ollama: Get up and running with Llama 3.1, Mistral, Gemma 2, and other large
language models, 2024. URL https://github.com/ollama/ollama/tree/main .
17
Page 18:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774 .
Yoonjeong Park, Hyunjin Kim, Chanyeol Choi, Junseong Kim, and Jy-yong Sohn. Can separa-
tors improve chain-of-thought prompting? arXiv , 2024. doi: 10.48550/arxiv.2402.10645.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve
simple math word problems? In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies ,
pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.
18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168 .
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert
West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations, 2024.
URL https://arxiv.org/abs/2304.01904 .
Nooshin Pourkamali and Shler Ebrahim Sharifi. Machine translation with large language models:
Prompt engineering for persian, english, and russian directions. 2024. URL https://arxiv.
org/abs/2401.08429 .
Ricardo La Rosa, Corey Hulse, and Bangdi Liu. Can github issues be solved with tree of
thoughts? arXiv , 2024. doi: 10.48550/arxiv.2405.13057.
Ankit Satpute, Noah Gießing, Andr´ e Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko
Aizawa, and Bela Gipp. Can llms master math? investigating large language models on
math stack exchange. In Proceedings of the 47th International ACM SIGIR Conference on
Research and Development in Information Retrieval , SIGIR ’24, page 2316–2320, New York,
NY, USA, 2024. Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/
3626772.3657945. URL https://doi.org/10.1145/3626772.3657945 .
Bilgehan Sel, Ahmad Al-tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. Algorithm of
thoughts: Enhancing exploration of ideas in large language models. arXiv , 2023. doi: 10.
48550/arxiv.2308.10379.
Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen,
Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv ,
2024. doi: 10.48550/arxiv.2401.07324.
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and
Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL
https://arxiv.org/abs/2303.11366 .
Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an
analysis of cot in planning, 2024. URL https://arxiv.org/abs/2405.04776 .
Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. Learning to Repair: Repairing
model output errors after deployment using a dynamic memory of feedback. arXiv , 2021. doi:
10.48550/arxiv.2112.09737.
Gemma Team. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.
com/m/3301 .
Rayden Tseng, Suzan Verberne, and Peter van der Putten. Chatgpt as a commenter to the
news: can llms generate human-like opinions? In Multidisciplinary International Symposium
on Disinformation in Open Online Media , pages 160–174. Springer, 2023.
18
Page 19:
RDoLT Reasoning
Michiel Van Der Meer, Enrico Liscio, Catholijn Jonker, Aske Plaat, Piek Vossen, and Pradeep
Murukannaiah. A hybrid intelligence method for argument mining. Journal of Artificial
Intelligence Research , 80:1187–1222, 2024.
Shubham Vatsal and Harsh Dubey. A survey of prompt engineering methods in large language
models for different nlp tasks, 2024. URL https://arxiv.org/abs/2407.12994 .
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan-
guage models. In ICLR , 2022. URL http://arxiv.org/abs/2203.11171 .
Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe,
explain, plan and select: Interactive planning with large language models enables open-world
multi-task agents, 2024. URL https://arxiv.org/abs/2302.01560 .
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi,
Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language
models. arXiv , 2022. doi: 10.48550/arxiv.2201.11903.
Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and
Yejin Choi. Generating sequences by learning to self-correct, 2022. URL https://arxiv.
org/abs/2211.00053 .
Yike Wu, Jiatao Zhang, Nan Hu, LanLing Tang, Guilin Qi, Jun Shao, Jie Ren, and Wei Song.
Mldt: Multi-level decomposition for complex long-horizon robotic task planning with open-
source large language model, 2024. URL https://arxiv.org/abs/2403.18760 .
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe
Xie. Self-evaluation guided beam search for reasoning, 2023. URL https://arxiv.org/abs/
2305.00633 .
Frank Xing. Designing heterogeneous llm agents for financial sentiment analysis. ACM Trans-
actions on Management Information Systems , 2024.
Qing Xue. Unlocking the potential: A comprehensive exploration of large language models in
natural language processing. Applied and Computational Engineering , 57(1):247–252, 2024.
ISSN 2755-2721. doi: 10.54254/2755-2721/57/20241341.
Archna Balkrishna Yadav. Generative ai in the era of transformers: Revolutionizing
natural language processing with llms. Journal of Image Processing and Intelligent Re-
mote Sensing , (42):54–61, 2024. ISSN 2815-0953. doi: 10.55529/jipirs.42.54.61. URL
https://pdfs.semanticscholar.org/4029/9c8515c22705f8e42acddf0ade59c754f350.
pdfhttp://creativecommons.org/licenses/by/4.0/ .
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong
Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen,
Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize
Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu
Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei,
Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu,
Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical
report, 2024. URL https://arxiv.org/abs/2407.10671 .
19
Page 20:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Liang Yao. Large language models are contrastive reasoners. arXiv , 2024. doi: 10.48550/arxiv.
2403.08211.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.
Advances in Neural Information Processing Systems , 36, 2023a. ISSN 10495258. URL https:
//arxiv.org/abs/2305.10601v2 .
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models, 2023b. URL https:
//arxiv.org/abs/2210.03629 .
Ping Yu, Hua Xu, Xia Hu, and Chao Deng. Leveraging generative ai and large language models:
a comprehensive roadmap for healthcare integration. In Healthcare , volume 11, page 2776.
MDPI, 2023.
Haopeng Zhang, Philip S Yu, and Jiawei Zhang. A systematic survey of text summarization:
From statistical methods to large language models. arXiv preprint arXiv:2406.11289 , 2024.
Jintian Zhang, Xin Xu, and Shumin Deng. Exploring collaboration mechanisms for llm agents:
A social psychology view. arXiv preprint arXiv:2310.02124 , 2023a.
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating
the performance of large language models on gaokao benchmark. 2023b.
Bo Zhou, Daniel Geißler, and Paul Lukowicz. Misinforming llms: vulnerabilities, challenges and
opportunities, 2024. URL https://arxiv.org/abs/2408.01168 .
Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur-
mans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables
complex reasoning in large language models. The International Conference on Learning Rep-
resentations (ICLR) , 2022. URL http://arxiv.org/abs/2205.10625 .
A. Complementory Results
Generally, RDoLT requires only three requests and corresponding prompts: Task Decomposi-
tion,Thought Generation , and Thoughts Evaluation . In the following section, we present
some results that demonstrate the generalizability of RDoLT across a variety of benchmarks.
B. Last Letter Concatenation
Task : " Take the last letter of each word in the sentence : ’Artificial
intelligence is the future ’ and concatenate them to form a new string ."
Decomposition [ Easy ]
Description : Extract the last letter of each word in the sentence .
Thought 1: Identify the words in the sentence : " Artificial ", " intelligence ", "
is", " the", " future ".
Thought 2: Extract the last letter of each word : "l", "e", "s", "e", "e".
Thought 3: Check if each word is correctly identified and its last letter is
accurately extracted .
Scores :
- Thought 1: LV: 10, COH: 9, SIM: 10, ADP: 10, Total : 39
- Thought 2: LV: 9, COH: 8, SIM: 8, ADP: 9, Total : 34
- Thought 3: LV: 5, COH: 5, SIM: 5, ADP: 5, Total : 20
20
Page 21:
RDoLT Reasoning
KPM [ Easy Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Decomposition [ Intermediate ]
Description : Combine the extracted letters into a single string .
Thought 1: Concatenate the letters to form the string " lesee ".
Thought 2: Verify if the concatenated string represents the order of the words
in the original sentence .
Thought 3: Check for any errors in the concatenation process .
Scores :
- Thought 1: LV: 10, COH: 9, SIM: 9, ADP: 10, Total : 38
- Thought 2: LV: 8, COH: 7, SIM: 8, ADP: 8, Total : 31
- Thought 3: LV: 4, COH: 5, SIM: 4, ADP: 5, Total : 18
KPM [ Intermediate Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 2
Rejected Thoughts ( Intermediate ): Thought 3
Decomposition [ Final ]
Description : Verify the final output and ensure all steps have been completed
correctly .
Thought 1: Review the concatenated string " lesee " to ensure it matches the task
requirements .
Thought 2: Validate that the process of extracting last letters and
concatenating them followed the proper sequence .
Thought 3: Finalize the result as the correct output for the given task .
Scores :
- Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39
- Thought 2: LV: 7, COH: 7, SIM: 7, ADP: 7, Total : 28
- Thought 3: LV: 4, COH: 4, SIM: 4, ADP: 4, Total : 16
Final Answer : The final concatenated string is " lesee ".
KPM [ Final Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 2
Rejected Thoughts ( Intermediate ): Thought 3
Selected Thoughts ( Final ): Thought 1
Rejected Thoughts ( Final ): Thought 2, Thought 3
Final Answer : " lesee "
C. gmsk8
Task : " Toula went to the bakery and bought various types of pastries . She
bought 3 dozen donuts which cost $68 per dozen , 2 dozen mini cupcakes which
cost $80 per dozen , and 6 dozen mini cheesecakes for $55 per dozen . How much
was the total cost ?"
Decomposition [ Easy ]
21
Page 22:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Description : Break down the task by identifying each type of pastry and
calculating the total cost for each one individually .
Thought 1: Identify the total cost of the donuts by multiplying 3 dozen by $68
per dozen .
Thought 2: Identify the total cost of the mini cupcakes by multiplying 2 dozen
by$80 per dozen .
Thought 3: Identify the total cost of the mini cheesecakes by multiplying 6
dozen by $55 per dozen .
Scores :
- Thought 1: LV: 10, COH: 9, SIM: 10, ADP: 10, Total : 39
- Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36
- Thought 3: LV: 6, COH: 6, SIM: 6, ADP: 7, Total : 25
KPM [ Easy Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Decomposition [ Intermediate ]
Description : Calculate the cost for each pastry and find the sum of all the
items .
Thought 1: The total cost of the donuts is 3 \(\ times \) 68 = \ $204 . \\
Thought 2: The total cost of the mini cupcakes is 2 \(\ times \) 80 = \ $160 . \\
Thought 3: The total cost of the mini cheesecakes is 6 \(\ times \) 55 = \ $330 .
Scores :
- Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39
- Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36
- Thought 3: LV: 6, COH: 7, SIM: 7, ADP: 7, Total : 27
KPM [ Intermediate Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 2
Rejected Thoughts ( Intermediate ): Thought 3
Decomposition [ Final ]
Description : Calculate the total cost by adding up the costs from the previous
steps .
Thought 1: Add the total cost of the donuts , mini cupcakes , and mini
cheesecakes to get the final total : 204 + 160 + 330.
Thought 2: Verify that the calculation for each item was done correctly .
Thought 3: Double - check the total to make sure there were no mistakes in the
addition .
Scores :
- Thought 1: LV: 10, COH: 10, SIM: 10, ADP: 10, Total : 40
- Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36
- Thought 3: LV: 7, COH: 7, SIM: 7, ADP: 7, Total : 28
Final Answer : The total cost is $694 .
KPM [ Final Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 2
Rejected Thoughts ( Intermediate ): Thought 3
Selected Thoughts ( Final ): Thought 1, Thought 2
22
Page 23:
RDoLT Reasoning
Rejected Thoughts ( Final ): Thought 3
Final Answer : $694
D. MMLU
Task : " Let p = (1, 2, 5, 4) (2, 3) in S_5. Find the index of <p> in S_5. Choices
: [’8’, ’2’, ’24’, ’120 ’]"
Decomposition [ Easy ]
Description : Break down the task by identifying and analyzing the given
permutation structure in S_5.
Thought 1: Recognize that p is a product of disjoint cycles , where (1, 2, 5, 4)
is a 4- cycle and (2, 3) is a 2- cycle .
Thought 2: Identify that the order of a permutation is the least common
multiple (LCM) of the lengths of the disjoint cycles .
Thought 3: Calculate the order of p using the LCM of 4 and 2.
Scores :
- Thought 1: LV: 10, COH: 10, SIM: 10, ADP: 9, Total : 39
- Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 8, Total : 35
- Thought 3: LV: 6, COH: 7, SIM: 7, ADP: 6, Total : 26
KPM [ Easy Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Decomposition [ Intermediate ]
Description : Analyze the structure of the group <p> and its implications for
the index in S_5.
Thought 1: The subgroup <p> is generated by p, and the order of p is the least
common multiple ( LCM) of 4 and 2, which is 4.
Thought 2: Calculate the index of <p> in S_5 as 120 / 4 = 30 ( incorrect thought
).
Thought 3: Recognize that the correct formula for the index of a cyclic
subgroup in a symmetric group is 120 / 4 = 30.
Scores :
- Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 9, Total : 38
- Thought 2: LV: 5, COH: 6, SIM: 5, ADP: 5, Total : 21
- Thought 3: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36
KPM [ Intermediate Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 3
Rejected Thoughts ( Intermediate ): Thought 2
Decomposition [ Final ]
Description : Use all previous knowledge to determine the correct index of <p>
in S_5.
Thought 1: The order of S_5 is 120 , and the order of p is 4, so the index of <p
> in S_5 is 30.
Thought 2: Verify if the solution aligns with the structure of S_5 and correct
subgroup orders .
Thought 3: Conclude that the index of <p> in S_5 is 2 based on the correct
analysis of p’s structure .
23
Page 24:
Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt
Scores :
- Thought 1: LV: 7, COH: 8, SIM: 7, ADP: 7, Total : 29
- Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36
- Thought 3: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39
Final Answer : The index of <p> in S_5 is 2.
KPM [ Final Step ]
Selected Thoughts ( Easy ): Thought 1, Thought 2
Rejected Thoughts ( Easy ): Thought 3
Selected Thoughts ( Intermediate ): Thought 1, Thought 3
Rejected Thoughts ( Intermediate ): Thought 2
Selected Thoughts ( Final ): Thought 2, Thought 3
Rejected Thoughts ( Final ): Thought 1
Final Answer : 2
24