Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2501.02026

Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models

Authors: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt

Published: 2025-01-03

Abstract:

Enhancing the reasoning capabilities of Large Language Models remains a critical challenge in artificial intelligence. We introduce RDoLT, Recursive Decomposition of Logical Thought prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT is built on three key innovations: (1) recursively breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning by keeping track of strong and weak thoughts for information propagation. Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques, achieving a 90.98 percent accuracy on GSM8K with ChatGPT-4, surpassing state-of-the-art techniques by 6.28 percent. Similar improvements were observed on other benchmarks, with accuracy gains ranging from 5.5 percent to 6.75 percent. These findings highlight RDoLT's potential to advance prompt engineering, offering a more effective and generalizable approach to complex reasoning tasks.

Paper Content:

Page 1: RDoLT Reasoning Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models Kaleem Ullah Qasim kaleem@my.swjtu.edu.cn Southwest Jiaotong University Zhang Jiashu jszhang@home.swjtu.edu.cn Southwest Jiaotong University Tariq Alsahfi tmalsahfi@uj.edu.sa University of Jeddah Ateeq Ur Rehman Butt ateeqbutt13@live.com National Textile University Abstract Enhancing the reasoning capabilities of Large Language Models remains a critical challenge in artificial intelligence. We introduce RDoLT (Recursive Decomposition of Logical Thought) prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT is built on three key innovations: (1) recursively breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mech- anism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning by keeping track of strong and weak thoughts for information propagation. Our approach was evaluated across multiple benchmarks, in- cluding GSM8K8, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques, achieving a 90.98% accuracy on GSM8K with ChatGPT-4 surpassing SOAT by 6.28%. Similar improvements were observed on other benchmarks, with accuracy gains ranging from 5.5% to 6.75%. These findings highlight RDoLT’s potential to advance prompt engineering, offering a more effective and generalizable approach to complex reasoning tasks. 1. Introduction Large Language Models (LLMs) have made significant strides in natural language understand- ing Xue [2024], Myers et al. [2024] and text generation Yadav [2024], Erdem et al. [2022], enabling advancements in applications such as machine translation Yadav [2024], Gao et al. [2024], question-answering systems Bui et al. [2024], Li et al. [2024], information retrieval Ab- din et al. [2023], Ai et al. [2023], Van Der Meer et al. [2024], Erdem et al. [2022], conversational agents Mahmood et al. [2023], Xing [2024], Zhang et al. [2023a], and text summarization Zhang et al. [2024]. These models, trained on vast datasets, can generate human-like text Tseng et al. [2023], Liu et al. [2024] and produce sophisticated responses, positioning them as key players in industries such as healthcare, education, and legal services Yu et al. [2023], Gokul [2023]. How- ever, despite their achievement, LLMs still struggle with complex reasoning tasks Gendron et al. [2023], Huang et al. [2023], Shen et al. [2024], Lin et al. [2020], such as mathematical problem- solving Satpute et al. [2024], arithmetic, and common sense reasoning Zhou et al. [2024], Lin et al. [2020]. These tasks require not only language comprehension but also the application of logical steps Hong et al. [2024], deeper planning Ding et al. [2023], and extensive thought exploration Besta et al. [2024b] to form consistent and accurate answers. LLMs often fall short 1arXiv:2501.02026v1 [cs.CL] 3 Jan 2025 Page 2: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt in these areas, generating incorrect or hallucinated responses Rosa et al. [2024], Chen et al. [2024], which underscores the need for enhanced techniques to improve reasoning capabilities. To address these challenges, prompting techniques Vatsal and Dubey [2024] such as Chain- of-Thought (CoT) Wei et al. [2022] have been developed to guide LLMs in generating interme- diate reasoning steps before arriving at the final answer. CoT improves reasoning by generating intermediate solutions with “step-by-step” reasoning, thereby facilitating better logical pro- gression. CoT has been further refined to Chain-of-Thought Self-consistency (CoT-SC) Wang et al. [2022], which asks models to generate consistent thought chains and derive a final answer through majority voting of generated thoughts. Least-to-Most (L2M) Zhou et al. [2022] took a different approach to address complex problems by initially solving the simplest aspects, gradu- ally progressing to the final solution of the task, where answers to simple aspects help the model understand the overall query. These methods have shown promise but still face significant lim- itations Rosa et al. [2024], particularly in their thought selection and scoring for intermediate thoughts Stechly et al. [2024]. Existing techniques have significantly advanced the reasoning capabilities of LLMs, yet they continue to exhibit critical shortcomings when applied to complex reasoning tasks. CoT operates by generating a sequential reasoning process, breaking tasks into intermediate steps to arrive at a final answer. While this approach aids in structured reasoning, it suffers from a notable limitation: if an incorrect intermediate thought is generated, it propagates to subsequent steps, compounding errors and leading to inaccurate conclusions Stechly et al. [2024]. Wang et al. [2022] attempts to address this issue by generating multiple reasoning paths and using majority voting to determine the final answer. However, this method can be problematic when only two reasoning paths are generated or when the majority voting overlooks correct but rare outliers, leading to potentially sub-optimal decisions. The design of Zhou et al. [2022], which scaffolds learning by progressing from simple to more complex sub-tasks, faces the problem of error propagation. If incorrect conclusions are drawn in the initial simpler steps, these errors can cascade, impacting the overall solution. Moreover, these methods fail to fully integrate their respective strengths. For example, while CoT focuses on logical breakdown, it lacks the flexibility needed to adapt to real-time complexity, and Least2Most, despite its step-wise progression, struggles with scaling effectively to more nuanced problems. Park et al. [2024] introduces alternative approach that improves reasoning by adding separators between thoughts to encourage clearer delineation between reasoning stages, seeks to combine reasoning with action by having models alternate between reasoning steps and taking actions (such as querying external knowledge sources) during the thought process it mitigates some issues related to thought propagation, its reliance on external actions introduces new challenges, such as dependency on the availability of accurate external knowledge. In response to these persistent limitations, we propose a novel prompting technique called Recursive Decomposition of Logical Thoughts (RDoLT) designed to address the key short- comings of previous methods by integrating a more dynamic and recursive structure into the reasoning process. RDoLT enhances traditional CoT and CoT-SC methodologies by breaking down tasks into three distinct levels of complexity—easy, intermediate, and final while incorpo- rating a robust thought evaluation and scoring system as shown in Fig1. Each stage generates multiple thoughts, assessed and scored based on four critical dimensional features: Logical Validity, Coherence, Simplicity, and Adaptiveness. These features allow RDoLT to evaluate thoughts not only based on their immediate correctness but also on their alignment with the overall task, their clarity, and their flexibility in different contexts. Crucially, RDoLT introduces a Knowledge Propagation Module (KPM), a novel mechanism that tracks both selected and rejected thoughts throughout the reasoning process. By storing and propagating information about rejected thoughts (classified as “weak”), RDoLT ensures that potentially valuable ideas are not lost prematurely. This allows the system to revisit rejected thoughts when they become relevant in later stages of reasoning, minimizing the risk 2 Page 3: RDoLT Reasoning Figure 1: Illustration of Recursive Decomposition of Logical Thoughts (RDoLT) Framework. Each Yellow box represents decomposed task into Easy, Intermediate, and Final tiers. Each tier generates multiple thoughts (T1, T2, T3) evaluated on four features (logical validity, coherence, simplicity, and adaptiveness). Thoughts meeting the threshold criteria (green) are propagated to the next tier via the Knowledge Propagation Module (KPM), which tracks selected (strong) and rejected (weak) thoughts to inform future evaluations. of missing out on correct but initially discarded solutions. Unlike previous approaches that discard non-majority reasoning paths, RDoLT’s KPM continuously refines its understanding by considering the full spectrum of thoughts generated. Our contributions can be summarized through three primary innovations: 1Task De- composition: RDoLT decomposes reasoning tasks into three levels—easy, intermediate, and final—allowing for a more structured exploration of task complexity. This differs from the rigid structure of Least2Most prompting Zhou et al. [2022], as RDoLT does not prioritize simpler tasks but instead decomposes tasks based on logical progressive complexity. 2Thought Scoring System: RDoLT evaluates thoughts using a four-feature system—Logical Validity, Coherence, 3 Page 4: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Simplicity, and Adaptiveness—ensuring that thoughts are selected not just for their correctness but also for their ability to support consistent and flexible reasoning across different contexts. 3Knowledge Propagation Module (KPM): The KPM tracks both successful (strong) and rejected (weak) thoughts, allowing for dynamic re-evaluation and adaptation throughout the reasoning process. This dual-path system mitigates the issues of premature rejection or over-reliance on majority voting seen in CoT-SC Wang et al. [2022]. Our research investigates the following key questions: •RQ1 : How does the RDoLT compare to existing prompting methods in terms of reasoning performance on complex tasks? •RQ2 : What impact does the novel thought selection system have on the accuracy and efficiency of LLM reasoning? •RQ3 : How effective is the Knowledge Propagation Module (KPM) in utilizing both se- lected and rejected thoughts to enhance reasoning outcomes? Empirical evaluations demonstrate that RDoLT significantly improves LLM performance on a range of reasoning benchmarks, addressing key challenges identified in earlier prompting methods. By synthesizing insights from CoT, CoT-SC, Least2Most, and other techniques, RDoLT offers a comprehensive framework that enhances the logical reasoning capabilities of LLMs, providing a promising path forward in prompt engineering research. All codes and results will be open-sourced upon publication. 2. Related Work •Feedback Guided Thought Generation: Human feedback has been shown to enhance the performance of LLMs by providing an external evaluation that can refine model out- puts Tandon et al. [2021], Elgohary et al. [2021], Bai et al. [2022]. However, this feedback is often expensive and problematic to incorporate into an automated process. Consequently, researchers have begun to replace human feedback with heuristic functions, which serve as a more scalable solutionLiu et al. [2022], Lu et al. [2022], Le et al. [2022], Welleck et al. [2022]. Recent advancements have introduced self-reflective mechanisms where models generate their own feedback to assess and improve outputs Madaan et al. [2023], Shinn et al. [2023], Paul et al. [2024]. These techniques are especially beneficial for code generation and other multi-step tasks, as seen in Chen et al. [2023], which utilizes execution results for refinement. However, these approaches typically follow a linear left-to-right process, which limits exploration of alternative reasoning paths. In contrast, RDoLT introduces a broader, more flexible scoring-based feedback mechanism. Each thought node generates multiple child nodes, allowing exploration of alternative reasoning and improving decision- making comprehensiveness. •Graph & Tree-Based Reasoning: Tree-based approaches, such as the Tree of Thoughts (ToT) method, organize reasoning paths into a tree structure, allowing models to explore multiple decision branches Yao et al. [2023a], Xie et al. [2023]. These methods are particu- larly suited for multi-step problem-solving, where each node represents a partial solution. However, ToT’s rigid structure prohibits modification of intermediate nodes, which can result in a final solution that depends heavily on the initial steps. Once a branch is se- lected, there’s no opportunity for feedback or revision until the final answer is generated, these methods are also very cost-intensive if a long tree of thought is generated. Graph- based approaches, such as Graph of Thoughts (GoT), offer more flexibility by connecting reasoning steps as nodes within a graph, enabling multiple solution paths to be explored 4 Page 5: RDoLT Reasoning concurrently Besta et al. [2024a]. This flexibility is particularly beneficial for tasks with complex dependencies. However, the complexity of managing graph-based structures, especially for large tasks, presents significant computational challenges. •Thought Selection and Scoring Systems: Several methods have been developed to guide thought selection and scoring in LLM reasoning tasks, with each method offering a different approach to evaluating model outputs. CoT prompting has gained prominence by improving LLMs’ reasoning through the generation of intermediate reasoning steps Chilled and Chilled [2023]. While effective in breaking down multi-step problems, CoT treats the intermediate reasoning process as a “black box,” meaning that no evaluation occurs at each step before the final answer is generated. Wang et al. [2022] addresses this limitation by generating multiple reasoning paths and then marginalizing them to select the most consistent solution. This technique increases the likelihood of selecting the correct answer, however, CoT-SC improves upon CoT’s limitations, and it relies heavily on the assumption that majority voting across generated reasoning paths will produce an optimal solution. Zhou et al. [2022] on the other hand, adopts a hierarchical approach by first solving simpler problems before tackling more complex tasks. This incremental process builds model confidence; however, the decomposition is still treated as an automated “black box,” meaning that mistakes in early stages are not corrected before proceeding to more complex stages. •LM Planning and Structured Reasoning: Long-form content generation and complex problem-solving often require high-level planning and structured reasoning. Techniques like natural language outliners and schemas have proven effective for these tasks, guiding models to generate coherent, multistep outputs Mirowski et al. [2022]. These methods have been successfully applied to diverse domains, including video games, fact-checking, housekeeping, and code optimization, where a clear plan or outline helps models structure their reasoning Yao et al. [2023b], Lin et al. [2020], Wang et al. [2024]. The challenge with these approaches is that while they provide structure, they do not necessarily allow for the revision of thoughts generated during the process. Once a plan is in place, the model tends to follow it rigidly, without reconsidering whether earlier decisions were correct or optimal. Our work addresses these limitations by allowing for both structured reasoning and flexible revisions. The decomposition of tasks based on task complexity into distinct levels—easy, in- termediate, and final—ensures that the model progresses logically, but the feedback mechanism embedded within the KPM allows for real-time adjustments. If an intermediate thought proves incorrect or sub-optimal, the model can revise it without having to restart the entire reasoning process. This dynamic combination of planning and real-time evaluation offers a more robust approach to long-form and complex problem-solving. 3. Methods The RDoLT employs a three-stage iterative reasoning process ( Easy,Intermediate ,Final ) to systematically refine outputs. At each stage, candidate thoughts ( T1, T2, T3, ..Tn) are generated and evaluated using four scoring criteria: Logical Validity, Coherence, Simplicity, and Adaptive- ness. Thoughts exceeding a predefined threshold are propagated to the next stage through the Knowledge Propagation Module (KPM), which integrates and refines selected outputs. This structured, score-driven approach ensures progressive enhancement of reasoning quality and convergence towards optimal solutions. The subsequent sections provide a detailed exposition of the framework’s stages, the scoring methodology, and the propagation mechanism. 5 Page 6: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt 3.1 Task Decomposition The initial phase involve the decomposition of the reasoning task into three distinct levels based on gradual and progressive complexity: easy, intermediate, and final. This hierarchical decom- position is more sophisticated than the (Zhou et al. [2022]) by incorporating a more granular and human-level intelligent method of task segmentation. Given a complex reasoning task R, we decompose it into three sub-tasks, P={Reasy, Rintermediate , Rfinal}. Each sub-task is designed to incrementally build upon the previous one, ensuring that the model tackles simpler components first and progressively moves to more complex reasoning. The decomposition process can be represented as follows: Reasy=fdecomp (R, θ easy) (1) Rintermediate =fdecomp (R, θ intermediate ) (2) Rfinal=fdecomp (R, θ final) (3) The transition between these levels is not merely sequential but involves a feedback mech- anism where the output of each level informs the subsequent level. This recursive feedback mechanism can be defined as: tk+1,i=ffeedback (tk,i, θk+1) (4) where krepresents the current level. This feedback mechanism integrates the output of thek-th level to refine the input for the ( k+ 1)-th level, ensuring that knowledge and errors identified in earlier stages are propagated and corrected in later stages. This approach significantly reduces cognitive overload on the model and mirrors the step- by-step approach humans naturally employ when solving complex problems. By systematically refining thoughts and leveraging a robust scoring system, the RDoLT framework enhances the reasoning performance of LLMs. This method is not only more nuanced but also more aligned with human cognitive processesWu et al. [2024], thereby improving the model’s accuracy and consistency in solving complex reasoning tasks. 3.2 Thought Generation Thought generation is a critical component that occurs within each of the decomposition levels: easy, intermediate, and final. The process involves generating multiple candidate thoughts for each task segment to ensure a diverse set of potential solutions is explored. For our framework, we set n(the number of thoughts generated per level) to three. Given a decomposed task Rkat level k, the thought generation process aims to produce a set of candidate thoughts Tk={tk1, tk2, tk3}. Each thought is generated by the LLM based on the input X, the question Q, and any previously generated thoughts at that level. Formally, the thought generation process at level kis represented as: tki∼pθ(tki|I(tk1, tk2, . . . , t k(i−1), X, Q )),fori= 1,2,3 (5) where pθdenotes the probability distribution parameterized by θ, and I(·) indicates that the prompt includes all previous thoughts, task instructions X, and the corresponding question Q. The thoughts generated at each level TE={tE1, tE2, tE3},TI={tI1, tI2, tI3}, and TF= {tF1, tF2, tF3}undergo evaluation based on predefined criteria in the subsequent scoring system step. 6 Page 7: RDoLT Reasoning 3.3 Scoring and Evaluation The scoring system in our framework evaluates each generated thought at each decomposition level using four core features: Logical Validity ,Coherence ,Simplicity , andAdaptiveness . These features ensure that the selected thoughts are effective for reasoning and reflect human- like intelligence. •Logical Validity (Svalid) ensures the thought is logically sound. This can be represented as the negative penalty based on logical contradictions: Svalid(tki) =−X r∈rulesI{violates( r, tki)} (6) where Iis an indicator function that is 1 if the thought tkiviolates a known logical rule r, and 0 otherwise. A lower score penalizes thoughts that violate known rules or facts. Human or model can both be used as a scorer of all these following steps. •Coherence (Scohere ) measures the degree to which the thought follows from previous thoughts, and can be defined as a similarity measure between the current thought and previous thoughts: Scohere (tki| {tk1, tk2, . . . , t k(i−1)}) =1 i−1i−1X j=1sim(tki, tkj) (7) where sim( ·) represents a similarity function (cosine similarity) between the current thought tkiand process thoughts tkj. Sentences similarity can be calculated using any embedding or sentence transformer model; however to reduce the complexity of the whole proces we let LLM score the thought Yao [2024]. •Simplicity (Ssimple ) evaluates the clarity and conciseness of a thought, inversely related to its complexity. This can be modeled by penalizing the length or complexity of the thought: Ssimple (tki) =−complexity( tki) (8) where complexity( tki) could represent the length of the thought or the number of steps in reasoning, with lower complexity resulting in a higher score. •Adaptiveness (Sadapt) assesses how well the thought aligns with the external context, such as the task instructions Xand the question Q. It can be defined as: Sadapt(tki| {X, Q}) = sim( tki,{X, Q}) (9) where sim( ·) measures the similarity between the thought and the context provided by X andQ. The overall score for each thought tkiis the sum of its individual feature scores: S(tki) =Svalid(tki) +Scohere (tki) +Ssimple (tki) +Sadapt(tki) (10) Thought is selected if its total score exceeds a predefined threshold τ, or if it maximizes the score among all thoughts in the current decomposition level: t∗ k= arg max tkiS(tki) (11) In contrast to Wei et al. [2022], Zhou et al. [2022], Yao et al. [2023a], Besta et al. [2024a], Sel et al. [2023], our framework employs a systematic, feature-based evaluation and scoring system of thoughts at each step, ensuring not only the highest-quality thoughts progress but also maintaining knowledge of weak thoughts to feed the model in subsequent steps of decomposition. This recursive, detailed scoring process—based on logical validity, coherence, simplicity, and adaptiveness—enables a more nuanced and reliable reasoning mechanism. 7 Page 8: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Figure 2: Edge cases handled by the Knowledge Propagation Module (KPM) during thought selection in decomposition steps. The complete Selection scenario (left) shows all thoughts (T1, T2, T3) selected. The Complete Rejection scenario (center) depicts all thoughts rejected, leading to regeneration. Mixed Selection scenario (right) highlights partial selection, with some thoughts accepted and others rejected. These scenarios demonstrate how the KPM ensures optimal thought progression 3.4 Knowledge Propagation Module and Edge Case Management The Knowledge Propagation Module (KPM) plays a crucial role in the RDoLT framework’s reasoning process. It is responsible for managing knowledge and propagating it through the subsequent steps of reasoning. This module ensures that the flow of information remains co- herent and consistent across all levels of decomposition, significantly enhancing the model’s reasoning capabilities. Furthermore, the KPM manages the execution of the system, handles edge cases, and oversees the selection and rejection of thoughts, which is essential for maintain- ing the framework’s overall accuracy. The KPM tracks both selected and non-selected thoughts at each step of the reasoning process. Selected thoughts are those that have met the threshold criteria based on the scoring system, while non-selected thoughts (weak thoughts) are those that did not meet the required threshold. Unlike traditional methods, which focus primarily on the immediate next step, KPM makes this information available to all subsequent steps. For instance, thoughts selected or rejected during the easy step are accessible in both the intermediate and final steps. This comprehensive tracking ensures that the system retains a full understanding of the reasoning progression from start to finish. Mathematically, let Sk selectedandSk non-selectedrepresent the sets of selected and non-selected thoughts at level k, respectively. The KPM propagates these sets to all subsequent levels k+ 1, k+ 2, . . ., as follows: {Sk selected , Sk non-selected } → { Sk+1 selected, Sk+1 non-selected, . . . , Sn selected , Sn non-selected } (12) This propagation includes maintaining a history of all thoughts and providing this history to the reasoning framework to ensure well-informed decision-making at each step. Additionally, the KPM includes a robust feedback mechanism. If no thought passes the threshold at any step, the module informs the main framework to regenerate thoughts, ensuring the reasoning process does not stall. This feedback mechanism is critical for maintaining the flow of reasoning and preventing bottlenecks: IfSk selected =∅ ⇒Regenerate thoughts at level k (13) Moreover, the KPM handles various edge cases, such as those illustrated in Figure 3. It tracks thoughts that receive the same score and ensures appropriate handling. For instance, if multiple thoughts pass the threshold, the module informs the model of the scores for all passing thoughts, enabling it to prioritize them effectively. If two thoughts have identical scores, they 8 Page 9: RDoLT Reasoning Figure 3: Detailed example illustrating how RDoLT addresses problems by 1decomposing them into three reasoning steps, 2generating thoughts for each step, and 3scoring and propagating knowledge through a Knowledge Propagation Module (KPM) across subsequent steps. The system performs scoring and selection at each step, ensuring that both accepted and rejected thoughts are transmitted to enhance learning and understand why certain thoughts were discarded are given equal priority and propagated to the next step: IfS(tk1) =S(tk2) and both tk1, tk2∈Sk selected ⇒tk1, tk2 (14) In cases where all thoughts pass the threshold, the module provides detailed scores to help the model utilize the thought information more effectively in subsequent steps. This systematic approach ensures that the reasoning process remains flexible and robust, capable of handling various scenarios without compromising the integrity of the reasoning flow. Compared to CoT, CoT-SC, and Least2Most prompting, our KPM offers a more advanced and comprehensive approach to managing and propagating knowledge. Traditional methods primarily focus on sequential thought generation and consensus mechanisms without maintaining a detailed history of thoughts or providing robust feedback. Our KPM addresses these gaps by ensuring that all relevant information is available at every step, thereby enhancing the overall accuracy and reliability of the reasoning process. 4. Experiments To evaluate the effectiveness of the RDoLT, we conducted a comprehensive series of experiments. These experiments aimed to assess RDoLT’s performance across various reasoning benchmarks, 9 Page 10: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt comparing it with state-of-the-art prompting methods. We meticulously selected benchmarks and models, implemented comparative methodologies, and analyzed the main results to deter- mine the robustness of different RDoLT variants. Additionally, we examined the impact of the quantity of thoughts generated on success rates to understand how thought granularity influ- ences overall performance. The following sections provide a detailed account of our experimental setup and findings. 4.1 Benchmarks & Models Selection Given the versatility and ease of use offered by the RDoLT framework, as well as the accessibility of tools like LM-StudioPourkamali and Sharifi [2024] and OllamaOllama [2024], we conducted experiments using four distinct open-source and open-weight LLMs: Llama-3 (8B)AI@Meta [2024], QWEN-2 (7B)Yang et al. [2024], Gemma-2(9B)Team [2024], and Gemma-2(27B)Team [2024]. Although we also utilized the OpenAI API to access ChatGPT-4oOpenAI [2024], our primary focus was on evaluating the performance of open-source LLMs with various quantization levels. This focus allowed us to conduct extensive experiments and explore different variations in prompt design. In our experiments, we maintained the temperature parameter at 0.4, striking a balance between encouraging model creativity and ensuring consistent, reliable responses for complex reasoning tasks. Additionally, the context length was set at 8192 tokens to maximize the model’s ability to handle extensive input sequences. We also explored context lengths of 4096 and 2056 tokens to evaluate the impact on model performance and accuracy. The RDoLT framework is specifically designed to address reasoning tasks, particularly those requiring sequential and multi-step reasoning. To thoroughly evaluate the effectiveness of our system, we tested it against well-known benchmarks that push the limits of prompt engineer- ing. For mathematical reasoning, we deployed the GSMK8 Cobbe et al. [2021], Multi-Arithmetic ChilleD [2023b], SVAMP Patel et al. [2021] and Gaokao 2023 Math Zhang et al. [2023b] bench- marks. To assess the system’s ability to handle common-sense reasoning, we included the LastLetterConcatenation ChilleD [2023a] benchmark in our evaluation. In total, we tested our prompt design across five different benchmarks to assess its generalizability and compare its performance with state-of-the-art techniques. 4.2 Method Selected for Comparison We compare our system with Standard I/O, (CoT), (CoT-SC). (Auto CoT) and (L2M) prompt- ing. These methods were chosen for their prominence and varied approaches to improving prompt accuracy. Standard I/O prompting serves as a baseline to highlight the improvements of more advanced techniques. CoT and CoT-SC and its other variants, which follow a step- by-step reasoning structure, align with our decomposition-based approach, while L2M offers a contrasting progressive complexity strategy. Unlike fine-tuning methods, which often excel in domain-specific tasks but lack flexibility and generalization across diverse datasets, RDoLT maintains adaptability. Evaluating RDoLT against these methods allows for a comprehensive assessment of its accuracy and generalizability. 5. Main Results We evaluated the RDoLT framework on five benchmarks—Cobbe et al. [2021], ChilleD [2023a], Zhang et al. [2023b], ChilleD [2023b], Patel et al. [2021] comparing it with established prompting techniques, including CoT, CoT-SC, Least2Most, and Auto-CoT (A-CoT). The results, shown in (Table 1), demonstrate that RDoLT consistently outperforms these methods across multiple tasks and models. 10 Page 11: RDoLT Reasoning Table 1: Comprehensive evaluation of the RDoLT framework across various LLMs. The perfor- mance of RDoLT is compared with several established methodologies, including Vanilla prompt- ing, CoT, CoT-SC, Least-to-Most prompting, and Auto-CoT (A-CoT). The assessment encom- passes multiple benchmarks: GSM8K, SVAMP, MultiArith, Last Letter Concatenation, and Gaokao 2023 Math. Benchmark Model Vanilla CoT CoT-SC Least2Most Auto-CoT RDoLT GSM8K Cobbe et al. [2021]ChatGPT 4o 84.7 88.9 89.4 - 85.8 90.98↑ LLama 3 (7B) 67.42 71.29 72.86↑ 69.91 68.53 72.63 Qwen 2 (7B) 62.2 65.3 64.8 63.6 61.8 67.4↑ Gemma 2 (9B) 59.36 64.28 62.53 61.89 60.18 65.79↑ Gemma 2 (27B) 60.65 75.46 76.72↑ 74.94 71.83 76.58 SVAMP Patel et al. [2021]ChatGPT 4o 83.5 87.3 86.7 85.9 - 89.35↑ LLama 3 (7B) 65.79 69.54↑ 68.86 67.93 66.67 69.23 Qwen 2 (7B) 60.3 64.2 63.5 62.6 61.3 66.37↑ Gemma 2 (9B) 58.6 63.4 61.8 60.9 64.52↑ 64.19 Gemma 2 (27B) 69.86 73.75 72.94 71.98 70.69 75.27↑ MultiArith ChilleD [2023b]ChatGPT 4o 82.7 - 85.7 84.9 83.6 88.2↑ LLama 3 (7B) 77.93 81.62 80.97 79.84 78.53 83.31↑ Qwen 2 (7B) 58.6 62.3 61.5 60.4 59.2 64.5↑ Gemma 2 (9B) 56.87 61.38 59.76 58.69 57.48 63.83↑ Gemma 2 (27B) 67.96 72.73↑ 71.86 70.75 68.68 72.49 LastLetterConcatenation ChilleD [2023a]ChatGPT 4o 80.4 84.4 83.7 - 81.3 87.15↑ LLama 3 (7B) 62.58 66.36 65.67 64.48 63.29 68.24↑ Qwen 2 (7B) 57.4 61.2 60.5 59.3 58.2 63.35↑ Gemma 2 (9B) 56.46 60.28 58.69 57.59 55.87 62.76↑ Gemma 2 (27B) 66.67 70.48 69.58 68.39 67.29 72.93↑ Gaokao 2023 Math Zhang et al. [2023b]ChatGPT 4o - 83.2 82.5 81.6 80.3 85.64↑ LLama 3 (7B) 60.86 64.49 63.78 62.68 61.39 66.57↑ Qwen 2 (7B) 55.7 59.4 58.6 57.5 56.3 61.75↑ Gemma 2 (9B) 55.27 59.19 57.58 56.47 54.79 61.68↑ Gemma 2 (27B) 65.47 70.28↑ 69.39 68.26 66.18 70.05 On the GSM8K benchmark, RDoLT achieves an accuracy of 90.98% with ChatGPT-4o, surpassing CoT (89.4%) and A-CoT (85.8%). RDoLT also shows strong performance with LLama3 (72.63%) and Qwen2 (67.4%), indicating its adaptability across different models and architectures. For the SVAMP benchmark, RDoLT leads with 89.35% using ChatGPT-4o, out- performing CoT-SC and CoT. CoT performs relatively well with LLama3 (69.54%), indicating that certain models can benefit from specific prompting strategies, although RDoLT remains su- perior overall across various architectures. In the MultiArith benchmark, RDoLT achieves 88.2% with ChatGPT-4o, showing strong results in arithmetic reasoning tasks. LLama3 (83.31%) and Qwen2 (64.5%) also perform well with RDoLT. However, in the case of the Gemma2 (27B) model, CoT produces better results, suggesting that the optimal prompting technique can vary depending on the model and task complexity. The Last Letter Concatenation benchmark fur- ther confirms RDoLT’s effectiveness, with ChatGPT-4o reaching 87.15%. LLama3 (68.24%) and Qwen2 (63.35%) also perform well with RDoLT, reinforcing its generalization capability across different models and tasks. Finally, in the Gaokao 2023 Math benchmark, RDoLT achieves 85.64% with ChatGPT-4o. LLama3 (66.57%) and Qwen2 (61.75%) also show strong results. While CoT slightly outperforms RDoLT with Gemma2 (70.28%), RDoLT remains the most consistent top performer overall. In summary, RDoLT outperforms other prompting methods in 65% of the evaluated bench- marks. While CoT and CoT-SC demonstrate competitive results in specific cases, RDoLT offers the most consistent and robust performance across a diverse range of reasoning tasks and models. 11 Page 12: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt 5.1 Robustness of RDoLT Variants Table 2: Performance Analysis of RDoLT Variants Across Different Threshold Score Levels: the performance of six variants across four Threshold levels. Bold figures with upward arrows ( ↑) indicate peak performance for each variant. Results show that optimal thresholds vary among variants, with most peaking at 30 or 35. Variants Threshold ≥25 Threshold ≥30 Threshold ≥35 Threshold ≥40 Single-Step(Sequential) 53.12 73.47 80.78↑ 65.30 Single-Step(One-Shot) 68.67 71.73↑ 68.89 69.45 Few-Shots(2) 65.89 76.47 77.15↑ 70.60 Multi-Requests(3) 61.12 73.24↑ 72.34 74.15 Multi-Requests Unlimited* 60.24 74.51↑ 73.13 72.30 RDoLT variants across different threshold score levels revealed intriguing patterns, each with potential implications for practical applications. The single-step (sequential) variant demon- strated the highest overall performance, peaking at 80.78% with a threshold of ≥35. This superior performance suggests that for complex tasks, allowing more extensive processing be- fore making decisions yields better results. The sequential nature of this variant likely enables a more thorough exploration of the problem space, leading to more accurate solutions. Multi- Step (1-Shot) and Few-Shots (2) variants showed optimal performance at thresholds of ≥35 (77.51%) and ≥35 (77.15%), respectively. These results indicate that these approaches ben- efit from moderate thresholds, striking a balance between depth of processing and efficiency. The slightly lower performance compared to the Single-Step variant might be attributed to the trade-off between speed and accuracy, where these methods attempt to reach solutions more quickly but potentially at the cost of some precision. Interestingly, One-Shot variant achieved its best performance (71.73%) at a lower threshold of≥30. This suggests that this approach is more suitable for tasks requiring quicker decision- making or where rapid responses are valued over absolute accuracy. The lower overall per- formance compared to other variants implies while efficient, this method may sacrifice some problem-solving depth. Multi-Request variants, both limited (3) and unlimited, performed op- timally at the ≥30 threshold, with scores of 73.24% and 74.51% respectively. This pattern indicates that allowing multiple attempts at problem-solving can be effective, but excessive it- erations may not yield significant improvements. The similarity in performance between the limited and unlimited variants suggests that there might be a natural ceiling to the benefits of multiple attempts, beyond which additional requests do not substantially enhance outcomes. A noteworthy trend across all variants is the reduced performance at the highest threshold (≥40). This consistent drop-off points to a potential over-processing effect, where excessively high thresholds may introduce unnecessary complexity or lead to over-fitting in the decision- making process. This observation underscores the importance of finding the right balance in threshold setting to optimize performance without incurring diminishing returns. 5.2 Impact of Thought Quantity on Success Rates The evaluation of the RDoLT prompt system reveals insightful findings about its performance across different stages of problem-solving, emphasizing the number of thoughts generated per step. Table 4 encapsulates the results of this analysis, illustrating the effectiveness of generating 3, 5, and 7 thoughts per step. In scenarios where three thoughts were generated per step, the final step proved to be the most effective, solving 24 out of 30 problems, resulting in an 80% success rate. This demonstrates the crucial role of the final step, benefiting from the refined thoughts developed in earlier stages. The intermediate step followed, solving 15 problems with a 50% success rate, while the easy step 12 Page 13: RDoLT Reasoning Table 3: Step-wise problem-solving performance across different step levels (Easy, Intermediate, Final) using varying numbers of thoughts per step (3, 5, and 7). Results show the progression of solved problems at each thought step, the total problems solved, and the corresponding success rates for each configuration. The 3 Thoughts Count/Step configuration is identified as the best-performing system. Thoughts Count/Step Steps T1 T2 T3 T4 T5 T6 T7 Total Solved Success Rate% 3Easy 2 3 4 - - - - 9 30 Intermediate 4 5 6 - - - - 15 50 Final 7 8 9 - - - - 24 80 Total 13 16 19 - - - - 48 53.33↑ 5Easy 2 3 4 5 6 - - 20 33.33 Intermediate 4 5 6 7 8 - - 30 50 Final 6 7 8 9 10 - - 40 66.67 Total 12 15 18 21 24 - - 90 49.75↓ 7Easy 1 2 3 2 4 4 - 16 31.11 Intermediate 2 3 3 4 5 5 5 27 38.89 Final 1 4 6 5 6 6 6 34 46.67 Total 6 9 10 12 14 15 11 77 38.89↓ Figure 4: Impact of different thought selection strategies on accuracy using KPM. Strategies include using only selected thoughts, both selected and non-selected thoughts, highest-scoring selected thoughts, and lowest-threshold non-selected thoughts. Incorporating both selected and non-selected thoughts yields a wider range of accuracy, while selecting only final thoughts ensures more consistent accuracy results 13 Page 14: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt had the lowest contribution, solving 9 problems with a 30% success rate. Overall, this variant solved 48 problems with a success rate of 53.33%, showcasing its robustness in a structured problem-solving environment. When the number of thoughts per step increased to five, the final step maintained its sig- nificance, solving 40 problems with a 66.67% success rate. The intermediate step solved 30 problems, achieving a 50% success rate, while the easy step solved 20 problems, with a 33.33% success rate. This variant solved a total of 90 problems, yielding a 60.00% success rate. These results suggest that increasing the number of thoughts per step enhances problem-solving ca- pabilities, though the final step remains critical for achieving high accuracy. Conversely, generating seven thoughts per step showed a decrease in overall performance. The final step solved 34 problems, achieving a 46.67% success rate, the intermediate step solved 27 problems with a 38.89% success rate, and the easy step solved 16 problems with a 31.11% success rate. The total number of problems solved in this variant was 77, with a 36.67% success rate. This indicates that while more thoughts per step provide more options, it may also introduce complexity that can hinder overall effectiveness. 5.3 Limitations and Future Directions While the RDoLT framework demonstrates superior performance across multiple benchmarks, several limitations remain. First, the generalizability of RDoLT to domain-specific tasks has not been fully explored. The framework shows promising results in standard reasoning tasks, but its adaptability to highly specialized domains such as legal reasoning or medical diagnostics may require further fine-tuning and optimization. Additionally, the computational overhead of maintaining the Knowledge Propagation Module (KPM) may limit scalability, particularly in resource-constrained environments. Another potential threat to validity is the reliance on benchmark datasets that may not fully capture the complexity of real-world reasoning scenarios. Although benchmarks like GSM8K and SVAMP are widely used, they represent structured tasks that may not account for the diverse and dynamic nature of human reasoning in more unstructured settings. Future research should focus on addressing these limitations by extending the framework to more domain-specific applications and exploring optimizations that reduce computational costs. Additionally, incorporating more diverse and challenging real-world datasets could pro- vide a deeper evaluation of RDoLT’s capabilities. Further work could also investigate hybrid approaches that combine the strengths of multiple prompting strategies to improve performance across various reasoning tasks. 6. Conclusion This paper introduced the RDoLT framework, a novel approach designed to enhance reasoning in large language models (LLMs) through dynamic thought selection and knowledge propaga- tion. The key innovation, the Knowledge Propagation Module (KPM), ensures that selected and rejected thoughts are tracked and leveraged across reasoning stages, improving accuracy and re- ducing error propagation. We evaluated RDoLT across multiple benchmarks, including GSM8K, SVAMP, MultiArith, and Gaokao 2023 Math. Our results show that RDoLT consistently out- performs existing methods such as Chain-of-Thought (CoT), CoT with Self-Consistency (CoT- SC), Least2Most, and Auto-CoT (A-CoT). For instance, on GSM8K, RDoLT achieved a top accuracy of 90.98% with ChatGPT 4o, surpassing CoT-SC’s 89.4%. Similar improvements were observed across LLama 3, Qwen 2, and Gemma 2 models. What sets RDoLT apart is its ability to utilize rejected thoughts, a feature absent in other methods. This comprehensive approach enhances the reasoning process by maintaining a com- plete view of generated thoughts, thereby improving overall decision-making. While RDoLT performs exceptionally well on general benchmarks, further research is needed to optimize its 14 Page 15: RDoLT Reasoning performance for domain-specific tasks and reduce computational overhead. Future work could focus on more efficient knowledge propagation techniques and exploring new domains. In summary, RDoLT offers a significant advancement in prompt engineering by improving thought selection and knowledge propagation in LLMs. Its performance across diverse bench- marks demonstrates its potential as a flexible and reliable framework for reasoning tasks. 7. Ethics Statement We ensured that all datasets used in this research were properly sourced and cited. In this study, we focused on open-source LLMs such as Llama 3, Qwen 2, and Gemma 2, which of- fer greater transparency and accessibility, promoting reproducibility and collaboration. The RDoLT framework improves reasoning without inherently preventing the generation of harm- ful content, so we encourage users to implement appropriate safeguards to mitigate potential risks. Our emphasis on open-source models aligns with our goal of supporting wider research participation and avoiding the constraints of proprietary systems. References Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Ra- hee Ghosh Peshawaria, Ranjita Naik, and Besmira Nushi. Kitab: Evaluating llms on con- straint satisfaction for information retrieval. arXiv preprint arXiv:2310.15511 , 2023. Qingyao Ai, Ting Bai, Zhao Cao, Yi Chang, Jiawei Chen, Zhumin Chen, Zhiyong Cheng, Shoubin Dong, Zhicheng Dou, Fuli Feng, Shen Gao, Jiafeng Guo, Xiangnan He, Yanyan Lan, Chenliang Li, Yiqun Liu, Ziyu Lyu, Weizhi Ma, Jun Ma, Zhaochun Ren, Pengjie Ren, Zhiqiang Wang, Mingwen Wang, Ji-Rong Wen, Le Wu, Xin Xin, Jun Xu, Dawei Yin, Peng Zhang, Fan Zhang, Weinan Zhang, Min Zhang, and Xiaofei Zhu. Information Retrieval meets Large Language Models: A strategic report from Chinese IR community. AI Open , 4:80–90, 2023. ISSN 2666-6510. doi: 10.1016/j.aiopen.2023.08.001. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/ main/MODEL_CARD.md . Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Cather- ine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URL https://arxiv.org/abs/2204.05862 . Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gian- inazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Pro- ceedings of the AAAI Conference on Artificial Intelligence , 38(16):17682–17690, 2024a. ISSN 2159-5399. doi: 10.1609/aaai.v38i16.29720. URL http://arxiv.org/abs/2308.09687http: //dx.doi.org/10.1609/aaai.v38i16.29720 . Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa´ sniewski, J¨ urgen M¨ uller, Lukas Gianinazzi, Ales Kubicek, Hubert Niewiadomski, Aidan O’mahony, Onur Mutlu, and Torsten Hoefler. Topologies of reasoning: Demystifying chains, trees, and graphs of thoughts. arXiv , 2024b. doi: 10.48550/arxiv.2401.14295. 15 Page 16: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Tuan Bui, Oanh Tran, Phuong Nguyen, Bao Ho, Long Nguyen, Thang Bui, and Tho Quan. Cross-data knowledge graph construction for llm-enabled educational question-answering sys- tem: A case study at hcmut. In Proceedings of the 1st ACM Workshop on AI-Powered Q&A Systems for Multimedia , pages 36–43, 2024. Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744 , 2024. Xinyun Chen, Maxwell Lin, Nathanael Sch¨ arli, and Denny Zhou. Teaching large language models to self-debug, 2023. URL https://arxiv.org/abs/2304.05128 . ChilleD. ChilleD/LastLetterConcat ·Datasets at Hugging Face, 2023a. URL https:// huggingface.co/datasets/ChilleD/LastLetterConcat . ChilleD. ChilleD/MultiArith ·Datasets at Hugging Face, 2023b. URL https://huggingface. co/datasets/ChilleD/MultiArith . Chilled and Chilled. Chilled/lastletterconcat ·datasets at hugging face. 2023. URL https: //huggingface.co/datasets/ChilleD/LastLetterConcat . Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. Yan Ding, Xiaohan Zhang, Saeid Amiri, Nieqing Cao, Hao Yang, Andy Kaminski, Chad Es- selink, and Shiqi Zhang. Integrating action knowledge and llms for task planning and situation handling in open worlds. Autonomous Robots , 47(8):981–997, 2023. Ahmed Elgohary, Christopher Meek, Matthew Richardson, Adam Fourney, Gonzalo Ramos, and Ahmed Hassan Awadallah. Nl-edit: Correcting semantic parse errors through natural language interaction, 2021. URL https://arxiv.org/abs/2103.14540 . Erkut Erdem, Menekse Kuyu, Semih Yagcioglu, Anette Frank, Letitia Parcalabescu, Barbara Plank, Andrii Babii, Oleksii Turuta, Aykut Erdem, Iacer Calixto, et al. Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning. Journal of Artificial Intelligence Research , 73:1131–1207, 2022. Dehong Gao, Kaidi Chen, Ben Chen, Huangyu Dai, Linbo Jin, Wen Jiang, Wei Ning, Shanqing Yu, Qi Xuan, Xiaoyan Cai, et al. Llms-based machine translation for e-commerce. Expert Systems with Applications , page 125087, 2024. Ga¨ el Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are not strong abstract reasoners. arXiv , 2023. doi: 10.48550/arxiv.2305.19555. Anand Gokul. Llms and ai: Understanding its reach and impact. 2023. Ruixin Hong, Hongming Zhang, Xiaoman Pan, Dong Yu, and Changshui Zhang. Abstraction- of-thought makes language models better reasoners. arXiv , 2024. doi: 10.48550/arxiv.2406. 12442. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv , 2023. doi: 10.48550/arxiv.2310.01798. 16 Page 17: RDoLT Reasoning Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learn- ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 21314–21328. Cur- ran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/file/8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf . Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking llms for adaptive and reliable medical reasoning. arXiv preprint arXiv:2406.00922 , 2024. Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative com- monsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the As- sociation for Computational Linguistics: EMNLP 2020 , pages 1823–1840, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.165. URL https://aclanthology.org/2020.findings-emnlp.165 . Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, and Yejin Choi. Rainier: Reinforced knowledge introspector for commonsense question answering, 2022. URL https://arxiv.org/abs/2210.03078 . Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages 1–13, 2024. Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Am- manabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearn- ing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 27591–27609. Cur- ran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/file/b125999bde7e80910cbdbd323087df8f-Paper-Conference.pdf . Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https: //arxiv.org/abs/2303.17651 . Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. Llm-powered conversational voice assistants: Interaction patterns, opportunities, challenges, and design guidelines. arXiv preprint arXiv:2309.13879 , 2023. Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. Co-writing screen- plays and theatre scripts with language models: An evaluation by industry professionals, 2022. URL https://arxiv.org/abs/2209.14958 . Devon Myers, Rami Mohawesh, Venkata Ishwarya Chellaboina, Anantha Lakshmi Sathvik, Praveen Venkatesh, Yi-Hui Ho, Hanna Henshaw, Muna Alhawawreh, David Berdik, and Yaser Jararweh. Foundation and large language models: fundamentals, challenges, opportunities, and social impacts. Cluster Computing , 27(1):1–26, 2024. Ollama. Ollama: Get up and running with Llama 3.1, Mistral, Gemma 2, and other large language models, 2024. URL https://github.com/ollama/ollama/tree/main . 17 Page 18: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774 . Yoonjeong Park, Hyunjin Kim, Chanyeol Choi, Junseong Kim, and Jy-yong Sohn. Can separa- tors improve chain-of-thought prompting? arXiv , 2024. doi: 10.48550/arxiv.2402.10645. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10. 18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168 . Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations, 2024. URL https://arxiv.org/abs/2304.01904 . Nooshin Pourkamali and Shler Ebrahim Sharifi. Machine translation with large language models: Prompt engineering for persian, english, and russian directions. 2024. URL https://arxiv. org/abs/2401.08429 . Ricardo La Rosa, Corey Hulse, and Bangdi Liu. Can github issues be solved with tree of thoughts? arXiv , 2024. doi: 10.48550/arxiv.2405.13057. Ankit Satpute, Noah Gießing, Andr´ e Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, and Bela Gipp. Can llms master math? investigating large language models on math stack exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’24, page 2316–2320, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/ 3626772.3657945. URL https://doi.org/10.1145/3626772.3657945 . Bilgehan Sel, Ahmad Al-tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv , 2023. doi: 10. 48550/arxiv.2308.10379. Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv , 2024. doi: 10.48550/arxiv.2401.07324. Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366 . Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an analysis of cot in planning, 2024. URL https://arxiv.org/abs/2405.04776 . Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. Learning to Repair: Repairing model output errors after deployment using a dynamic memory of feedback. arXiv , 2021. doi: 10.48550/arxiv.2112.09737. Gemma Team. Gemma. 2024. doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle. com/m/3301 . Rayden Tseng, Suzan Verberne, and Peter van der Putten. Chatgpt as a commenter to the news: can llms generate human-like opinions? In Multidisciplinary International Symposium on Disinformation in Open Online Media , pages 160–174. Springer, 2023. 18 Page 19: RDoLT Reasoning Michiel Van Der Meer, Enrico Liscio, Catholijn Jonker, Aske Plaat, Piek Vossen, and Pradeep Murukannaiah. A hybrid intelligence method for argument mining. Journal of Artificial Intelligence Research , 80:1187–1222, 2024. Shubham Vatsal and Harsh Dubey. A survey of prompt engineering methods in large language models for different nlp tasks, 2024. URL https://arxiv.org/abs/2407.12994 . Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models. In ICLR , 2022. URL http://arxiv.org/abs/2203.11171 . Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents, 2024. URL https://arxiv.org/abs/2302.01560 . Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv , 2022. doi: 10.48550/arxiv.2201.11903. Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct, 2022. URL https://arxiv. org/abs/2211.00053 . Yike Wu, Jiatao Zhang, Nan Hu, LanLing Tang, Guilin Qi, Jun Shao, Jie Ren, and Wei Song. Mldt: Multi-level decomposition for complex long-horizon robotic task planning with open- source large language model, 2024. URL https://arxiv.org/abs/2403.18760 . Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Self-evaluation guided beam search for reasoning, 2023. URL https://arxiv.org/abs/ 2305.00633 . Frank Xing. Designing heterogeneous llm agents for financial sentiment analysis. ACM Trans- actions on Management Information Systems , 2024. Qing Xue. Unlocking the potential: A comprehensive exploration of large language models in natural language processing. Applied and Computational Engineering , 57(1):247–252, 2024. ISSN 2755-2721. doi: 10.54254/2755-2721/57/20241341. Archna Balkrishna Yadav. Generative ai in the era of transformers: Revolutionizing natural language processing with llms. Journal of Image Processing and Intelligent Re- mote Sensing , (42):54–61, 2024. ISSN 2815-0953. doi: 10.55529/jipirs.42.54.61. URL https://pdfs.semanticscholar.org/4029/9c8515c22705f8e42acddf0ade59c754f350. pdfhttp://creativecommons.org/licenses/by/4.0/ . An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671 . 19 Page 20: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Liang Yao. Large language models are contrastive reasoners. arXiv , 2024. doi: 10.48550/arxiv. 2403.08211. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems , 36, 2023a. ISSN 10495258. URL https: //arxiv.org/abs/2305.10601v2 . Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023b. URL https: //arxiv.org/abs/2210.03629 . Ping Yu, Hua Xu, Xia Hu, and Chao Deng. Leveraging generative ai and large language models: a comprehensive roadmap for healthcare integration. In Healthcare , volume 11, page 2776. MDPI, 2023. Haopeng Zhang, Philip S Yu, and Jiawei Zhang. A systematic survey of text summarization: From statistical methods to large language models. arXiv preprint arXiv:2406.11289 , 2024. Jintian Zhang, Xin Xu, and Shumin Deng. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124 , 2023a. Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark. 2023b. Bo Zhou, Daniel Geißler, and Paul Lukowicz. Misinforming llms: vulnerabilities, challenges and opportunities, 2024. URL https://arxiv.org/abs/2408.01168 . Denny Zhou, Nathanael Sch¨ arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuur- mans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. The International Conference on Learning Rep- resentations (ICLR) , 2022. URL http://arxiv.org/abs/2205.10625 . A. Complementory Results Generally, RDoLT requires only three requests and corresponding prompts: Task Decomposi- tion,Thought Generation , and Thoughts Evaluation . In the following section, we present some results that demonstrate the generalizability of RDoLT across a variety of benchmarks. B. Last Letter Concatenation Task : " Take the last letter of each word in the sentence : ’Artificial intelligence is the future ’ and concatenate them to form a new string ." Decomposition [ Easy ] Description : Extract the last letter of each word in the sentence . Thought 1: Identify the words in the sentence : " Artificial ", " intelligence ", " is", " the", " future ". Thought 2: Extract the last letter of each word : "l", "e", "s", "e", "e". Thought 3: Check if each word is correctly identified and its last letter is accurately extracted . Scores : - Thought 1: LV: 10, COH: 9, SIM: 10, ADP: 10, Total : 39 - Thought 2: LV: 9, COH: 8, SIM: 8, ADP: 9, Total : 34 - Thought 3: LV: 5, COH: 5, SIM: 5, ADP: 5, Total : 20 20 Page 21: RDoLT Reasoning KPM [ Easy Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Decomposition [ Intermediate ] Description : Combine the extracted letters into a single string . Thought 1: Concatenate the letters to form the string " lesee ". Thought 2: Verify if the concatenated string represents the order of the words in the original sentence . Thought 3: Check for any errors in the concatenation process . Scores : - Thought 1: LV: 10, COH: 9, SIM: 9, ADP: 10, Total : 38 - Thought 2: LV: 8, COH: 7, SIM: 8, ADP: 8, Total : 31 - Thought 3: LV: 4, COH: 5, SIM: 4, ADP: 5, Total : 18 KPM [ Intermediate Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 2 Rejected Thoughts ( Intermediate ): Thought 3 Decomposition [ Final ] Description : Verify the final output and ensure all steps have been completed correctly . Thought 1: Review the concatenated string " lesee " to ensure it matches the task requirements . Thought 2: Validate that the process of extracting last letters and concatenating them followed the proper sequence . Thought 3: Finalize the result as the correct output for the given task . Scores : - Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39 - Thought 2: LV: 7, COH: 7, SIM: 7, ADP: 7, Total : 28 - Thought 3: LV: 4, COH: 4, SIM: 4, ADP: 4, Total : 16 Final Answer : The final concatenated string is " lesee ". KPM [ Final Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 2 Rejected Thoughts ( Intermediate ): Thought 3 Selected Thoughts ( Final ): Thought 1 Rejected Thoughts ( Final ): Thought 2, Thought 3 Final Answer : " lesee " C. gmsk8 Task : " Toula went to the bakery and bought various types of pastries . She bought 3 dozen donuts which cost $68 per dozen , 2 dozen mini cupcakes which cost $80 per dozen , and 6 dozen mini cheesecakes for $55 per dozen . How much was the total cost ?" Decomposition [ Easy ] 21 Page 22: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Description : Break down the task by identifying each type of pastry and calculating the total cost for each one individually . Thought 1: Identify the total cost of the donuts by multiplying 3 dozen by $68 per dozen . Thought 2: Identify the total cost of the mini cupcakes by multiplying 2 dozen by$80 per dozen . Thought 3: Identify the total cost of the mini cheesecakes by multiplying 6 dozen by $55 per dozen . Scores : - Thought 1: LV: 10, COH: 9, SIM: 10, ADP: 10, Total : 39 - Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36 - Thought 3: LV: 6, COH: 6, SIM: 6, ADP: 7, Total : 25 KPM [ Easy Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Decomposition [ Intermediate ] Description : Calculate the cost for each pastry and find the sum of all the items . Thought 1: The total cost of the donuts is 3 $\ times $ 68 = \ $204 . \\ Thought 2: The total cost of the mini cupcakes is 2 $\ times $ 80 = \ $160 . \\ Thought 3: The total cost of the mini cheesecakes is 6 $\ times $ 55 = \ $330 . Scores : - Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39 - Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36 - Thought 3: LV: 6, COH: 7, SIM: 7, ADP: 7, Total : 27 KPM [ Intermediate Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 2 Rejected Thoughts ( Intermediate ): Thought 3 Decomposition [ Final ] Description : Calculate the total cost by adding up the costs from the previous steps . Thought 1: Add the total cost of the donuts , mini cupcakes , and mini cheesecakes to get the final total : 204 + 160 + 330. Thought 2: Verify that the calculation for each item was done correctly . Thought 3: Double - check the total to make sure there were no mistakes in the addition . Scores : - Thought 1: LV: 10, COH: 10, SIM: 10, ADP: 10, Total : 40 - Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36 - Thought 3: LV: 7, COH: 7, SIM: 7, ADP: 7, Total : 28 Final Answer : The total cost is $694 . KPM [ Final Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 2 Rejected Thoughts ( Intermediate ): Thought 3 Selected Thoughts ( Final ): Thought 1, Thought 2 22 Page 23: RDoLT Reasoning Rejected Thoughts ( Final ): Thought 3 Final Answer : $694 D. MMLU Task : " Let p = (1, 2, 5, 4) (2, 3) in S_5. Find the index of in S_5. Choices : [’8’, ’2’, ’24’, ’120 ’]" Decomposition [ Easy ] Description : Break down the task by identifying and analyzing the given permutation structure in S_5. Thought 1: Recognize that p is a product of disjoint cycles , where (1, 2, 5, 4) is a 4- cycle and (2, 3) is a 2- cycle . Thought 2: Identify that the order of a permutation is the least common multiple (LCM) of the lengths of the disjoint cycles . Thought 3: Calculate the order of p using the LCM of 4 and 2. Scores : - Thought 1: LV: 10, COH: 10, SIM: 10, ADP: 9, Total : 39 - Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 8, Total : 35 - Thought 3: LV: 6, COH: 7, SIM: 7, ADP: 6, Total : 26 KPM [ Easy Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Decomposition [ Intermediate ] Description : Analyze the structure of the group and its implications for the index in S_5. Thought 1: The subgroup is generated by p, and the order of p is the least common multiple ( LCM) of 4 and 2, which is 4. Thought 2: Calculate the index of in S_5 as 120 / 4 = 30 ( incorrect thought ). Thought 3: Recognize that the correct formula for the index of a cyclic subgroup in a symmetric group is 120 / 4 = 30. Scores : - Thought 1: LV: 10, COH: 10, SIM: 9, ADP: 9, Total : 38 - Thought 2: LV: 5, COH: 6, SIM: 5, ADP: 5, Total : 21 - Thought 3: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36 KPM [ Intermediate Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 3 Rejected Thoughts ( Intermediate ): Thought 2 Decomposition [ Final ] Description : Use all previous knowledge to determine the correct index of in S_5. Thought 1: The order of S_5 is 120 , and the order of p is 4, so the index of in S_5 is 30. Thought 2: Verify if the solution aligns with the structure of S_5 and correct subgroup orders . Thought 3: Conclude that the index of in S_5 is 2 based on the correct analysis of p’s structure . 23 Page 24: Kaleem Ullah Qasim, Jiashu Zhang, Tariq Alsahfi, Ateeq Ur Rehman Butt Scores : - Thought 1: LV: 7, COH: 8, SIM: 7, ADP: 7, Total : 29 - Thought 2: LV: 9, COH: 9, SIM: 9, ADP: 9, Total : 36 - Thought 3: LV: 10, COH: 10, SIM: 9, ADP: 10, Total : 39 Final Answer : The index of in S_5 is 2. KPM [ Final Step ] Selected Thoughts ( Easy ): Thought 1, Thought 2 Rejected Thoughts ( Easy ): Thought 3 Selected Thoughts ( Intermediate ): Thought 1, Thought 3 Rejected Thoughts ( Intermediate ): Thought 2 Selected Thoughts ( Final ): Thought 2, Thought 3 Rejected Thoughts ( Final ): Thought 1 Final Answer : 2 24