loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2502.07165

Don't Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification

Authors: Peipei Wei, Dimitris Dimitriadis, Yan Xu, Mingwei Shen

Published: 2025-02-11

Abstract:

We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi-class classification datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% - 19.37%) over zero-shot prompting on macro-F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human crafted principles on two private datasets. Our multi-agent PRINCIPLE-BASED PROMPTING approach also shows on-par or better performance compared to demonstration-based few-shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label information and the multi-agent cooperative LLM framework play an important role in generating high-quality principles to facilitate downstream classification tasks.

Paper Content: on Alphaxiv
Page 1: Don’t Just Demo, Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification Peipei Wei, Dimitris Dimitriadis, Yan Xu, Mingwei Shen Amazon {peipeiw, dbdim, yanxuml, mingweis }@amazon.com Abstract We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classifica- tion. It first asks multiple LLM agents to independently gen- erate candidate principles based on analysis of demonstra- tion samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi-class classifica- tion datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% - 19.37%) over zero-shot prompting on macro-F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human- crafted principles on two private datasets. Our multi-agent PRINCIPLE-BASED PROMPTING approach also shows on-par or better performance compared to demonstration- based few-shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label infor- mation and the multi-agent cooperative LLM framework play an important role in generating high-quality principles to fa- cilitate downstream classification tasks. Introduction In recent years, transformer-based language models with at- tention mechanisms have deeply revolutionized the field of NLP. Particularly, decoder-only transformer language mod- els, such as GPT-series models, demonstrate impressive emerging capabilities after scaling up the pre-training cor- pora and model sizes—capabilities not seen in their smaller predecessors such as BERT-based models (Zheng et al. 2023). One of these capabilities is In-Context Learning (ICL) (Brown et al. 2020). Equipped with knowledge ac- quired during the pre-training stage, these large language models (LLMs) are able to perform various tasks with only task instructions and a few demonstrations, without any pa- rameter updates. Despite their surprisingly good zero-shot and few-shot performance on a wide range of tasks such as general QA, reasoning, and text generation, their perfor- mance still significantly lags behind fine-tuned models for text classification (Sun et al. 2023b). Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.On the other hand, these fine-tuned models heavily de- pend on human annotations, which are not only costly and time-consuming but also sometimes unavailable. Accord- ingly, leveraging zero-shot or few-shot ICL capabilities of LLMs for text classification has become an important re- search topic. However, ICL relies on prompt engineering and human expertise in designing demonstration questions, intermediate reasoning steps, and final answers for LLMs to generalize to a variety of unseen queries. Additionally, in- creasing the number of demonstrations in few-shot settings leads to increased inference costs and may exceed the max- imum input length imposed by LLMs. When humans work on complicated tasks, they usually follow Standard Operating Procedures (SOPs) to ensure that anyone with varying degrees of domain and task-specific knowledge can perform the task with consistently high qual- ity. These SOPs are written by domain experts who have gained expertise by analyzing numerous concrete examples and extracting common principles from them. Inspired by this, we ask: can we mimic the same procedure to gener- ate task-specific principles based on analysis of a handful of demonstrations and then feed them back to LLMs to help mitigate the limitation of lack of task-specific knowledge in ICL? Previous studies show that adding complex class de- scriptions as additional inputs to a pre-trained transformer backbone via cross-encoder architecture can significantly boost classification performance under zero-shot and few- shot settings (De Silva et al. 2023). Intuitively, injecting more knowledge-intensive principles should also help im- prove LLMs’ ICL performance. In this paper, we present PRINCIPLE-BASED PROMPT- ING for zero-shot text classification. It utilizes a multi- agent collaboration framework to auto-generate principles for each classification task. First, it employs multiple LLM agents to generate candidate principles from demonstrations with or without labels. In the prompts, it explicitly instructs LLMs to extract key principles that can distinguish each class based on analysis of provided demonstrations. Then, all LLM agents send their principle candidates to a cen- tral agent for finalization, which selects the best principles for downstream classification tasks. Our approach demon- strates substantial performance gains over other strong base- line ICL approaches, such as Chain-of-Thought (CoT) (WeiarXiv:2502.07165v1 [cs.CL] 11 Feb 2025 Page 2: et al. 2022) and step-back prompting (Zheng et al. 2023), in zero-shot ICL settings. The performance is also very com- petitive compared to few-shot ICL. In summary, our contri- butions are as follows: • We conduct extensive experiments on three public and two private datasets with two LLMs (flan-t5-xxl and flan- ul2) and show that our approach substantially boosts zero-shot ICL performance on both binary and multi- class classification problems over vanilla prompting as well as strong ICL baselines. We also show on-par or even better classification performance using automat- ically generated SOPs compared to human-generated SOPs on two private datasets. • Our approach demonstrates competitive performance even compared to few-shot ICL. Unlike previous work, our multi-agent approach boosts performance while re- quiring much shorter input token lengths, resulting in sig- nificantly reduced inference costs. • Our PRINCIPLE-BASED PROMPTING approach sig- nificantly outperforms fine-tuned RoBERTa-large under low-resource settings, although the performance of su- pervised models tends to improve when more labeled data becomes available. • Through ablation studies, we have identified that label information and the reasoning capabilities of LLMs are key contributors to extracting high-quality principles for downstream classification tasks. We demonstrate the ad- vantages of a multi-agent approach over a single-agent approach. Additionally, we show that selecting more ca- pable LLMs to generate candidate principles and focus- ing on collaboration rather than competition among LLM agents are important factors when constructing a multi- agent LLM collaboration framework for text classifica- tion. Related work Demonstration and label relationship Supervised ML models rely heavily on drawing mappings between repre- sentations of training examples and their label information to make predictions on unseen examples. Surprisingly, early research on ICL shows that ground truth in demonstration- label mapping is not as important, as showing demonstra- tions with random labels only leads to minimal performance drops on a range of classification tasks (Min et al. 2022). However, later research points out the limitations of this study and arrives at a different conclusion: the correct cor- respondence between examples and labels is essential to en- sure ICL performance (Kossen, Rainforth, and Gal 2023). The previous biased conclusion could be attributed to the use of binary (accuracy) instead of probabilistic metrics, rel- atively weaker LLMs that are mostly under 20B parameters, and focus on only one few-shot setting (16 demos). Thus, although LLMs predominantly rely on knowledge acquired during pre-training to perform downstream tasks, they in- deed can learn new tasks from in-context information, which motivates this work to find an alternative approach to provid- ing more effective context information for LLM ICL than the commonly used demonstration-based approach. In ourexperiments, we also conduct ablation studies to explore the importance of label information on the quality of principles generated. Number of demonstrations Supervised ML algorithms are data-hungry and require a substantial amount of la- beled training data to ensure model performance. Under ICL few-shot settings, previous work shows that adding more than one demonstration might not be necessary due to only marginal performance improvements (Chen et al. 2023). As Chen suggests, this indicates that the use of demonstrations is inefficient and the information provided by randomly se- lected demonstrations is most likely redundant. In some cases, multiple demonstrations can even hurt performance due to misguidance or negative interference among them (Chen et al. 2023). This leads to our research question: un- der the same input length constraint, can we design more concise but knowledge-intensive contexts as alternatives to few-shot demonstrations to better guide LLMs in perform- ing downstream classification tasks? We also conduct ab- lation studies to explore the importance of the number of demonstrations on the quality of principles generated. Single-Agent vs. Multi-Agent LLM Framework Text classification, as one of the most fundamental NLP tasks, appears to be straightforward in the sense that LLMs only need to output one or more class labels from a prede- fined label space. However, it can actually be quite com- plicated and even more challenging due to the implicit na- ture of the reasoning process in comparison to other tasks. Most research on LLM ICL attempts to enhance model per- formance either by decomposing complex tasks into mul- tiple steps or by providing LLMs with relevant domain- and task-specific data as additional context, such as the Retrieval Augmented Generation (RAG) approach. For in- stance, Chain-of-Thought (CoT) prompting first prompts the LLM to break problem-solving into multiple steps and then derives the final answer by following a step-by-step thought process (Wei et al. 2022). Focusing on QA questions, step- back prompting (Zheng et al. 2023) runs inference on the same LLM twice by first asking LLMs to provide abstract principles or concepts to help resolve the original question before answering it. To improve LLMs’ performance on text classification, for each data point, Clue And Reason- ing Prompting (CARP) (Sun et al. 2023b) includes multiple steps in a single prompt by asking the same LLM to first find superficial clues (e.g., keywords, tones, semantic relations, references, etc.) based on which final decisions are made after reasoning steps. CARP also leverages knowledge ac- quired through supervised fine-tuning on labeled datasets to search for more effective demonstrations for ICL. Recently, the multi-agent framework has gained popular- ity and has been shown to greatly improve LLMs’ perfor- mance on complicated tasks such as long-context QA, multi- hop QA, math, and reasoning (Shridhar, Stolfo, and Sachan 2022; Wang et al. 2022). For instance, the multi-agent debate framework can improve LLMs’ reasoning capability, factu- ality, and inter-consistency in mathematical and multiple- choice commonsense reasoning tasks, as well as output qual- ity in open-ended generation tasks, in comparison to their Page 3: single-agent counterparts (Du et al. 2023; Xiong et al. 2023; Chan et al. 2023). In our multi-agent implementation of the principle-based approach, we try competitive and collabora- tive paradigms and evaluate their effectiveness. Performance improvements provided by single- or multi- agent solutions mentioned above, using either self-ensemble (multiple inferences on the same LLM agent) or hetero- geneous ensemble (multiple inferences on different LLM agents) approaches, usually come at significantly increased inference costs due to multiple LLM inferences and/or com- munication costs across different agents. Our research ques- tion is: can we achieve the same performance improvement without significantly increasing inference costs for text clas- sification? Unlike other tasks such as long-context QA or text generation tasks, the label space for most text classifi- cation tasks is finite and relatively limited. Thus, the search space for an optimal principle should also be bounded. Ac- cordingly, we propose to implement an effective and effi- cient multi-agent LLM framework to auto-generate a single all-inclusive SOP for each task and reuse it for inference on all data points. We believe that, in addition to improv- ing performance, the shared principle can also help ensure consistency in classification predictions. Methods PRINCIPLE-BASED PROMPTING is motivated by the observation that when performing classification tasks, hu- man beings usually start to build their mental models after reviewing a few concrete examples by summarizing com- mon key principles. Humans tend to rely on abstracted prin- ciples since we have limited memory capacity to remem- ber overwhelmingly large amounts of detailed data points. The more comprehensive these principles are to include dif- ferent scenarios, the more helpful they should be for per- forming the same task on unseen data. As we see later that in our two internal datasets (Product Classification 1 and Product Classification 2, PC1 and PC2), we have prin- ciples manually drafted by domain experts for each task to help ensure annotation quality. In the experiments sec- tion, we also investigate whether text classification via ICL with principles generated by our multi-agent framework can outperform their human-generated counterparts. We imple- ment our PRINCIPLE-BASED PROMPTING strategy via a multi-agent LLM framework. It consists of three major steps, each of which can be completed by one or multiple LLM agents (see Figure 1 ). Principle Generation Before tackling the classification problem, we first ask the multi-agent LLMs to analyze a few randomly sampled demonstrations with or without la- bel information on their own. Then, we ask them to generate principles to distinguish each class based on their analysis. Since principles are generated at the task level, additional in- ference costs only occur for each principle generated, which is almost negligible in comparison to the inference costs for entire datasets. In this step, we experiment with a diverse set of six dif- ferent LLMs, ranging from open to closed models in vari- ous model sizes: two open-source LLMs from Huggingface:FLAN-T5-XX (Chung et al. 2024) with 11B parameters and FLAN-UL2 (Tay et al. 2022) with 19.5B parameters, Meta- Llama-3-70B-Instruct (AI@Meta, 2024), Mistral 7B (Jiang et al. 2023), Mixtral 7Bx81, and Claude 3.5 Sonnet2. We directly download FLAN-T5-XXL and FLAN-UL2 models from Huggingface and run inference on a p4.24 xlarge EC2 instance with a batch size of 1. For other models, we run in- ference by making API calls. All inferences are performed with temperature=0.2 and top p=0.9. Principles are gener- ated based on a sampling of n=[4, 8, 16] demonstrations with and without label information from the training set for each task. Refer to Appendix 3 for prompt examples that we use to perform this step. Accordingly, for each task, we obtain 3 × 2 × 6 = 36 principle candidates by varying (1) the num- ber of demonstrations: [4, 8, 16], (2) labeled or unlabeled demonstrations, and (3) six different LLM agents. Principle Consolidation After the Principle Generation step, we discard the analysis and extract the principles only. These 36 principle candidates are then sent to a finalizer agent to provide the optimal principle for performing the tar- get classification task. We implement three methods based on the paradigm of how these principle candidates are uti- lized to derive the final principle: (1) Listwise ranking by the finalizer agent: We directly ask each LLM to rank the top five principles given the en- tire list of candidate principles based on their helpfulness for performing the target classification task. Previous research shows that ICL is sensitive to permutation of in-context ex- amples (i.e., selection and ordering) (Wu et al. 2022). Ac- cordingly, we randomize the list of principles presented to LLMs in two different orders, with and without demonstra- tions (n=2) to illustrate how the target task is defined, yield- ing 2 × 2 = 4 different prompts for each LLM agent. We aggregate the top five ranked principles from each LLM agent and select the top 1 principle for each dataset based on majority voting. We use all LLMs mentioned above except FLAN-T5-XXL (Chung et al. 2024) and FLAN-UL2 (Tay et al. 2022) because they exceed the input token length lim- its of 512 or 2048 if we put all the candidate principles in one single prompt. This requires 4 × 4 = 16 inference costs from various multi-agent LLMs. See the Appendix for prompt ex- amples for principle ranking and Table 7 as an example of the final principle selected. (2) Consolidation by the finalizer agent: The listwise ranking method tries to make agents compete with each other and select the best principle based on their helpfulness to the downstream classification task. In contrast, the con- solidation method acknowledges that a single agent might not be able to provide the optimal principle for the task and instead tries to establish a comprehensive principle by in- tegrating and summarizing key points from all principles while resolving conflicting information. Since this method requires the LLM agent to possess reasoning capabilities, we select Claude 3.5 Sonnet as the finalizer agent based on the overall high quality of principles generated in the previous step. See Appendix for prompt examples for principle con- 1https://mistral.ai/news/mixtral-of-experts/ 2https://www.anthropic.com/news/claude-3-5-sonnet Page 4: solidation and Table 7 as an example of the final principle selected. (3) Random selection (control group): This method ran- domly selects one principle from all the candidates. Text Classification After selecting the optimal principle for performing the downstream classification task, we ap- pend it to the prompt as context and ask LLMs to pro- vide the answer to the classification task based on the pro- vided principles. In this step, we only experiment with two open-source LLMs from Huggingface: FLAN-T5-XXL and FLAN-UL2, due to inference cost concerns. We run the in- ference with five random seeds on a p4.24xlarge EC2 in- stance using the same hyperparameters as in the Principle Generation step. Baselines We compare our PRINCIPLE-BASED PROMPTING ap- proach to the following baselines. All prompting approaches listed here are considered single-agent approaches which in- volve one or multiple inferences with one LLM. Vanilla Prompting The LLM is provided with a task de- scription containing all classification options, and then di- rectly asked to provide a decision in short answer format. In the zero-shot setting, no demonstrations are provided, while n demonstrations are provided for few-shot settings. CoT Prompting The only difference between Vanilla and CoT prompting is that ”Let’s think step by step” is appended to prompts right before asking the LLM to output the final answer. Stepback Prompting In this two-step prompting ap- proach, the LLM is first asked ”What are the principles or important features to distinguish...” and then asked to pro- vide the classification decision given the answers from the first step. Principle Single-Agent Unlike our multi-agent frame- work, this approach asks the classifier agent to first pro- vide principles based on its analysis of randomly sampled demonstrations (n=4). Then these principles are appended as context when performing ICL for text classification tasks. We use this baseline to evaluate the contribution of the multi- agent framework to performance gains. Finetuning Finetuning RoBERTa-large in full or few-shot settings: We also finetune a pretrained language model (RoBERTa-large) on training data with a linear classifica- tion layer on top of [CLS] embeddings. For public datasets (Irony2018, Emotion20, and Financial), the training sets range from 1K to 4K samples. We also finetune RoBERTa- large with only 10% of the datasets to evaluate performance in the few-shot settings. In contrast, two internal datasets PC1 and PC2 have very limited training data ( <200 sam- ples), thus automatically falling into the few-shot setting. Experiments Datasets We test our PRINCIPLE-BASED ICL approach and base- lines on five text classification datasets: three are public(Irony2018, Emotion20 (Barbieri et al. 2020), (Sailunaz and Alhajj 2019) and Financial Phrasebank (Malo et al. 2014)) and two are private datasets: Production Classification 1 (PC1) and Production Classification 2 (PC2). PC1, PC2, and Irony2018 are binary classification tasks, while Emo- tion20 and Financial Phrasebank are multi-class classifica- tion tasks. Irony2018 We choose the Subtask 3A dataset of the Se- mEval2018 Irony Detection challenge (Barbieri et al. 2020) (referred to as ”Irony18”). The goal is to determine whether a tweet contains ironic intent. It contains 784 tweets in the test set. Emotion20 Emotion recognition involves the identifica- tion and understanding of emotions expressed in text (Sailu- naz and Alhajj 2019). The objective of this dataset is to iden- tify four emotions expressed: anger, joy, optimism, and sad- ness. We use the dataset provided by TweetEval benchmark (Barbieri et al. 2020), which we refer to as ”Emotion20”. It contains 1,421 data points in the test set. Financial Phrasebank We choose dataset of sentences la- beled with polar sentiment from financial news. This dataset consists of 4,840 sentences from English-language financial news categorized by sentiment. It is divided by agreement rates of 5-8 annotators, and we select labels with instances having ≥75% agreement. We refer to this dataset as ”Finan- cial”. It contains 1,036 financial statements in the test set. PC1 and PC2 Production Classification 1 and 2 are bi- nary classification datasets consisting of product descrip- tions from an e-commerce website and their associated classes as labels. They contain 1,788 and 1,749 unique prod- ucts in the test set, respectively. Evaluation We use the macro-averaged F1 score as the evaluation metric, which considers the overall performance across all classes. Results Table 1 shows that under zero-shot settings, our PRINCIPLE-BASED PROMPTING approach not only outperforms vanilla prompting but also other strong base- lines such as CoT prompting and stepback prompting for both FLAN-T5-XXL and FLAN-UL2 models. The principle single-agent approach achieves on-par or better performance than the more costly stepback prompting approach. Stepback prompting incurs twice the inference costs of vanilla prompting due to its two-step prompting strategy at the instance level (one for eliciting abstracted principles via questions, one for classification decisions). In contrast, the principle single-agent approach only adds one single inference for generating principles at the task level. The multi-agent LLM framework with consolidation can further boost performance gains with the principle-based approach on top of single-agent implementation by 1.23% on FLAN-T5-XXL and 6.52% on FLAN-UL2 on average across five datasets. FLAN-UL2 with principles finalized Page 5: Figure 1: Pipeline and Multiagent illustrations of PRINCIPLE-BASED PROMPTING by the multi-agent consolidation approach boosts model performance by 10.69% over vanilla prompting averaged across five datasets. FLAN-T5-XXL also achieves 6.92% performance gains averaged across five datasets. In gen- eral, the performance gains are more evident and consis- tent on FLAN-UL2 than FLAN-T5-XXL. This is proba- bly due to FLAN-UL2’s stronger reasoning capability with nearly twice as many model parameters as FLAN-T5-XXL, which can better incorporate principles provided to guide the downstream classification task. In comparison, other strong single-agent baselines such as CoT and stepback prompting either do not show consistent performance gains compared to vanilla prompting or are outperformed by the principle- based approach. For instance, FLAN-T5-XXL fails to ben- efit from CoT in general, while the multi-agent principle- based approach can further improve stepback prompting from 3.05% to 6.92% on FLAN-T5-XXL and from 4.28% to 10.69% on FLAN-UL2. Under the multi-agent framework, the consolidation ap- proach performs better than its ranking and random (con- trol group) counterparts. Interestingly, the ranking approach is sometimes even outperformed by random selection. This is likely because the cooperative mode of the multi-agent framework can better leverage different perspectives from multiple agents and potentially resolve limitations of single- agent approaches. In contrast, the competitive mode is too risky and more likely to fail, as it heavily relies on the capa- bility of a single champion agent. Additionally, both LLMs perform better on the PC2 pri- vate classification task using principles generated and final- ized by the principle-based multi-agent consolidation ap- proach compared to principles created by humans (16.21% vs. 14.89% on FLAN-T5-XXL and 19.37% vs. 13.26% on FLAN-UL2). This demonstrates the effectiveness of our ap- proach. On PC1, the principle-based multi-agent ranking approach achieves comparable or better performance gains (1.57% vs. 0.90% on FLAN-T5-XXL and 3.71% vs. 3.98%on FLAN-UL2) compared to human-created principles. When comparing with the finetuned RoBERTa-Large model, our PRINCIPLE-BASED PROMPTING approach significantly outperforms the finetuned encoder-only RoBERTa-Large under low-resource settings on three public datasets, using only 10% of the labeled datasets, resulting in training sets ranging from 78 to 174 sam- ples. Since PC1 and PC2 have fewer than 200 training samples, they automatically fall into few-shot settings. The advantages of supervised fine-tuning diminish under few-shot settings, showing negative performance gains compared to the zero-shot vanilla prompting approach across all five datasets. When the number of labeled data increases to the full dataset, which contains thousands of labeled samples, the finetuned RoBERTa-Large model’s performance improves due to explicit supervision from these labels and finally outperforms LLM ICL approaches on Emotion20 (15.11% vs. 17.62%) and Financial (14.17% vs. 16.62%) by only small margins. The small performance gap demonstrates that our principle-based multi-agent LLM approach can serve as an effective and cost-friendly alternative to supervised classifiers when labeling resources are constrained. Principle-based vs. Few-shot ICL We also compare the performance of the multi-agent princi- ple consolidation approach with the few-shot ICL approach. Results in Table 2 align with findings in previous research that adding more demonstrations tends to improve LLM ICL performance across all datasets with both LLMs (Levy, Bo- gin, and Berant 2022). However, we also observe that this effect quickly diminishes, and model performance plateaus and even decreases as n increases to 4 or 8. Table 2 shows that the principle-based approach is very competitive even in comparison to the few-shot ICL, which leverages one or more demonstrations as contexts, thus resulting in sig- nificantly increased input token length. It outperforms all Page 6: Table 1: Absolute improvements in the macro-F1 scores over the zero-shot vanilla prompting for various single- and multi- agent approaches under the zero-shot settings. Human-crafted principles are only available for two private datasets. Results are averaged across five inferences with different random seeds. Model Method Irony2018 Emotion20 Financial PC1 PC2 A VG flan-t5-xxlsingle agentCoT -9.31 -14.23 1.51 -1.56 17.25 -1.27 stepback -2.03 1.68 -3.31 1.36 17.56 3.05 principle 2.62 8.13 3.40 1.40 12.89 5.69 principle+human NA NA NA 3.98 14.89 NA multi agentprinciple+random 0.63 9.74 6.69 2.43 14.16 6.73 principle + ranking 1.55 9.52 4.16 3.71 13.84 6.56 principle+consolidation 0.45 12.13 4.38 1.43 16.21 6.92 flan-ul2single agentCoT -6.87 0.41 0.96 -0.58 13.46 1.48 stepback 2.72 0.47 4.18 0.02 13.99 4.28 principle 4.57 0.02 3.42 -0.2 13.03 4.17 principle + human NA NA NA 0.90 13.26 NA multi agentprinciple+random 5.56 12.15 11.78 -0.54 19.08 9.61 principle+ranking 4.96 11.14 11.05 1.57 18.69 9.48 principle+consolidation 4.77 15.11 14.17 0.04 19.37 10.69 RoBERTafullfinetune0.44 *17.62 *16.62 -5.26 -7.93 4.30 10% -19.71 -41.01 -52.41 NA NA NA few-shot ICL (n=[1, 2, 4, 8]) on four (Irony2018, Emo- tion20, Financial, and PC2) out of five datasets with FLAN- UL2, and also shows comparable performance gains on PC1 (0.59 vs. 0.04). Although the results with FLAN-T5-XXL are slightly mixed, it outperforms all few-shot ICL (n=[1, 2, 4, 8]) on two (Emotion20 and Financial) out of five datasets and shows comparative performance to the best n-shot set- ting on Irony2018 (0.68 vs. 0.45), PC1 (1.49 vs. 1.43), and PC2 (17.36 vs. 16.21). Previous research adopts a sliding window approach to tackle the prompt length constraints (Ma et al. 2023; Sun et al. 2023a) imposed by LLMs. We show our principle- based approach can also serve as a good solution to bypass this limit. We compute input token lengths of our multi- agent consolidation approach with both FLAN-UL2 and FLAN-T5-XXL tokenizers on each dataset. Since the num- bers are very similar, we only use data from the FLAN- UL2 tokenizer. Figure 2 shows that the length of input to- kens increases linearly as the number of demonstrations in- creases for few-shot prompting. In contrast, the principle- based approach has much shorter input token lengths com- pared to most few-shot settings. We can see that the input token length roughly corresponds to 2-shot on Emotion20 and Financial, and 4-shot on Irony 2018. Since PC1 and PC2 are internal datasets with lengthy product titles and descriptions as inputs, increasing the num- ber of demonstrations n beyond four in few-shot ICL is not only costly in terms of inference but also infeasible due to input length limits imposed by LLMs: 512 for FLAN-T5- XXL and 2048 for FLAN-UL2. The PRINCIPLE-BASED PROMPTING approach, however, only needs input token lengths that are even less than the 1-shot setting. Never-theless, the PRINCIPLE-BASED PROMPTING approach achieves better performance on PC2 with FLAN-UL2 and comparable performance on PC1 with both models, while significantly reducing inference costs. Ablation Studies We further investigate how different factors contribute to crafting high-quality principles for predicting downstream classification tasks: (1) the number of demonstrations used for principle generation, (2) whether these demonstrations are labeled, and (3) the use of a single-agent versus multi- agent LLM framework. Specifically, we randomly sample demonstrations (where n=[4, 8, 16]) with and without la- bels. For the single-agent approach, we use the classifier LLM agent to generate principles based on its analysis of n demonstrations with or without label information. In the multi-agent approach, we employ the consolidation-based multi-agent LLM framework for each number of demon- strations. For both approaches, we use the same open- source models (FLAN-T5-XXL and FLAN-UL2) as classi- fier agents. Figure 3 shows that using more demonstrations does not guarantee higher quality principles during the principle gen- eration stage. Including label information during principle generation, however, tends to have a positive impact on clas- sification performance in most cases. Nevertheless, we ob- serve exceptions on some datasets with different LLMs. For instance, FLAN-T5-XXL achieves better classification per- formance on Irony2018 and PC1 when using principles gen- erated from unlabeled samples rather than labeled samples. Additionally, as shown in Figure 4, principles gener- Page 7: Table 2: Absolute improvements in the macro-F1 scores over the zero-shot vanilla prompting for the few-shot versus zero-shot principle-based approaches. Results are averaged across five inferences with different random seeds. n indicates the number of demonstrations per class. For PC1 and PC2, experiments were limited to n≤2due to out-of-memory errors caused by long input token lengths. Dataset Model n=1 n=2 n=4 n=8multiagent principle consolidation irony2018flan-t5-xxl 0.62 0.08 0.06 0.68 0.45 flan-ul2 3.63 3.08 3.64 3.66 4.77 emotion20flan-t5-xxl 7.82 4.17 1.92 2.58 12.13 flan-ul2 0.94 1.28 0.32 0.92 15.11 financialflan-t5-xxl 1.57 2.26 2.28 2.70 4.38 flan-ul2 8.22 10.42 11.49 11.32 14.17 PC1flan-t5-xxl 0.22 1.49 NA NA 1.43 flan-ul2 0.59 0.47 NA NA 0.04 PC2flan-t5-xxl 17.36 17.31 NA NA 16.21 flan-ul2 16.98 17.41 NA NA 19.37 Figure 2: Comparison of input token lengths between principle-based and few-shot vanilla prompting approaches. Stars on each line indicate where the input token length of the principle-based multi-agent consolidation approach corresponds to different n-shot settings (where n ranges from 1 to 8) Page 8: ated by the multi-agent LLM framework significantly im- prove ICL performance across all datasets compared to those generated by the single-agent framework using relatively weaker LLMs (FLAN-T5-XXL and FLAN-UL2). This im- provement is consistent across all numbers of demonstra- tions selected for principle generation, with the exception of 16-shot principle generation on Irony2018. These results demonstrate that our multi-agent consolidation framework is essential for generating high-quality principles for down- stream classification. The framework overcomes the limi- tations of weaker classifier LLM agents (selected primar- ily due to inference cost considerations) by first utilizing LLMs with better reasoning capabilities (Claude 3.5 Son- net and Llama-3-70B-Instruct) as principle generator agents, and then further optimizing principles through consolida- tion. Discussion and Conclusion We introduce PRINCIPLE-BASED PROMPTING, imple- mented via a multi-agent framework, as a simple yet generic strategy to elicit deep reasoning capabilities of LLMs by providing them with principles to perform downstream clas- sification tasks. We show its superior performance over single-agent frameworks, including vanilla prompting and other strong ICL strategies such as CoT (Wei et al. 2022), CARP (Sun et al. 2023b), and stepback prompting (Zheng et al. 2023). One of the key differences between our work and previous works that attempt to scaffold LLMs with self- elicited clues or ask high-level concepts and principles be- fore tackling the problem lies in our approach: instead of prompting LLMs to extract abstract principles or superficial clues to answer a single question, we perform knowledge distillation at the task level by providing multiple demon- strations with or without labels and instructing LLMs to ex- tract common patterns (principles) based on their analysis. Our intuition is that analyzing how to solve the same task under different scenarios can help generate general knowl- edge that is abstracted away from details and thus easily applicable to unseen data with different distributions. The principles generated this way are knowledge-intensive and task-specific, and thus more efficient than those generated by purely relying on LLMs’ general world knowledge ob- tained during the pretraining stage. Because principle generation is performed at the task level, we show that by implementing the principle-based approach via a multi-agent consolidation framework, we can achieve significant performance improvement with only minimal additional inference costs for text classification tasks. The competitive performance of our principle-based ap- proach compared to few-shot ICL settings indicates that naively adding more demonstrations is not an efficient way to teach LLMs the input-label mapping relationship on new tasks. On one hand, sub-optimal sampling of demonstrations might provide a biased perspective for tackling the task, thus becoming insufficient to perform well on more com- plex or challenging examples. On the other hand, adding more demonstrations can potentially introduce more noise, as the vast amount of details contained in demonstrations isnot only challenging for LLMs to comprehend but also dis- tracting, since some details might be irrelevant for perform- ing the classification task at hand. Accordingly, performance could be negatively impacted, as we observe in Table 2. In contrast, our PRINCIPLE-BASED approach abstracts away all these irrelevant details based on analysis across multiple demonstrations and presents only the most salient instruc- tions for LLMs to focus on. It can serve as an alternative to the popular few-shot ICL approach for performing classifi- cation tasks, especially when inference costs and input token length are constraints imposed by certain LLMs. Additionally, our multi-agent framework for principle generation is generic and can be applied to any use cases that require synthetic text generation. It can automatically gen- erate highly relevant and knowledge-intensive documents (e.g. SOPs) with only a handful of examples, regardless of availability of labeling resources. Although traditional Re- trieval Augmented Generation (RAG) usually performs re- trieval of relevant documents from existing data stores, our approach can automatically generate highly relevant docu- ments or SOPs for any tasks. The comparable or even better classification performance of LLMs shown in Table 1 using principles that are LLM-generated in comparison to human- generated counterparts suggests a promising direction to au- tomate SOP generation without compromising on the qual- ity of SOPs generated. As future research, it would be also interesting to see how our PRINCIPLE-BASED approach can be integrated with RAG. While our principle-based approach provides an effective and efficient ICL solution for text classification under zero- shot settings, we acknowledge several limitations. First, it might not work well for classification problems with many labels since generating principles that cover all classes might lead to very lengthy content to be included as contexts. In this case, we could potentially generate principles for each class individually and use a retriever to fetch correspond- ing principles for top-k classes before performing down- stream classification. Additionally, we only explore open- source models such as FLAN-T5-XXL and FLAN-UL2 as classifier agents due to inference cost constraints. In future work, we would like to investigate whether the same perfor- mance gains can be replicated with black-box LLMs such as GPT-4. Lastly, while we mainly focus on zero-shot settings of our principle-based approach, it would also be interest- ing to explore whether adding concrete examples that are specifically analyzed and explained based on these princi- ples would further improve model performance. We leave these research questions for future work. References Barbieri, F.; Camacho-Collados, J.; Neves, L.; and Espinosa- Anke, L. 2020. TweetEval: Unified benchmark and com- parative evaluation for tweet classification. arXiv preprint arXiv:2010.12421 . Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Ad- Page 9: vances in neural information processing systems , 33: 1877– 1901. Chan, C.-M.; Chen, W.; Su, Y .; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; and Liu, Z. 2023. Chateval: Towards better llm- based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 . Chen, J.; Chen, L.; Zhu, C.; and Zhou, T. 2023. How Many Demonstrations Do You Need for In-context Learning? In Findings of the Association for Computational Linguistics: EMNLP 2023 , 11149–11159. Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y .; Fe- dus, W.; Li, Y .; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2024. Scaling instruction-finetuned language models. Jour- nal of Machine Learning Research , 25(70): 1–53. De Silva, B.; Huang, K.-W.; Lee, G.; Hovsepian, K.; Xu, Y .; and Shen, M. 2023. Semantic matching for text classi- fication with complex class descriptions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing , 7654–7680. Du, Y .; Li, S.; Torralba, A.; Tenenbaum, J. B.; and Mor- datch, I. 2023. Improving factuality and reasoning in lan- guage models through multiagent debate. arXiv preprint arXiv:2305.14325 . Jiang, A. Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D. S.; Casas, D. d. l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 . Kossen, J.; Rainforth, T.; and Gal, Y . 2023. In-context learn- ing in large language models learns label relationships but is not conventional learning. arXiv preprint arXiv:2307.12375 . Levy, I.; Bogin, B.; and Berant, J. 2022. Diverse demon- strations improve in-context compositional generalization. arXiv preprint arXiv:2212.06800 . Ma, X.; Zhang, X.; Pradeep, R.; and Lin, J. 2023. Zero-shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156 . Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; and Takala, P. 2014. Good debt or bad debt: Detecting semantic ori- entations in economic texts. Journal of the Association for Information Science and Technology , 65(4): 782–796. Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Ha- jishirzi, H.; and Zettlemoyer, L. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 . Sailunaz, K.; and Alhajj, R. 2019. Emotion and sentiment analysis from Twitter text. Journal of computational sci- ence, 36: 101003. Shridhar, K.; Stolfo, A.; and Sachan, M. 2022. Distilling reasoning capabilities into smaller language models. arXiv preprint arXiv:2212.00193 . Sun, W.; Yan, L.; Ma, X.; Ren, P.; Yin, D.; and Ren, Z. 2023a. Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542 .Sun, X.; Li, X.; Li, J.; Wu, F.; Guo, S.; Zhang, T.; and Wang, G. 2023b. Text classification via large language models. arXiv preprint arXiv:2305.08377 . Tay, Y .; Dehghani, M.; Tran, V . Q.; Garcia, X.; Wei, J.; Wang, X.; Chung, H. W.; Shakeri, S.; Bahri, D.; Schuster, T.; et al. 2022. Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131 . Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 . Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V .; Zhou, D.; et al. 2022. Chain-of- thought prompting elicits reasoning in large language mod- els.Advances in neural information processing systems , 35: 24824–24837. Wu, Z.; Wang, Y .; Ye, J.; and Kong, L. 2022. Self-adaptive in-context learning: An information compression perspec- tive for in-context example selection and ordering. arXiv preprint arXiv:2212.10375 . Xiong, K.; Ding, X.; Cao, Y .; Liu, T.; and Qin, B. 2023. Examining inter-consistency of large language models col- laboration: An in-depth analysis via debate. arXiv preprint arXiv:2305.11595 . Zheng, H. S.; Mishra, S.; Chen, X.; Cheng, H.-T.; Chi, E. H.; Le, Q. V .; and Zhou, D. 2023. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117 . Appendix Page 10: Figure 3: Effects of label information in sampled demonstrations on generating high-quality principles for downstream classi- fication Page 11: Figure 4: Effects of single vs multi-agent in generating high-quality principles for downstream classification task Page 12: Table 3: Irony 2018 Field Description Label Word Mapping{Yes: 1; No: 0 } Principle Generation PromptYou are given the task to extract principles or important features which dis- tinguish between statements that contain irony and those that do not. Here are some examples: Statement: <sent> Statement: <sent> Statement: <sent> Statement: <sent> Can you analyze each statement and identify whether it contains irony or not? Based on your analysis, can you extract principles or important features which distinguish between statements that contain irony and those that do not? Classification PromptYou are given the task to identify the sentiment of the following statement. Here are important features to distinguish statements that contain irony and those that do not. {principle } Listwise Ranking PromptYou are given the task to rank a list of principles based on how helpful they are for identifying whether statements contain irony or not. Here is the list of principles: {list of principles } Here are some examples of statements: {fewshot example } How would you rank the principles above based on helpfulness for identify- ing whether statements contain irony or not? Provide your ranking of top 10 principles in the following format: A > B > C... Consolidation PromptYou are given multiple sets of principles for distinguishing emotions in state- ments. Your task is to analyze these principles and consolidate them into a single, comprehensive set of principles. Here are the sets of principles: {sets of principles } Please analyze these principles and create a consolidated set that captures the most important and effective principles for identifying emotions in state- ments. Ensure the consolidated set is clear, non-redundant, and comprehen- sive. Page 13: Table 4: Emotion20 Field Description Label Word Mapping{Anger: 0; Joy: 1; Optimism: 2; Sadness 3 } Principle Generation PromptYou are given the task to extract principles or important features which distinguish statements that express four different emotions: anger, joy, op- timism, and sadness. Here are some examples that express different emotions: Statement: <sent> Statement: <sent> Statement: <sent> Statement: <sent> Can you analyze each statement and identify the emotion that it tries to ex- press from these four options: anger, joy, optimism, and sadness? Based on your analysis, can you extract principles or important features which distinguish between statements that express these four emotions: anger, joy, optimism, and sadness? Classification PromptYou are given the task to identify the emotion of the following statements from four options: anger, joy, optimism, and sadness. Here are some principles that distinguish statements expressing different emotions: {principle } Listwise Ranking PromptYou are given the task to rank a list of principles based on how helpful they are for identifying the emotions of statements from four options: anger, joy, optimism, and sadness. Here is the list of principles: {list of principles } Here are some examples of statements: {fewshot example } How would you rank the principles above based on helpfulness for identify- ing emotions of statements? Provide your ranking of top 5 principles in the following format: A > B > C... Consolidation PromptYou are given a list of principles written by different LLM agents to distin- guish statements that express four different emotions: anger, joy, optimism, and sadness. Here are the sets of principles: {sets of principles } Please analyze these principles and create a consolidated set that captures the most important and effective principles for identifying irony in statements. Ensure the consolidated set is clear, non-redundant, and comprehensive. Page 14: Table 5: Financial Field Description Label Word Mapping{Positive: 1; Negative: 0; Neutral 2 } Principle Generation PromptYou are given the task to extract principles or important features which dis- tinguish between financial news that have positive, neutral, or negative sen- timents. Here are some examples: Statement: <sent> Statement: <sent> Statement: <sent> Statement: <sent> Can you analyze each financial news below and identify the sentiment from these three options? Based on your analysis, can you extract principles or important features which distinguish between statements that have positive, neutral, or nega- tive sentiments? Classification PromptYou are given the task to identify the sentiment of the following financial news. Here are some key principles that distinguish statements with positive, neu- tral, and negative sentiments. {principle } Listwise Ranking PromptYou are given the task to rank a list of principles based on how helpful they are for identifying sentiments of financial news from three options: positive, negative, or neutral. Here is the list of principles: {list of principles } Here are some examples of statements: {fewshot example } How would you rank the principles above based on helpfulness for identify- ing sentiments of financial news? Provide your ranking of top 10 principles in the following format: A > B > C... Consolidation PromptYou are given a list of principles written by different LLM agents to distin- guish financial news with positive, neutral or negative sentiments. Here are the sets of principles: {sets of principles } Please analyze these principles and create a consolidated set that captures the most important and effective principles for identifying different sentiments in financial news. Ensure the consolidated set is clear, non-redundant, and comprehensive. Page 15: Table 6: PC1 and PC2 Field Description Label Word Mapping{Yes: 1; No: 0 } Principle Generation PromptYou are given the task to extract principles or important features which dis- tinguish between products that are classified as A and those that are not. Here are some examples and their corresponding answers. Statement: <sent> Statement: <sent> Statement: <sent> Statement: <sent> Can you analyze each product description below and identify whether it is classified as A or not? Based on your analysis, can you extract principles or important features which distinguish between products that are classified as A and those that are not? Classification PromptYou are given the task to identify whether the product below is classified as A or not based on the product description. Here are some key principles that distinguish products that are classified as A and those that are not. {principle } Listwise Ranking PromptYou are given the task to rank a list of principles based on how helpful they are for identifying whether products below are classified as A or not based on product descriptions. Here is the list of principles: {list of principles } Here are some examples of statements: {fewshot example } How would you rank the principles above based on helpfulness for identify- ing products as A or not? Provide your ranking of top 5 principles in the following format: A > B > C... Consolidation PromptYou are given a list of principles written by different LLM agents to distin- guish products that are classified as A or not. Here are the sets of principles: {sets of principles } Please analyze these principles and create a consolidated set that captures the most important and effective principles for identifying products classified as A or not. Ensure the consolidated set is clear, non-redundant, and compre- hensive. Page 16: Table 7: Principles examples finalized by multi-agent LLM framework Dataset Principles finalized Emotion20 Here are some principles that distinguish statements expressing differ- ent emotions: • Anger statements tended to express resentment, insults, confronta- tion, aggression or rage. They often involved critique of others or expressed a desire for revenge. • Joy statements conveyed a sense of cheerfulness, amusement or plea- sure. They referenced positive or fun activities and did not criticize others. • Optimism statements had an upbeat, hopeful or ambitious tone. They focused on positive goals, beliefs in achievement or maintaining a positive mindset. • Sadness statements expressed regret, disheartenment, grief, failure or negative outcomes. They had a somber, downbeat tone and refer- enced disappointment or undesirable situations. Some key distinguishing features between the emotions included: • Tone (positive vs. negative, upbeat vs. downbeat) • Attitude toward others (critical vs. not critical) • Focus (goals/beliefs vs. regret/failure) • References to emotion words like rage, disgust, cheerfulness, hope, regret • Mention of confrontation/aggression vs. pleasure/amusement • Desire for revenge/payback vs. absence of such sentiments. Irony2018 Key principles that distinguish statements that contain irony and those do not: • - Ironic statements often use exaggerated, insincere or inappropriate language that implies the opposite or a hidden meaning when taken literally. • - Ironic statements commonly employ linguistic cues like sarcasm, understatement or rhetorical questions to imply the unstated attitude of the speaker. • - Emoticons, punctuation or contextual cues can indicate a statement is not meant to be taken at face value. • - Non-ironic statements directly and literally state what is meant without implicit, implied or hidden meanings beneath the surface. They do not rely on tone or context. In summary, ironic statements tend to have layers of implied or intended meaning beyond the sur- face interpretation, while non-ironic statements clearly and directly state what is meant without implicit meanings or implications. The use of exaggerated language, insincere tones and cues from contex- t/punctuation also distinguishes ironic statements.

---