loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2406.15126

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Authors: Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang

Published: 2024-06-14

Abstract:

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.

Paper Content: on Alphaxiv
Page 1: On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey Lin Long1, Rui Wang1, Ruixuan Xiao1 Junbo Zhao1, Xiao Ding2, Gang Chen1, Haobo Wang1* 1Zhejiang University, China2Harbin Institute of Technology, China Correspondence :wanghaobo@zju.edu.cn Abstract Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent ad- vent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limita- tions of real-world data with synthetic data gen- eration. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper pro- vides an organization of relevant studies based on a generic workflow of synthetic data gen- eration. By doing so, we highlight the gaps within existing research and outline prospec- tive avenues for future study. This work aims to shepherd the academic and industrial com- munities towards deeper, more methodical in- quiries into the capabilities and applications of LLMs-driven synthetic data generation. 1 Introduction The game-changing emergence of Large Language Models (LLMs) instigated a significant paradigm shift in the field of deep learning (Zhang et al., 2023a; Guo et al., 2023; Bang et al., 2023). Despite these advancements, a large amount of high-quality data remains the foundation for building robust NLP models (Gandhi et al., 2024). To be more specific, here high-quality data typically refers to diverse data that carries rich supervision signals (generally in the form of labels) closely aligned with human intent. However, fulfilling such data reliance with human data can be challenging or even unrealistic sometimes, due to high costs, data scarcity, privacy concerns, etc. (Kurakin et al., 2023). Moreover, several studies (Hosking et al., 2023; Singh et al., 2023; Gilardi et al., 2023) have highlighted that human-generated data, being inher- ently susceptible to biases and errors, may not even be optimal for model training or evaluation. These considerations necessitate a more serious inquiry *Corresponding author. ClinicalTrialsHealthcareAnalyticsTable Analysis SyntheticDataLLM PromptPrompt𝒳SynthesisAnnotationSynthetic DataCuration EvolvementLLM+LM1LM2LMn…Training/Fine-TuningGenerationSyntheticDataGenerationDownstreamApplication 1 DownstreamApplication 2Figure 1: Illustration of the LLMs-based application ecosystem, where synthetic data serves as the flow- ing nutrients for fruiting (training of small LMs or fine-tuning task-specific LLMs) and rooting (training stronger LLMs or self-improvement). into the question: are there other more effective and scalable methods of data collection that can overcome the current limitations? Given the recent advancements in LLMs, which demonstrate the capability to generate fluent text on par with human output (Hartvigsen et al., 2022; Sahu et al., 2022; Ye et al., 2022a; Tang et al., 2023; Gao et al., 2023a), synthetic data produced by LLMs emerges as a viable alternative or supple- ment to human-generated data. Specifically, syn- thetic data is designed to mimic the characteristics and patterns of real-world data (Liu et al., 2024). On the one hand, LLMs, through extensive pretrain- ing, have acquired a vast repository of knowledge and demonstrate exceptional linguistic comprehen- sion (Kim et al., 2022; Ding et al., 2023a), which forms a foundation for generating faithful data. On the other hand, the profound instruction-following capabilities of LLMs allow better controllability and adaptability over the generation process, facili- tating the creation of tailored datasets for specific applications with more flexible process designs (El- dan and Li, 2023). These two advantages make LLMs highly promising synthetic data generators. As a pivotal application of LLMs, synthetic data 1arXiv:2406.15126v1 [cs.CL] 14 Jun 2024 Page 2: generation holds significant importance for the de- velopment of deep learning. As shown in Figure 1, LLMs-driven synthetic data generation (Li et al., 2023c; Wang et al., 2021; Seedat et al., 2023) en- ables the automation of the entire model training and evaluation process with minimal human partic- ipation required in the loop (Huang et al., 2023), which allows the advantages of deep learning mod- els to be applied across a broader range of appli- cations. Beyond providing a scalable supply of training and testing data, LLM-driven synthetic data generation also may pave the way for develop- ing next-generation LLMs. Insights from TinySto- ries (Eldan and Li, 2023) and the Phiseries (Gu- nasekar et al., 2023; Li et al., 2023b) emphasize that data quality is crucial for effective model learning, while LLMs empower us to actively “design” what the models learn through data manipulation, signifi- cantly enhancing the efficacy and controllability of model training. As of June 2024, there are over 300 datasets on Hugging Face1that are tagged as “syn- thetic”, with many mainstream LLMs leveraging high-quality synthetic data for training, including Alpaca (Taori et al., 2023), Vicuna (Zheng et al., 2023), OpenHermes 2.5, and Openchat 3.5 (Wang et al., 2023a). Though seemingly straightforward, generating synthetic datasets that simultaneously have high correctness and sufficient diversity requires careful process designs and involves a lot of tricks (Gandhi et al., 2024), making LLMs-driven synthetic data generation a non-trivial problem. While most existing works generally target data generation for various tasks (e.g., pre-training (Gunasekar et al., 2023; Li et al., 2023b; Eldan and Li, 2023), fine-tuning (Mukherjee et al., 2023; Mitra et al., 2023; Xu et al., 2023a), evaluation (Feng et al., 2023; Wei et al., 2024)) across different domains (e.g., math (Yu et al., 2023a; Luo et al., 2023a), code (Luo et al., 2023b; Wei et al., 2023b), instruc- tion (Honovich et al., 2023a; Wang et al., 2023d)), they share many common ideas. To address the lack of a unified framework in the emerging field of LLM-driven synthetic data generation and develop a general workflow, this survey investigates recent studies and organizes them according to the topics of generation, curation, and evaluation, which are closely related, as shown in Figure 2. Our primary aim is to provide a comprehensive overview of the current state of the field, identify key areas of focus, 1https://huggingface.coand highlight the gaps that remain to be addressed. We hope to bring insights to both the academic and industrial communities and drive further develop- ment in LLM-driven synthetic data generation. 2 Preliminaries 2.1 Problem Definition In this paper, we investigate the challenge of generating high-quality synthetic data using pre- trained LLMs, denoted as M. Rather than creat- ing new datasets from scratch, in more cases, we perform data augmentation with a small number of seed samples or unlabeled inputs, which we denote uniformly as Dsup. Although optional for LLMs-driven synthetic data generation, Dsupcan typically provide valuable supporting information when available. Consequently, the overall genera- tion task can be formulated as: Dgen← M p(T,Dsup), (1) where Dgenrepresents the final generated dataset, andprefers to the prompt used for model inference. Tspecifies the generation task, such as rewrit- ing, question answering, annotation, etc. Notably, data annotation as a specialized paradigm of syn- thetic data generation, has particularly extensive applicability, including RLAIF (Bai et al., 2022) and LLMs-based evaluation (Chen et al., 2023b; Zheng et al., 2023; Kim et al., 2023), which may involve specific challenges and corresponding so- lution techniques. Due to page limitations, further details about data annotation can be found in Ap- pendix A. 2.2 Requirements of Dgen Briefly speaking, our goal is to generate data that closely aligns with evaluation metrics. While the standard of high-quality data may vary across dif- ferent downstream tasks, there are two general re- quirements that are considered challenging in most existing literature: •Faithfulness. To provide valid supervision, the generated data must first be logically and grammatically coherent. However, the inher- ent problems of hallucination fat-tailed knowl- edge distribution of LLMs can introduce sig- nificant noise into the generated results, man- ifesting as factual errors, incorrect labels, or irrelevant content. These issues become more pronounced when generating long, complex, or domain-specific data. 2 Page 3: I II IIIGenerationCurationEvaluationPrompt EngineeringTask SpecificationGeneration ConditionsIn-Context DemonstrationsConditioningScopeConditioningValuesDemonstration AcquirementDemonstration SelectionMulti-Step GenerationSample-Wise DecompositionDataset-Wise Decomposition Sample FilteringLabel EnhancementHeuristic MetricsRe-Weighting StrategiesLearning Dynamics BasedConsistency BasedManual Re-AnnotationAuxiliary Model EnhancementSample Selection Knowledge DistillationLabel RefineryDirect Evaluation Indirect EvaluationData FaithfulnessData DiversityBenchmark EvaluationHuman EvaluationAuxiliary Model EvaluationVocabulary StatisticsSample RelevanceBenchmark EvaluationOpen EvaluationHuman EvaluationModel EvaluationAutomatic Annotation (App. A)Metric BasedModeling Baseda special caseFigure 2: A taxonomy of LLMs-driven synthetic data generation, curation, and evaluation. •Diversity. Diversity captures the variation among the generated data, reflecting differ- ences in text length, topic, or even writing style. It is crucial for generating synthetic samples that mimic the diversified nature of real-world data, thereby preventing overfitting and bias during model training or evaluation. Nevertheless, due to the inherent biases of LLMs, uncontrolled generated content often tends to be monotonous, limiting its applica- bility in downstream tasks. These two requirements are the focal points of most current research efforts. In the subsequent work- flow, we will introduce how different methods ad- dress these issues. 3 Generic Workflow Existing studies on LLMs-driven synthetic data generation generally incorporate three main top- ics: generation, curation, and evaluation. Various approaches are employed within these aspects to collaboratively achieve optimal data generation. 3.1 Data Generation In this section, we systematically summarize some common practices for synthetic data generation with LLMs, which can be roughly divided into prompt engineering and multi-step generation. An overall illustration is provided in Figure 3.3.1.1 Prompt Engineering One of the greatest advantages of LLMs for syn- thetic data generation is their instruction-following capability, which contributes to great controllability (Wang et al., 2023c; Radford et al., 2019). There- fore, many approaches try to guide LLMs with heuristic prompts to enhance the faithfulness and diversity of the synthetic data (Liu et al., 2024). Empirically, an effective prompt generally con- tains three key elements: task specification etask, generation conditions econdition , and in-context demonstrations edemo, which are then collectively wrapped with a template Einto the form of natural instruction: p(T,D)←E(etask, econdition , edemo). (2) As shown above, both the generation task Tand the support dataset Dwill affect the design of p. Next, we will proceed to detail how each part of the prompt should be appropriately designed to accommodate various scenarios. Task Specification. In traditional crowdsourced annotation scenarios, the recruited workers are commonly offered a codebook that specifies the necessary contexts, such as task purpose, data ex- planation, and other background knowledge, so that they can better understand their jobs (Gilardi et al., 2023). Similarly, such task specification is crucial for setting the right context for LLMs- driven data generation, which can also include role- play (Li et al., 2023c), format clarification, knowl- edge augmentation (Xu et al., 2023b; Sudalairaj et al., 2024), etc. Evidence shows that a simple prologue such as “suppose you are a {xxx} ” can significantly improve the LLMs’ performance by setting up a proper scenario for data generation and allowing the LLMs to better take on the roles (Li et al., 2023c). More formally, Yoo et al. (2021) de- fines the task specification with a triplet of text type, label type, and label-token verbalizer. Such a de- scription header is particularly important when ex- tra domain expertise is demanded to address issues like terminology complexities in both context un- derstanding and data generation. Consequently, Xu et al. (2023b) leverages external knowledge graphs and LLMs to obtain domain topics for context- informed prompting, which effectively enhances the faithfulness and complexity of generated data. Conditional Prompting. As mentioned in Sec- tion 2.2, a pivotal challenge in using LLMs for 3 Page 4: LOW-Quality Data Prompt2 ❌ ✅HIGH-Quality DataPlease generate a negative movie review.Example Prompts in Existing Literature Prompt for Data Synthesis (AttrPrompt, Yu et al., 2023)Suppose you are a news writer. Please generate affordable care act news in NYT following requirements below:1. Should focus on role of state governments;2. Should be in length between 30 and 80 words;3. The writing style of the news should be news analysis;4. The location of the news is in Oceania. Prompt1List some words that reflect negative sentiment. ChatGPT p D Here are some words that reflect negative sentiment:1.Angry - Feeling or showing strong annoyance, displeasure, or hostility.2.Sad …3.Disappointed … Prompt for Data Annotation (App. A) -- A Special Case(LLMAAA, Zhang et al., 2023)You are a highly intelligent and accurate news domain named-entity recognition (NER) system. You take passage as input and your task is to recognize and extract specific types of named entities in that given passage and classify into a set of following predefined entity types: [PER,LOC,ORG,MISC] Your output format must be in json form of: [{"span":span,"type”:type},…]Input: … Output: …Input: Soccer – late goals give Japan win over Syria. Output: Y pX Prompts for Sample-Wise Multi-Step Generation (Self-Prompt, Li et al., 2023) LLM ReponsesList some historical events: D-Day, The Battle of Gallipoli, The Black Death, …This is a passage from Wikipedia about the historical even, The battle of Gallipoli:The Battle of Gallipoli … and to open up a supply route to Russia through the Dardanelles and … a million men.Here is a passage: The Battle of Gallipoli was … over half a million men. Extract the named entities in it:April 1915, …, Dardanelles, the Black Sea, …The Battle of Gallipoli was … over half a million men. Dardanelles is the answer to the question:What is the name of the waterway that was the site of the Battle of Gallipoli?The Battle of Gallipoli … million men. Question: What is the name of the waterway that was the site of the Battle of Gallipoli? Answer: Dardanelles. You can refer to the passage and write a short explanation to this Question-Answer pair:The Battle of Gallipoli was fought along the Dardanelles, a waterway that connects the Aegean Sea to the Sea of Marmara.ExamplePassageEntityQuestionExplain (Complete Sample)Example of Integrated Pipeline of Generating High-Quality Data (for Sentiment Analysis) ChatGPT"Black Panther: Wakanda Forever" plunges into a chasm of cinematic disappointment. Its glaring lack of coherence, stitched together with feeble narrative threads, is a disservice to its predecessor's legacy. …… ➕Negative[Label][Text]SampleAssemble 🟰 Suppose you are a movie reviewer. Please generate a movie review to show your <disappointment> about the movie <Black Panther: Wakanda Forever> with detailed explanation, following requirements below:1. Should be in length <between 30 and 80 words>;2. The writing style of the review should be <intense>;3. Suppose the time you write this review is <immediately after watching the movie>.Here are some examples about other movie reviews:<movie review 1><movie review 2><movie review 3>Figure 3: A toy example of effective synthetic data generation. The corresponding fields for task specification , conditions , and in-context demonstrations are highlighted, while < > marks the switchable contents. synthetic data generation is ensuring sufficient di- versity, as directly prompting the LLMs to produce data for certain tasks often results in highly repet- itive outputs, even with a high decoding temper- ature (Gandhi et al., 2024; Liu et al., 2024). Ad- dressing this problem, a widely adopted strategy is conditional prompting, which explicitly and con- cretely communicates to the LLMs the specific type of data desired. The core of conditional prompting involves delineating the targeted data through the formulation of a series of condition-value pairs: econdition ={(c1, v1),(c2, v2),···,(cn, vn)}, (3) which effectively characterizes the desired at- tributes and characteristics of the synthetic data. With different combinations of such attributes, we can automatically achieve a degree of “artificiallydefined” diversity in the generated samples (Gu- nasekar et al., 2023; Li et al., 2023b; Eldan and Li, 2023). Conditional prompting not only allows better control over the diversity and coverage of the generated dataset but also refines the content to a narrower, more focused scope that is more likely to align with our specific expectations and requirements (Li et al., 2023c). Current research on conditional prompting primarily centers on the following two subjects: 1)Conditioning Scope. As the backbone ofecondition , conditioning scope defined by {c1,···, cn}delineates the dimensions that we utilize to characterize our target data. Early studies (Gao et al., 2023a; Ye et al., 2022a,b) employed a basic output-conditional prompt- ing strategy, utilizing the specific label asso- 4 Page 5: ciated with the classification task as the con- ditioning variable. The rationale behind this was primarily to maintain class balance and coverage. However, such a strategy is unsuit- able for data lacking explicit category labels. Subsequent work by Yu et al. (2023b) argues that conditional-prompting with finer-grained attributes (e.g., topics, length, and style (Xu et al., 2023b)), can lead to more diversified generation due to the vast number of possi- ble attribute combinations, being also applica- ble to open-ended data. Additionally, Eldan and Li (2023) also condition each generation on the task of incorporating three randomly chosen words into the generated story. This approach was also proven to significantly en- hance the diversity of the generated data, shift- ing the focus from the heuristic features of the output to a more structured and targeted con- ditioning mechanism by adding “creative ran- domness” to the prompt (Eldan and Li, 2023). 2)Conditioning Values. After defining the con- ditioning scope, we then need to assign con- crete values to each condition. Despite the seemingly straightforward strategy of sam- pling from the known classes or labels (Ye et al., 2022a), there are cases where such an instance pool is unavailable. Addressing this problem, Josifoski et al. (2023) actively re- trieves the conditioning instances from exter- nal knowledge graphs, while Xu et al. (2023b); Ding et al. (2023b) leverage the LLMs to generate diversified instances for conditional prompting. Specifically, Ding et al. (2023b) construct a concept tree to delve into different subtopics, ensuring the coverage of sampled conditioning values, which then contributes to more diverse generated data. Moreover, the prompt template Ecan also be considered a special type of condition. It has been demon- strated that incorporating templates with a cer- tain level of randomness throughout the gen- eration process can enhance the diversity of the generated contents (Meng et al., 2022). In-Context Learning. Due to the inherent bias of LLMs, it remains challenging to elicit favorable responses from the LLMs with merely task speci- fication and conditional prompting. In this case, a straightforward yet effective strategy is to provide several demonstrations, which can serve as a form of implicit human guidance. Research has shownthat, owing to LLMs’ remarkable in-context learn- ing (ICL) capabilities, a few exemplars can provide them with insights into the patterns exhibited in real-world data, thereby significantly improving the faithfulness of generated data (Li et al., 2023c). In the few-shot setting, where labeled samples are available in the support set Dsup, these samples can be directly utilized as demonstrations for ICL. However, in scenarios where no ground truth data is available, approaches like Self-Instruct (Wang et al., 2023e) and Self-Prompting (Li et al., 2022) instead leverage ICL with synthetic demonstrations generated by LLMs. This allows the models to learn from their own predictions or other teacher models, even in the absence of labeled data. However, given the constraint of prompt length and data inconsistency, the quality of in-context samples significantly affects the effectiveness of in-context learning. Sudalairaj et al. (2024) ar- gue that randomly selecting in-context examples from the pool of seed samples, as done in Self- Instruct (Wang et al., 2023e), results in a lack of diversity and quality in the generated data. To ad- dress this issue, Sudalairaj et al. (2024) opt for selecting examples that concentrate on specific as- pects to better stimulate the long tail of knowledge inherent in LLMs. Liu et al. (2022b) and Su et al. (2023) prioritize consistent samples as demonstra- tive examples based on their cosine similarity in the embedding space. Alternatively, Ye et al. (2022b) selects the most informative samples using quanti- fied influence scores to steer the generation process. To enhance the informativeness of in-context ex- amples, He et al. (2023) prompts LLMs to provide an explanation for each sample before integrating it into the prompt. This approach not only offers valuable additional information but also aligns well with the subsequent Chain-of-Thought generation. 3.1.2 Multi-Step Generation In the previous paragraphs, we have introduced some common prompting strategies, which are typically designed for a specific generation task T. However, in most cases, due to the lack of enough reasoning abilities, it is unrealistic to ex- pect the LLMs to generate the entire desired dataset within a single reference, especially when target- ing data with complex structures or semantics (Cui and Wang, 2023). In addressing this problem, a common strategy is multi-step generation, through which the overall generation process is manually decomposed into a chain of simpler sub-tasks T1:k, 5 Page 6: to force the LLMs to produce data in a step-by-step manner as scheduled: Di← Mi pi(Ti,D0:i−1), i= 1,2,···, k, (4) where D0=Dsup. Each intermediate output Di is generated using model Mi, prompted by pi, for a sub-task Ti. These outputs can then po- tentially be used in subsequent generations. By manually scheduling the generation procedure, we implicitly align the reasoning paths of LLMs with human prior knowledge. Specifically, there are two common strategies for task decomposi- tion: sample-wise anddataset-wise decomposition, which mainly aim at enhancing the quality of syn- thetic data at different scales. Sample-Wise Decomposition. A typical use- case of multi-step generation is for addressing the challenges of long-text processing and logical rea- soning when dealing with multi-text data such as dialogues and entity-relation triplets. In such cases, a straightforward approach is to divide the sample into smaller chunks and generate only a portion of each sample at a time (Li et al., 2022; Ye et al., 2023; Wang et al., 2023e). In this way, D1:kcan be considered as different parts of Dgen: Dgen= (D1,D2,···,Dk). (5) Notably, as shown in Eq. 4, each iteration of the generation process can be conditioned on the previ- ously generated contents. For example, Ding et al. (2023b) prompts the LLMs to alternate between acting as the assistant and the user, replying to each other based on the context, ultimately producing a complete conversation transcript. In this way, the coherence among each internal component Di can be pointedly reinforced with separated instruc- tions, thus making it easier for the model to follow the requirements and generate more faithful data. It should be noted that D1:kmay not necessarily form part of the final Dgen, instead, explicitly out- putting some intermediate reasoning steps can also improve the generation of complex data (Bai et al., 2022; He et al., 2023). Chain-of-Thought (CoT) prompting stands out as one of the most popular strategies for improving the faithfulness of LLM- generated content (Wei et al., 2022). Nevertheless, current research on the exploration of such latent metadata is still insufficient, leaving sample-wise task decomposition from a reasoning perspective an open problem for future studies.Dataset-Wise Decomposition. In Section 3.1.1 we have introduced how to generate data with spec- ified properties. However, generating a series of such data that can eventually form a dataset with good diversity and domain coverage requires long- term scheduling. To this end, dataset-wise task decomposition dynamically adjusts the conditions used at each stage of multi-step generation to en- sure the overall dataset grows in the right direction: Dgen=k[ i=1Di. (6) Specifically, S3 (Wang et al., 2023b) targets the most frequently mislabeled categories at each iter- ation, according to the performance of the down- stream model trained on previously generated data. Similarly, Honovich et al. (2023b); Shao et al. (2023) utilize a generate-then-expand paradigm, to enhance the diversity of the overall dataset accord- ingly. Some other methods also leverage specific data structures to model the pathways of data gen- eration. For example, Explore-Instruct (Wan et al., 2023) models the domain space as a tree structure and continually refines the generated data along with tree traversal to promote both the specializa- tion and domain coverage of the generated data. 3.2 Data Curation After the preceding steps, one may excessively gen- erate overflowing and theoretically unlimited data Dgen. However, these datasets often comprise a considerable portion of noisy, worthless, or even toxic samples, which primarily stems from two causes. Firstly, LLMs can inevitably produce cor- rupted samples with incorrect labels due to the hal- lucination problem. Secondly, ineffective prompts containing ambiguous descriptions can trick the model into generating irrelevant or redundant sam- ples. Consequently, directly utilizing these low- quality data without proper processing may have a significant negative impact. To address this, plenty of data curation ap- proaches have been studied, which mainly fall into two dominant groups of high-quality sample filter- ingandlabel enhancement as elaborated below. 3.2.1 High-Quality Sample Filtering Sample filtering aims to weed out undesired low- quality samples and obtain a more helpful sub- setDcurated⊂ D gen. These methods typically de- sign heuristic criteria orre-weighting functions to rerank samples for filtering, as shown in Figure 4. 6 Page 7: Metric CalculatingUntailored SetFiltered SetRerank/Reweightselecteddiscarded Untailored SetRefined SetHuman InterventionAuxiliary Distillation Student ModelLLMsample indexstatistic feedbacklow-quality dataActive SelectionHumanRe-labeling distillHigh-quality Sample FilteringLabel EnhancementDgenDgenDcuratedDcuratedFigure 4: Two dominant approaches of data curation. Heuristic Metrics. For methods based on heuris- tic metrics, the key step is to design appropriate criteria based on the learning dynamics, such as confidence score (Seedat et al., 2023), influence function (Ye et al., 2022b), and generation abil- ity (Meng et al., 2022). SuperGen (Meng et al., 2022) employs the estimated generation probability to identify samples most related to the desired la- bel. Seedat et al. (2023) discard samples with both low confidence and low uncertainty. Some other methods assume that clean samples are prone to hold similar predictions under different conditions and employ cross-condition consistency for filter- ing. Specifically, such consistency can be between LLM and downstream classifier (Yu et al., 2023c), between multiple executions (Ye et al., 2023), or be- tween neighboring data points (Seedat et al., 2023). Chen et al. (2023b) leverage the powerful text un- derstanding capabilities of LLMs to assess the qual- ity of different samples and filter out those with low scores. Results show that Alpagasus (Chen et al., 2023b), trained on a much smaller but cu- rated dataset, surpasses the original Alpaca (Taori et al., 2023) across several benchmarks, underscor- ing the importance of data curation. Sample Re-Weighting. On the other hand, re- weighting methods believe all data are valuable but with varying importance. Thus, they assign larger weights to correctly annotated or influential sam- ples during downstream utilization (Zhang et al., 2023b; Gao et al., 2023a; Meng et al., 2023). For instance, SunGen (Gao et al., 2023a) proposes an adaptive bi-level re-weighting algorithm without human annotations. FewGen (Meng et al., 2023) designs a discriminative meta-learning objective to adjust sample weights and demarcate the nuanced differences between different labels.3.2.2 Label Enhancement Label enhancement methods strive to rectify the potentially erroneous annotations in generated sam- ples. Due to confirmation bias, it is unrealistic for LLMs to identify their own mistakes. To address this, recent works either rely on human interven- tionor incorporate a student model for human-free knowledge distillation . Human Intervention. A straightforward strat- egy for label refinery is to include human efforts to re-annotate the corrupted samples (Chung et al., 2023a; Wang et al., 2021; Pangakis et al., 2023). Wang et al. (2021) proposed to actively select samples with the lowest confidence for human re-labeling. Pangakis et al. (2023) and Liu et al. (2022a) further emphasize the importance of hu- man review and suggest comparing annotations from humans and LLMs guided by the same code- book. Despite the simplicity, these methods can lead to considerable labeling costs and can be unre- alistic in practical deployment. Auxiliary Model. To reduce the labeling cost, a more pragmatic human-free paradigm is developed which involves auxiliary student models for knowl- edge distillation and label refinery (Xiao et al., 2023; Zhao et al., 2023a; Saad-Falcon et al., 2023). These methods rely on the weakly supervised abil- ity of student models and hypothesize that a stu- dent distilled from the LLM teacher can produce superior labels. The seminal work FreeAL (Xiao et al., 2023) proposes a collaborative framework, where a student model is leveraged to distill the high-quality task-related knowledge from the weak annotations and in return feedback LLMs for label refinery. MCKD (Zhao et al., 2023a) designs a multistage distillation pipeline with data-split train- ing and cross-partition labeling to avoid overfitting on noisy labels. With the expanding abilities and availability of LLMs, the incorporation of auxiliary student models will play a more crucial role as a cost-effective alternative to human intervention. 3.3 Data Evaluation Before the employment of generated data, it is im- portant to evaluate the quality and application ef- fectiveness of the data, to ensure its value to down- stream tasks. The current mainstream evaluation methods can be roughly divided into two categories: direct andindirect , which evaluate the quality of Dgenindividually and through its effectiveness on downstream tasks, respectively. 7 Page 8: Benchmark Human Model Data Faithfulness ModelSynthetic Data Vocabulary StatisticsSample RelevanceData Diversity Benchmark EvaluationClassification QAReasoningII + III = IIIII…… Open EvaluationDirect Evaluation Indirect Evaluation Answer 1 > Answer 2Human Evaluation Answer 1 √ Answer 2 ×Model Evaluation Figure 5: Direct and indirect methods of data evaluation. 3.3.1 Direct Evaluation Data Faithfulness. Ideally, automatic evaluation of the LLMs’ generation results can be easily re- alized with ground truths from existing datasets, if available (Zhu et al., 2023). However, for open- ended data, human-based evaluation is necessitated. A straightforward idea is to provide some generated samples to human experts, who will then determine whether they are correct, according to which we can estimate the overall generation quality (Wang et al., 2023e). Theoretically, the larger the sample size, the more accurate the estimation results will be, but the labor it costs will correspondingly get higher. To this end, a reliable auxiliary model can be leveraged for a more comprehensive yet cost- effective evaluation of the generated data in replace of human experts (Chung et al., 2023b). Consider- ing that most models can only process contents of limited length, appropriate information extraction can reduce the burden of the auxiliary model and contribute to a more precise prediction of whether a sample contains factual errors (Lee et al., 2022). Data Diversity. The quantification of data diver- sity primarily employs vocabulary statistics and sample relevance calculations. V ocabulary statis- tics (Yu et al., 2023b), such as vocabulary size and N-gram frequency, provide a straightforward and intuitive approach. However, they struggle to capture the semantic information of a dataset. The calculation of sample relevance compensates for this limitation effectively. The most common measures of sample correlation are based on co- sine similarity (Wang et al., 2023b) and sample distance (Chung et al., 2023b), which can better capture the contextual and semantic diversity of the dataset. Furthermore, these metrics can alsobe leveraged to select in-context demonstrations edemo (Wang et al., 2023e) that are more dissimi- lar with the previously generated samples, thereby leading to more diversified generation results. 3.3.2 Indirect Evaluation Benchmark Evaluation. The performance of downstream models trained on the generated data can also reflect the generation quality to some ex- tent (Yu et al., 2023b; Chung et al., 2023b). Specif- ically, the impact of synthetic data can be evaluated from multiple dimensions except for the special- ized capabilities of the downstream models. For example, TruthfulQA enables the assessment of a model’s ability to identify true claims (Sun et al., 2023); NIV2 is employed to evaluate a model’s language comprehension and reasoning abilities across multiple tasks (Wang et al., 2023e). Open Evaluation. For open-ended benchmarks, evaluation by humans or auxiliary models is ne- cessitated due to the absence of standardized an- swers. To fully leverage the preference outputs of the auxiliary models, multiple evaluation strategies have been designed, such as response ranking (Xu et al., 2023a), four-level rating system (Wang et al., 2023e) and Elo scores (Bai et al., 2022). To further reduce evaluation costs, Sun et al. (2023); Xu et al. (2023a) utilize the automatic evaluation framework based on GPT-4 proposed by Vicuna for evaluation. However, general LLMs may lack enough knowl- edge for domain-specific tasks, which hinders them to provide effective evaluation (Bran et al., 2023). Therefore, collecting human assessment data to fine-tune open-source models for evaluation pur- poses is an important practice in real-world scenar- ios (He et al., 2023). Other techniques like (Peng et al., 2024, 2023) remain to be further explored. 4 Future Directions 4.1 Complex Task Decomposition Current multi-step generation algorithms depend on the model’s understanding of task requirements, requiring it to perform complex logical reason- ing with limited information. However, in real- world complex scenarios, this limited informa- tion may not adequately support effective decision- making. For instance, the generation of mathe- matical problem-solution pairs entails multiple rea- soning steps and may necessitate the utilization of calculator tools for validation. To date, there re- mains a lack of systematic investigation on how 8 Page 9: to activate the reasoning and planning capabilities of LLMs for autonomous synthetic data genera- tion. Inspired by prevalent LLMs-based agents like HuggingGPT (Shen et al., 2023) and MetaGPT (Hong et al., 2023), we believe it would also be quite valuable to develop a data generation agent for industrial applications. 4.2 Knowledge Enhancement Recent research has found that LLMs’ knowledge is long-tailed and biased (Navigli et al., 2023; Fei et al., 2023). Lacking specific domain knowledge, LLMs tend to generate biased, monotonous, and even unfaithful data. Though we have introduced how to mildly guide the data generation with task specification and conditional prompting in the pre- vious sections, such methods still hold strong lim- itations and are not conducive to scalable imple- mentation. Instead, we believe that developing automated condition controls directly on mature domain knowledge bases will significantly improve the efficiency of knowledge enhancement. For ex- ample, we can establish certain links between the LLMs and external knowledge graphs (Ji et al., 2022) or retrieve augmentation from the website (Gao et al., 2023b), which is helpful for the defini- tion, decomposition, and reasoning of data features throughout the entire generation process. Addition- ally, with enhanced domain knowledge, we may also better assess the quality of generated data or even develop automatic evaluation systems. Over- all, we believe that knowledge-driven data genera- tion will be a key focus for future studies. 4.3 Synergy between Large & Small LMs In Section 3.2, we introduced the use of small domain-specific models for data curation. In par- ticular, FreeAL (Xiao et al., 2023) has shown the feasibility of low-cost data curation with integrated collaboration between large and small models. The idea of leveraging real-time feedback provided by automated performance evaluation during the data generation process to guide the corresponding ad- justments in the following generation hints at an im- portant research direction. However, the exploita- tion of small LMs at the current stage is simply based on prediction confidence. In the future, we are looking forward to seeing more diversified col- laboration modes between large and small models to improve the quality of generated data, e.g., us- age of various output information, new design of collaborative architectures, and so on.4.4 Human-Model Collaboration Data, as the source of model intelligence, theo- retically cannot be generated completely without human intervention. Otherwise, wild synthetic data that carries noisy, toxic information can easily “poi- son” a model, even resulting in mode collapse. Due to the inherent bias of LLMs, they can hardly be self-aware of the bias in their generated data and fi- nally deviate from our intentions. Thus, designing a human-friendly interactive system to involves a few necessary human knowledge for annotation and ver- ification is vital and irreplaceable. To date, there is still a lack of a generic framework to standardize and systematize the human-machine collaboration involved in the data production process. We believe that an appropriate design of such a system must be based on a thorough understanding of the strengths and limitations of human interven- tion, and should follow the human-centered prin- ciple. To achieve sustainable and efficient human involvement, we need comprehensive considera- tion of various factors such as feasibility, cost, and even labor psychology. For specific examples: (i)- readability and interpretability of the information provided by the LLMs should be ensured to reduce obstacles to human understanding; (ii)-upstream knowledge enrichment or filtering should be carried out to improve the efficiency of human resource uti- lization and reduce consumption on tasks with low cost-effectiveness; (iii)-incorporating enjoyable in- teractive features can not only mitigate the negative impact of mechanical data processing tasks on hu- mans but also attract a broader audience. 5 Conclusion In this paper, we present a systematic review of ad- vancements in synthetic data generation propelled by Large Language Models (LLMs). We aim to offer guidance to enterprises and organizations on effectively building their domain-specific datasets using LLMs. In the meantime, we endeavor to provide insights into the challenges and opportuni- ties within this field, while also proposing potential directions for future research. We hope that our work can promote the rapid production of large amounts of data in various fields and push the lim- its of data-centric AI. We also envision a fantastic future, where an LLMs community, endowed with human-like abilities such as bionics and communi- cation, may be constructed to generate data for its own self-improvement. 9 Page 10: Limitations In this paper, we survey existing studies on LLMs- driven synthetic data generation, curation, and evaluation, proposing a generic workflow for real- world practice. Synthetic data generation is a broad topic that involves data and models of various modals, including vision and speech. Due to the page limit, we mainly focus on the objective of text data and LLMs-driven approaches, while leav- ing investigations in other fields for future work. We will also keep paying attention to the latest work and add more related approaches with more detailed analysis. Ethics Statement We believe that our proposed workflow of LLMs- driven synthetic data generation, curation, and eval- uation can benefit both researchers who are inter- ested in data-centric AI and industrial producers who are facing data problems. However, the mali- cious use of such synthetic data also raises ethical concerns that should arouse our vigilance. Acknowledgements This work is supported by the Pioneer R&D Pro- gram of Zhejiang (No. 2024C01035), NSFC un- der Grants (No. 62206247), and the Fundamental Research Funds for the Central Universities (No. 226-2024-00049). References Tiago A. Almeida, José María Gómez Hidalgo, and Akebo Yamakami. 2011. Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 2011 ACM Symposium on Docu- ment Engineering, Mountain View, CA, USA, Septem- ber 19-22, 2011 . Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christo- pher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemí Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott John- ston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Con- erly, Tom Henighan, Tristan Hume, Samuel R. Bow- man, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, andJared Kaplan. 2022. Constitutional AI: harmlessness from AI feedback. CoRR , abs/2212.08073. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hal- lucination, and interactivity. CoRR , abs/2302.04023. Parikshit Bansal and Amit Sharma. 2023. Large lan- guage models as annotators: Enhancing general- ization of NLP models at minimal cost. CoRR , abs/2306.15766. Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas- tian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating adversarial human annotation for read- ing comprehension. Transactions of the Association for Computational Linguistics , 8:662–678. Andres M Bran, Sam Cox, Andrew D White, and Philippe Schwaller. 2023. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 . Iñigo Casanueva, Tadas Tem ˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ´c. 2020. Efficient intent detection with dual sentence encoders. In Pro- ceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , pages 38–45, On- line. Association for Computational Linguistics. Derek Chen, Celine Lee, Yunan Lu, Domenic Rosati, and Zhou Yu. 2023a. Mixture of soft prompts for con- trollable data generation. In EMNLP , pages 14815– 14833. Association for Computational Linguistics. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- vasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. 2023b. Alpagasus: Training A better alpaca with fewer data. CoRR , abs/2307.08701. John Chung, Ece Kamar, and Saleema Amershi. 2023a. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In Proceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 575–593, Toronto, Canada. Association for Computational Lin- guistics. John Joon Young Chung, Ece Kamar, and Saleema Amershi. 2023b. Increasing diversity while main- taining accuracy: Text data generation with large language models and human interventions. In ACL, pages 575–593. Association for Computational Lin- guistics. Wanyun Cui and Qianle Wang. 2023. Ada-instruct: Adapting instruction generators for complex reason- ing. CoRR , abs/2310.04484. 10 Page 11: Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A dataset of fine-grained emo- tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4040–4054, Online. Association for Computational Linguistics. Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023a. Is GPT-3 a good data annotator? In Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11173–11195, Toronto, Canada. Association for Computational Linguistics. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023b. Enhancing chat language mod- els by scaling high-quality instructional conversa- tions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 3029–3051, Singapore. Association for Com- putational Linguistics. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? CoRR , abs/2305.07759. Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learn- ing. In ACL, pages 14014–14031. Association for Computational Linguistics. Shangbin Feng, Vidhisha Balachandran, Yuyang Bai, and Yulia Tsvetkov. 2023. FactKB: Generaliz- able factuality evaluation using language models en- hanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing , pages 933–952, Singapore. Association for Computational Linguistics. Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tong- shuang Wu, and Graham Neubig. 2024. Better syn- thetic data by retrieving and transforming existing datasets. CoRR , abs/2404.14361. Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. 2023a. Self-guided noise-free data generation for efficient zero-shot learning. In ICLR . OpenReview.net. Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019. FewRel 2.0: Towards more challenging few-shot relation classi- fication. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP) , pages 6250–6255, Hong Kong, China. Association for Com- putational Linguistics. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo,Meng Wang, and Haofen Wang. 2023b. Retrieval- augmented generation for large language models: A survey. CoRR , abs/2312.10997. Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text- annotation tasks. CoRR , abs/2303.15056. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks are all you need. CoRR , abs/2306.11644. Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. CoRR , abs/2301.07597. Kelvin Han and Claire Gardent. 2023. Multilingual gen- eration and answering of questions from texts and knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 13740–13756, Singapore. Association for Computa- tional Linguistics. Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal LLM for chart understand- ing and generation. CoRR , abs/2311.16483. Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics. Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2023. Annollm: Making large language models to be better crowdsourced annotators. CoRR , abs/2303.16854. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Proc. of NeurIPS . Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. 2023. Metagpt: Meta programming for multi-agent collaborative framework. CoRR , abs/2308.00352. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2023a. Unnatural instructions: Tuning lan- guage models with (almost) no human labor. In 11 Page 12: Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers) , pages 14409–14428, Toronto, Canada. Association for Computational Linguistics. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2023b. Unnatural instructions: Tuning lan- guage models with (almost) no human labor. In ACL, pages 14409–14428. Association for Computational Linguistics. Tom Hosking, Phil Blunsom, and Max Bartolo. 2023. Human feedback is not gold standard. CoRR , abs/2309.16349. Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee- Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. 2023. LLM-adapters: An adapter fam- ily for parameter-efficient fine-tuning of large lan- guage models. In Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing , pages 5254–5276, Singapore. Association for Computational Linguistics. Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. Large language models can self-improve. In EMNLP , pages 1051–1068. Association for Computational Linguis- tics. Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Mart- tinen, and Philip S. Yu. 2022. A survey on knowl- edge graphs: Representation, acquisition, and appli- cations. IEEE Trans. Neural Networks Learn. Syst. , 33(2):494–514. Martin Josifoski, Marija Sakota, Maxime Peyrard, and Robert West. 2023. Exploiting asymmetry for syn- thetic training data generation: Synthie and the case of information extraction. In EMNLP , pages 1555– 1574. Association for Computational Linguistics. Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. 2023. Prometheus: Inducing fine- grained evaluation capability in language models. CoRR , abs/2310.08491. Su Young Kim, Hyeon-Jin Park, Kyuyong Shin, and Kyung-Min Kim. 2022. Ask me what you need: Product retrieval using knowledge from GPT-3. CoRR , abs/2207.02516. Jan Kocon, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydlo, Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Kocon, Bartlomiej Koptyra, Wik- toria Mieleszczenko-Kowszewicz, Piotr Milkowski, Marcin Oleksy, Maciej Piasecki, Lukasz Radlinski, Konrad Wojtasik, Stanislaw Wozniak, and Przemys- law Kazienko. 2023. Chatgpt: Jack of all trades, master of none. Inf. Fusion , 99:101861.Anastasia Kritharoula, Maria Lymperaiou, and Giorgos Stamou. 2023. Large language models and multi- modal retrieval for visual word sense disambiguation. InProceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing , pages 13053–13077, Singapore. Association for Computa- tional Linguistics. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, and Andreas Terzis. 2023. Harnessing large-language models to generate private synthetic text. arXiv preprint arXiv:2306.01684 . Stefan Larson, Anish Mahendran, Joseph J. Peper, Christopher Clarke, Andrew Lee, Parker Hill, Jonathan K. Kummerfeld, Kevin Leach, Michael A. Laurenzano, Lingjia Tang, and Jason Mars. 2019. An evaluation dataset for intent classification and out-of- scope prediction. In Proc. of EMNLP . Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pas- cale N Fung, Mohammad Shoeybi, and Bryan Catan- zaro. 2022. Factuality enhanced language models for open-ended text generation. NeurIPS . Bryan Li and Chris Callison-Burch. 2023. PAXQA: Generating cross-lingual question answering exam- ples at training scale. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 439–454, Singapore. Association for Computational Linguistics. Junlong Li, Zhuosheng Zhang, and Hai Zhao. 2022. Self-prompting large language models for open- domain QA. CoRR , abs/2212.08635. Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang. 2023a. CoAnnotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 1487–1505, Singapore. Association for Computational Linguistics. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023b. Textbooks are all you need II: phi-1.5 technical report. CoRR , abs/2309.05463. Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. 2023c. Synthetic data generation with large language models for text classification: Potential and limita- tions. In EMNLP , pages 10443–10461. Association for Computational Linguistics. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics. Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and Yejin Choi. 2022a. WANLI: Worker and AI collabo- ration for natural language inference dataset creation. 12 Page 13: InFindings of the Association for Computational Linguistics: EMNLP 2022 , pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023. Improved baselines with visual instruc- tion tuning. CoRR , abs/2310.03744. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022b. What makes good in-context examples for gpt-3? In Dee- LIO@ACL , pages 100–114. Association for Compu- tational Linguistics. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. 2024. Best practices and lessons learned on synthetic data for language models. CoRR , abs/2404.07503. Yuzhe Lu, Sungmin Hong, Yash Shah, and Panpan Xu. 2023. Effectively fine-tune to improve large multi- modal models for radiology report generation. CoRR , abs/2312.01504. Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian- guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023a. Wiz- ardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR , abs/2308.09583. Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- wei Lin, and Daxin Jiang. 2023b. Wizardcoder: Empowering code large language models with evol- instruct. CoRR , abs/2306.08568. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA. Association for Computational Lin- guistics. Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Generating training data with language mod- els: Towards zero-shot language understanding. In NeurIPS . Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Tarek Abdelzaher, and Jiawei Han. 2023. Tun- ing language models as training data generators for augmentation-enhanced few-shot learning. In ICML , pages 24457–24477. PMLR. Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andrés Codas, Clarisse Simões, Sahaj Agrawal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Ag- garwal, Hamid Palangi, Guoqing Zheng, Corby Ros- set, Hamed Khanpour, and Ahmed Awadallah. 2023. Orca 2: Teaching small language models how to rea- son. CoRR , abs/2311.11045.Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- har, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of GPT-4. CoRR , abs/2306.02707. Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in large language models: Origins, inventory, and discussion. ACM J. Data Inf. Qual. , 15(2):10:1– 10:21. Seokjin Oh, Su Ah Lee, and Woohwan Jung. 2023. Data augmentation for neural machine translation using generative language model. CoRR , abs/2307.16833. Nicholas Pangakis, Samuel Wolken, and Neil Fasching. 2023. Automated annotation with generative AI re- quires validation. CoRR , abs/2306.00176. Ru Peng, Qiuyang Duan, Haobo Wang, Jiachen Ma, Yanbo Jiang, Yongjun Tu, Xiu Jiang, and Junbo Zhao. 2023. Came: Contrastive automated model eval- uation. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pages 20121– 20132. Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, and Junbo Zhao. 2024. Energy- based automated model evaluation. arXiv preprint arXiv:2401.12689 . Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. CoRR . Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. 2023. UDAPDR: Unsupervised domain adaptation via LLM prompting and distillation of rerankers. In Pro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing , pages 11265– 11279, Singapore. Association for Computational Linguistics. Gaurav Sahu, Pau Rodriguez, Issam Laradji, Parmida Atighehchian, David Vazquez, and Dzmitry Bah- danau. 2022. Data augmentation for intent classi- fication with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conver- sational AI , pages 47–57, Dublin, Ireland. Associa- tion for Computational Linguistics. Nabeel Seedat, Nicolas Huynh, Boris van Breugel, and Mihaela van der Schaar. 2023. Curated llm: Synergy of llms and data curation for tabular augmentation in ultra low-data regimes. 13 Page 14: Zhihong Shao, Yeyun Gong, Yelong Shen, Min- lie Huang, Nan Duan, and Weizhu Chen. 2023. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In ICML , volume 202 of Proceedings of Machine Learning Research , pages 30706–30775. PMLR. Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugging- gpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR , abs/2303.17580. Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Pe- ter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin F. Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron, Kathleen Ke- nealy, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, Maxwell L. Bileschi, Noah Con- stant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel. 2023. Beyond human data: Scaling self-training for problem-solving with language models. CoRR , abs/2312.06585. Ryan Smith, Jason A. Fries, Braden Hancock, and Stephen H. Bach. 2022. Language models in the loop: Incorporating prompting into weak supervision. CoRR , abs/2205.02318. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- rish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Anto- nio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Her- rick, Avia Efrat, Aykut Erdem, Ayla Karakas, and et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR . Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. 2023. Selective annotation makes language models better few-shot learners. In ICLR . OpenReview.net. Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, and Akash Srivas- tava. 2024. LAB: large-scale alignment for chatbots. CoRR , abs/2403.01081.Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self- alignment of language models from scratch with min- imal human supervision. CoRR , abs/2305.03047. Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic data generation of llms help clinical text mining? CoRR , abs/2303.04360. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca . Fanqi Wan, Xinting Huang, Tao Yang, Xiaojun Quan, Wei Bi, and Shuming Shi. 2023. Explore-instruct: Enhancing domain-specific instruction coverage through active exploration. In EMNLP , pages 9435– 9454. Association for Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for nat- ural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353–355, Brussels, Belgium. Association for Com- putational Linguistics. Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. 2023a. Openchat: Advanc- ing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235 . Ruida Wang, Wangchunshu Zhou, and Mrinmaya Sachan. 2023b. Let’s synthesize step by step: It- erative dataset synthesis with large language mod- els by extrapolating errors from small models. In EMNLP (Findings) , pages 11817–11831. Associa- tion for Computational Linguistics. Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce la- beling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 4195–4205, Punta Cana, Dominican Re- public. Association for Computational Linguistics. Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xingxu Xie, Wei Ye, Shi-Bo Zhang, and Yue Zhang. 2023c. Pandalm: An automatic evaluation benchmark for llm instruction tuning opti- mization. ArXiv . Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023d. Self-instruct: Aligning language models with self-generated instructions. In Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 13484–13508, Toronto, Canada. Association for Computational Linguistics. 14 Page 15: Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023e. Self-instruct: Aligning language models with self-generated instructions. In ACL, pages 13484–13508. Association for Computational Linguistics. Yizhong Wang, Swaroop Mishra, Pegah Alipoormo- labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Puro- hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022. Super-NaturalInstructions: Generaliza- tion via declarative instructions on 1600+ NLP tasks. InProceedings of the 2022 Conference on Empiri- cal Methods in Natural Language Processing , pages 5085–5109, Abu Dhabi, United Arab Emirates. As- sociation for Computational Linguistics. Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet, Jianping Zhang, Adam Dabrowski, Jingchao Yang, Qiang Mao, and Han Qin. 2023a. Empirical study of LLM fine-tuning for text classification in legal document review. In IEEE Big Data , pages 2786– 2792. IEEE. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. In NeurIPS . Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V . Le. 2024. Long- form factuality in large language models. CoRR , abs/2403.18802. Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023b. Magicoder: Source code is all you need. CoRR , abs/2312.02120. Le Xiao and Xiaolin Chen. 2023. Enhancing LLM with evolutionary fine tuning for news summary genera- tion. CoRR , abs/2307.02839. Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, and Haobo Wang. 2023. FreeAL: Towards human-free active learning in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 14520–14535, Singa- pore. Association for Computational Linguistics. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023a. Wizardlm: Empowering large lan- guage models to follow complex instructions. CoRR , abs/2304.12244.Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce C. Ho, and Carl J. Yang. 2023b. Knowledge-infused prompting: As- sessing and advancing clinical text data generation with large language models. CoRR , abs/2311.00287. Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiang- tao Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. 2022a. ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing , pages 11653–11669, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Jiacheng Ye, Jiahui Gao, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. 2022b. ProGen: Pro- gressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Com- putational Linguistics: EMNLP 2022 , pages 3671– 3683, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Jiacheng Ye, Chengzu Li, Lingpeng Kong, and Tao Yu. 2023. Generating data for symbolic language with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 8418–8443, Singapore. Association for Computational Linguistics. Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woo-Myoung Park. 2021. Gpt3mix: Lever- aging large-scale language models for text augmenta- tion. In EMNLP , pages 2225–2239. Association for Computational Linguistics. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023a. Meta- math: Bootstrap your own mathematical questions for large language models. CoRR , abs/2309.12284. Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. 2023b. Large language model as attributed training data generator: A tale of diversity and bias. CoRR , abs/2306.15895. Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, and Chao Zhang. 2023c. ReGen: Zero-shot text classification via training data genera- tion with progressive dense retrieval. In Findings of the Association for Computational Linguistics: ACL 2023 , pages 11782–11805, Toronto, Canada. Associ- ation for Computational Linguistics. Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Ku- mar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Lu- ang Huy, Dong Uk Kim, Sung-Ho Bae, Lik-Hang Lee, Yang Yang, Heng Tao Shen, In So Kweon, and Choong Seon Hong. 2023a. A complete survey on generative AI (AIGC): is chatgpt from GPT-4 to GPT- 5 all you need? CoRR , abs/2303.11717. 15 Page 16: Jieyu Zhang, Bohan Wang, Xiangchen Song, Yujing Wang, Yaming Yang, Jing Bai, and Alexander Rat- ner. 2022. Creating training sets via weak indirect supervision. In ICLR . OpenReview.net. Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. 2023b. LLMaAA: Making large lan- guage models as active annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 13088–13103, Singapore. Association for Computational Linguistics. Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Proc. of NeurIPS . Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, Ben- jamin Rozonoyer, Md. Arafat Sultan, Jay-Yoon Lee, Mohit Iyyer, and Andrew McCallum. 2023a. Multi- stage collaborative knowledge distillation from large language models. CoRR , abs/2311.08640. Zilong Zhao, Robert Birke, and Lydia Chen. 2023b. Tabula: Harnessing language models for tabular data synthesis. arXiv preprint arXiv:2310.12746 . Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Yiming Zhu, Peixian Zhang, Ehsan ul Haq, Pan Hui, and Gareth Tyson. 2023. Can chatgpt reproduce human- generated labels? A study of social computing tasks. CoRR , abs/2304.10145. A Data Annotation In the main text, we introduced a series of tech- niques for general data synthesis. Though annota- tion can be considered a special type of synthesis with the input of a particular sample as the synthe- sis condition, there are also approaches specifically suitable for data annotation. Among them, selective annotation is one of the most important practices. Selective annotation represents an optimal tradeoff between expensive and precise human annotation and economic but relatively rough LLMs-based annotation(Wang et al., 2021; Kocon et al., 2023). The key to selective annotation is to define a "cost-effective" sample distribution between hu- mans and LLMs. (Zhang et al., 2023b; Bansal and Sharma, 2023) covers some common selec- tion strategies for LLMs-based annotation, includ- ing random selection, maximum entropy selec- tion, least confidence selection and kmeans se- lection for thorough comparisons. Results show that uncertainty-based methods, i.e. maximal en- tropy and least confidence, perform significantlybetter than the random baseline, with faster conver- gence and better performance of the downstream model trained on the annotated data. (Li et al., 2023a) also utilizes uncertainty to estimate LLMs’ annotation capability to effectively allocate the an- notation work among humans and LLMs. (Su et al., 2023) instead proposes a novel unsupervised, graph-based selective annotation method named vote-k, to select diverse and representative exam- ples to annotate. B Tuning Techniques Another large body of research pertains to the tun- ing techniques , such as model fine-tuning (Zhao et al., 2023b; Sun et al., 2023; Meng et al., 2023; Kurakin et al., 2023) and soft prompting (Chen et al., 2023a), which have already been heavily studied in other fields and can be detailedly re- ferred in (Hu et al., 2023; Lu et al., 2023; Wei et al., 2023a; Xiao and Chen, 2023). Despite their effec- tiveness in improving the generation performance, most of the existing approaches are established on the accessibility of the LLMs, while their appli- cation on black-box models remains to be further explored. C Applications LLM-driven synthetic data generation has served as a new alternative to traditional human-dependent data collection and demonstrated great potential in various applications, including general tasks, domain-specific tasks, and multimodal tasks. Generic Tasks. With the exploding capabilities of LLMs, this generation pipeline has been adopted in a wide range of basic NLP studies, including text classification (Ye et al., 2022b; Yu et al., 2023c; Sahu et al., 2022), named entity recognition (Xiao et al., 2023), question answering (Li and Callison- Burch, 2023), relationship extraction (He et al., 2023), and natural language inference (Zhang et al., 2023b). These studies further underpin diverse applications, such as sentiment recognition (Gao et al., 2023a; Ye et al., 2022b), online translation (Oh et al., 2023), stance detection (Li et al., 2023a) and spam identification (Smith et al., 2022). Domain-specific Tasks. Some domain-specific tasks also impose significant demands on this pipeline, where human annotation can be extremely expensive and impractical, such as medical diagno- sis (Tang et al., 2023), drug discovery (Xiao et al., 16 Page 17: 2023), clinical trial extraction (Xu et al., 2023b), industrial advertisement (Zhang et al., 2022) and tabular data analysis (Seedat et al., 2023). Multimodal Tasks. Stemming from the simplic- ity and low cost, this generation paradigm has also exhibited significant promise in multimodal tasks, including text-image retrieval (Kritharoula et al., 2023), chat understanding (Han et al., 2023), visual question answering (Han and Gardent, 2023), and multimodal instruction tuning (Liu et al., 2023). D Benchmark Datasets In Table 1, we summarize representative bench- mark datasets for evaluating models trained through data generation. Among them, ToolBench (Qin et al., 2023) is generated by LLMs and is com- monly employed to evaluate the performance of LLMs in tool usage proficiency. In most classifica- tion task evaluations (Li et al., 2023c; Wang et al., 2023b; Sahu et al., 2022), LLMs are infrequently used as test models; instead, small language models trained on generated data are often used, followed by testing on existing benchmarks. 17 Page 18: Type Benchmark Dataset Subdataset Quantity Partial Subdataset Task Ability Domain/Data Source ClassificationSMS spam (Almeida et al., 2011; Li et al., 2023c) 1 SMS spam Text Classification Spam Detection SMS AG News (Zhang et al., 2015; Li et al., 2023c) 1 AG News Text Classification Topic Classification News IMDb (Maas et al., 2011; Li et al., 2023c; Wang et al., 2023b) 1 IMDb Text Classification Binary Sentiment Classification Review GoEmotions (Demszky et al., 2020; Li et al., 2023c) 1 GoEmotions Text Classification Sentiment Classification Reddit Comments CLINC150 (Larson et al., 2019; Sahu et al., 2022) 1 CLINC150 Text Classification Intent Detection Human Annotation BANKING77 (Casanueva et al., 2020; Sahu et al., 2022) 1 BANKING77 Text Classification Intent Detection Bank FewRel (Gao et al., 2019; Li et al., 2023c) 1 FewRel Text Classification Relation Classification Wikipedia GLUE (Wang et al., 2018, 2023b) 7QNLI Natural Language Inference Recognizing Textual Entailment Wikipedia RTE Natural Language Inference Recognizing Textual Entailment News and Wikipedia QAAdversarialQA (Bartolo et al., 2020; Wang et al., 2023b) 1 AdversarialQA Question Answering Reading Comprehension Wikipedia TruthfulQA (Lin et al., 2022; Sun et al., 2023) 1 TruthfulQA Question Answering Honestness Hard Data ReasoningMATH (Hendrycks et al., 2021; Wan et al., 2023) 1 MATH mathematical reasoning Complex Reasoning Math ToolBench (Qin et al., 2023) 1 ToolBench Trajectory Planning Tool manipulation Tool - NIV2 (Wang et al., 2022, 2023e) 1616 - - Language Understanding & Reasoning Benchmark Collection/Human Annotation - BIG-bench (Srivastava et al., 2022; Sun et al., 2023) 204 - - Language Understanding & Reasoning Human Annotation Table 1: Representative benchmark dataset for assessing models trained with generated data. The dataset generated based on LLM is highlighted in bold. 18

---