loader
Generating audio...

arxiv

Paper 2503.05070

PromptPex: Automatic Test Generation for Language Model Prompts

Authors: Reshabh K Sharma, Jonathan De Halleux, Shraddha Barke, Benjamin Zorn

Published: 2025-03-07

Abstract:

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.

Paper Content:
Page 1: PromptPex: Automatic Test Generation for Language Model Prompts RESHABH K SHARMA∗,University of Washington, USA JONATHAN DE HALLEUX, Microsoft Research, USA SHRADDHA BARKE, Microsoft Research, USA BENJAMIN ZORN, Microsoft Research, USA Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex , an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex. 1 INTRODUCTION 1.1 Motivation Large language models (LLMs) are being used in many applications beyond chatbots and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function [ 31]. They are also often part of complex chain of control flow that combines LLM-driven prompts and regular code supported by popular frameworks such as langchain [7]. Such prompts will become an integral part of many code bases in the future because they leverage the power of AI models and can perform common tasks such as summarization, classification, and evaluation that traditional non-AI software is unable to do. Furthermore, as AI models continue to diversify and become more efficient and effective, software that leverages them will benefit. While prompts are becoming a key element of software code bases, they have both similarities and differences from traditional software. Similarities include taking input, generating output, performing a transformation much like an ordinary function. But differences exist that create major new software engineering challenges. First, the output of a prompt is inherently non-deterministic due to the nature of the underlying AI model and inference engine that interprets it. Significant effort has been invested in ensuring the model output conforms, at least syntactically, to a given ∗Work done while at Microsoft Research. Authors’ addresses: Reshabh K Sharma, University of Washington, Seattle, Washington, USA, reshabh@cs .washington .edu; Jonathan de Halleux, Microsoft Research, Seattle, Washington, USA, jhalleux@microsoft .com; Shraddha Barke, Microsoft Research, Seattle, Washington, USA, sbarke@microsoft .com; Benjamin Zorn, Microsoft Research, Seattle, Washington, USA, Ben .Zorn@microsoft .com.arXiv:2503.05070v1 [cs.SE] 7 Mar 2025 Page 2: 2 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn sentence: Have you seen their new house yet? word: theirIn this task, you will be presented with a sentence and a word contained in that sentence. You have to determine the part of speech for a given word and return just the tag for the word's part of speech. Return only the part of speech tag. If the word cannot be tagged with the listed tags, return Unknown. If you are unable to tag the word, return CantAnswer.gpt-4o-minigemma2:9bqwen2.5:3bllama3.2:1b++++PRP$PRP$their NN* IN:Determiner.* Their InputSystem PromptModelsOutputs Fig. 1. Example illustrating that the use of PromptPex to test a given prompt against different models. For a given prompt (labeled System Prompt) and PromptPex -generated test input (on the left), the resulting output differs significantly depending on what AI model is used to interpret it. PromptPex automatically generates test input based on the prompt and evaluates whether the output is compliant with what the prompt specifies. Because the prompt specifies "Return only the part of speech tag" the lower two models produced non-compliant output. specification [ 37]. Also, while natural language prompts are easy for non-programmers to write, the effect of small changes to a prompt are unpredictable, leading to challenges making robust prompt edits. Second, unlike traditional software, where the behavior of a function depends only on the well- defined specification of the hardware that it is compiled to, the output of a prompt depends on the model that interprets it. As a result, if the AI model changes, the result of the prompt may change, sometimes dramatically [ 29,32]. Application developers have strong motivations to change the underlying AI model their application uses because new models are being developed and released that are more efficient, more capable, able to run locally, available as open source, etc. As a result, it is critical that an application developer can quickly understand the implication of changing the underlying model on the behavior of a specific prompt. (Figure 1 illustrates a concrete example of this). 1.2 Our Solution: PromptPex While many new software engineering practices will need to be adapted to build robust AI software1, in this paper we focus on helping developers test and evaluate individual prompts. Specifically our tool, PromptPex , takes a prompt as input and automatically generates and evaluates test cases that explore whether a given model interprets the prompt as directed. PromptPex was inspired by work on Program Exploration (Pex [51]), now shipping as IntelliTest in Microsoft Visual Studio [35]2. PromptPex uses an LLM to extract explicit specifications from an input prompt (the Prompt Under Test, or PUT ) that capture the user intent in simple and concrete terms. A key element of our approach is to extract a projection of the prompt as a set of independent, concrete, checkable output rules (OR) that are then used to create targeted tests. Many prompts used in commercial applications have natural language statements that express such rules. For example, a prompt might contain phrases like "Ensure that..." or "The output must ..." that translate directly into our output rules. From this extracted specification, we use an LLM to generate test cases focused on exploring whether the PUT with a given model ( the Model Under Test, or MUT ) adheres to the specification. We then run our test cases with different MUTs to generate model-specific outputs for a given test case, as shown in Figure 1. Because the tests were generated with the specification in mind, we can then 1In this paper, we refer to AI software as any software that uses a generative AI model at runtime. 2Pex automatically generates unit tests for .Net applications using dynamic symbolic execution. Page 3: PromptPex: Automatic Test Generation for Language Model Prompts 3 automatically evaluate whether a given output from a MUT complies with the specification. While PromptPex cannot be used to automatically generate test cases that explore the full functionality of a given PUT, it is still valuable in creating test cases the break the requirements of our extracted specification. Details of how we extract the specifications and generate tests are provided in Section 3. Because it is fully automated, PromptPex is both easy to use and provides immediate insights into potential issues that can arise from the language in the PUT, the MUT, or the combination of the two. To evaluate our approach, we collected a suite of eight benchmark PUTs, created tests using both PromptPex and a baseline LLM-based test generator, ran each test with four MUTs, and evaluated the tests using an LLM to determine if the outputs were compliant with requirements specified in the PUT. We consider the generation of tests that are non-compliant as more successful because a test that causes a model to generate a non-compliant output indicates a problem that should be addressed. Our results show that PromptPex consistently generates more non-compliant tests than our baseline test generator and also clearly distinguishes relative model capabilities for the given prompt. A full explanation of our experiments and methods is available in Section 4. 1.3 Contributions Our contributions are as follows: •Systematic and Automated Test Generation for AI Model Prompts: While prior work has focused on optimizing a prompt for a given model (e.g., [ 39]) and generating unit test cases for programs [ 25,36,56,64], our work is the first to focus on the specific problem of automated test generation for prompts and creating tests that allow a developer to understand the behavior of their prompt across multiple models. •Specification Extraction from Prompts for Testing: To generate effective tests, we define a new approach to extract an input specification and output rules that capture targeted properties of prompts. These generated artifacts can be used both in automatically generating tests and in helping the prompt developer test, refine, and migrate their prompts to new models. •Evaluation of Test Generation and Model Compliance with Specifications: We measure the effectiveness of PromptPex using a benchmark suite of eight benchmark prompts running the generated tests on four diverse AI models. We compare with a sophisticated baseline LLM-based test generator and show that our approach extracting a specification from the input prompt results in tests that are more likely to cause non-compliant outputs across all the models tested. 2 MOTIVATING EXAMPLE Chatbots, for example for customer support, are a major application of LLMs where natural language prompts describe the bot’s behavior. These descriptions, or meta/system prompts, specify the role and other properties of the chatbot. They limit the domain of input and restrict the possible outputs that can be generated. Using prompts to configure a chatbot makes customization and modification easy, contributing to their widespread success. These prompts are designed to directly interact with user input, engage in conversations, and provide information to users. The prompts we focus on, which are used as software artifacts, share some similarities but also have distinct features from chatbot prompts. First, prompts used in codebases act like programs, of- ten with well-defined inputs and outputs. This may include both syntactic and semantic restrictions. While LLMs are generally good at handling inputs that do not adhere to format, the output must still be well-formed for other parts of the codebase, which may be traditional, to handle it effectively. Page 4: 4 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn These prompts exhibit constructs similar to various programming constructs, such as performing an early return for particular input, similar to an if-then-return in traditional programming. They can also include features like complex control flow, multiple returns, assertions, and constraints. All these features are described in natural language and can be easily modified or updated. Unlike in chatbot applications, where users can often provide any input and the chatbot needs to handle it, the input domain of a prompt embedded inside a codebase is more narrowly defined. The prompts embedded in software expect well-defined input. The constraint over the input can be explicitly stated within the prompt itself to filter out invalid input or be implicit in the codebase so that invalid input is never sent to the prompt. These prompts do not operate in isolation. They are part of complex logic pipelines where the output from a prompt or a traditional program can be fed into one another. These pipelines can be thoroughly layered, made up of multiple prompts and programs processing data to create a single response. The input to a prompt may be preprocessed to optimize efficiency or to filter out invalid data. Likewise, the output can be validated, and if necessary, the prompt can be executed again with the same input for generating a well-formed response that can be correctly parsed by dependent components [ 14,15]. The main advantage of using prompts embedded in software is their capability to perform tasks on natural language inputs, which is not possible with existing traditional programs. The prompts that are part of codebases are rich in program-like constructs. In Appendix B.6, we describe one such prompt for extracting important entities from the text and returning them in a structured list format. Due to the complexity of that prompt, we use a simpler prompt as our running example as shown in Figure 2. Although this prompt is simple, it includes program-like constructs similar to those found in complex prompts and is representative in highlighting the challenges faced by the prompts being used in traditional software pipelines. 2.1 Prompt Under Test ( PUT ) The prompt in Figure 2 is the part-of-speech classification (POS) prompt3, which classifies a word into a part-of-speech tag. We have truncated the list of speech tags with their description for brevity. The complete prompt is available in Appendix B.1. In this task, you will be presented with asentence andaword contained inthat sentence. You have to determine thepart ofspeech foragiven word and return just the tag for the word’s part of speech. Return only the part of speech tag. Iftheword cannot be tagged with the listed tags, return Unknown. Ifyouare unable to tag the word, return CantAnswer. Here is the alphabetical listofpart-of-speech tags used in this task: CC: Coordinating conjunction, CD: Cardinal number, ... Fig. 2. Part-of-Speech Prompt The POS prompt has the following program-like constructs: •Input: It takes a sentence and a word as input, specifying that the word must be present in the sentence. •Output: It defines the output as the part-of-speech tag, Unknown , orCantAnswer . •Com putation: It describes how the output should be computed, which in this case is simply tagging the word as a part-of-speech tag. 3A modified version of the prompt from [40]. Page 5: PromptPex: Automatic Test Generation for Language Model Prompts 5 •Controlflow: There are multiple if-then constructs in the prompt, similar to those found in code. •Early return: Like in programs where the code terminates early, returning an error code, prompts also have multiple early returns to handle corner cases. •Assertions andconstraints: Like in programs, various assertions and constraints can be defined directly in the prompt. However, they need not be explicitly implemented. For example, the tag must be from the list of tags provided. These prompts function like programs, featuring well-defined input and output specifications. For instance, when given the input quick brown fox jumps over the lazy dog; quick , the output generated is JJ. This input might originate from another prompt or program, and similarly, the output can be passed on to another component. While the part-of-speech prompt may align with what the prompt developer intended, ensuring that the LLM interpreting the prompt can precisely understand the intent and constraints remains a significant challenge. Developing these prompts is demanding because they require more rigorous testing than traditional software due to the vast potential input domain, even with constraints in place. Moreover, the model’s behavior can vary unpredictably based on the input, which can be neither fully understandable nor predictable. The inherent ambiguity of natural language further complicates efforts to manage these challenges effectively. LLMs often struggle to consistently follow provided instructions, but these instructions can be made more precise to accurately match the intended purpose, thereby minimizing unintended consequences. As models improve their ability to follow instructions, the alignment between the prompt developer’s intent and the model’s actual execution behavior will become increasingly critical. As an example of the challenges in writing an effective prompt, the POS prompt above sometimes generates a part-of-speech tag along with a description of the reasoning steps taken to arrive at that decision. This issue sometimes occurs even with gpt-4o , a state-of-the-art (SOTA) model. We observed this behavior due to ambiguity in what is allowed as output. To understand the ambiguity, note that the prompt requires a part-of-speech tag (e.g., NN) and not the word describing the part of speech (e.g., noun). Specifically, the prompt says "Return only the part of speech tag." The ambiguity arises because sometimes a model interprets this rule as applying only to tag (interpreting the prompt as "Return only the part of speech tag and not the word describing the part of speech") and does not explicitly forbid also adding an explanation about the reasoning behind the choice. This demonstrates the intricacies of prompt development and the value of both being more explicit about what is expected as well as testing model behavior more rigorously. The same prompt on gpt-4o most of the times only outputs the tag, but on gpt-3.5-turbo it often prefixes the tag with Output: , illustrating the problem of model portability. A prompt that works correctly on one model may not function as expected on another, highlighting the importance of extensive testing across multiple models. These challenges underscore the necessity for thorough testing of prompts on various models to identify issues early and to gain insights into how prompts behave during execution on different models. This understanding can help prompt developers address any problems and optimize their prompts for improved performance across different platforms. We developed PromptPex , a tool designed to help prompt developers better understand the exe- cution behavior of their prompts. It achieves this by first extracting input and output specifications for the prompt under test ( PUT ), which are assertion-like constraints over the input and output, equivalent to pre- and post-conditions in a program. Page 6: 6 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn PromptPex utilizes gpt-4o to extract these specifications from the prompt. The prompt developer can then examine these extracted specifications to compare their understanding with what a SOTA model perceives as expected input and the constraints on the output. 2.2 Input Specification ( IS) For the POS prompt, Figure 3 shows the extracted input specification ( IS) bygpt-4o . For the PUT , theISclearly captures that the input must be a sentence and a word, and it also lists constraints such as the requirement for the word to be a single word from the sentence. The single word constraint was not explicitly mentioned in the PUT, but during execution, the model must select a behavior, either allowing or disallowing compound words, such as ice cream . The IShighlights this under-specification in the PUT, indicating that the model has assumed it is expecting single words. Compound words might be considered valid input based on the developer’s understanding, but they would result in undefined behavior for the model and are more likely to lead to the output Unknown orCantAnswer . ◦The input consists of a sentence combined with a specific word from that sentence. ◦The sentence must contain natural language text. ◦The word must be a single word from the provided sentence. Fig. 3. Extracted Input Specification for Part-of-Speech Prompt 2.3 Output Specification or Rules ( OR) The output specification, or the rules governing the output ( OR), captures the constraints on the output generated for PUT . Figure 4 shows the extracted output specification ( OR) for the POS prompt. The prompt developer can compare these rules with their understanding. For instance, the listed rules specify that the output must consist solely of the tag, without any additional text or formatting. This can be interpreted to mean that extraneous text, such as descriptions of the tag or formatting details, is not allowed—such as the use of Output: seen with gpt-3.5-turbo but it does not restrict description of the reasoning behind the tag. These rules help the prompt developer understand the potential behavior of the prompt during execution with the state-of-the-art model. This information can be used to tune the PUT to generate specifications that match more accurately the intentions of the prompt developer. ◦The output must return only the part of speech tag without any additional text or formatting. ◦If the given word can be identified with one of the listed part of speech tags, the output must include only the specific tag for that word from the provided alphabetical list. ◦If the given word cannot be tagged with any of the listed part of speech tags, the output should be the word "Unknown". ◦If tagging the given word is not possible for any reason, the output should be the word "CantAnswer". Fig. 4. Extracted Output Rules for Part-of-Speech Prompt After first generating the ISandOR,PromptPex can then generate unit tests for the PUT . The prompt developer also has the option to edit the extracted specification to add implicit rules that are Page 7: PromptPex: Automatic Test Generation for Language Model Prompts 7 part of the pipeline in which the prompt is embedded but are not needed within the prompt itself, as the input is preprocessed. For example, the prompt developer can extend the input specification to provide the format for the input, such as "sentence;word" . 2.4 Test Generation Thus far, the prompt developer is able to align the intent with the input and output specifications extracted by PromptPex for the SOTA model. However, this does not account for how the prompt will actually perform on various inputs, especially when tested on other models. The test inputs generated by PromptPex gives the prompt developer a test suite designed to cover all the output constraints in the prompt and also to challenge the model to violate the constraints set within the prompt. By running these tests, the prompt developer can identify additional failures. Figure 5 shows a few tests generated by PromptPex for the POS prompt. This test suite also serves as a regression test suite for any future modifications to the prompt. With passing tests, the suite now encodes the developer’s intentions as concrete tests which were initially present only as specifications. ◦An aura of mystery surrounded them; aura ◦The researchers documented carefully; carefully ◦This is such a unique perspective; such Fig. 5. Tests Generated for Part-of-Speech Prompt These tests can now be executed across different models to compare and analyze the execution behavior of the prompt on other platforms. PromptPex assists the prompt developer in better understanding the behavior of the prompt by explicitly aligning the developer’s intentions with the extracted specification and implicitly through the generated test suite. This test suite is then used to identify differences between models and determine what modifications to the original prompt are needed to achieve the intended execution across various models as envisioned by the prompt developer. 2.5 Test Evaluation One use case for PromptPex is to generate tests and then allow the user to add those tests to their existing test suite. Note that PromptPex does not generate the correct output for each test case, so this scenario requires the user to add the correct output for the generated tests as an additional step. PromptPex supports an automated approach to test evaluation which checks test output not forcorrectness but for compliance with the prompt using an LLM. We discuss our approach to test evaluation for compliance in Section 3. 3 DESIGN In Figure 6, we present the end-to-end pipeline of PromptPex . It helps the prompt developer to explore prompts and understand their behavior on different models. Below, we discuss each part of the pipeline in detail. 3.1 Input Specification Similar to OR, the input specification ( IS) describes the input and the constraints expected by the PUT . The ISdefines what constitutes valid input. For the POS prompt, the IS, shown in Figure 3, Page 8: 8 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn I N P U T P R O M P TRule Gen (gpt -4o)Rule 1 Rule 1Rule 1 Input SpecTest Gen (gpt -4o)Test 1 Test 2 Test 3 … Test nTest 4gpt-35 gpt-35 gpt-35 gpt-35 gpt-35 gpt-35gpt-35 gpt-35 gpt-35 gpt-35 gpt-35 gpt-35Result 1 Result 2 Result 3 … Result nResult 4Test Eval (gpt -4o)Pass/Fail 1 Pass/Fail 2 Pass/Fail 3 … Pass/Fail nPass/Fail 4 Extract rules Create tests Run tests (on any model)Check results Fig. 6. The end-to-end pipeline of PromptPex , which is implemented using a series of LLM prompts (shown in brown). The user provides their prompt (in green on the left), and PromptPex automatically generates input and output specifications, and tests (in blue), and can run the generated tests on multiple models (shown in purple). describes the input as a sentence and a word, with the constraint that the word must be present in the sentence. Some prompts may also accept input in the form of a file; in such cases, the IS treats file input as string content and describes the content itself as the input rather than the file. PromptPex also generates the ISusing gpt-4o . We frame it as a task to extract the ISto create valid inputs. We restrict any details about the output or how the input will be used in the computation inside the PUT . In creating the IS, we first extract a description of what the inputs are. If they are composed of multiple components, those components should be listed, and their constraints and properties should be described. The complete prompt for extracting the ISis available in Appendix A.2. Sometimes the PUT will attempt to handle corner cases. For example, in the POS prompt, if the word cannot be tagged with the listed tags, it returns Unknown . In this case, the IS extractor might incorrectly assume that words which cannot be tagged are not valid inputs. While this may be true, the prompt does not imply that. We explicitly added this case in the ISextractor prompt to consider such inputs as part of the input domain, even if there is a rule against them. Both the ORand the ISare editable by the prompt developer, allowing them to modify the extracted specifications or augment them with constraints from the environment. 3.2 Output Specification or Rules The output specification or output rules ( OR) describe the constraints on the output generated for PUT. Rules are concrete, checkable, general, input-agnostic, and independent constraints over the output described in natural language and derived from the PUT . They are similar to assertions and post-conditions in a program, detailing what the output must be and must look like, without regard for how it is generated or derived. In our running example of the part-of-speech (POS) prompt, the OR, as shown in Figure 4, includes rules stating that the output should either be a POS tag, CantAnswer , orUnknown . It is important to note that ORdo not capture how the output is generated; instead, it focuses solely on constraints over the generated output. This quality of OR helps in keeping it separate from the input, allowing evaluation of the output based solely on the OR, regardless of input. The groundedness of the rules within the ORin the prompt is a necessary Page 9: PromptPex: Automatic Test Generation for Language Model Prompts 9 condition for the ORitself as it implies that all rules are valid and are present in the prompt, while the exhaustiveness in accurately covering all rules from the prompt is the sufficient condition for theORas no more rules are required. We use gpt-4o for extraction of the ORfrom PUT . The complete prompt is available in Appen- dix A.4. We frame it as a task to extract the rules for output validation such that the inputs are not available. We enforce the following properties during the extraction of the rules: •If an example is present in the prompt, do not generate rules specifically for that example. Generalize them so that they will applicable for other possible inputs. •Rules must be clear, concrete and independent from each other such that they can individu- ally be used to validate the output. •They should not contain any information about how the output depends on the input or how the output is computed.4 3.3 Test Generation Tests for PUT are generated by PromptPex using ORandIS. The ORis utilized to create directed tests that challenge the model to correctly adhere to each rule, while the ISis used to create valid test cases. The input domain specified by the ISis usually extensive, as these inputs are in natural language, which can be presented in multiple forms. It is also non-trivial to determine which part of the prompt a given input covers, leaving us without a notion of coverage in prompts unlike what we have in programs. 3.3.1 Exhaustiveness. Given the vast range of possible inputs, creating an exhaustive set of test cases is challenging. To achieve this, we generate tests for each rule in the OR. We argue that if our ORis exhaustive (completely covers the prompt), the tests generated for the rules in the ORare also exhaustive. PromptPex not only generates tests but also associates each test with a specific rule, providing reasoning for its creation. Beyond exhaustiveness, this approach allows for various analyses of the tests, as they are directly linked to a rule. We use this approach to attempt to develop an exhaustive test suite. Other possible future uses include modular updates to the test suites, allowing the addition of new tests for new rules and the removal of old tests linked to obsolete rules. 3.3.2 Generating challenging tests. Since we generate tests per rule, we can create tests that explicitly challenge these rules. We accomplish this while ensuring our test generator remains unaware of any properties of the rule itself; for a given prompt, our test generator will produce a valid test for any rule. Inverse Rules: We generate inverse rules from the given rules. The inverse of a rule is a semantic inversion that violates the rule by describing its opposite. For example, if a rule in ORstates that the output must always be a tag, it can be inverted into a rule that enforces the output to be the tag along with the actual name of the part of speech. PromptPex uses gpt-4o for generating inverses of the given rules. We ask it to generate inverse rules such that they contradict the given rules. The complete prompt is available in the Appendix A.5. We generate tests for both the rules from the ORand their inverses as shown in Figure 7. This approach helps us generate tests that cover intriguing test cases which might cause a given model to violate the OR. 3.3.3 Valid Tests. A test may not follow the ISand might create output that violates the rules described in the ORandPUT . Such test cases are not the most valuable, as the prompt developer 4We limit our scope to output compliance testing where we only validate the output unlike functional testing where we validate the output for the given input. Page 10: 10 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn The output must be a single part of speech tag that represents the given word's role in the sentence according to the specified list of tags.The output must be multiple parts of speech tags representing different potential roles for the given word according to any possible tags.Inverse RuleExtracted RuleInversed Rulesentence: The quick brown fox jumps over the lazy dog., word: foxsentence: Play the notes., word: PlayGenerate TestTest generated from ruleTest generated from inverse rule Fig. 7. Example illustrating the use of inverse rules for test generation. Because our inverse rule was used to generate the test in the lower half of the figure, the test intentionally focuses on the word "Play" which depending on context can be labeled with multiple parts of speech. knows that some of those inputs can be pre-filtered before they are passed to the prompt. However, these test cases might be useful for testing any input validation pipeline that processes the input or for identifying gaps in the assumed input domain. The most valuable tests are those that follow the IS, as they are directly actionable and require fixing. We use the extracted ISto guide PromptPex to generate tests that strictly follow the IS. Even in scenarios where the prompt-based system is exposed to raw inputs, the same IScan validate inputs and disregard any input that does not meet the input specification. The test generator in PromptPex uses gpt-4o and takes a rule (which can also be an inverse rule), ISandPUT as input. It generates a test along with the reasoning of why the model thinks this particular test during execution will comply with the rule it was generated for. We ask the test generator to start with first understanding: What is an input? What are the different components of a valid input? What are the syntax and semantics related constraints from IS? In the prompt, we direct the model to consider these issues during test generation. Each test must be generated such that the expected output or behavior demonstrates adherence to the given rules for a range of scenarios, including boundary cases, typical cases, and edge cases. The exact prompt used is available in the Appendix A.5. 3.4 Test Evaluation: Compliance versus Correctness The generated test cases can then be executed on different models. Running tests on different models helps the prompt developer understand the behavior of diverse test inputs to the prompt when run on multiple models. As mentioned, because we lack the correct output for each test input, our automated strategy for test evaluation is an LLM-based determination of test compliance. Compliance is determined by ourLLM-as-a-judge [65] which, given the original prompt and the test output, determines whether the output complies with the requirements described in the prompt. As a result, our evaluation explores only a partial understanding of the model output and must be augmented with additional testing. However, as demonstrated in traditional software engineering, testing the assertions over the output can also be very valuable [ 26,28,50]. After the tests are run, the results of our validator can assist the prompt developer in updating the prompt to correctly handle tests with invalid results. To implement LLM-as-a-judge , we use gpt-4o and provide it the output generated by executing a test and the prompt ( PUT) for which the test was executed. To allow a fair comparison with a test generator that does use PromptPex , we ensured that no artifacts specific to PromptPex are used in Page 11: PromptPex: Automatic Test Generation for Language Model Prompts 11 the validation and that the test output validator is generic, applicable to any prompt and test output regardless of the input, the method used to generate the test, or the model used to run the test. This validator is used to evaluate the outputs of tests generated by both the baseline and PromptPex . We enforce the following properties for this output validator: •Keep the evaluation independent of the input, as it will not be provided. •Must not speculate, infer, or make any assumptions during the evaluation. •Always check for compliance, not correctness. The complete prompt can be found in Appendix A.1. The output validator generates a boolean result (compliant/non-compliant) and explanation for the decision. 4 EVALUATION 4.1 Evaluation Goals In this section, we evaluate how effective PromptPex is at generating tests that provide actionable feedback to the developers for improving the prompts. Through our experiments, we answer the following research questions: •RQ1: Does having specifications help in generating better tests? •RQ2: How useful are rules for generating better tests? •RQ3: Does having an input specification and using it to generate tests create more valid tests? •RQ4: Can automatically generated tests help determine if a model is suitable for a prompt? 4.2 Baseline In our evaluation, we used a zero-shot LLM-based test generator as the baseline. We use gpt-4o to generate tests for both the baseline and PromptPex . We instruct the model to develop multiple test cases by inferring the functional specification and input for the given prompt. We ask it to begin with understanding possible inputs and its components, what the syntax and semantics related constraints are, and possible input scenarios. We enforce the following properties while generating the baseline tests: •The test cases must be designed to validate whether the output properly adheres to descrip- tion. •A good test must always be a valid input meeting the requirements mentioned in the description. •The test cases must be diverse and distinct. •Each test case must be crafted to rigorously assess whether the output meets the stipulated behavior based on the provided prompt. •The input scenarios used for creating the tests must be valid, realistic, and fully comply with the given description. •Generate test cases to broadly cover a range of scenarios, including boundary cases, typical cases, and edge cases, to thoroughly evaluate the software’s adherence to the description under various conditions. •Each test case should adhere to principles of good software testing practices, emphasizing coverage, specificity, and independence. •Focus on creating diverse test cases that effectively challenge the prompt’s capabilities by critically assessing potential weaknesses in the handling of inputs by the prompt. The complete prompt is available in Appendix C.1. We chose this as our baseline because LLMs are capable of interpreting the prompt and generating test cases for them. LLMs are already found Page 12: 12 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn useful in generating unit tests for traditional software [ 41,45,49,54,58,62,63]. We ensured that the prompt used to generate the baseline tests is robust and underwent multiple revisions to enhance its ability to generate effective tests. To provide a fair comparison with PromptPex , we explicitly covered all requirements that are implicitly enforced by PromptPex , such as generating valid, challenging, and diverse test cases that comprehensively cover the prompt and help identify any flaws. Although there will always be opportunities to improve the baseline prompt, the same holds for the prompts used in PromptPex . We have refined both approaches sufficiently so that they can be effectively compared. 4.3 Metrics We evaluate PromptPex and the baseline based on their ability to generate more effective tests. 4.3.1 % non-compliance. We define effective tests as those that expose limitations in the prompt, which means that they result in more failures. We consider non-compliance with the prompt as the metric for test quality. This includes any violations of the rules and constraints described in the prompt. This approach is similar to checking all the assertions over the output generated by a traditional program. We used LLM-as-a-judge to evaluate whether the output generated by the prompt violates any rules or constraints within the prompt as described in Section 3.4. We could have also developed a validator that does not use the PUT but instead relies on the OR, as they precisely represent the constraints over the output. However, since the ORis part of PromptPex and was not provided to the baseline test generator, we refrained from using it to maintain a fair and equal comparison. We run the tests generated by PromptPex and the baseline on gpt-4o-mini ,gemma2:9b ,qwen2.5:3b , llama3.2:1b . We compare the non-compliance of the output from the generated tests. In our eval- uation, we consider more test non-compliance as better since it indicates that the tests are more effective at identifying gaps in the prompt’s constraints. 4.3.2 Test Validity. Valid tests follows the input specification defined in the PUT . Valid tests are important as they provide instant feedback about the PUT and needs to be fixed. We also consider the test validity as a metric for test quality of the tests generated by PromptPex . We use the ISwith an LLM-as-a-judge to determine if the generated test is a valid input. The input validation prompt is available in Appendix C.2. We specifically used ISasPromptPex allows editing of generated ISto capture the precise input specification including implicit constraints arising from preprocessed inputs. 4.3.3 Groundedness and Spec Agreement. In addition to measuring test non-compliance and test validity which are direct indicators of the quality of the tests generated, we also evaluate the generated ORas these are directly used to generate tests by PromptPex and have impact on the quality of the generated tests. A rule in the ORis considered grounded if it is present in the PUT. We do not want to generate tests for the rules which are not present in the PUT . We consider a higher groundedness of the rules as an indirect metric for test quality. To determine if the generated rules are grounded in the original PUT, we used LLM-as-a-judge and ask it to confirm rule groundedness. The complete prompt for checking the groundedness of the rules is available in Appendix C.3. Another goal for generating ORis to ensure that all important constraints mentioned in the PUT are captured in the OR. We refer to this as spec agreement. To compute this, we extract a description of how the output must be computed from the PUT, the task specification . The prompt for extracting the task specification is available in Appendix C.4. We append the task specification with our extracted ORto derive a spec prompt which in theory should capture all the same constraints as the Page 13: PromptPex: Automatic Test Generation for Language Model Prompts 13 Prompt Valid Baseline TestsTask Spec. Output Spec. Input Spec.EXECUTESpec AgreementExtract SpecsSpec Prompt Fig. 8. Overview of the spec agreement estimation for the tests generated by PromptPex . The specifications are extracted from the prompt and a spec prompt is created by appending the output specification to the task specification. The valid baseline tests are then executed with the original and synthetic prompt. Spec agreement is considered high when the spec prompt and the original prompt behave similarly. This is estimated by comparing the number of non-compliant test results. PUT. We do not use the ISin the spec prompt because PromptPex generates tests for each rule in the OR. To estimate how closely the extracted specification represents the original PUT , we compare the spec prompt with the original PUT . We feed tests generated from our baseline test generator to both the spec prompt and the PUT to compare their behaviors as shown in Figure 8. We use the baseline tests instead of the PromptPex tests so that they are not influenced by our process of generating them from the OR. The spec prompt and the original PUT must have similar behavior for the generated ORto have high spec agreement. We use cosine similarity of the non-compliance percentage of spec prompt and the original PUT to derive a spec agreement score. 4.4 Benchmarks We list the prompts with sources and their descriptions in Table 1. We selected a diverse set of prompts from publicly available sources, focusing on those that are within the scope of PromptPex . Currently, we support prompts that can accept only a single input. Although this single input can be interpreted by the model to be made up of multiple components, this differentiation is not explicitly present, for example, a single string as input representing a sentence and a word for the speech tag prompt. This limitation makes it unsuitable for prompts requiring multiple embedded inputs. We also only support prompts where the output is independent of the previous outputs, making prompts describing multi-turn conversations for tasks out of scope for PromptPex . 4.5 Evaluation Procedure We accessed gpt-4o and gpt-4o-mini through APIs. We kept the temperature 1.0 across all the requests. We used Ollama [ 10] for the local models, gemma2:9b ,qwen2.5:3b ,llama3.2:1b . We ran all the experiments and hosted the local models on a virtual machine (VM) hosted on Microsoft Azure. Our system runs on an AMD EPYC 7V13 processor with 256 GB of RAM, 2 TB of SSD, and an NVIDIA A100 GPU with 80 GB of dedicated memory. The VM is running Ubuntu 22.04.5 LTS and is allocated 24 out of 64 CPU cores. All the prompts are written and ran using GenAIScript [ 4] and Prompty [13]. We ran each test once per prompt per model. 4.6 How useful are specifications for generating tests? (RQ1) To test our hypothesis that having explicit specifications like ORandIShelps in generating better tests, we generated tests for the different prompts in our benchmark using PromptPex and the Page 14: 14 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn Name of PromptPrompt Description Source speech-tag Determine the part of speech for a given word within a sentence using a predefined set of tags. The task may return tags such as noun, verb, adjective among others, or return "Unknown" or "CantAnswer" if the word cannot be classified.Modified from an ex- ample used by Schn- abel et. al. [40] text-to-p Format a paragraph of text into HTML by splitting it into sentences and wrapping each sentence with paragraph tags, while enhancing key words and phrases with addi- tional HTML tags.format_text prompt from Ghost- Writer [22] shakespeare Assist users in creating text that mimics the Shake- spearean style of writing, including the use of archaic language and stylistic elements typical of the period.Azure AI Studio Prompt Catalog [ 16] sentence Rewrite a sentence to improve its readability and make it more conversational while preserving its original mean- ing. This includes simplifying complex phrases and en- hancing engagement through fluid structure.The Big Prompt Li- brary [17] extract- namesExtract model names from machine learning paper ab- stracts, returning a structured array of identified model names or "NA" if none are found.Information extrac- tion prompt from Prompt Hub [21] elements Extract important entities from a text, including company names, people names, specific topics, and general themes, presented in a structured list format.OpenAI documenta- tion [20] classify Classify a news article into one of several predefined categories such as World, Sports, Business, or Sci/Tech, based on its content and context.Prompt used in a tu- torial [19] art-prompt Create detailed prompts based on user descriptions for generating AI images, focusing on key characteristics, timing, lighting, and the desired emotional impact of the image in a concise single paragraph.The Big Prompt Li- brary [18] Table 1. Description of the prompts used in the evaluation with their sources. baseline. To compare the quality of test generation, we compare the percentage of tests not compliant with the prompt description for PromptPex and the baseline on multiple models. Higher percentage of non-compliance is better as it represents more test failures. Table 2 shows the percentage of test non-compliance for different prompts on each model. PromptPex on average generated 5.5% more tests for different models that were not compliant with description of the PUT from the benchmark. Figure 9 shows the average non-compliance across the benchmarks for PromptPex and baseline for each of the different models. The more capable models (like gpt-4o-mini ) have lower rates of non-compliance while the smaller models (like llama3.2:1b ) have high rates. PromptPex beats baseline in every case but is particularly more effective for gpt-4o-mini , where the baseline model has a much harder time generating any tests that result in non-compliant output. Page 15: PromptPex: Automatic Test Generation for Language Model Prompts 15 Promptsgpt-4o-mini gemma2:9b qwen2.5:3b llama3.2:1b PPex BL PPex BL PPex BL PPex BL speech-tag 0% 0% 5% 2% 2% 7% 90% 98% text-to-p 15% 5% 23% 38% 72% 69% 95% 97% shakespeare 2% 2% 5% 2% 5% 5% 7% 7% sentence 12% 0% 4% 6% 21% 2% 31% 19% extract-names 0% 0% 6% 16% 31% 22% 61% 46% elements 31% 2% 44% 15% 62% 19% 54% 62% classify 4% 0% 0% 0% 12% 0% 25% 25% art-prompt 2% 0% 17% 12% 17% 7% 48% 45% Average 8% 1% 13% 11% 28% 16% 51% 50% Table 2. Test non-compliance results for the tests generated by PromptPex (PPex) vs the baseline (BL) on different models. More test non-compliance (winner shown in blue) represents more challenging tests and hence are better tests. Promptsgpt-4o-mini gemma2:9b qwen2.5:3b llama3.2:1b RL Inv RL Inv RL Inv RL Inv speech-tag 0% 0% 0% 10% 5% 0% 95% 86% text-to-p 14% 17% 14% 33% 62% 83% 90% 100% shakespeare 0% 5% 0% 10% 0% 10% 0% 14% sentence 17% 8% 0% 8% 25% 17% 38% 25% extract-names 0% 0% 0% 11% 12% 28% 56% 67% elements 33% 29% 17% 71% 58% 67% 50% 58% classify 0% 8% 0% 0% 0% 25% 25% 25% art-prompt 0% 5% 5% 29% 10% 24% 29% 67% Average 8% 9% 4% 21% 24% 32% 48% 56% Table 3. Test non-compliance results for tests generated from rules (RL) and tests generated from inverse rules (Inv) by PromptPex on different models. Higher test non-compliance (winner shown in blue) represents better tests. 4.7 How useful are rules for generating better tests? (RQ2) We saw that PromptPex generates better tests using its approach, but we also evaluate how important is the fact that we generate tests for specific rules and inverse rules. As discussed in Section 3.3.2, we extract ORfrom the PUT. The ORis made up of rules which are the constraints over the output. We create inverse rules from these rules and use the generated rules to create tests. This approach allows us to compare the non-compliance of tests generated from the ORand from the inverse OR. Table 3 shows the percentage of their test non-compliance for different prompts on each model. The tests generated from the inverse rules on average generated 8.5% more non-compliant tests as shown in Figure 10 demonstrating the role of the ORand the rules whose inverse created these tests. Notably the tests generated for the inverse rules were able to consistently generate better tests even for a large model like gpt-4o-mini which otherwise is challenging even for the tests directly generated from PromptPex rules. Page 16: 16 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn gpt-4o-mini gemma2:9b qwen2.5:3b llama3.2:1b Model01020304050Percentage Non-CompliancePromptPex Baseline Fig. 9. Average % Test Non-Compliance of tests gen- erated by PromptPex and Baseline. gpt-4o-mini gemma2:9b qwen2.5:3b llama3.2:1b Model01020304050Percentage Non-ComplianceRule Inverse RuleFig. 10. Average % Test Non-Compliance of the tests generated for rules and inverse rules by PromptPex . speech-tagtext-to-p shakespearesentence extract-nameselementsart-promptclassify Benchmark01020304050Number of T ests tests valid tests Fig. 11. Number of valid tests generated by Prompt- Pex. speech-tagtext-to-p shakespearesentence extract-nameselementsart-promptclassify Benchmark012345678Number of Rules rules grounded rulesFig. 12. Number of grounded rules generated by PromptPex . 4.8 Does having an input specification help generate valid tests? (RQ3) PromptPex uses ISto generate tests that meet the constraints on the input as described in PUT. In Figure 11, we show the total number of PromptPex generated tests for each benchmark as well as the number of tests considered valid by our validation analysis. We observe that a significant fraction of all generated tests are deemed valid, with the shakespeare prompt being an exception due to the input validator considering non-shakespearean input as invalid5. We also compare (not shown) the percentage of compliant valid tests with all the tests generated by PromptPex . We observed that 73% of all the non-compliant tests are valid tests. Non-complaint valid tests are valuable as they act as instant feedback and must be fixed. 4.9 Can automatically generated tests help determine if a model is suitable for a prompt? (RQ4) Prompt developers have many AI model options to choose from for a given application. They need to understand how effective a model is for their specific prompt. Because PromptPex automatically 5Writing prompts that process other prompts is challenging. In particular, our input validator in this case was confused and thought the input should be shakespearean text as well. Page 17: PromptPex: Automatic Test Generation for Language Model Prompts 17 generates tests, we seek to understand if the automatically generated tests can appropriately distinguish model effectiveness for a given prompt. Figure 13 shows the PromptPex test % non-compliance for the models under test across the benchmarks. It illustrates that for a given prompt, tests generated by PromptPex strongly differ- entiates the capabilities of the models and can be used to determine model suitability. The more capable models (like gpt-4o-mini ) have lower rates of non-compliance while the smaller models (like llama3.2:1b ) have high rates. At the same time, some of the smaller models are highly capable for a given prompt. For example, both the gemma2:9b andquen2.5:3b models are effective for the speech-tag benchmark. speech-tag text-to-p shakespeare sentence extract-names elements art-prompt classify average Benchmark020406080Percentage Non-Compliancegpt-4o-mini gemma2:9b qwen2.5:3b llama3.2:1b Fig. 13. % Test Non-Compliance of tests generated by PromptPex . 4.10 Groundedness and Spec Agreement It is important that the generated ORare both grounded in the PUT and cover all the important constraints expressed in the PUT. In Figure 12, we show how many rules were generated for each benchmark and whether those rules were grounded, according to our groundedness evaluation. We found that on an average 89% rules from ORwere grounded in the respective prompt from the benchmark with the classify prompt being an exception. The classify prompt takes input as a news article and classifies it into a given category. The non-grounded rules are generated by the model to compensate for the under-specification in the prompt. The classify prompt does not cover the cases when the output can be multiple categories or should the output just be the name of category without any explanation. We observe high scores for spec agreement for all the prompts in our benchmark (96.8%) with the shakespeare prompt being an exception, without it the prompts achieved the spec agreement score of 99.9%. Page 18: 18 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn High groundedness and spec agreement score ensures that the ORgenerated is neither over or under extracted respectively and the tests generated by PromptPex are focused on testing the constraints in the PUT. 5 DISCUSSION In this section, we consider concrete examples from our benchmarks to better understand how PromptPex can help with the development of prompts. PromptPex not only generates tests but also generates specifications that are used for generating tests and have value on their own. 5.1 Input Specification IS, which represents the constraints over the input, can be used to understand what the model expects as input for a particular prompt. This can be extremely useful when the input is complex or not obvious when reading the prompt. The developer can ensure the input spec remains the same with changes to the prompt that do not intend to change the input. It can also help in understanding if the input is under-specified in the prompt. For example, for the elements prompt (Appendix B.6), the ISstates that the input text can be of any length and can be in any language. This means that the model is expecting some implicit understanding of the input, as these constraints were not mentioned in the prompt itself, which only stated that the input is text. In the art prompt (Appendix B.8), the description of the input in the prompt is confusing. The prompt takes user descriptions as input and converts them into prompts for generating art. The prompt talks about the input description and the generated prompt in a similar way, which is even confusing to read. It is not straightforward to separate the specification for the input and output. For example, ensuring each description does not exceed 80 words and is crafted in a single paragraph , the description refers to the input user description, but “crafted” can also mean it is referring to the generated prompt, which must be in a single paragraph. Such ambiguities are also reflected in the generated input specification, where an output constraint about the generated prompt being in English is present in IS. The prompt writer can take a look at the ISand update their prompt to make it clearer and without ambiguities. Similarly, in the classify prompt (Appendix B.7), the prompt redefines the notion of a news article by stating that a news article can be classified into the following categories, which is reflected in theIS. If this was the actual intent, then it is okay; otherwise, seeing the ISprovided the useful feedback to change the way a news article is defined in the prompt to make it more generic. ISnot only helps in generating tests that are valid but also helps the prompt developer understand if the input is properly specified in the prompt and if there is some ambiguity or errors around it. 5.2 Output Specification The same can be said about the OR; it can also be used to understand what the model thinks about the output that can be generated. It can be used to figure out any under-specification or errors in the output constraints defined in the prompt. For example, consider the elements prompt (Appendix B.6), where the values for multiple labels are extracted from the input. The ORstates a rule that the labels must still be present even when there is no value that can be extracted for them. This case is not defined in the prompt, but it is a decision the model needs to take to handle such cases, thus pointing out an under-specification. Similarly, for the art prompt (Appendix B.8), as we saw above while observing the IS, the ORis also getting confused, and there is a rule about only generating a description of length 80 words, which must be applied to the input. A prompt developer, after seeing the ORandIS, will be able to Page 19: PromptPex: Automatic Test Generation for Language Model Prompts 19 clarify the input and output in the prompt such that they generate better and clearer specifications, which also match the intent of the developer. For the classify prompt, the ORadds a rule about not outputting anything other than the name of the category, pointing towards an under-specification, i.e., does it want the name of the category with or without any explanation, etc. From these examples, it is clear that the specifications help in revising the prompt to improve any under-specification over the input and output. They can also be used to easily identify conflicting rules, which we did not see because the prompts we used were well-formed. The specifications can also be used to highlight bugs, like the inconsistencies found in the art prompt. 5.3 Inverse Rules When we create tests from the rules in OR, we generate tests aimed at ensuring they follow the rule. Such tests ensure that the output is able to comply with the rules, and if it fails, they can help expose situations when there are conflicting rules causing the failure or when the model is not capable of generating output compliant with the rules. For a fairly capable model and a well-written prompt, these tests are expected to pass. Since inverse rules contradict the rules themselves, they attempt to push the model to generate output that does not comply with the rules. We have shown an example of a test generated from the inverse rule in Figure 7, where the generated test is created such that it can potentially confuse the model and output multiple tags for the same word. These tests also have limitations as they are bounded by the input specification, which limits how challenging they can be. For example, for the speech tag prompt (Appendix B.1), in cases when the output can be Unknown orCan’t Answer , the test might contain a made-up word, a number, or an empty string. However, all of these are outside the input domain, which forces the test generator to create valid tests that are challenging. 5.4 Comparison with Baseline Tests We examined the differences between the tests generated by PromptPex and those from the baseline to highlight some distinctive properties found in the tests created by PromptPex . 5.4.1 Creativity. Tests generated by PromptPex are more creative than those produced by the baseline while still remaining within the input domain. For instance, in the speech tag prompt (Appendix B.1), PromptPex used non-existent words in tests like sentence: The truth was uncertain, shenative., word: shenative and sentence: The xylophone zxylophone harmonizes., word: zxylophone while the most creative attempt from the baseline was the use of 12as a word, which lies outside the input domain. For the classify prompt (Appendix B.7), tests generated by PromptPex could be classified into multiple categories, like Google announces groundbreaking quantum computing progress , which fits both tech and business categories, unlike the baseline, which did not create ambiguous tests. In the extract name prompt (Appendix B.5), PromptPex produced many more tests with less common or imaginary model names such as ReinforceNet and AdvancedDL in varying contexts than the baseline did. For the sentence rewrite prompt (Appendix B.4), PromptPex addressed complex subject matter, such as the proliferation of digital technologies andintegration of quantum computing , while the baseline focused on routine subjects. Lastly, in the Shakespeare prompt (Appendix B.3), PromptPex explored modern scenarios like Compose a modern dialogue about picking a TV show to watch andWrite an excuse note for not doing my chores whereas the baseline was confined to traditional themes. 5.4.2 Complexity. PromptPex also produced more complex tests compared to the baseline. In the speech tag prompt (Appendix B.1), it used more complex words, for example, sentence: She spoke Page 20: 20 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn eloquently although about nothing in particular., word: although . For the classify prompt (Appendix B.7), PromptPex generated tests with varied descriptions, unlike the consistent tests from the baseline. In the sentence rewrite prompt (Appendix B.4), PromptPex used complex language and demonstrated conceptual depth with tests like juxtaposition of chaos and order in contemporary art , while baseline tests remained more literal. In the Shakespeare prompt (Appendix B.3), PromptPex introduced complexity by creating scenarios involving layered themes and character interactions, examples being Describe a character’s internal conflict in a scene andWrite a dramatic scene about a kingdom’s downfall . 6 LIMITATIONS PromptPex , while promising, faces several key limitations. The system is currently designed to handle single-input prompts as a single string, which can be interpreted as multiple components, but lacks support for structured inputs like embedded prompts. Additionally, PromptPex does not support generating dummy RAG data as input alongside tests, limiting its applicability in prompts which also take RAG data as input. Many prompts used in PromptPex take another prompt as input creating risk of prompt injection attacks. We filter out the malicious inputs using Azure AI Content Safety [ 2]. Despite following best practices and proper prompt delimitation, the potential for injection attacks can affect the specification extraction, test generation, and evaluation. We use LLM-as-a-judge for output val- idation, though the use of LLMs for output validation, while effective, is not 100% reliable [ 53] and may impact the system’s overall accuracy. LLMs also exhibit self-bias when evaluating their own outputs [ 55].PromptPex rely on gpt-4o for both test generation and test output validation risking the overall accuracy due to the propagation of these biases. We consider using code-based validation methods as a future improvement to address some of these issues. Furthermore, PromptPex faces parsing challenges due to occasional formatting issues in generated results, which can lead to failures in processing artifacts. We plan to implement structured output as future work to mitigate these issues. PromptPex consistently generates more concise tests compared to the baseline, which, while not observed to reduce quality, may be less preferable in contexts requiring more detailed descriptions, such as abstract or note generation. These limitations collectively provide clear directions for future work and improvements to PromptPex . 7 RELATED WORK Existing research has broadly highlighted the need for prompt testing [ 34,38]. However, existing work has focused more on using unit test suites to assist model migration for a prompt [ 29] or as a regression test suite for checking future modifications to the prompt [ 42].PromptPex automatically generates tests using the extracted specification from the prompt, and this test suite can be used for model migration or as a regression test suite. 7.1 Prompt Testing A prompt can result in different outputs when executed on different models. Even a single word change in a prompt can lead to drastic changes in the output. This makes prompt testing important, and currently, it is mostly done manually or through the use of benchmarks to evaluate a prompt’s performance on specific tasks like classification [24]. Model migration and regression also expect the user to provide a manually created unit test suite. Several prompt development platforms offer support for writing unit tests for prompts [ 1,3, 6,8,12]. Promptfoo [ 1] allows for writing assertions over generated output and SPADE [ 42] automatically generates similar assertions over the output for checking regression but none of these platforms automatically generate unit tests. Page 21: PromptPex: Automatic Test Generation for Language Model Prompts 21 7.2 Prompt Fuzzing Prompt fuzzing also addresses similar problems as PromptPex by generating tests to highlight flaws in the prompt. The key difference is that fuzzing also checks the prompt’s ability to handle ill-formed inputs, whereas PromptPex focuses on creating valid inputs. Even when it generates invalid tests, prompt fuzzing is closely related to our work as it involves generating tests for prompts. Currently, the primary focus of most prompt fuzzing efforts is the creation of adversarial inputs for red teaming [ 30,57,59–61]. The automatic test generation in promptfoo [ 9] is also focused on generating malicious prompts. While exploring the impact of malicious inputs is valuable, PromptPex has a broader focus on generating inputs that help evaluate the robustness of the combination of the PUT and model under test. 7.3 Prompt Optimization Prompt optimization involves creating prompt variants that perform better on a specific model or are more cost-effective by being smaller. Although prompt optimization is not directly related to our work, most techniques for it require an equivalence checking test suite. Often, this suite consists of a set of manually labeled examples, either from the user or from commonly available benchmarks [ 27,33,39,47]. Some methods use these datasets as seed examples and generate additional examples from them [ 23]. The generation of examples in such cases is more comparable to the baseline we used in our evaluation than to PromptPex . 7.4 Synthetic data generation There is an increasing demand for high-quality, diverse data for different phases of LLMs, which existing datasets cannot meet. Synthetic data not only fulfills this requirement but also fills gaps where specialized data is needed, for example, data with reasoning steps [ 48,52]. Synthetic data generation shares a similar goal with PromptPex in that both generate input data for a specialized task. However, most efforts in synthetic data generation are not done from scratch, and even when they are, the methodology is more similar to the baseline used in our evaluation than to PromptPex . 7.5 Prompt Specification InPromptPex , we extract input and output specifications to serve multiple purposes, from under- standing the prompt to generating input or output validators and unit tests. In parallel, Stoica et al. [46] have also highlighted the importance of specifications in prompt engineering for making it as reliable and robust as traditional software engineering. Although we could not find any work on extracting and using specifications for generating tests, we found research on using specification for input and output validation and output generation. Sharma et al. [ 43] define input specifications for a VLLM prompt using a declarative meta-language, SPML [ 44]. Amazon Bedrock Guardrails [ 11] allows defining policies over the output in a natural language abstraction over formal logic using declarative variables with natural language description. Structured or constrained decoding [ 5] generates output that follows a given domain-specific grammar, which is equivalent to an output structure specification. 8 CONCLUSION AI model prompts are an increasingly common and important part of many software applications yet few tools exist to help software developers write, test, debug, maintain, and migrate such prompts to new models. We have presented PromptPex , the first LLM-based tool that, given an AI model prompt as input, automatically generates an input and output specification for the prompt as well as test cases to Page 22: 22 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn explore the behavior of the prompt. The specifications and test cases alone are valuable artifacts that can help the prompt developer understand their prompt and test it. Tests generated by PromptPex can be used to augment existing unit test cases for prompts if they already exist. Further, PromptPex automatically tests a given prompt and test input against a collection of AI models and evaluates the results for output compliance with the input prompt, allowing users to quickly understand the behavior of their prompt on multiple models. A key element of our approach is that we extract a projection of the prompt as a set of independent, concrete, checkable output rules that are then used to create targeted tests. Many prompts used in commercial applications have natural language statements that express such rules. For example, a prompt might contain phrases like "Ensure that..." or "The output must ..." that translate directly into our output rules. In evaluating PromptPex with a suite of eight prompt benchmarks using four diverse AI models, we show that PromptPex consistently outperforms the baseline LLM-based test generator and clearly identifies which models are most suited to use for given prompt. Future extensions to PromptPex will focus on test generation for more sophisticated inputs, extracting more sophisticated logical constraints expressed in prompts, and integrating our test generation with prompt optimization approaches. REFERENCES [1]Assertions & metrics | promptfoo — promptfoo.dev. https://www .promptfoo .dev/docs/configuration/expected-outputs/. [Accessed 13-12-2024]. [2]Azure AI Content Safety – AI Content Moderation | Microsoft Azure — azure.microsoft.com. https:// azure .microsoft .com/en-us/products/ai-services/ai-content-safety. [Accessed 09-01-2025]. [3]Create Test Sets | Agenta Documentation — docs.agenta.ai. https://docs .agenta .ai/evaluation/create-test-sets. [Accessed 13-12-2024]. [4] Generative AI Scripting — microsoft.github.io. https://microsoft .github .io/genaiscript/. [Accessed 09-01-2025]. [5]GitHub - guidance-ai/guidance: A guidance language for controlling large language models. — github.com. https: //github .com/guidance-ai/guidance. [Accessed 05-02-2025]. [6]GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. — github.com. https://github .com/openai/evals. [Accessed 13-12-2024]. [7] LangChain — python.langchain.com. https://python .langchain .com. [Accessed 13-12-2024]. [8]LangChain — python.langchain.com. https://python .langchain .com/docs/concepts/testing/#unit-tests. [Accessed 13-12-2024]. [9]LLM Vulnerability Scanner | promptfoo — promptfoo.dev. https://www .promptfoo .dev/llm-vulnerability-scanner/. [Accessed 05-01-2025]. [10] Ollama — ollama.com. https://ollama .com/. [Accessed 09-01-2025]. [11] Prevent factual errors from LLM hallucinations with mathematically sound Automated Reasoning checks (preview) | Amazon Web Services — aws.amazon.com. https://aws .amazon .com/blogs/aws/prevent-factual-errors-from-llm- hallucinations-with-mathematically-sound-automated-reasoning-checks-preview/. [Accessed 14-01-2025]. [12] PromptLayer — docs.promptlayer.com. https://docs .promptlayer .com/quickstart#evaluations. [Accessed 13-12-2024]. [13] prompty.ai — prompty.ai. https://prompty .ai/. [Accessed 09-01-2025]. [14] TypeChat — microsoft.github.io. https://microsoft .github .io/TypeChat/. [Accessed 05-01-2025]. [15] Validators - Pydantic — docs.pydantic.dev. https://docs .pydantic .dev/latest/concepts/validators/. [Accessed 05-01- 2025]. [16] Azure ai studio prompt catalog. https://ai .azure .com/explore/prompts/shakespeare _writing _assistant/ version/0 .0.1/registry/azureml?wsid =/subscriptions/fc8867fe-bf04-426c-a32a-07d0c709a945/resourcegroups/ genaiscript/providers/Microsoft .MachineLearningServices/workspaces/genaiscript&tid =512451b2-ca3c-4016-b97c- 10bd8c704cfc&promptType =promptSamples&promptSharedInHub =false, 2023. [17] The big prompt library. https://github .com/0xeb/TheBigPromptLibrary/blob/main/CustomInstructions/ChatGPT/ SaxlWzH4g_Sentence_Rewriter_Tool .md, 2023. [18] The big prompt library. https://github .com/0xeb/TheBigPromptLibrary/blob/main/CustomInstructions/ChatGPT/ U2CjpQSs6_ArtPrompt .md, 2023. Page 23: PromptPex: Automatic Test Generation for Language Model Prompts 23 [19] How to use llama2 tutorial for text classification. https://pupuweb .com/how-use-llama-2-text-classification-tasks/, 2023. [20] Openai documentation: Best practices for prompt engineering with the openai api. https://help .openai .com/en/articles/ 6654000-best-practices-for-prompt-engineering-with-the-openai-api, 2023. [21] Prompt examples from the website. https://www .promptingguide .ai/prompts/information-extraction/extract-models, 2023. [22] Ghostwriter: Augmenting collaborative human-ai writing experiences through personalization and agency. https: //arxiv .org/abs/2402 .08855, 2024. [23] Eshaan Agarwal, Vivek Dani, Tanuja Ganu, and Akshay Nambi. Promptwizard: Task-aware agent-driven prompt optimization framework. arXiv preprint arXiv:2405.18369 , 2024. [24] BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research , 2023. [25] Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. Unit test generation using generative ai: A comparative performance analysis of autogeneration tools. In Proceedings of the 1st International Workshop on Large Language Models for Code , pages 54–61, 2024. [26] Margaret Burnett, Curtis Cook, Omkar Pendse, Gregg Rothermel, Jay Summet, and Chris Wallace. End-user software engineering with assertions in the spreadsheet paradigm. In 25th International Conference on Software Engineering, 2003. Proceedings. , pages 93–103. IEEE, 2003. [27] Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, and Yanghua Xiao. Mapo: Boosting large language model performance with model-adaptive prompt optimization. In Findings of the Association for Computational Linguistics: EMNLP 2023 , page 3279–3304. Association for Computational Linguistics, 2023. [28] Lori A Clarke and David S Rosenblum. A historical perspective on runtime assertion checking in software development. ACM SIGSOFT Software Engineering Notes , 31(3):25–37, 2006. [29] Tanay Dixit, Daniel Lee, Sally Fang, Sai Sree Harsha, Anirudh Sureshan, Akash Maharaj, and Yunyao Li. Retain: Interactive tool for regression testing guided llm migration, 2024. [30] Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, and Kwok-Yan Lam. Effective and evasive fuzz testing-driven jailbreaking attacks against llms, 2024. [31] Tommy Guy, Peli de Halleux, Reshabh K Sharma, and Ben Zorn. Prompts are programs. https://blog .sigplan .org/2024/ 10/22/prompts-are-programs//, 2024. SIGPLAN Perspectives Blog, [Accessed 18-12-2024]. [32] Eaman Jahani, Benjamin S. Manning, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz. As generative models improve, we must adapt our prompts, 2024. [33] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 , 2023. [34] Wanqin Ma, Chenyang Yang, and Christian Kästner. Why is my prompt getting worse? Rethinking regression testing for evolving LLM APIs. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI , CAIN 2024, page 166–171. ACM, April 2024. [35] Microsoft. Overview of Microsoft IntelliTest. https://learn .microsoft .com/en-us/visualstudio/test/intellitest- manual/?view =vs-2022, 2024. [Accessed 18-12-2024]. [36] Facundo Molina, Alessandra Gorla, and Marcelo d’Amorim. Test oracle automation in the era of llms. ACM Transactions on Software Engineering and Methodology , 2024. [37] OpenAI. OpenAI API guide: Using JSON mode. https://community .openai .com/t/openai-api-guide-using-json- mode/557265, 2024. [Accessed 18-12-2024]. [38] Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. Building your own product copilot: Challenges, opportunities, and needs, 2023. [39] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495 , 2023. [40] Tobias Schnabel and Jennifer Neville. Symbolic prompt program search: A structure-aware approach to efficient compile-time prompt optimization. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 670–686, 2024. [41] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering , 50(1):85–105, 2024. [42] Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, and Eugene Wu. spade: Synthesizing data quality assertions for large language model pipelines. Proceedings of the VLDB Endowment , 17(12):4173–4186, August 2024. Page 24: 24 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn [43] Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Defending language models against image-based prompt attacks via user-provided specifications. In 2024 IEEE Security and Privacy Workshops (SPW) , pages 112–131. IEEE, 2024. [44] Reshabh K Sharma, Vinayak Gupta, and Dan Grossman. Spml: A dsl for defending language models against prompt attacks. arXiv preprint arXiv:2402.11755 , 2024. [45] Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 313–322, 2024. [46] Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Hao Zhang, Anastasios Angelopoulos, Shishir G Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q Davis. Specifications: The missing link to making the development of llm systems an engineering discipline. arXiv preprint arXiv:2412.05299 , 2024. [47] Hong Sun, Xue Li, Yinchuan Xu, Youkow Homma, Qi Cao, Min Wu, Jian Jiao, and Denis Charles. Autohint: Automatic prompt optimization with hint generation, 2023. [48] Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 930–957, 2024. [49] Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Transactions on Software Engineering , 2024. [50] Masoumeh Taromirad and Per Runeson. A literature survey of assertions in software testing. In International Conference on Engineering of Computer-Based Systems , pages 75–96. Springer, 2024. [51] Nikolai Tillmann and Peli de Halleux. Pex - white box test generation for .net. In Proc. of Tests and Proofs (TAP’08) , volume 4966 of LNCS , pages 134–153. Springer Verlag, April 2008. [52] Ke Wang, Jiahui Zhu, Minjie Ren, Zeming Liu, Shiwei Li, Zongye Zhang, Chenkai Zhang, Xiaoyu Wu, Qiqi Zhan, Qingjie Liu, et al. A survey on data synthesis and augmentation for large language models. arXiv preprint arXiv:2410.12896 , 2024. [53] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 , 2023. [54] Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. Chatunitest: a chatgpt-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 , 2023. [55] Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, and William Wang. Pride and prejudice: LLM amplifies self-bias in self-refinement. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15474–15492, Bangkok, Thailand, August 2024. Association for Computational Linguistics. [56] Zhiyi Xue, Liangguo Li, Senyue Tian, Xiaohong Chen, Pingping Li, Liangyu Chen, Tingting Jiang, and Min Zhang. Llm4fin: Fully automating llm-powered test case generation for fintech software acceptance testing. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages 1643–1655, 2024. [57] Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu Shen, and Xiangyu Zhang. Parafuzz: An interpretability-driven technique for detecting poisoned samples in nlp. Advances in Neural Information Processing Systems , 36, 2024. [58] Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages 1607–1619, 2024. [59] Dongyu Yao, Jianshu Zhang, Ian G. Harris, and Marcel Carlsson. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4485–4489, 2024. [60] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, 2023. [61] Jiahao Yu, Yangguang Shao, Hanwen Miao, Junzheng Shi, and Xinyu Xing. Promptfuzz: Harnessing fuzzing techniques for robust testing of prompt injection in llms, 2024. [62] Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering , 1(FSE):1703–1726, 2024. [63] Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. No more manual tests? evaluating and improving chatgpt for unit test generation. arXiv preprint arXiv:2305.04207 , 2023. [64] Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. Exploring automated assertion generation via large language models. ACM Transactions on Software Engineering and Methodology , 2024. Page 25: PromptPex: Automatic Test Generation for Language Model Prompts 25 [65] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems , NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc. A PROMPTS USED IN PROMPTPEX A.1 Output Validation using PUT system: Your task is to very carefully and thoroughly evaluate the given output generated by a chatbot to find out if it comply with its description. Since the input is not given to you, only check for the rules which can be checked without knowing the input. The chatbot description that you must use as the basis for your evaluation are provided between the delimiters <DESC> and </DESC>. The description is as follows: <DESC>{{ system }} </DESC> Here are the guidelines to follow for your evaluation process: 1. **Direct Compliance Only**: Your evaluation should be based solely on direct and explicit compliance with the description provided. You should not speculate, infer, or make assumptions about the chatbot 's output. Your judgment must be grounded exclusively in the textual content provided by the chatbot. Do not check for anything which requires knowing the input. 2. **Binary Decision on Compliance**: You are required to make a binary decision based on your evaluation: - Return 'OK'if chatbot output complies with the description (except checks which requires knowing the input). - Return 'ERR'if there is any non compliance with the chatbot description (except checks which requires knowing the input). 3. **Compliance Statement**: Carefully examine the output and determine why the output does not comply with the description (except rules which requires knowing the input ), think of reasons why the output complies or does not compiles with the chatbot description, citing specific elements of the output. 4. **Explanation of Violations**: In the event that a violation is detected, you have to provide a detailed explanation. This explanation should describe what specific elements of the chatbot 's output led you to conclude that a rule was violated and what was your thinking process which led you make that conclusion. Be as clear and precise as possible, and reference specific parts of the output to substantiate your reasoning. 5. **Checking compliance and never correctness**: You are not required to evaluate the functional correctness of the chatbot 's output as you are not given the input which generated those outputs. Your evaluation should focus solely on whether the output complies with the chatbot description, if it requires knowing the input, ignore that part of the description. 6. **Output guidelines**: For the chatbot 's output given to you, first describe your thinking and reasoning that went into coming up with the decision then in the next line output 'OK'or'ERR'based on your decision. Output 'OK', if the chatbot 's output complies with the chatbot description. Output 'ERR', if the chatbot 's output does not comply with the chatbot description. Do not output anything else. Example output: Mention the reason for violation and your thinking went into coming up with it. ERR No violation.OK By adhering to these guidelines, you ensure a consistent and rigorous evaluation process. Be very rational and do not make up information. Your attention to detail and careful analysis are crucial for maintaining the integrity and reliability of the evaluation.user: Chatbot Output: {{ result }} A.2 ISExtraction Page 26: 26 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn system: You are an expert in analyzing chatbot functionalities and identifying the requirements for their inputs. Given a description of a chatbot 's capabilities, your task is to specifically extract and list the rules and constraints that will guide the creation of valid inputs. Your response should focus solely on input requirements and ignore any details related to output generation or other functionalities. Start with describing what the input is, is it a question related to programming or is it a math problem or something more complex like code or a complete blog post, then describe properties of input of this kind and then describe the restrictions for the input. Make sure to include all the possible properties of the input and the restrictions for the input, for example, the length of the input. If the chatbot description handles a corner case, for example if the description says ignore all the greetings, it means that a greeting is a valid input but the chatbot is handling it in a special way which makes it a part of the input domain and there must not be a rules against it. If the input is coming from any kind of file then assume the input will be a string containing the content of the file. Only describe the content of the file without any details about the file itself. This input specification will be used for generating tests for the chatbot. Please make sure to only think about the input and not the output or how will the chatbot respond to the input. If it a possible input, it is a valid input irrespective of the output or the chatbot description. Please format your response as follows: - List each input rule as a clear, independent sentence. - Ensure each rule directly relates to the types of inputs the chatbot can accept. - Avoid mentioning output details or any assumptions beyond the provided description. - Do not add unnecessary details, generate max two rules for each compenent of the input. Focus only on what types of inputs can be given to the chatbot based on its description, output each input rule in a line without any bullets, and nothing else.user: Chatbot description: {{context}} A.3 Inverse Rule system: Given a list of rules provided by the user, generate another list of rules which contradicts the given rules semantically. Generate one inversed rule for each given rule in the given list. Come up with smart edge case scenarios. Please ensure that each generated rule is only in a single line. Output only the generated rules and nothing else.user: Rules: {{rule}} A.4 Extract Rules system: You are an expert in analyzing chatbot description and extracting rules and constrains for output validation. You are given a description for a chatbot. It describes the interaction between the user and the chatbot that helps the user achieve their goals. Sometimes the description will contain examples. DO NOT provide rules that only apply for those examples. Generalize the rules so that they will apply for other possible inputs. Ensure the rules are clear, specific and very verbose such that they define everything in the rules based on the provided description. Provide the rules as meaningful independent sentences that can be easily validated by just seeing the output and have all the required information for performing the check. Make sure every entity in the rules are provided with a definition and all rules must only be about what the output is and should not contain any information about how the output should be generated. {% if num_rules == 0 %} Page 27: PromptPex: Automatic Test Generation for Language Model Prompts 27 Only output all the rules related to the output or response generated by the chatbot based on the given description, one in each line and nothing else without any bullets or numbering. Do not make any assumptions. {% else %} Output at least {{num_rules}} most crucial rules related to the output or response generated by the chatbot based on the given description, one in each line and nothing else without any bullets or numbering. Do not make any assumptions. {% endif %}user: System prompt: {{input_data}} A.5 Generate Tests system: You are tasked with developing multiple test cases for an software, given its functional and input specification and a list of rules as input. For each rule, you must create {{num}} test cases. These test cases must be designed to validate whether the software 's outputs correctly adhere to a particular rule. These tests must be well defined based on the input specifications. Start with first understanding what is a input for the software using the given input specification. Understand what are the different components of a valid input, what are the sytax and sematics related constraints. A good test must always be a valid input meeting the requirements from the given input specification. Use the following input specification to understand valid inputs and generate good tests: {{input_spec}} Use the following functional specification of the software to generate the test cases: {{context}} Guidelines for generating test cases: - Analyze the input specifications to understand the valid input formats, components of the input and scenarios for the software. - If the test case have multiple components, try to use all the components in the test case and tag them with their name, like, component name: value - Develop {{num}} test cases for each rule provided in the list. - Each test case must be crafted to rigorously assess whether the software 's output meets the stipulated rule based on the inputs that conform to the provided input specification. - Use valid and realistic input scenarios that fully comply with the input specifications and are relevant to the rule being tested. - Specify clearly in each test case the input given to the software and the expected output or behavior that demonstrates adherence to the rule. - Broadly cover a range of scenarios, including boundary cases, typical cases, and edge cases, to thoroughly evaluate the software 's adherence to the rule under various conditions. - Never generate similar or redundant test cases Each test case should adhere to principles of good software testing practices, emphasizing coverage, specificity and independence. Critically assess potential weaknesses in the software 's handling of inputs based on the rule and focus on creating diverse test cases that effectively challenge the software 's capabilities. Format your response in a structured CSV format as follows: - "ruleid": Identifier for the rule being tested. - "testid": Sequential identifier for each test case under a rule. - "testinput": Detailed input provided to the software. - "expectedoutput": Output or behavior expected from the software to affirm rule adherence. - "reasoning": Brief explanation of why this test case is relevant and contributes to robust testing of the rule. List the input specification that this test case does not follow. Example CSV layout: ruleid, testid, testinput, expectedoutput, reasoning 1, 1, "input based on rule 1 scenario 1", "expected outcome demonstrating rule adherence", "Explains the relevance and effectiveness of the test and how it follows the input specification" 1, 2, "input based on rule 1 scenario 2, examples", "expected response confirming rule", "Illustrates how inputs challenge the software and ensure compliance and how is a valid test case based on input specification" Page 28: 28 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn Only output the test cases in the specified CSV format and nothing else. Please make sure that the CSV generated is well formed, only have five columns and each value in a these columns must only have commas inside quoted value else they will be counted as a new column. Do not wrap the output in any additional text or formatting like triple backticks or quotes. Since you will be given {{ num_rules }} rules, you are expected to generated {{ num_rules * num }} tests, {{ num }} for each given rule. user: List of Rules: {{rule}} B PROMPTS USED IN THE BENCHMARK B.1 Speech Tag system: In this task, you will be presented with a sentence and a word contained in that sentence. You have to determine the part of speech for a given word and return just the tag for the word 's part of speech. Return only the part of speech tag. If the word cannot be tagged with the listed tags, return Unknown. If you are unable to tag the word, return CantAnswer. Here is the Alphabetical list of part-of-speech tags used in this task: CC: Coordinating conjunction, CD: Cardinal number, DT: Determiner, EX: Existential there, FW: Foreign word, IN: Preposition or subordinating conjunction, JJ: Adjective, JJR: Adjective, comparative, JJS: Adjective, superlative, LS: List item marker, MD: Modal, NN: Noun, singular or mass, NNS: Noun, plural, NNP: Proper noun, singular, NNPS: Proper noun, plural, PDT: Predeterminer, POS: Possessive ending, PRP: Personal pronoun, PRP$: Possessive pronoun, RB: Adverb, RBR: Adverb, comparative, RBS: Adverb, superlative, RP: Particle, SYM: Symbol, TO: to, UH: Interjection, VB: Verb, base form, VBD: Verb, past tense, VBG: Verb, gerund or present participle, VBN: Verb, past participle, VBP: Verb, non-3rd person singular present, VBZ: Verb, 3rd person singular present, WDT: Wh-determiner, WP: Wh-pronoun, WP$: Possessive wh-pronoun, WRB: Wh-adverbuser: {{sentenceword}} B.2 Text to P system: You are a web developer who is formatting a paragraph of text as HTML. First, please split the paragraph into individual sentences and wrap each sentence with a <p> tag. **Your answer should have at least three <p> tags.** Then, inside each <p> tag, add one <strong> tag and multiple <em> tags to emphasize key words and phrases.user:{{text}} B.3 Shakespeare system: You are a Shakespearean writing assistant who speaks in a Shakespearean style. You help people come up with creative ideas and content like stories, poems, and songs that use Shakespearean style of writing style, including words like "thou" and "hath". Here are some example of Shakespeare 's style: - Romeo, Romeo! Wherefore art thou Romeo? - Love looks not with the eyes, but with the mind; and therefore is winged Cupid painted blind. - Shall I compare thee to a summer 's day? Thou art more lovely and more temperate. Example: user: Please write a short text turning down an invitation to dinner. assistant: - Dearest, Regretfully, I must decline thy invitation. Prior engagements call me hence. Apologies. Page 29: PromptPex: Automatic Test Generation for Language Model Prompts 29 user: {{question}} B.4 Sentence system: Rewrite the following sentence to enhance its readability and make it sound more conversational. Ensure that the original meaning and factual accuracy are preserved. Concentrate on simplifying complex phrases, using language that 's easy to relate to, and creating a fluid, engaging structure. You 're free to change the style, wording, and other elements (as specified by the user). Note that this instruction is specifically aimed at improving individual sentences, rather than entire paragraphs. For example: Input: Under the shimmering twilight sky, a curious cat ventured onto the ancient cobblestone path, its whiskers twitching with each whisper of the gentle evening breeze. Response: In the enchanting twilight sky, an inquisitive feline embarked on the time- honored cobblestone pathway, its whiskers quivering at every murmur of the serene evening wind. Input:user:{{text}} B.5 Extract Names system: Your task is to extract model names from machine learning paper abstracts. Your response is an array of the model names in the format [\"model_name\"]. If you don ' t find model names in the abstract or you are not sure, return [\"NA\"]user: Abstract: {{input}} B.6 Elements system: Extract the important entities mentioned in the text below. First extract all company names, then extract all people names, then extract specific topics which fit the content and finally extract general overarching themes Desired format: Company names: <comma_separated_list_of_company_names> People names: -||- Specific topics: -||- General themes: -||-user:Text: {{text}} B.7 Classify system: A news article can be classified as one of the following categories: World, Sports, Business, Sci/Tech. Examples: - World: "UN chief urges action on climate change as report warns of 'catastrophe '" - Sports: "Ronaldo scores twice in Manchester United return" - Business: "Apple delays plan to scan iPhones for child abuse images" - Sci/Tech: "SpaceX launches first all-civilian crew into orbit" ' Based on these categories, classify this news article:user:{{text}} Page 30: 30 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn B.8 Art Prompt system: Your role is to transform user descriptions into detailed prompts for generating AI photos, ensuring each description does not exceed 80 words and is crafted in a single paragraph. Focus first on the subjects and their characteristics, then detail the timing and lighting, and describe the background. Conclude by conveying the feeling the image should evoke. Always generate texts in English, combining artistic insight with precise imagery to create impactful AI-generated photos within a brief, singular paragraph. Input from the user:user:{{text}} C PROMPTS USED IN EVALUATION C.1 Baseline system: You are tasked with developing multiple test cases for an software, use the given description to infer its functional and input specification. You must create {{num }} distinct and diverse test cases. These test cases must be designed to validate whether the software 's outputs correctly adhere to description. These tests must be well defined as per the description. Start with first understanding what is a input for the software. Understand what are the different components of a valid input, what are the sytax and sematics related constraints. A good test must always be a valid input meeting the requirements mentioned in the given description. Use the given description of the software to generate the test cases. Guidelines for generating test cases: - Use the description to understand the valid input formats, components of the input and scenarios for the software. - If the test case have multiple components, try to use all the components in the test case and tag them with their name, like, component name: value - Each test case must be crafted to rigorously assess whether the software 's output meets the stipulated behavior based on the provided software description. - Use valid and realistic input scenarios that fully comply with the given description. - Broadly cover a range of scenarios, including boundary cases, typical cases, and edge cases, to thoroughly evaluate the software 's adherence to the description under various conditions. - Never generate similar or redundant test cases. - Test cases must never have the output or the expected output, it must only contain the input. Each test case should adhere to principles of good software testing practices, emphasizing coverage, specificity and independence. Critically assess potential weaknesses in the software 's handling of inputs and focus on creating diverse test cases that effectively challenge the software 's capabilities. Separate each test case with a new line with "===" as the delimiter. It will help in identifying each test case separately. Do not wrap the output in any additional text like the index of test case or formatting like triple backticks or quotes. Only output the test cases directly separated by "===". Try to generate {{ num }} test cases. user: Description of the software: {{prompt}} C.2 ISValidation system: Your task is to very carefully and thoroughly evaluate the given input to a chatbot to find out if it comply with its input specification, that is, if it is a valid input. Use the following input specification to evaluate the given input: <SPEC> {{input_spec}} </SPEC> Here are the guidelines to follow for your evaluation process: Page 31: PromptPex: Automatic Test Generation for Language Model Prompts 31 1. **Direct Compliance Only**: Your evaluation should be based solely on direct and explicit compliance with the provided input specification. You should not speculate, infer, or make assumptions about the chatbot 's description. Your judgment must be grounded exclusively in the input specification provided for the chatbot. 2. **Binary Decision on Compliance**: You are required to make a binary decision based on your evaluation: - Return 'OK'if the given input complies with the input specification. - Return 'ERR'if there is any non compliance with the input specification. 3. **Compliance Statement**: Carefully examine the input and determine why the input does not comply with the input specification, think of reasons why the input complies or does not compiles with the input specification, citing specific rules from the input specification. 4. **Explanation of Violations**: In the event that a violation is detected, you have to provide a detailed explanation. This explanation should describe what specific elements of the input and input specification led you to conclude that a rule was violated and what was your thinking process which led you make that conclusion. Be as clear and precise as possible, and reference specific parts of the input and input specification to substantiate your reasoning. 5. **Output guidelines**: For the input given to you, first describe your thinking and reasoning that went into coming up with the decision then in the next line output ' OK'or'ERR'based on your decision. Output 'OK', if the input complies with the input specification. Output 'ERR', if the input does not comply with the input specification. Do not output anything else. Examples: Mention the reason for violation and your thinking went into coming up with it. ERR No violation.OK By adhering to these guidelines, you ensure a consistent and rigorous evaluation process. Be very rational and do not make up information. Your attention to detail and careful analysis are crucial for maintaining the integrity and reliability of the evaluation.user:Input: {{test}} C.3 Rule Groundedness system: You are given a rule and a description of a chatbot. Your task is to evaluate the rule to determine if it is grounded in the provided description. A rule is considered grounded if it is supported by the information provided in the description. Use the following description to evaluate the rule: <DESCRIPTION> {{ description }} </DESCRIPTION> Output 'OK'if the rule is grounded in the description. Output 'ERR'if the rule is not grounded in the description. Only output the decision as OK or ERR and nothing else.user: Rule: {{ rule }} C.4 Task Specification Extraction system: You are given a description of a chatbot 's task. Your task is to extract the intent of the chatbot from the given description. The intent is the primary goal or purpose of the chatbot. It is the action that the chatbot is designed to perform based on the task description. In the output, provide the extracted intent of the chatbot. Only output the extracted intent and nothing else. Do not include any additional information in the output.user: Page 32: 32 Reshabh K Sharma, Jonathan de Halleux, Shraddha Barke, and Benjamin Zorn {{ prompt }}

---