loader
Generating audio...

arxiv

Paper 2503.10306

Test Amplification for REST APIs Using "Out-of-the-box" Large Language Models

Authors: Tolgahan Bardakci, Serge Demeyer, Mutlu Beyazit

Published: 2025-03-13

Abstract:

REST APIs are an indispensable building block in today's cloud-native applications, so testing them is critically important. However, writing automated tests for such REST APIs is challenging because one needs strong and readable tests that exercise the boundary values of the protocol embedded in the REST API. In this paper, we report our experience with using "out of the box" large language models (ChatGPT and GitHub's Copilot) to amplify REST API test suites. We compare the resulting tests based on coverage and understandability, and we derive a series of guidelines and lessons learned concerning the prompts that result in the strongest test suite.

Paper Content:
Page 1: arXiv:2503.10306v1 [cs.SE] 13 Mar 2025Test Amplification for REST APIs Using “Out-of-the-box” Large Language Models Tolgahan Bardakci∗, Serge Demeyer†, Mutlu Beyazıt† ∗Universiteit Antwerpen†Universiteit Antwerpen and Flanders Make Abstract —REST APIs are an indispensable building block in today’s cloud-native applications, so testing them is crit ically important. However, writing automated tests for such REST APIs is challenging because one needs strong and readable te sts that exercise the boundary values of the protocol embedded i n the REST API. In this paper, we report our experience with using “out of the box” large language models (ChatGPT and GitHub’s Copilot) to amplify REST API test suites. We compar e the resulting tests based on coverage and understandabilit y, and we derive a series of guidelines and lessons learned concern ing the prompts that result in the strongest test suite. Index Terms —Rest APIs; Software Testing; Test Amplification; Artificial Intelligence; Large Language Models; Prompt Eng i- neering I. I NTRODUCTION The API economy is an expanding trend in modern soci- ety, enabling companies and organizations to share data and functionality with other businesses, developers, and cust omers. Through Application Programmers Interfaces (APIs), softw are engineers can develop reliable applications by seamlessly integrating various components. REST APIs, in particular, are the dominant architectural style. Their stateless natu re allows for increased scalability (e.g., auto-scaling). Gi ven the distributed nature of REST APIs, ensuring high quality is crucial. Therefore, these APIs must be tested extensively. However, testing at the API level is inherently complex. First, there is the technical complexity induced by the shee r number of possible combinations between different protoco ls and an even greater number of combinations of API calls. Secondly, different engineering teams develop different c om- ponents, adding organizational complexity. Testing bound ary values remains crucial, as these will ultimately reveal the underlying defects (a.k.a. “the needle in the haystack”). Test amplification is a likely solution for searching the needle in the haystack, as substantial evidence supports it s effectiveness in the context of unit tests [3]. Indeed, test ampli- fiers automatically transform an existing, manually writte n test suite into a more comprehensive one with stronger coverage. An amplified test suite exercises a broader range of conditio ns, including boundary test values that reveal defects. Unfortunately, the readability of the amplified tests poses a challenge. The current generation of test amplification too ls uses generic names for temporary variables (such as t1, t2, t3, . . . ). Also, the injected code sometimes deviates from accepted coding conventions, which hinders the readabilit y and, ultimately, the understandability of the test cases. U sing large language models for test amplification can be beneficia lin addressing these issues. Since they have seen numerous te st code examples, they will likely generate meaningful names and use proper coding idioms. However, prompt engineering is needed to optimize the test amplification process. This paper reports using “out-of-the-box” large language models to amplify REST API tests. We adopt ChatGPT 3.5, ChatGPT 4, and Copilot version 1.5.3.5510. We validate the results against a well-known representative cloud applica tion called PetStore [4], an open-source system with multiple AP I endpoints providing read, write, update, and delete action s. We compare the results based on coverage and understandabilit y and derive guidelines and lessons learned concerning the prompts that generate strong tests. II. R ELATED WORK Testing REST APIs is crucial for ensuring the reliability and functionality of web services. The state of the art in this domain includes a variety of techniques and tools aimed at automating and simplifying the testing process. One key approach involves functional testing, which verifies that t he API behaves as expected under various conditions. Modern tools such as Postman, SoapUI, and RestAssured offer robust frameworks for creating and executing API tests [5]. These automated testing frameworks have also been enhanced by continuous integration and continuous deployment (CI/CD) pipelines, allowing for more frequent and reliable testing cycles. To evaluate the strength of an API test suite, a series of API coverage metrics have been proposed by Martin-Lopez et al. [6]. These coverage metrics are defined based on elements of the OpenAPI documentation. They are quantified by the ratio of the number of elements observed via HTTP requests or responses to the number of elements in the documentation. The metrics derived from observed HTTP messages include: •Path Coverage: The ratio of tested paths to the total documented paths. •Operation Coverage: The ratio of tested operations to the total documented operations. •Parameter Coverage: The ratio of input parameters to the total documented parameters. •Request Content-type Coverage: The ratio of tested content-types to the total accepted content-types. This excludes wildcard types (e.g., application/*). •Status Code Class Coverage: Achieved when both correct (2XX) and erroneous (4XX, 5XX) status codes are triggered. Page 2: SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing •Status Code Coverage: The ratio of obtained status codes to the total documented status codes. •Response Content-type Coverage: The ratio of obtained content-types to the total response content-types docu- mented, also excluding wildcard types. Test amplification is an umbrella term for various activities that analyze and operate on existing test suites, including augmentation, optimization, enrichment, and refactoring [7]. Test amplification differs from test generation, as it creat es new test cases based on existing ones instead of building them fr om scratch. This provides a significant advantage as the amplifi ed test code will better comply with the test architecture. Artificial intelligence (AI) techniques, specifically large language models (LLMs) , have recently garnered a lot of attention, and numerous software engineering techniques a re incorporating such AI models in various ways. One such application is using LLMs for unit test amplification. There are at least two recent reports concerning industrial adopt ion of LLMs for unit test amplification at Meta [8] and GitHub [9]. Other AI applications to software testing concern web API testing by means of generating realistic test inputs, [1] [2 ], enhancing API specifications [1], and generating valid API calls [2]. The authors highlight the power of LLMs in API testing. However, to the best of our knowledge, no reports exist that combine test amplification with LLMs to create stronger REST API tests.✞ ✝☎ ✆Recently, AI tools to generate strong and readable test case s have been gaining attention. Combining test amplification with large language models is particularly appealing be- cause amplified test code will better comply with the test architecture. However, amplifying REST API tests remains uncharted territory. III. C OMPARISON SET-UP Since amplifying REST API tests remains an unexplored area, we set out to investigate large language models streng th- ening a REST API test suite. We restrict ourselves to “out-of - the box” models to obtain a minimum viable baseline; future extensions (in particular RAG pipelines) could probably go further. We compare ChatGPT 3.5, ChatGPT 4, and Copilot version 1.5.3.5510 with varying prompts to derive guidelin es and lessons learned. Our comparison is driven by an overar- ching research question. •How can we use large language models to amplify test code for REST APIs? We validate the results against a well-known cloud appli- cation called PetStore [4]. Petstore is an open-source web application serving as a tutorial for deploying web service s. It provides 20 API endpoints with read, write, update, and delete operations; hence, the system is a good vehicle for experimentation. To evaluate the quality of the amplified test code, we combine quantitative and qualitative criteria, such as str uctural API coverage, readability, and the amount of post-processi ngrequired. The detailed evaluation criteria are listed belo w. All results are available in our reproduction package1. A. Descriptive Statistics To put the results in context, we count the absolute number of amplified tests. Specifically, we tally the number of gener - ated, successful, failed, and not applicable tests and expo sed bugs. B. Structural API Coverage We use the tool Restats, written by Corradini et al. [10] for collecting the API coverage. The tool allows us to collect co v- erage metrics of every executed test case, which we combine afterward in tabular form. Specifically, we use the followin g metrics, a subset of the ones defined by Martin-Lopez et al. [6 ]. •Path Coverage •Operation Coverage •Status Class Coverage •Status Coverage •Response Type Coverage •Request Type Coverage •Parameter Coverage C. Amount of Post-Processing In git-based software engineering environments, code changes are submitted via pull requests. This implies that some human post-processing is needed before the changes are accepted into the code base. We mimic this by manually reviewing the amplified test code and making slight alterati ons when needed. The purpose of these alterations is to bring the test suite into an executable form. As a proxy measure for the amount of work this entails, we count the number of lines edited. D. Readability Besides the above quantitative evidence, we adopt one qualitative criterion. We assess the readability of the amp lified test code using the following questions. The questions are answered by the first author and reviewed by the two other authors. •Are the tests understandable from the human perspective? •Do they include appropriate comments? •Do they comply with the common coding idioms? •Are they really useful in a way that few or no edits are required for clarity? IV. C OMPARISON We start from a happy-path test script for one endpoint (/pet/{petId}/uploadImage), which uploads an image to a gi ven pet ID. The test script is shown in Listing 1. 1https://figshare.com/projects/Test_Amplification/2176 09 2 – 5 Page 3: SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing 1@Test 2public void uploadImageHappyPath () { 3String formData = "../../data/ tolgahanimage" ; 4intpetId = 2; 5Response response = post( "/pet"+"/"+ petId +"/uploadImage" ,null,null, formData, null,null); 6Assert.assertEquals(response.getStatusCode (), 200); 7} Listing 1. Happy-Path Test Script Starting from this baseline, the actual comparison is drive n by increasingly stricter prompts executed against the thre e LLMs under investigation: ChatGPT 3.5, ChatGPT 4, and Copilot. •Prompt 1. Our first prompt is the simplest thing that could possibly work. We provide the happy-path test script and ask the LLM “Can you perform test amplifica- tion?” Sometimes, the LLM does not return any test code but instead provides test scenarios in natural language. In those instances, we follow up with a question similar to “Can you write the test code for these scenarios?” •Prompt 2. For the second prompt, we provide the Ope- nAPI documentation as an extra input, expecting tests with better coverage. •Prompt 3. With the third prompt, we ask the LLMs for the maximum number of test cases they can amplify, in addition to the happy-path test scenario and OpenAPI documentation. We place it at the end of the prompt after the happy-path test and OpenAPI documentation. A. Descriptive Statistics Table I shows how many tests were created by the different prompts. As expected, Prompt 3 (maximize the amount of tests) generates the most tests. Prompt 2 (add the API docu- mentation) has the most impact on Copilot: 12 tests instead of 3. TABLE I DESCRIPTIVE STATISTICS FOR DIFFERENT PROMPTS Statistic GPT 3.5 GPT 4 Copilot Prompt 1Generated Tests 7 8 3 Successful Tests 5 4 2 Failed Tests 2 4 1 N/A (Not Acceptable Tests) 0 0 0 Bugs Exposed 1 4 1 Post-Processing (No. of lines edited) 3 1 0 Prompt 2Generated Tests 5 8 12 Successful Tests 5 4 10 Failed Tests 0 2 2 N/A (Not Acceptable Tests) 0 2 0 Bugs Exposed 0 2 2 Post-Processing (No. of lines edited) 0 7 13 Prompt 3Generated Tests 15 9 17 Successful Tests 14 7 16 Failed Tests 0 2 0 N/A (Not Acceptable Tests) 1 0 1 Bugs Exposed 0 2 0 Post-Processing (No. of lines edited) 52 22 30On rare occasions, we encounter "Not Acceptable" test cases. These cases exercise "Deprecated Endpoints." There - fore, they are neither "successful" nor "failed." LLMs crea ted these test cases mainly because they were documented in the OpenAPI specification. However, in practice, they have become deprecated in the cloud. The row “Bugs Exposed” is quite insightful as well. To illustrate what is happening, we use an example test created by ChatGPT 4 in Listing 2. This test exercises the end-point wit h incorrect input, forcing the API under test to be in a special erroneous state and expecting error status codes. When we ru n the amplified test (Listing 2), we deploy an invalid petId , and we do not expect a 200 status code (line 7). However, the JSON output is shown in Listing 3, and lines 2 and 7 show it is a status code 200. This is an example of a test case that exposes a bug in the API under test. GPT 4.0 had the most bug-exposing tests (four in Prompt 1 and two in Prompts 2 and 3). However, GPT 3.5 and Copilot had a few of these as well. 1// 3. Test with Invalid Pet ID 2@Test 3public void uploadImageInvalidPetId() { 4String formData = "../../data/ tolgahanImage" ; 5intpetId = -1; // Assuming -1 is an invalid ID 6Response response = post( "/pet"+"/"+ petId + "/uploadImage" ,null,null, formData, null,null); 7Assert.assertNotEquals(response. getStatusCode(), 200); 8} Listing 2. Amplified Test Script by GPT 4 1{ 2"code":200, 3"type":"unknown" , 4"message" :"additionalMetadata :null\nFile uploaded to ./null ,24bytes" 5} 6 7java.lang.AssertionError :did not expect [ 200] but found [ 200] Listing 3. Bug Example JSON B. Structural API coverage Table II shows the respective impact on the coverage metrics for the successive prompts and compares with the baseline. Prompt 1 has no impact on the various coverage metrics, Status andStatus Class being the noteworthy exceptions. The simplest prompt generates tests that make the REST API return correct (2XX) and erroneous (4XX) status codes. GPT 4.0 was smarter than the other two in the sense that it did not increase the coverage but instead exposed bugs, as discusse d previously. The API specification provided in Prompt 2 had a signifi- cantly positive impact. Copilot increases the coverage sig nifi- cantly for all coverage metrics. GPT 4.0 (and to a lesser exte nt GPT 3.5) sees only changes for the Status ,Status Class , 3 – 5 Page 4: SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing andParameter . But this illustrates quite well how the API specification permitted LLMs to create stronger tests. Prompt 3 delivers the best results regarding coverage. A notable observation is the significant increase in Path Cover- age across all three LLMs. This suggests that additional tests exercise different endpoints. Thus, combining an example t est script (Prompt 1) with the API specification (Prompt 2) and requesting to maximize the number of test cases (Prompt 3) creates tests for the whole API. TABLE II API C OVERAGE FOR DIFFERENT PROMPTS Coverage BaselinePath 7% Operation 5% Status Class 4% Status 3% Response Type 3% Request Type 9% Parameter 11% GPT 3.5 GPT 4 Copilot Prompt 1Path 7% 7% 7% Operation 5% 5% 5% Status Class 7% 4% 7% Status 8% 3% 5% Response Type 3% 3% 3% Request Type 9% 9% 9% Parameter 11% 11% 11% Prompt 2Path 7% 7% 64% Operation 5% 5% 55% Status Class 4% 7% 18% Status 3% 5% 13% Response Type 3% 3% 29% Request Type 9% 9% 45% Parameter 11% 22% 33% Prompt 3Path 71% 43% 93% Operation 80% 35% 85% Status Class 26% 11% 26% Status 19% 8% 19% Response Type 42% 18% 45% Request Type 64% 45% 73% Parameter 33% 33% 78% C. Post-processing Looking at the rows Post-Processing in Table I, we ob- serve that when only a small amount of test code is generated, minimal post-processing is necessary. However, more editi ng is required with increasing tests (with Prompts 2 and 3). The increased number of edited code lines is not simply due to the increased number of tests. Since we only provide one test case for one endpoint, the LLMs generate relatively short te st cases. Still, they generate tests for other endpoints, espe cially with Prompt 3. Consequently, they make minor errors that need human attention while producing additional test cases . For GPT 4.0 and Copilot, through the more advanced prompts, it increases steadily. However, for GPT 3.5 if we lo ok at the Prompt 3 on Table I, it increases significantly compare d to the previous prompts. It requires 2.4 times more effort th an GPT-4 and 1.7 times more effort than Copilot.D. Readability One of the biggest advantages of using LLMs is that they are built to generate meaningful text. For all the LLMs, GPT 3.5, GPT 4, and GitHub Copilot, it is true that, in the produced test code, the test method names and the variable names are excellent. In addition to that, the code lines are clear, and the comments are understandable. They follow coding idioms and generate meaningful method names such as "deleteOrderInvalidId". V. F UTURE WORK The following points are potential explorations for the future. •Our investigation focused specifically on REST APIs, whereas previous work focused on Unit Testing. Given the promising results, we should consider other testing contexts, such as web UI and mobile application testing. •Our prompt strategies were designed to maximize the strength of the test suite via structural coverage metrics. However, other goals might be expressed (e.g., maxi- mizing bug exposure), in which case, different prompt strategies would be appropriate. •Validating these findings through industry partner case studies would be highly beneficial. This real-world aspect would uncover insights into the strengths and weaknesses of this approach. •We seeded the LLM with a one happy-day test case for a single API end-point. In reality, a test suite for a REST API will consist of several tests, possibly end-to-end test s representing user stories. It remains to be seen whether LLMs can amplify a complete test suite. •LLMs are rapidly developing; we may soon expect more specialized versions trained for writing test code. One possible approach would be using RAG (Retrieval Aug- mented Generation) models for better results. Replicat- ing our work with specialized LLMs and using more advanced prompting techniques may reveal interesting insights. VI. T HREATS TO THE VALIDITY As with all empirical research, some factors may jeopardize the validity of our results. Below, we list those factors and our actions to reduce or alleviate the risks. 1) The outputs of LLMs may vary over time, even when given the same prompt. This is a known problem with the use of LLMS. In this case, we do not expect the variation to affect the amount of test cases generated or the Path or Status Class coverage. For the other criteria, we only expect a minor impact. 2) As we use an open-source cloud application, which has been used in other test generation experiments, the LLMs may have already seen valid test cases. We intend to replicate the investigation with other REST APIs to verify whether this is an issue. 3) The calculation of API coverage was a semi-automatic process; hence, human error is possible. To minimize that 4 – 5 Page 5: SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing risk, we conducted a pilot study using a simple REST API (with only read access at a single entry point.) From a practical standpoint, there are validity issues when considering real-world applications: 1) Due to privacy concerns, some company policies may prohibit using LLMs in software development. Explicitly submitting the OpenAPI documentation to an LLM in- deed induces a security risk. We warn prospective users about these potential policy violations and argue that they should check beforehand. 2) There are many REST APIs, and style and complexity may vary greatly. We have attempted to select an appro- priate cloud application with all REST API operations, but other cloud applications may yield different results. Investigating other REST APIs should minimize that risk. VII. C ONCLUSION This comparison illustrates how test amplification combine d with LLM could strengthen REST API test suites. When asked to amplify an existing happy-day test, all LLMs created extr a API tests. Extra API tests have increased coverage and resul ted in readable test cases that require little post-processing to be accepted as pull requests. Some extra tests even exposed bug s, illustrating that such tests can find the proverbial "needle in a haystack." The provided prompt significantly impacts cover age (see Table I). However, when we exercise only one endpoint and create amplified tests (Prompt 1), models tend to provide detailed coverage of boundary values and test cases. Provid ing the appropriate additional information in the prompt —such as the OpenAPI specification (prompt 2) or asking for the maximum number of tests (prompt 3)— significantly improves the results for GPT 4 and Copilot. When LLMs have more information, they not only improve the coverage but —more importantly— also exercise other endpoints. VIII. A CKNOWLEDGMENTS This work is supported by the Research Foundation Flanders (FWO) via the BaseCamp Zero Project under Grant number S000323N.REFERENCES [1] M. Kim, T. Stennett, D. Shah, S. Sinha, and A. Orso, “Lever aging large language models to improve REST API testing,” in Proceedings ICSE- NIER’24 (2024 ACM/IEEE 44th International Conference on So ftware Engineering: New Ideas and Emerging Results) . New York, NY , USA: Association for Computing Machinery, 2024, pp. 37 – 41. [2] J. C. Alonso, “Automated generation of realistic test in puts for web apis,” inProceedings ESEC/FSE 2021 (29th ACM Joint Meeting on Europe an Software Engineering Conference and Symposium on the Found ations of Software Engineering) . New York, NY , USA: Association for Computing Machinery, 2021, pp. 1666 – 1668. [3] E. Schoofs, M. Abdi, and S. Demeyer, “AmPyfier: Test ampli fication in Python,” Journal of Software: Evolution and Process , vol. 34, no. 11, p. e2490, 2022. [4] Swagger. Petstore application. [Online]. Available: https://petstore.swagger.io [5] A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing RESTf ul APIs: A survey,” ACM Trans. Softw. Eng. Methodol. , vol. 33, no. 1, nov 2023. [Online]. Available: https://doi.org/10.1145/3617175 [6] A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “Test co verage criteria for RESTful web APIs,” in Proceedings A-TEST 2019 (10th ACM SIGSOFT International Workshop on Automating TEST Case Des ign, Selection, and Evaluation) . New York, NY , USA: Association for Computing Machinery, 2019, p. 15–21. [7] B. Danglot, O. Vera-Perez, Z. Yu, A. Zaidman, M. Monperru s, and B. Baudry, “A snowballing literature study on test amplifica tion,” Jour- nal of Systems and Software , vol. 157, p. 110398, 2019. [8] N. Alshahwan, J. Chheda, A. Finogenova, B. Gokkaya, M. Ha rman, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automate d unit test improvement using large language models at Meta,” in Companion Proceedings FSE 2024 (ACM International Conference on the F ounda- tions of Software Engineering) . New York, NY , USA: Association for Computing Machinery, 2024, pp. 185 – 196. [9] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generat ion,” IEEE Transactions on Software Engineering , vol. 50, no. 1, pp. 85–105, 2024. [10] D. Corradini, A. Zampieri, M. Pasqua, and M. Ceccato, “R estats: A test coverage tool for RESTful APIs,” arXiv , 2021. [Online]. Available: https://arxiv.org/abs/2108.08209 5 – 5

---