Paper Content:
Page 1:
arXiv:2503.10306v1 [cs.SE] 13 Mar 2025Test Amplification for REST APIs
Using “Out-of-the-box” Large Language Models
Tolgahan Bardakci∗, Serge Demeyer†, Mutlu Beyazıt†
∗Universiteit Antwerpen†Universiteit Antwerpen and Flanders Make
Abstract —REST APIs are an indispensable building block in
today’s cloud-native applications, so testing them is crit ically
important. However, writing automated tests for such REST
APIs is challenging because one needs strong and readable te sts
that exercise the boundary values of the protocol embedded i n
the REST API. In this paper, we report our experience with
using “out of the box” large language models (ChatGPT and
GitHub’s Copilot) to amplify REST API test suites. We compar e
the resulting tests based on coverage and understandabilit y, and
we derive a series of guidelines and lessons learned concern ing
the prompts that result in the strongest test suite.
Index Terms —Rest APIs; Software Testing; Test Amplification;
Artificial Intelligence; Large Language Models; Prompt Eng i-
neering
I. I NTRODUCTION
The API economy is an expanding trend in modern soci-
ety, enabling companies and organizations to share data and
functionality with other businesses, developers, and cust omers.
Through Application Programmers Interfaces (APIs), softw are
engineers can develop reliable applications by seamlessly
integrating various components. REST APIs, in particular,
are the dominant architectural style. Their stateless natu re
allows for increased scalability (e.g., auto-scaling). Gi ven the
distributed nature of REST APIs, ensuring high quality is
crucial. Therefore, these APIs must be tested extensively.
However, testing at the API level is inherently complex.
First, there is the technical complexity induced by the shee r
number of possible combinations between different protoco ls
and an even greater number of combinations of API calls.
Secondly, different engineering teams develop different c om-
ponents, adding organizational complexity. Testing bound ary
values remains crucial, as these will ultimately reveal the
underlying defects (a.k.a. “the needle in the haystack”).
Test amplification is a likely solution for searching the
needle in the haystack, as substantial evidence supports it s
effectiveness in the context of unit tests [3]. Indeed, test ampli-
fiers automatically transform an existing, manually writte n test
suite into a more comprehensive one with stronger coverage.
An amplified test suite exercises a broader range of conditio ns,
including boundary test values that reveal defects.
Unfortunately, the readability of the amplified tests poses a
challenge. The current generation of test amplification too ls
uses generic names for temporary variables (such as t1, t2,
t3, . . . ). Also, the injected code sometimes deviates from
accepted coding conventions, which hinders the readabilit y
and, ultimately, the understandability of the test cases. U sing
large language models for test amplification can be beneficia lin addressing these issues. Since they have seen numerous te st
code examples, they will likely generate meaningful names
and use proper coding idioms. However, prompt engineering
is needed to optimize the test amplification process.
This paper reports using “out-of-the-box” large language
models to amplify REST API tests. We adopt ChatGPT 3.5,
ChatGPT 4, and Copilot version 1.5.3.5510. We validate the
results against a well-known representative cloud applica tion
called PetStore [4], an open-source system with multiple AP I
endpoints providing read, write, update, and delete action s. We
compare the results based on coverage and understandabilit y
and derive guidelines and lessons learned concerning the
prompts that generate strong tests.
II. R ELATED WORK
Testing REST APIs is crucial for ensuring the reliability
and functionality of web services. The state of the art in
this domain includes a variety of techniques and tools aimed
at automating and simplifying the testing process. One key
approach involves functional testing, which verifies that t he
API behaves as expected under various conditions. Modern
tools such as Postman, SoapUI, and RestAssured offer robust
frameworks for creating and executing API tests [5]. These
automated testing frameworks have also been enhanced by
continuous integration and continuous deployment (CI/CD)
pipelines, allowing for more frequent and reliable testing
cycles.
To evaluate the strength of an API test suite, a series of
API coverage metrics have been proposed by Martin-Lopez et
al. [6]. These coverage metrics are defined based on elements
of the OpenAPI documentation. They are quantified by the
ratio of the number of elements observed via HTTP requests
or responses to the number of elements in the documentation.
The metrics derived from observed HTTP messages include:
•Path Coverage: The ratio of tested paths to the total
documented paths.
•Operation Coverage: The ratio of tested operations to
the total documented operations.
•Parameter Coverage: The ratio of input parameters to
the total documented parameters.
•Request Content-type Coverage: The ratio of tested
content-types to the total accepted content-types. This
excludes wildcard types (e.g., application/*).
•Status Code Class Coverage: Achieved when both
correct (2XX) and erroneous (4XX, 5XX) status codes
are triggered.
Page 2:
SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing
•Status Code Coverage: The ratio of obtained status
codes to the total documented status codes.
•Response Content-type Coverage: The ratio of obtained
content-types to the total response content-types docu-
mented, also excluding wildcard types.
Test amplification is an umbrella term for various activities
that analyze and operate on existing test suites, including
augmentation, optimization, enrichment, and refactoring [7].
Test amplification differs from test generation, as it creat es new
test cases based on existing ones instead of building them fr om
scratch. This provides a significant advantage as the amplifi ed
test code will better comply with the test architecture.
Artificial intelligence (AI) techniques, specifically large
language models (LLMs) , have recently garnered a lot of
attention, and numerous software engineering techniques a re
incorporating such AI models in various ways. One such
application is using LLMs for unit test amplification. There
are at least two recent reports concerning industrial adopt ion
of LLMs for unit test amplification at Meta [8] and GitHub [9].
Other AI applications to software testing concern web API
testing by means of generating realistic test inputs, [1] [2 ],
enhancing API specifications [1], and generating valid API
calls [2]. The authors highlight the power of LLMs in API
testing.
However, to the best of our knowledge, no reports exist
that combine test amplification with LLMs to create stronger
REST API tests.✞
✝☎
✆Recently, AI tools to generate strong and readable test case s
have been gaining attention. Combining test amplification
with large language models is particularly appealing be-
cause amplified test code will better comply with the test
architecture. However, amplifying REST API tests remains
uncharted territory.
III. C OMPARISON SET-UP
Since amplifying REST API tests remains an unexplored
area, we set out to investigate large language models streng th-
ening a REST API test suite. We restrict ourselves to “out-of -
the box” models to obtain a minimum viable baseline; future
extensions (in particular RAG pipelines) could probably go
further. We compare ChatGPT 3.5, ChatGPT 4, and Copilot
version 1.5.3.5510 with varying prompts to derive guidelin es
and lessons learned. Our comparison is driven by an overar-
ching research question.
•How can we use large language models to amplify test
code for REST APIs?
We validate the results against a well-known cloud appli-
cation called PetStore [4]. Petstore is an open-source web
application serving as a tutorial for deploying web service s.
It provides 20 API endpoints with read, write, update, and
delete operations; hence, the system is a good vehicle for
experimentation.
To evaluate the quality of the amplified test code, we
combine quantitative and qualitative criteria, such as str uctural
API coverage, readability, and the amount of post-processi ngrequired. The detailed evaluation criteria are listed belo w. All
results are available in our reproduction package1.
A. Descriptive Statistics
To put the results in context, we count the absolute number
of amplified tests. Specifically, we tally the number of gener -
ated, successful, failed, and not applicable tests and expo sed
bugs.
B. Structural API Coverage
We use the tool Restats, written by Corradini et al. [10] for
collecting the API coverage. The tool allows us to collect co v-
erage metrics of every executed test case, which we combine
afterward in tabular form. Specifically, we use the followin g
metrics, a subset of the ones defined by Martin-Lopez et al. [6 ].
•Path Coverage
•Operation Coverage
•Status Class Coverage
•Status Coverage
•Response Type Coverage
•Request Type Coverage
•Parameter Coverage
C. Amount of Post-Processing
In git-based software engineering environments, code
changes are submitted via pull requests. This implies that
some human post-processing is needed before the changes
are accepted into the code base. We mimic this by manually
reviewing the amplified test code and making slight alterati ons
when needed. The purpose of these alterations is to bring the
test suite into an executable form. As a proxy measure for
the amount of work this entails, we count the number of lines
edited.
D. Readability
Besides the above quantitative evidence, we adopt one
qualitative criterion. We assess the readability of the amp lified
test code using the following questions. The questions are
answered by the first author and reviewed by the two other
authors.
•Are the tests understandable from the human perspective?
•Do they include appropriate comments?
•Do they comply with the common coding idioms?
•Are they really useful in a way that few or no edits are
required for clarity?
IV. C OMPARISON
We start from a happy-path test script for one endpoint
(/pet/{petId}/uploadImage), which uploads an image to a gi ven
pet ID.
The test script is shown in Listing 1.
1https://figshare.com/projects/Test_Amplification/2176 09
2 – 5
Page 3:
SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing
1@Test
2public void uploadImageHappyPath () {
3String formData = "../../data/
tolgahanimage" ;
4intpetId = 2;
5Response response = post( "/pet"+"/"+
petId +"/uploadImage" ,null,null,
formData, null,null);
6Assert.assertEquals(response.getStatusCode
(), 200);
7}
Listing 1. Happy-Path Test Script
Starting from this baseline, the actual comparison is drive n
by increasingly stricter prompts executed against the thre e
LLMs under investigation: ChatGPT 3.5, ChatGPT 4, and
Copilot.
•Prompt 1. Our first prompt is the simplest thing that
could possibly work. We provide the happy-path test
script and ask the LLM “Can you perform test amplifica-
tion?” Sometimes, the LLM does not return any test code
but instead provides test scenarios in natural language. In
those instances, we follow up with a question similar to
“Can you write the test code for these scenarios?”
•Prompt 2. For the second prompt, we provide the Ope-
nAPI documentation as an extra input, expecting tests
with better coverage.
•Prompt 3. With the third prompt, we ask the LLMs for
the maximum number of test cases they can amplify, in
addition to the happy-path test scenario and OpenAPI
documentation. We place it at the end of the prompt after
the happy-path test and OpenAPI documentation.
A. Descriptive Statistics
Table I shows how many tests were created by the different
prompts. As expected, Prompt 3 (maximize the amount of
tests) generates the most tests. Prompt 2 (add the API docu-
mentation) has the most impact on Copilot: 12 tests instead
of 3.
TABLE I
DESCRIPTIVE STATISTICS FOR DIFFERENT PROMPTS
Statistic GPT 3.5 GPT 4 Copilot
Prompt 1Generated Tests 7 8 3
Successful Tests 5 4 2
Failed Tests 2 4 1
N/A (Not Acceptable Tests) 0 0 0
Bugs Exposed 1 4 1
Post-Processing (No. of lines edited) 3 1 0
Prompt 2Generated Tests 5 8 12
Successful Tests 5 4 10
Failed Tests 0 2 2
N/A (Not Acceptable Tests) 0 2 0
Bugs Exposed 0 2 2
Post-Processing (No. of lines edited) 0 7 13
Prompt 3Generated Tests 15 9 17
Successful Tests 14 7 16
Failed Tests 0 2 0
N/A (Not Acceptable Tests) 1 0 1
Bugs Exposed 0 2 0
Post-Processing (No. of lines edited) 52 22 30On rare occasions, we encounter "Not Acceptable" test
cases. These cases exercise "Deprecated Endpoints." There -
fore, they are neither "successful" nor "failed." LLMs crea ted
these test cases mainly because they were documented in
the OpenAPI specification. However, in practice, they have
become deprecated in the cloud.
The row “Bugs Exposed” is quite insightful as well. To
illustrate what is happening, we use an example test created by
ChatGPT 4 in Listing 2. This test exercises the end-point wit h
incorrect input, forcing the API under test to be in a special
erroneous state and expecting error status codes. When we ru n
the amplified test (Listing 2), we deploy an invalid petId ,
and we do not expect a 200 status code (line 7). However, the
JSON output is shown in Listing 3, and lines 2 and 7 show
it is a status code 200. This is an example of a test case that
exposes a bug in the API under test.
GPT 4.0 had the most bug-exposing tests (four in Prompt 1
and two in Prompts 2 and 3). However, GPT 3.5 and Copilot
had a few of these as well.
1// 3. Test with Invalid Pet ID
2@Test
3public void uploadImageInvalidPetId() {
4String formData = "../../data/
tolgahanImage" ;
5intpetId = -1; // Assuming -1 is an
invalid ID
6Response response = post( "/pet"+"/"+
petId + "/uploadImage" ,null,null,
formData, null,null);
7Assert.assertNotEquals(response.
getStatusCode(), 200);
8}
Listing 2. Amplified Test Script by GPT 4
1{
2"code":200,
3"type":"unknown" ,
4"message" :"additionalMetadata :null\nFile
uploaded to ./null ,24bytes"
5}
6
7java.lang.AssertionError :did not expect [ 200]
but found [ 200]
Listing 3. Bug Example JSON
B. Structural API coverage
Table II shows the respective impact on the coverage metrics
for the successive prompts and compares with the baseline.
Prompt 1 has no impact on the various coverage metrics,
Status andStatus Class being the noteworthy exceptions.
The simplest prompt generates tests that make the REST API
return correct (2XX) and erroneous (4XX) status codes. GPT
4.0 was smarter than the other two in the sense that it did not
increase the coverage but instead exposed bugs, as discusse d
previously.
The API specification provided in Prompt 2 had a signifi-
cantly positive impact. Copilot increases the coverage sig nifi-
cantly for all coverage metrics. GPT 4.0 (and to a lesser exte nt
GPT 3.5) sees only changes for the Status ,Status Class ,
3 – 5
Page 4:
SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing
andParameter . But this illustrates quite well how the API
specification permitted LLMs to create stronger tests.
Prompt 3 delivers the best results regarding coverage. A
notable observation is the significant increase in Path Cover-
age across all three LLMs. This suggests that additional tests
exercise different endpoints. Thus, combining an example t est
script (Prompt 1) with the API specification (Prompt 2) and
requesting to maximize the number of test cases (Prompt 3)
creates tests for the whole API.
TABLE II
API C OVERAGE FOR DIFFERENT PROMPTS
Coverage
BaselinePath 7%
Operation 5%
Status Class 4%
Status 3%
Response Type 3%
Request Type 9%
Parameter 11%
GPT 3.5 GPT 4 Copilot
Prompt 1Path 7% 7% 7%
Operation 5% 5% 5%
Status Class 7% 4% 7%
Status 8% 3% 5%
Response Type 3% 3% 3%
Request Type 9% 9% 9%
Parameter 11% 11% 11%
Prompt 2Path 7% 7% 64%
Operation 5% 5% 55%
Status Class 4% 7% 18%
Status 3% 5% 13%
Response Type 3% 3% 29%
Request Type 9% 9% 45%
Parameter 11% 22% 33%
Prompt 3Path 71% 43% 93%
Operation 80% 35% 85%
Status Class 26% 11% 26%
Status 19% 8% 19%
Response Type 42% 18% 45%
Request Type 64% 45% 73%
Parameter 33% 33% 78%
C. Post-processing
Looking at the rows Post-Processing in Table I, we ob-
serve that when only a small amount of test code is generated,
minimal post-processing is necessary. However, more editi ng
is required with increasing tests (with Prompts 2 and 3). The
increased number of edited code lines is not simply due to
the increased number of tests. Since we only provide one test
case for one endpoint, the LLMs generate relatively short te st
cases. Still, they generate tests for other endpoints, espe cially
with Prompt 3. Consequently, they make minor errors that
need human attention while producing additional test cases .
For GPT 4.0 and Copilot, through the more advanced
prompts, it increases steadily. However, for GPT 3.5 if we lo ok
at the Prompt 3 on Table I, it increases significantly compare d
to the previous prompts. It requires 2.4 times more effort th an
GPT-4 and 1.7 times more effort than Copilot.D. Readability
One of the biggest advantages of using LLMs is that they
are built to generate meaningful text. For all the LLMs,
GPT 3.5, GPT 4, and GitHub Copilot, it is true that, in the
produced test code, the test method names and the variable
names are excellent. In addition to that, the code lines are
clear, and the comments are understandable. They follow
coding idioms and generate meaningful method names such
as "deleteOrderInvalidId".
V. F UTURE WORK
The following points are potential explorations for the
future.
•Our investigation focused specifically on REST APIs,
whereas previous work focused on Unit Testing. Given
the promising results, we should consider other testing
contexts, such as web UI and mobile application testing.
•Our prompt strategies were designed to maximize the
strength of the test suite via structural coverage metrics.
However, other goals might be expressed (e.g., maxi-
mizing bug exposure), in which case, different prompt
strategies would be appropriate.
•Validating these findings through industry partner case
studies would be highly beneficial. This real-world aspect
would uncover insights into the strengths and weaknesses
of this approach.
•We seeded the LLM with a one happy-day test case for
a single API end-point. In reality, a test suite for a REST
API will consist of several tests, possibly end-to-end test s
representing user stories. It remains to be seen whether
LLMs can amplify a complete test suite.
•LLMs are rapidly developing; we may soon expect more
specialized versions trained for writing test code. One
possible approach would be using RAG (Retrieval Aug-
mented Generation) models for better results. Replicat-
ing our work with specialized LLMs and using more
advanced prompting techniques may reveal interesting
insights.
VI. T HREATS TO THE VALIDITY
As with all empirical research, some factors may jeopardize
the validity of our results. Below, we list those factors and our
actions to reduce or alleviate the risks.
1) The outputs of LLMs may vary over time, even when
given the same prompt. This is a known problem with
the use of LLMS. In this case, we do not expect the
variation to affect the amount of test cases generated or
the Path or Status Class coverage. For the other criteria,
we only expect a minor impact.
2) As we use an open-source cloud application, which has
been used in other test generation experiments, the LLMs
may have already seen valid test cases. We intend to
replicate the investigation with other REST APIs to verify
whether this is an issue.
3) The calculation of API coverage was a semi-automatic
process; hence, human error is possible. To minimize that
4 – 5
Page 5:
SUBMITTED TO IEEE Software – Special Issue on Next-generati on Software Testing
risk, we conducted a pilot study using a simple REST API
(with only read access at a single entry point.)
From a practical standpoint, there are validity issues when
considering real-world applications:
1) Due to privacy concerns, some company policies may
prohibit using LLMs in software development. Explicitly
submitting the OpenAPI documentation to an LLM in-
deed induces a security risk. We warn prospective users
about these potential policy violations and argue that they
should check beforehand.
2) There are many REST APIs, and style and complexity
may vary greatly. We have attempted to select an appro-
priate cloud application with all REST API operations,
but other cloud applications may yield different results.
Investigating other REST APIs should minimize that risk.
VII. C ONCLUSION
This comparison illustrates how test amplification combine d
with LLM could strengthen REST API test suites. When asked
to amplify an existing happy-day test, all LLMs created extr a
API tests. Extra API tests have increased coverage and resul ted
in readable test cases that require little post-processing to be
accepted as pull requests. Some extra tests even exposed bug s,
illustrating that such tests can find the proverbial "needle in a
haystack." The provided prompt significantly impacts cover age
(see Table I). However, when we exercise only one endpoint
and create amplified tests (Prompt 1), models tend to provide
detailed coverage of boundary values and test cases. Provid ing
the appropriate additional information in the prompt —such
as the OpenAPI specification (prompt 2) or asking for the
maximum number of tests (prompt 3)— significantly improves
the results for GPT 4 and Copilot. When LLMs have more
information, they not only improve the coverage but —more
importantly— also exercise other endpoints.
VIII. A CKNOWLEDGMENTS
This work is supported by the Research Foundation Flanders
(FWO) via the BaseCamp Zero Project under Grant number
S000323N.REFERENCES
[1] M. Kim, T. Stennett, D. Shah, S. Sinha, and A. Orso, “Lever aging large
language models to improve REST API testing,” in Proceedings ICSE-
NIER’24 (2024 ACM/IEEE 44th International Conference on So ftware
Engineering: New Ideas and Emerging Results) . New York, NY , USA:
Association for Computing Machinery, 2024, pp. 37 – 41.
[2] J. C. Alonso, “Automated generation of realistic test in puts for web apis,”
inProceedings ESEC/FSE 2021 (29th ACM Joint Meeting on Europe an
Software Engineering Conference and Symposium on the Found ations
of Software Engineering) . New York, NY , USA: Association for
Computing Machinery, 2021, pp. 1666 – 1668.
[3] E. Schoofs, M. Abdi, and S. Demeyer, “AmPyfier: Test ampli fication in
Python,” Journal of Software: Evolution and Process , vol. 34, no. 11,
p. e2490, 2022.
[4] Swagger. Petstore application. [Online]. Available:
https://petstore.swagger.io
[5] A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing RESTf ul APIs:
A survey,” ACM Trans. Softw. Eng. Methodol. , vol. 33, no. 1, nov
2023. [Online]. Available: https://doi.org/10.1145/3617175
[6] A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “Test co verage criteria
for RESTful web APIs,” in Proceedings A-TEST 2019 (10th ACM
SIGSOFT International Workshop on Automating TEST Case Des ign,
Selection, and Evaluation) . New York, NY , USA: Association for
Computing Machinery, 2019, p. 15–21.
[7] B. Danglot, O. Vera-Perez, Z. Yu, A. Zaidman, M. Monperru s, and
B. Baudry, “A snowballing literature study on test amplifica tion,” Jour-
nal of Systems and Software , vol. 157, p. 110398, 2019.
[8] N. Alshahwan, J. Chheda, A. Finogenova, B. Gokkaya, M. Ha rman,
I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automate d unit
test improvement using large language models at Meta,” in Companion
Proceedings FSE 2024 (ACM International Conference on the F ounda-
tions of Software Engineering) . New York, NY , USA: Association for
Computing Machinery, 2024, pp. 185 – 196.
[9] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of
using large language models for automated unit test generat ion,” IEEE
Transactions on Software Engineering , vol. 50, no. 1, pp. 85–105, 2024.
[10] D. Corradini, A. Zampieri, M. Pasqua, and M. Ceccato, “R estats: A
test coverage tool for RESTful APIs,” arXiv , 2021. [Online]. Available:
https://arxiv.org/abs/2108.08209
5 – 5