Paper Content:
Page 1:
Automating Code Review: A Systematic Literature Review
ROSALIA TUFANO, SEART @ Software Institute - Università della Svizzera italiana, Switzerland
GABRIELE BAVOTA, SEART @ Software Institute - Università della Svizzera italiana, Switzerland
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented
the benefits brought by such a practice that, however, has its cost to pay in terms of developers’ time. For this reason, researchers
have proposed techniques and tools to automate code review tasks such as the reviewers selection ( i.e.,identifying suitable reviewers
for a given code change) or the actual review of a given change ( i.e.,recommending improvements to the contributor as a human
reviewer would do). Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and
practitioners to get a complete overview of the state-of-the-art.
We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks. We provide:
(i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the
automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for
their evaluation, with a description of the evaluation metrics usually adopted for each task.
The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.
CCS Concepts: •Software and its engineering →Software development techniques .
Additional Key Words and Phrases: code review, recommender systems
ACM Reference Format:
Rosalia Tufano and Gabriele Bavota. 2025. Automating Code Review: A Systematic Literature Review. In Woodstock ’18: ACM Symposium
on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 34 pages. https://doi.org/XXXXXXX
1 INTRODUCTION
The idea of inspecting peers’ code looking for bugs and suboptimal implementation choices dates back to the 70s and in
particular to the seminal work by Fagan titled “ Design and code inspections to reduce errors in program development ” [1].
The formal code inspections envisioned at that time slowly evolved into what is know as modern code review (MCR) [ 2],
being tool-based and more informal.
One of the objectives of MCR is to reduce the inherent cost associated with code review. Indeed, while there is ample
evidence about the benefits of code review [ 2–6], they do not come for free, and may result in developers spending
many hours per week reviewing code [7].
For this reason, researchers proposed techniques and tools to automate specific code review tasks. For example,
several studies focus the attention on the task of recommending reviewers [8–43], namely the automatic selection of
proper reviewers for a given code change. Other researchers target instead the task of classifying reviewers’ comments
[44,45], having the goal of automatically classifying comments posted by reviewers based on the “type of feedback”
they provide to the contributor ( e.g.,feedback about the code style,functionality , etc.). With the recent adoption of deep
learning (DL) in software engineering, generative tasks have also been subject of automation. For example, DL models
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
©2025 Association for Computing Machinery.
Manuscript submitted to ACM
1arXiv:2503.09510v1 [cs.SE] 12 Mar 2025
Page 2:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
have been trained with the goal of generating natural language comments asking to the contributor code changes as a
human reviewer would do ( i.e.,simulating a reviewer commenting on the submitted code) [46–56].
Given the numerous code review tasks in which automation attempts have been made and the large number of studies
targeting this topic, it is important to synthesize the current state-of-the-art to provide researchers and practitioners
with an updated entry point on code review automation.
We present a systematic review of the literature presenting techniques and tools for the automation of code review
tasks. Previous secondary studies on the topic [ 57,58] only focused on specific tasks ( i.e., recommending reviewers task
and refactoring-aware solutions) or do not have a specific focus on code review automation [ 59]. As we show, there
are 34 tasks for which researchers proposed automated solutions in 119 articles. As a comparison, the most extensive
literature review at date also featuring code review automation techniques only includes 53 of these articles [ 59]. This
makes our SLR by far the most comprehensive at date on the topic of code review automation. The SLR we present
is the result of filtering out 119 relevant studies out of 11,165 resulting from querying popular digital libraries. Our
contributions are: (i) A categorization of the 34 code review tasks for which researchers proposed automated solutions;
(ii) An overview of the techniques used in the literature to automate code review ( e.g.,exploiting machine learning, DL,
information retrieval, etc.) with a focus on the training strategies used for data-driven techniques; (iii) A collection
of the publicly available techniques ( i.e.,the tool or the code implementing the technique is publicly available) and
evaluation datasets clustered by “type of automated tasks” ( e.g.,we list all publicly available tools/techniques and
evaluation datasets for the task of recommending reviewers ); (iv) A description of the evaluation frameworks adopted in
the literature to assess the performance of techniques proposed for the different tasks, with a focus on the adopted
metrics, targeted language, and deployment in industry of the automated solution; (v) Informed by the finding of our
SLR, we list directions for future work in the field of code review automation.
1.1 Structure of the Paper
Section 2 reports the related literature, presenting surveys, SLR and mapping studies dealing with modern code review.
Section 3 presents the methodology we adopted to conduct the SLR. Section 4 discusses the achieved results, answering
our research questions. Section 5 reports the threats that could affect the validity of our findings. Finally, Section 6
concludes the paper.
2 RELATED WORK
Table 1 lists the previous secondary studies on modern code review in chronological order. For each work we also
include (i) the overall number of papers part of the study ( i.e.,column “#Papers”) and (ii) the papers related to the
automation of code review tasks that are featured in the study.
The focus of the works by Badampudi et al. [60], Wang et al. [62], and Fronza et al. [61] is different as compared
to our SLR. Badampudi et al. [60] aim at classifying the literature on modern code review based on the investigated
research questions. As a result of this analysis they report that 39 of the surveyed papers present tool support for code
review. However, no additional analyses are performed on these works. A similar study has also been presented by
Wang et al. [62]. Also in this case the authors focus on classifying the type of contribution, reporting 37 papers out
of the 112 considered as related to code review automation. Again, there are not specific research questions in the
SLR about code review automation. Fronza et al. [61], instead, explicitly focus on empirical studies rather than papers
presenting techniques and tools for code review automation.
2
Page 3:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Table 1. Surveys, SLRs and mapping studies dealing with modern code review
Reference Main Goal Year #Papers #Papers Automation
Hannebauer et al. [57] Comparing eight techniques to recommend code reviewers 2016 8 8
Badampudi et al. [60] Documenting the research questions addressed in code review literature 2019 177 39
Coelho et al. [58] Mapping refactoring-aware solutions to support modern code review 2019 13 9
Fronza et al. [61] Documenting the research questions addressed in code review literature 2020 75 0
Wang et al. [62] Mapping the type of contributions ( e.g.,empirical study, automation) in
code review papers, study their replicability, document the type of data
collected in such studies ( e.g.,the experience of reviewers, the workload,
etc.)2021 112 37
Davila et al. [59] Mapping the type of contributions ( i.e.,foundational, proposals, evalua-
tions) in code review papers2021 139 53
Our work Documenting the code review tasks automated in the literature,
the adopted techniques and evaluation datasets/frameworks2025 119 119
Hannebauer et al. [57] and Coelho et al. [58] present secondary studies focusing on the automation of specific code
review tasks. The former compares eight techniques for the recommending reviewers task, while the latter looks at 13
refactoring-aware solutions proposed in the literature. Our SLR has a wider target, looking at works automating any
code review task.
Finally, Davila et al. [59] presented another SLR mapping the type of contribution of the code review papers. Their
SLR features 53 papers presenting tools and techniques for the automation of code review. As compared to the previously
discussed SLRs, Davila et al. provide a detailed description of these works, including the type of task they support.
However, differently from our SLR, the main focus is not on code review automation and, due to the time period in which
papers have been collected ( i.e.,up to 2019), none of the recent techniques built on top of DL models is documented
(and, consequently, none of the tasks that have been automated for the first time thanks to DL models). Our SLR more
than doubles the paper on code review automation present in the work by Davila et al. [59] (from 53 to 119).
3 RESEARCH METHOD
We describe our research method following the guidelines by Kitchenham and Charters [63] for SLR.
3.1 Research Questions
Our SLR aims at informing researchers and practitioners about the state of the art in automating code review and it is
thus steered by the following research questions (RQs):
•RQ 1:What are the code review tasks for which researchers proposed automated solutions? We aim at categorizing
the code review tasks automated in the literature to support (i) researchers, in getting a complete overview of
tackled research directions in the field, thus possibly identifying areas in needed of further research; and (ii)
practitioners, in discovering automated solutions which may be employed in their daily workflow.
Once identified the list of automated tasks, the following RQs are answered for each task ( i.e.,by discussing the
findings by task):
•RQ 2:What are the under-the-hood solutions behind the techniques and tools proposed for code review automation?
This RQ sheds light on the functioning of the proposed automated solutions. In particular, we present: (i) a high-
level classification of the adopted technical solution — e.g.,DL-based, ML-based (Machine Learning), IR-based
3
Page 4:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
+244
+1397
+2890
+1885
+3743
+1006Online SearchAutomated FilteringSnowballing119Selected StudiesManual InspectionInvalid publication venuesSistematic literature reviewsBook chapters/MagazinesConference reviewers’ listUser/App review studiesDuplicates
-1320
-3511
-3271
-1151
-306
-209ACMElsevierIEEEScopusSpringerWiley
+16
-12941234n=11,165n=1,397n=103
Fig. 1. Study selection process
(Information Retrieval), etc.; (ii) a description of the training strategies adopted in data-driven solutions; and (iii)
information about the programming language target of the automation ( e.g.,does the technique only support
Java or it is language-independent?).
•RQ 3:How are techniques for the automation of code related tasks empirically evaluated? We focus on the adopted
evaluation metrics and on additional qualitative/industrial studies present in the papers. RQ 3can help researches
in getting a quick understanding of the possible evaluation framework to adopt for their techniques.
•RQ 4:Which techniques and datasets are publicly available? While RQ 1identifies the automated tasks and, for
each of them, lists the solutions proposed in the literature, not all these techniques are publicly available ( i.e.,
their implementation has been released by the authors). A similar observation can be made for the used datasets.
The output of RQ 4is the list of techniques and datasets that, as of the day of writing (January 2025), are publicly
available. Such an outcome can be useful to (i) researchers, to easily identify baselines for comparisons and/or
datasets that can be used for building or evaluating automated solutions; and (ii) practitioners, to easily spot
“ready-to-use” solutions they can consider for adoption.
•RQ 5:What are the concerns raised or the limitations observed by researchers when experimenting the automated
solutions? We inspect the 119 papers to identify limitations and concerns researchers discuss about the proposed
techniques, with the goal of outlining possible future research directions in the field.
3.2 Relevant Study Identification
Fig. 1 depicts the process adopted to identify the relevant primary studies. Such a process is detailed in the following.
3.2.1 Search Strategy. We queried six digital libraries to search for primary studies: ACM Digital Library [ 64], Elsevier
ScienceDirect [ 65], IEEE Xplore Digital Library [ 66], Scopus [ 67], Springer Link Online Library [ 68], and Wiley Online
Library [ 69]. We did not query Google Scholar due to the limitations documented by Halevi et al. [70] (e.g.,lack of
quality control, missing support for data download).
To define the query needed to identify works related to the automation of code review tasks, a trial-and-error
procedure has been performed by the two authors. It became soon clear that searching in the paper titles for keywords
such as “ automating ”, “recommending ”, etc. was not an option, even considering all their possible variations ( e.g.,
4
Page 5:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
automating ,automated ,automate ). Indeed, this would have led to the lost of several relevant studies ( e.g.,“Intelligent
Code Review Assignment for Large Scale Open Source Software Stacks ” [30], “A Multi-Step Learning Approach to Assist
Code Review ” [71]). For this reason, we opted for a more conservative query which targets the identification of all code
review-related studies, even those do not presenting automated solutions:
Title CONTAINS
“revi* ” OR (“ cod*” AND “ edit* ”) AND
Publication venue CONTAINS
(“software ” OR “ program ” OR “ code”)
The query searches for the term “ revi* ” (e.g.,review, reviewing, revision) or both the terms “ cod*” (e.g.,code, coding)
and “ edit* ” in the article title. The latter have been included to match works related to the recent trend of automating
code editing needed to address a reviewer’s comment (see e.g.,[47,49,54–56,72–77]). While only searching in the title
might be restrictive, we want to identify automated solutions which have been explicitly proposed for code review
(e.g.,we are not interested in articles presenting generic static analysis tools that might be applied in code review to
spot quality issues). Also, we only searched for articles published in venues containing at least one of three keywords:
“software ”, “program ”, and “ code”. Such a filter is based on the authors’ knowledge of software engineering publication
venues. We acknowledge that there might be relevant articles published in related fields ( e.g.,artificial intelligence) that
our query would exclude. However, as explained later, we adopt a snowballing process to partially address this issue.
Among the queried search engines Elsevier, Scopus, Springer, and Wiley allow to specify a discipline of interest,
which is useful to minimize the retrieved false positive instances. For these libraries, we selected “Computer Science” as
discipline. Springer also allows to specify sub-disciplines, for which we selected “Software Engineering/Programming”.
The link with the query used for each digital library is publicly available in our replication package [ 78]. The query has
been run on 20 December 2024 on all digital libraries.
Table 2. Articles returned by the queried digital libraries
Source Returned Articles
ACM Digital Library 1,006
Elsevier ScienceDirect 3,743
IEEE Xplore Digital Library 1,885
Scopus 2,890
Springer Link Online Library 1,397
Wiley Online Library 244
Total (including duplicates) 11,165
Total (excluding duplicates) 9,845
Table 2 reports the articles returned by each digital library. Once removed duplicates ( i.e.,the same article has been
returned by multiple libraries), we collected 9,845 candidate primary studies which have been manually inspected as
described in the following.
3.2.2 Study Selection. Given the high number of articles returned by the formulated query, we started with an automated
check aimed at excluding clear false positives. First, despite the filter on venues we set in the digital libraries, we noticed
that some of the returned results concerned invalid publication venues ( i.e.,venues not featuring in their name any of
the three keywords “ software ”, “program ”, and “ code”). Thus, we implemented a simple script excluding those cases
(-3,511).
5
Page 6:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Other three filters were implemented. First, given our query, and in particular the retrieval of articles containing
“revi* ” in their title, we retrieved several SLRs. Among those, we were only interested in the ones focusing on code
review, since they represent an important source of references for the snowballing phase. Thus, we automatically
removed all articles containing in the title, besides “ review ”, the term “ systematic ” and do not containing the term
“code” (-3,271). Second, we excluded articles published as book chapters or in magazines, since those are usually not full
research articles (-1,151). Finally, we also excluded “reviewers lists” (-306) and works related to user/app reviews (-209).
At the end of this process, 1,397 candidate primary studies were left.
Table 3. Inclusion and exclusion criteria
Inclusion Criteria
IC1 The article must be peer-reviewed, published at conferences, workshops, or journals. In the snowballing phase later described, we ignore all
referenced preprints ( e.g.,those published on arXiv.org).
IC2 The PDF of the article must be available online. We searched for it on the online libraries featuring and, if needed, on Google.
IC3 The article must present technique(s) to automate a code review task. It is not enough to present a generic technique that, accordingly to the
reader, might be useful in the context of code review: The authors must explicitly state that the technique has been thought to support code
review.
Exclusion Criteria
EC1 The article is not written in English.
EC2 The article has been published in a conference/workshop and later on extended to a journal. We only keep the journal article to avoid
redundancy.
EC3 The article is not a full research publication ( e.g.,doctoral symposium articles, posters, ERA track). We exclude all articles having less than six
pages with the goal of removing articles that may not have been subject to the same peer-review process typical of full research articles.
EC4 The article replicates a previously published technique for code review automation which has been already included in the SLR.
EC5 The article is a secondary study. In this case, we keep it only as a source of references for the snowballing phase.
EC6 The article has not been published in an international venue, but in a national one ( e.g., Brazilian Symposium on Programming Languages ).
This set has then been manually inspected by both authors. Inclusion and exclusion criteria are listed in Table 3.
This part of the manual analysis was mainly focused on the inspection of the title and abstract of the article. Authors
agreed to be conservative and include the article in case of doubts, given the planned subsequent reading of the whole
article as described in the following. Conflicts ( i.e.,cases in which one author considered the article as relevant and one
not) arisen in 25 cases (1.8%) and have been solved through an open discussion. This filtering process left 175 candidate
studies which have been equally split among the two authors. Each author downloaded the corresponding article and
re-inspected it keeping the inclusion and exclusion criteria in mind (Table 3) and then either confirming the article as
relevant for the SLR or discarding it. All those discarded have been double-checked by the other author to ensure no
relevant studies were mistakenly excluded.
This further check confirmed 103 articles as relevant primary studies. Those, together with 19 articles tagged as
“relevant secondary study”, have been subject of a backward snowballing process.
Backward Snowballing. The 122 articles were equally split among the authors, with each of them in charge of
reading the reference list and identify possible relevant papers. At this step, we retrieved also relevant papers published
in venues not containing any of the three keywords “ software ”, “program ”, and “ code” (e.g.,papers published in the
Conference on Artificial Intelligence — AAAI ). Also in this phase, in case of doubts, the authors agreed to included a
referenced article for a further check by the other author. The snowballing resulted in 16 additional primary studies,
that summed up to the 103 previously collected leads to the final set of 119 primary studies featured in our SLR.
6
Page 7:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Table 4. Data extraction questionnaire
No. Question Focus
Q1 Which code review task has been automated? RQ1
Q2 Does the employed technique rely on machine/deep learning? RQ2
Q3 If yes to Q2, which specific algorithms are used? RQ2
Q4 If no to Q2, summarize the approach functioning. RQ2
Q5 Which dataset has been exploited to build the technique?1Collect information related to (i) the subject programming language(s) and (ii) the
type of information featured in the dataset ( i.e.,what is an “instance” in the dataset?).RQ2
Q6 Which evaluation metrics have been employed? RQ3
Q7 Did the authors perform any sort of qualitative analysis? RQ3
Q8 Was the approach deployed in an industrial setting? RQ3
Q9 Is a link to a replication package available? Is the link still working? RQ 4
Q10 Is the implementation of the proposed solution publicly available? RQ 4
Q11 Are the datasets used for training and/or evaluating the technique publicly available? RQ 4
Q12 Do the researchers raise specific concern or discuss limitations about the experimented solutions? RQ 5
3.3 Data Extraction and Analysis
The 119 primary studies have been inspected one last time with the goal of extracting the information needed to answer
our RQs. The articles have been again equally split among the two authors with each of them in charge of extracting
the needed data guided by the questionnaire in Table 4. The questions are clustered based on the RQ they serve.
Q1 collects the data needed to answer RQ 1(i.e.,code review tasks automated in the literature). Q2-Q5 aim at
categorizing the under-the-hood functioning of these techniques, thus answering RQ 2. Q6-Q8 shed light on the
empirical evaluation performed to assess the proposed techniques (RQ 3) while Q9-Q11 look at the replicability of the
primary studies and lists publicly available techniques and tools (RQ 4). Finally, Q12 informs our discussion of current
limitations of automated techniques (RQ 5).
It is worth noting that some of the considered articles did not explicitly report some of the information we aim at
collecting. Those cases are all documented in the master table reported in our replication package [78].
Once collected the needed data, we answer our RQs as follows. For RQ 1we report the list of code review tasks
automated in the literature. Given this list, all other RQs are discussed by task. In all RQs in which “categories” must
be defined ( e.g.,the list of automated tasks in RQ 1), this has been obtained via an open-coding inspired procedure
performed together by the two authors on the notes each of them took during the data extraction procedure, going
back to the original paper if needed ( i.e.,if the notes were not clear/comprehensive enough).
For RQ 2we classify the automated approaches based on the technical solution they are built upon ( e.g.,DL-based).
Then, we distill findings about the training procedures followed for data-driven techniques and the targeted programming
languages. For RQ 3we focus instead on the evaluation, reporting the metrics usually adopted in the assessment of the
techniques, whether qualitative analysis was present, and if the approach has been deployed in industry.
RQ4lists in a tabular fashion the available replication packages reporting for each of them whether they provide an
implementation of the proposed technique and/or the datasets used in the study.
Finally, for RQ 5we read the selected papers with a particular focus on the sections describing the approach, those
discussing the results, and the conclusions to identify concerns/limitations about the proposed technique and its
experimentation. We ignored classic limitations which can be found in any paper and which are usually discussed in the
“threats to validity” section ( e.g.,lack of generalizability beyond the scope of the experiment, limited hyperparameters
tuning), but focused on concerns/limitations which are peculiar of the experimented technique ( e.g.,the lack of an
7
Page 8:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Fig. 2. Publication years
appropriate metric to assess its effectiveness). Once identified the relevant parts of the papers, a tag summarizing the
discussed issue was defined. Then, similar tags were merged and the final list of tags was organized in a taxonomy
presented in Table 10. The identified issues and their mapping with the corresponding papers were double-checked by
a second author.
4 RESULTS
Before answering our RQs, we provide an overview about the identified primary studies. Fig. 2 plots the publication year
of the 119 articles showing, with few exceptions, an overall increasing trend over the years with 21 papers published in
2024. Fig. 3 shows instead the publication venues for these techniques, with venues such as IST, EMSE, ESEC/FSE, and
ICSE being the most popular ones. Table 11 in the paper appendix indicates what each acronym used for publication
venues stands for.
4.1 RQ 1: What are the code review tasks for which researchers proposed automated solutions?
Table 5 presents the 34 types of code review tasks which have been automated in the literature. Table 5 groups the
tasks into macro categories ( e.g.,“Code Change Analysis”) and provides a short description of each task with related
references ( i.e.,the works addressing its automation). We discuss in the following each macro category.
4.1.1 Assessing Review Quality. Works in this area aim at automatically assessing the quality of the review. Such
information is meant to be fed to the reviewer who can take proper actions to improve the review quality, if needed.
Works in this area aimed at classifying review comments as useful or not-useful for the contributor [ 79–82]. Rahman
et al. [83] address a similar problem but by focusing specifically on comments requiring additional explanations to be
properly understood by the contributor (thus being a subcategory of not-useful comments). Widyasari et al. [84] also
8
Page 9:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Fig. 3. Publication venues
investigate comments requiring additional explanations, also proposing the usage of Large Language Models (LLMs) to
generate the additional explanations, when needed.
Finally, Hijazi et al. [85,86] looked at the code review quality measurement from an orthogonal perspective using
biometrics data. By monitoring the reviewer’s activities (using e.g.,an eye-tracking device) they can provide feedback
to the reviewer about areas of the reviewed code they did not pay enough attention to, thus suggesting a further check.
4.1.2 Code Change Analysis. This category groups techniques aimed at analyzing the code change submitted for review
in order to extract information useful to support the reviewer in its inspection. Several authors [ 87–89] targeted the
splitting of tangled commits [ 90] into smaller and cohesive changes which are supposed to be easier to review. Indeed,
having smaller changes can help in achieving quick review turnarounds [ 2,6] while cohesive changes simplify the
identification of proper reviewers, which are more likely to have a comprehensive expertise to review the change (given
its cohesiveness and focus).
Huang et al. [91] propose the automated identification of the “salient-class” in a commit to review. The salient-class is
the one supposed to be the main focus of the changes and which likely triggered changes to other code locations. Such a
class can be used as entry point for the review process, assuming that this will simplify the code change understanding.
Wang et al. [92] suggest the automated linking of similar contributions which may help in identifying duplicated
patches and, more in general, in increasing the reviewers’ awareness about changes impacting similar locations, thus
promoting a better code review.
9
Page 10:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Finally, with the goal of minimizing the number of code review iterations needed to accept a proposed change, Hong
et al. [93] propose a change impact analysis methodology specifically tailored for the code review process and aimed at
identifying functions that must co-change given the proposed contribution, but are not changed.
4.1.3 Code Change Classification. Works in this area classify the whole code change to review again with the goal of
augmenting the information available to reviewers before starting the code inspection. Predicting whether the code
change will be approved (merged) or needs additional review rounds is the most popular code change classification task
tackled in the literature [ 47,55,56,94–101]. Works on this topic provide a representation of the code change as input to
the approach ( e.g.,to a DL model) expecting it to suggest whether the implemented change is acceptable. A variation is
to also provide the technique with information about the specific change the developer was asked to implement ( e.g.,a
reviewer comment that the contributor had to address). The outputted boolean prediction can help, for example, to
prioritize the diff hunks part of a pull request, focusing on those likely to require a reviewer’s comment [47].
Another line of research aims at identifying code contributions which, due to their nature, will require a large review
effort. Uchôa et al. [104] automatically flag code changes which are likely to impact the software design, thus requiring
extra care in their assessment. Wen et al. [103] propose BLIMP Tracer, a tool to support code review through impact
analysis information, thus helping in identifying changes impacting mission-critical deliverables. Wang et al. [105]
generalize the problem to the automated identification of large-review-effort changes while, at the other side of the
spectrum, Zhao et al. [106] target the identification of quickly reviewable changes, namely contributions that are easy
to merge or reject. Similarly to the work classifying the contributions as likely to be accepted/rejected, all these works
provide code reviewers with information useful for prioritizing the changes to inspect.
4.1.4 Code Change Quality Check. Researchers proposed solutions to (partially) automate the quality check usually in
place when reviewing a code change. Approaches addressing this task substantially vary in their goal and complexity.
Some of them focus on specific code quality aspects, such as predicting whether a submitted patch is likely to introduce
a bug [ 109,110], identifying the presence of missed clone refactoring opportunities [ 108], or checking whether the
implemented change violates existing design patterns [ 107]. Other techniques address the same problem with, however,
a more general view on code quality.
Some authors [ 71,111,112] aim at predicting code elements in a patch which require the reviewer’s attention, since
likely in need for changes. These approaches are useful in the context of within-patch review prioritization ( i.e.,deciding
where to allocate more review effort within a patch). Other works push the boundaries further targeting the automated
generation of concrete feedback for the contributor, as a human reviewer would do. A first strategy to achieve this
goal consists in merging the output of several static analysis tools [ 8], providing the contributor with a list of potential
flaws identified in the submitted patch. The most recent trend consists, however, in exploiting DL models to generate
natural language comments for a given patch, with the model imitating a human reviewer [ 46–56]. These techniques
are trained on thousands of examples of real code reviews ( i.e.,review comments liked to specific code changes) and
can then be applied to previously unseen changes to generate review comments. Markovtsev et al. [113] focused on
a simplified version of this problem: Their approach “learns” the code formatting style of a given software project,
identifies violations to such a style, and suggests possibly fixes as automatically generated reviewer’s comments.
4.1.5 Code Review Sentiment Analysis. The code review process may result in critiques moved by a developer (reviewer)
to one of their peers (contributor). The way in which these critiques are formalized in the reviewer’s comment can play
an important role in the successful outcome of the whole process. For this reason, researchers applied sentiment analysis
10
Page 11:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Table 5. Code review tasks for which automated solutions have been proposed
Type Task Description ReferenceAssessing
Review QualityAssessing Review Quality through BiometricsEvaluate the quality of code review using biometrics data, warning
the reviewer if specific areas of code deserve a further check[85, 86]
Classifying the Usefulness of Review CommentsClassify a given code review comment as useful or not-useful for the
contributor[79–82]
Identifying/Improving Review Comments Need-
ing Further ExplanationsIdentifies review comments which need further explanations to be
properly understood by the contributor[83, 84]Code Change
AnalysisDecomposing Tangled Commit Split a composite code change into smaller and cohesive changes [87–89]
Impact Analysis for Code ReviewRecommend functions that must be changed given the submitted con-
tribution[93]
Linking Similar ContributionsLink similar changes to review that share textual content and modify
similar code locations[92, 102]
Predicting Salient-ClassIdentification of the “salient-class” in a commit to review, namely the
class causing the other changes in the commit[91]Code Change
ClassificationIdentifying Impactful Code Changes Identify impactful code changes ( e.g.,impacting the system design) [103, 104]
Identifying Large-review-effort Code Changes Identify code changes that will require a large reviewing effort [105]
Identifying Quickly Reviewable ChangesRank changes to be reviewed based on their likelihood of being quickly
merged or rejected[106]
Predicting Code Changes Approval, Merge, or
Need for reviewPredict the likelihood of a change of being accepted, merged, or needing
review[47,55,56,94–
101]Code Change
Quality CheckChecking Design Patterns ConsistencyCheck whether the implemented change violates existing design pat-
terns[107]
Generating Review Comments Generate review comments for a given piece of code [46–56]
Identifying Clone Refactoring Opportunities Detect unrefactored or partially refactored code clones [108]
Predicting Code Defectiveness Predict the defectiveness of a patch before or after being reviewed [109, 110]
Predicting Problematic Code ElementsPredict code elements in a given contribution reviewers should pay
particular attention to ( e.g.,lines likely needing changes)[71, 111, 112]
Reviewing Code Formatting Violations Suggest how to fix code formatting violations in a given piece of code [113]
Reviewing via Static Analysis Use multiple static analysis tools to generate a code review [8]Code Review
Sent. AnalysisClassifying the Sentiment of Review CommentsClassify the sentiment of review comments as neutral, negative, or
positive[114]
Identifying “Pushback” Feelings in ReviewsIdentify feelings of “pushback", with the reviewer blocking a change
request for interpersonal conflicts[115]
Identifying Toxic/Uncivil Review Comments Identify toxic or uncivil comments in code reviews [116–119]
Rephrasing Toxic/Uncivil CommentsRephrase review comments to improve its politeness without changing
its semantic[119]Retrieval of
Similar CR/CCAugmenting ReviewsCan be used to provide either (i) the contributor with examples of
reviews similar to those they are receiving (for better understanding);
or (ii) the reviewer with examples of reviews which have been written
for code similar to the one they are inspecting[83, 120–126]
Mining Code Improvement PatternsExtract source code improvement patterns from existing code review
history to recommend how to improve the submitted code[127]Revised Code
GenerationImplementing the Code Change Requested by a
ReviewerGenerate a revised version of a given piece of code by implementing
a specific change requested by the reviewer in a natural language
comment[47,49,54–56,
72–77]
Predicting the Code Output of the Review Pro-
cessGiven a code snippet submitted for review, revise it to implement
changes which are likely to be required by reviewers[49, 72, 128, 129]Time
ManagementIdentifying Blocking Actors in Pull RequestsIdentify who among contributor(s) and reviewer(s) is to blame for
overdue pull requests[130]
Predicting Pull Request/Code Review Comple-
tion TimePredict the time needed to complete a pull requests/code review [100, 130–133]
Prioritizing Review RequestsPrioritize code review requests based on factors such as age of the
change, test verdicts, etc.[101, 134, 135]OtherClassifying the Goal of a Review Comment or
the Type of Change Triggered by a CommentClassify a review comment as Style, Functionality, Test, Approval,
Disagreeing, Questioning, Roadmap, Diversion, Convention, Response
or Encouragement[44, 45, 136]
Configuring Static Code Analysis ToolsLeverage code review comments for recommending static code analysis
tools and warning categories to be used in future[137]
Partitioning Static Analysis WarningsCluster the warnings of static analysis tools into categories to simplify
their inspection[138]
Recommending Reviewers Recommend reviewers that are best suited for the given piece of code [8–43]
Visualizing Code ChangesProvide visualizations of the change to review to ease code compre-
hension[139–142]
11
Page 12:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
techniques to automatically classify the sentiment of reviewers’ comments [ 114]: Flagging comments expressing a
negative sentiment can provide useful information to the reviewer, who can revise those potentially problematic
comments.
Other authors tackled a more specific version of this problem, focusing on the identification of a specific type of
reviewers’ comments expressing negative feelings. In particular, Egelman et al. [115] aim at identify review comments
suggesting the will of the reviewer to block a change request for interpersonal conflicts rather than for the quality of
the submitted contribution. Sarker et al. [116,118], instead, focus on the identification of “toxic code reviews”, while
Ferreira et al. [117] and Rahman et al. [119] target “uncivil review comments”. Incivility represents a broader set of
negative comments as compared to toxicity , since the latter entails hate speech and offensive language, while incivility
does not [ 117]. Note that Rahman et al. [119], besides identifying uncivil comments, also present a model able to propose
alternative civil rephrasing preserving the original comments’ semantic.
4.1.6 Retrieval of Similar Code Reviews/Code Changes. Retrieval techniques have been used to create recommender
systems supporting code review from different perspectives. Given a code fragment to review, some techniques [ 83,120–
126] retrieve from a dataset of past reviews those involving similar code fragments and recommend to the reviewer
comments they can reuse (since used in the past to suggest improvements to similar code). Rahman et al. [83] also
proposed a similar approach, but motivated it as a mechanism to provide the contributor with additional examples of
reviews similar to those they are receiving. This could help in better understanding what the reviewer meant.
Ueda et al. [127] focused instead on mining recurring improvement patterns from code review ( i.e.,changes frequently
suggested by reviewers). Those patterns can then be potentially applied to improve the quality of the code to review
(even before the review process starts).
4.1.7 Revised Code Generation. This line of research aims at supporting the code review process by automatically
generating the code output of the review process. Two variations of this task have been proposed. The fist [ 49,72,128,129]
provides as input to the automated technique a code snippet submitted for review and expects the technique to revise
such a code to implement changes which will likely be requested during the code review process. These techniques are
meant to be used by the contributor before even starting the code review process to quickly verify whether improvements
can be made to the code they write.
The second [ 47,49,54–56,72–77] is instead a code refinement task in which the approaches are provided as input
not only a code snippet submitted for review but also a specific reviewer’s comment to address. In this case the goal of
the approach is to automatically revise the submitted code generating a version of it addressing the comment provided
as input. These approaches are meant to be used during the code review process either (i) by the reviewer, to attach to
their comments an example of how they envision the revised code, or (ii) by the contributor, to automatically address
some of the reviewer’s requests.
4.1.8 Time Management. Evidence from the literature suggests that both open source and industrial projects can
undergo hundreds of reviews per month ( e.g.,∼500 reviews per month in Linux [ 143],∼3k in Microsoft Bing [ 144]). In
such a context time management becomes essential and researchers proposed solutions to help the proper allocation
of reviewers’ time. Differently from previously discussed techniques which automated specific code review tasks,
these approaches aim at augmenting the information available to reviewers and/or managers, thus possibly improving
decisions taken during code review.
12
Page 13:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Some of the proposed solutions can be combined in a sort of pipeline to support the code review: Approaches to
predict the time needed to complete a pull request [ 100,130–133] can be used to inform techniques aimed at prioritizing
review requests [ 101,134,135]. Also, pull requests taking longer than expected can be provided as input to techniques
identifying blocking actor(s) [ 130], namely the person(s) responsible for the delay. This could help in triggering the
blocking actor or, if possible, replace them.
4.1.9 Other. The last category groups together tasks which did not fit in the previously presented categories and
features heterogeneous tasks. These include the code review task which has been mostly subject to automation attempts
in the literature: the recommendation of reviewers that are best suited for a given change [ 8–43]. These techniques,
while sharing the same goal, differ for the underlying technical solution adopted (RQ 2focuses on this aspect) and for
the features used to rank the reviewers given the change. In most of cases the features include information extracted
from the history of code changes to favor the recommendation of reviewers who e.g.,already worked in the past on the
code files subject of the change or already reviewed similar patches. The recency of these activities is usually considered
as well.
Another popular task in the “Other” category features approaches providing visualizations for the code changes to
review in order to simplify the reviewer’s inspection [ 139–142]. Note that we only included in our SLR visualization
techniques specifically aimed at supporting code review. Different works focus the visualization on different types
of information. Brito and Valente [ 141] propose RAID, a tool for refactoring-aware code review which visualizes the
refactoring operations implemented in the change to review. Fadhel and Sekerinski [ 140] target instead visualizations
aimed at improving the reviewer’s awareness of the possible impact that the implemented changes can have on the
system’s architecture. Fregnan et al. [142] provide a more general-purpose graph-based visualization to support code
review: Each node represents a class or a method and the links between them represents dependencies such as method
calls. The goal here is to improve the navigation of the change and its comprehension. Finally, still related to visualization
is the behavioral diff generated by the approach proposed in [ 139]. The idea is to show the behavioral differences (in
terms of test case execution) which can be observed in the system before and after the implementation of the code
change to review. This can support the assessment of code change correctness made by the reviewer.
Moving to the next task, Li et al. [44] present an approach to automatically classify reviewers’ comments into the
categories reported in Table 5 ( e.g.,style, functionality, etc.). Their approach is meant to provide a better understanding
and monitoring of the ongoing review process. On top of that, with the proposal of data-driven techniques to automate
tasks such as generating review comments this approach can be used to cleanup the training set of these techniques,
removing for example the comments classified as “Encouragement”, since irrelevant for training techniques suggesting
how to improve code snippets. A similar approach has also been presented by Turzo et al. [136], while Fregnan et al.
[45] focus on classifying the code changes implemented as result of the code review process.
Tukaram et al. [138] propose the idea of partitioning static analysis warnings, with the goal of clustering the similar
ones thus simplifying their interpretation. On a related research thread, Zampetti et al. [137] suggest the automated
analysis of review comments posted in the past to understand which static analysis tools should be used in the continuous
integration pipeline of a given project and how they should be configured. In other words, they aim at understanding
what the relevant “issues” reviewers look for when inspecting a patch and which of those issues can be automatically
identified by static analysis tools.
13
Page 14:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Table 6. Under-the-hood solutions behind the techniques and tools proposed for code review automation.
Approaches: Deep Learning; Machine Learning; Information Retrivial; Heuristic-Based; Other.
Programming Languages: Java; Multiple Languages; Language Independent; Other.
Task ApproachTrainingGranularity LanguagePT NL PT code FT
Assessing Review Quality
Assessing Review Quality through Biometrics (2) ✗ ✗ ✗ code regions, code review
Classifying the Usefulness of Review Comments (4) 1/4 ✗ 4/4 review comment
Identifying/Improving Review Comments Needing
Further Explanations (2)1/2 1/2 1/2 review comment
Code Change Analysis
Decomposing Tangled Commit (3) ✗ ✗ ✗ commit
Impact Analysis for Code Review (1) ✗ ✗ ✗ PR
Linking Similar Contributions (2) ✗ ✗ 1/2 code change, PR
Predicting Salient-Class (1) ✗ ✗ 1/1 commit
Code Change Classification
Identifying Impactful Code Changes (2) ✗ ✗ 1/2 commit
Identifying Large-review-effort Code Changes (1) ✗ ✗ 1/1 commit
Identifying Quickly Reviewable Changes (1) ✗ ✗ 1/1 PR
Predicting Code Changes Approval, Merge, or Need
for review (11)3/11 4/11 11/11 diff hunk, method, PR
Code Change Quality Check
Checking Design Patterns Consistency (1) ✗ ✗ ✗ file
Generating Review Comments (11) 9/11 10/11 10/11 code change, diff hunk, method
Identifying Clone Refactoring Opportunities (1) ✗ ✗ ✗ PR
Predicting Code Defectiveness (2) ✗ ✗ 1/2 file, PR
Predicting Problematic Code Elements (3) 2/3 2/3 3/3 code line, file, PR
Reviewing Code Formatting Violations (1) ✗ ✗ 1/1 file
Reviewing via Static Analysis (1) ✗ ✗ ✗ PR
Code Review Sentiment Analysis
Classifying the Sentiment of Review Comments (1) ✗ ✗ 1/1 review comment
Identifying “Pushback” Feelings in Reviews (1) ✗ ✗ ✗ code review
Identifying Toxic/Uncivil Code Review Comments (4) 4/4 ✗ 4/4email, review comment, sen-
tence
Rephrasing Toxic/Uncivil Comments (1) 1/1 1/1 1/1 review comment
Retrieval of Similar CR/CC
Augmenting Reviews (8) ✗ ✗ 4/8code change, code review, code
snippet, diff hunk, review com-
ment
Mining Code Improvement Patterns (1) ✗ ✗ 1/1 diff hunk
Revised Code Generation
Implementing the Code Change Requested by a Re-
viewer (11)8/11 9/11 11/11 code change, diff hunk, method
Predicting the Code Output of the Review Process (4) 2/3 2/3 4 method
Time Management
Identifying Blocking Actors in Pull Requests (1) ✗ ✗ ✗ PR
Predicting Pull Request/Code Review Completion
Time (5)✗ ✗ 4/5 code change, commit, PR
Prioritizing Review Requests (3) ✗ ✗ 2/3 code change, PR
Other
Classifying the Goal of a Review Comment or the
Type of Change Triggered by a Comment (3)1/3 1/3 3/3 code change, review comment
Configuring Static Code Analysis Tools (1) ✗ ✗ 1/1 review comment
Partitioning Static Analysis Warnings (1) ✗ ✗ ✗ code snippet
Recommending Reviewers (36) 1/36 ✗ 23/36 commit, patch, PR
Visualizing Code Changes (4) ✗ ✗ ✗ commit, diff hunk, PR
14
Page 15:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
4.2 RQ 2: What are the under-the-hood solutions behind the techniques and tools proposed for code
review automation?
Table 6 summarizes the under-the-hood solutions behind the techniques proposed in the literature for code review
automation. The “Task” column reports the list of automated tasks, with the number in parenthesis representing the
number of papers (out of the considered 119) presenting an automation solution for such a task. For each task 𝑇𝑖, in
the “Approach” column the bar chart depicts the percentage of DL-based, ML-based, IR-based, and Heuristic-based
techniques out of those automating 𝑇𝑖. Approaches not relying on any of these four techniques are grouped into the
“Other” categories ( e.g.,data-flow analysis [ 138] or visualization techniques [ 139]). With heuristic-based techniques we
refer to hand-crafted techniques which are usually composed by multiple steps ( e.g.,building a traceability graph and
defining a specific metric to identify the best-suited reviewer for a given code change [21]).
For approaches based on DL/ML, the “Training” column shows whether they underwent (i) a pre-training on natural
language corpus (“PT NL”); (ii) a pre-training on a code corpus (“PT code”); and (iii) a fine-tuning (“FT”). While the
pre-training procedures are typical of DL-based techniques, with fine-tuning we also indicate the standard training of
classic ML algorithms ( e.g.,training a classifier to identify design-impactful changes on a labeled dataset [ 104]). For each
of these three training procedures, a ✗indicates that none of the corresponding papers adopts it, otherwise a fraction is
used to report the number of papers employing it.
The “Granularity” column indicates, for a given task, the type of “entities” for which automation solutions have been
proposed. For example, among the 11 techniques aimed at commenting on source code by posting natural language
comments as a human would do ( i.e., generating review comments task), some of them work at code change granularity
(i.e.,they take as input the whole code diff of a PR), others consider a specific diff hunk (i.e.,only a specific part of
the change, possibly spanning multiple functions), and the remaining ones work on a single function impacted by the
change ( i.e.,they comment on one changed function).
Finally, for each task, the “Language” column depicts, using again a bar chart, the percentage of proposed automation
techniques providing support for a specific programming language. Since Java was by far the most popular language,
a specific color has been assigned to it (see Table 6’s caption), while other colors are used to indicate techniques (i)
supporting multiple languages, (ii) being language-independent, (iii) or being specific for a single language which is
not Java. When we report an approach as only supporting a specific or multiple languages, this does not mean that
the approach cannot be adapted to other languages. This is something we did not assess, since it would require a
deep understanding of all technicalities behind each approach, something which is not always easy to grasp from the
paper’s reading. For example, a DL model trained and tested on Java code to support a specific task, is labeled as “Java
only” despite, with a reasonable effort, the approach could probably be trained on another languages keeping similar
performance. Basically, we considered the languages on which the approaches can be used out of the box.
4.2.1 Approach. There is the clear distinction between the underlying solutions adopted by techniques automating
classification vsgenerative tasks. For the former ( e.g., classifying the usefulness of review comments ,predicting salient-class ,
identifying impactful code changes ), ML-based solutions (red bars in Table 6) are the most popular ones (36%), followed
by heuristic-based (24%) and DL-based (21%) techniques. Other solutions account for the remaining 19% of techniques
automating classification tasks, with none of them relying on IR. The situation is quite different for generative tasks
from which the generation of textual output is expected ( e.g., generating review comments ,implementing the code change
requested by a reviewer ,predicting the code output of the review process ). In this case, DL-based solutions are by far the
most employed (78%), followed by IR-based ones (11%) which can identify relevant content in a knowledge base and
15
Page 16:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
use it as output. For example, in the generating review comments task, the approach can take a piece of code to review
𝐶𝑖, find in a knowledge base the code 𝐶𝑗being most similar to 𝐶𝑖, and reuse the reviewers’ comments posted for 𝐶𝑗
when reviewing 𝐶𝑖.
For other types of tasks which cannot really be categorized as classification or generative tasks ( e.g., checking design
patterns consistency ,reviewing via static analysis ,visualizing code changes ), there is no clear trend which can be observed,
with all type of solutions being explored.
Interestingly is also to comment on the strongly increasing adoption of DL-based techniques for code review
automation. If we focus on the last five years considered in our SLR (2020 to 2024), we find that in 2020 and in 2021 DL
models have been exploited in 20% (2/10) and in 17% (2/12), respectively, of the papers in our SLR. From 2022, instead,
we observe a strong increase in the adoption of DL-based solutions, with 42% in 2022 (11/26), 68% in 2023 (13/19), and
63% in 2024 (17/27).
4.2.2 Training procedures. Out of the 46 DL-based solution, 34 use some form of pre-training. The idea of pre-training
is mostly to teach the DL model the language of interest, by performing a task-agnostic training. For example, a model
meant to automatically generate review comments may be pre-trained on a corpus of natural language and code
instances via the Masked Language Modeling (MLM) pre-training objective, providing the model with a sentence as
input ( e.g.,an English sentence or a Java statement) having 15% of its tokens masked, with the model in charge of
guessing the masked tokens. Of the 34 automated techniques using pre-training, 24 start from an already pre-trained
model ( e.g.,Code Llama [ 145], CodeT5 [ 146], RoBERTa [ 147]), while the remaining ones pre-train their own model. In
both cases, the pre-training usually involves both natural language and code ( i.e.,bi-modal pre-training): This is visible
in Table 6 by comparing the number of papers using a model pre-trained on natural-language (column “PT NL”) with
those exploiting a model pre-trained on code (column “PT code”). For example, out of the 11 papers predicting code
changes approval, merge, or need for review , 4 use a pre-trained model, 3 of which pre-trained on bi-modal data (+1 only
code). Similarly, when looking at the ones implementing the code change requested by a reviewer , 9 of the 11 papers use
a pre-trained model, in 8 cases pre-trained on bi-modal data. Such a choice may have been driven by the empirical
evidence showing that pre-training on natural language helps for code-related tasks as well [ 148]. There is only one
exception to this trend: All four DL-based techniques aimed at identifying toxic/uncivil code review comments exploit
pre-training only on natural language. This is a sensible choice considering that the tackled task does not foresee the
model manipulating code elements.
It is also worth mentioning the radical changes observed in the usage of pre-training in recent years. First, before
2022, we found no work on code review automation using a pre-trained DL model. Second, in 2022, out of the eight
automated solutions exploiting pre-training, only one (12.5%) used an already pre-trained models (in the other cases,
the authors of the technique pre-trained their own model). In 2023 and 2024 this trend radically changed, with only 2
of the 26 proposed techniques relying on a pre-trained model [ 50,77] exploiting a model pre-trained by the authors
themself. Such a trend is easily explained by the always-increasing availability of open source pre-trained models in
websites such as HuggingFace [149].
Concerning the fine-tuning ( i.e.,the training of a model aimed at specializing it to the target task), all but three
ML/DL-based techniques exploit it. The first two are those assessing review quality through biometrics , which use
already trained models to interpret in real time the biometric information collected by dedicated devices ( e.g.,heart rate
variability and pupillary response are captured and interpreted as “mental workload”). The third one is the work by
Widyasari et al. [84] exploiting prompt engineering techniques in ChatGPT to improve review comments needing further
16
Page 17:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
explanations. Given the recent raise of capabilities of general-purpose LLMs and their applicability to software-related
tasks, we expect more and more code review automation techniques to rely on LLMs’ prompt engineering rather than
on fine-tuned models.
4.2.3 Granularity. We only discuss the observed trends for a selection of the tasks, mostly those targeted by several
works. For some tasks, the granularity of the targeted entities is rather homogeneous. For example, when recommending
reviewers , all 36 works take as input a code change to review, which could be a commit, a patch, or a PR. Still, the overall
idea is: given a change to review, suggest the best-suited reviewers. The same observation can be made for the five
works predicting pull request/code review completion time ”, and for the four visualizing code changes .
When looking at generative tasks, instead, differences can be observed. The most interesting ones are those related
to techniques generating review comments (11) and implementing the code change requested by a reviewer (11). For both
of them, we can see that there are three families of techniques working on (i) entire code changes, (ii) specific diff hunks,
and (iii) a single method/function. Targeting these two tasks at these different granularities entails completely different
levels of difficulty. Let us discuss this point for the approaches supporting the implementation of a code change required
by a reviewer. Approaches working on diff elements (either an entire diff or a diff hunk) [ 47,54,73,76,77] require the
ability of the approach to “understand” the reviewer’s comment in the context of complex diff changes which could
possibly span across different code elements ( e.g.,multiple functions or even files involved). Thus, addressing these
comments may be challenging, requiring modifications to several code elements. Differently, when isolating single
functions which had been commented by a reviewer in the context of a larger code change ( i.e.,the single function
may be only one of the impacted code elements) [ 49,55,72,74], the approach has a much more limited coding context
on which to operate the required code transformations. Obviously, also the applicability and potential usefulness of
these techniques is different, with the former being more flexible ( e.g.,several reviewer comments do not even concern
methods, but other code elements). The recent trend is to expand as much as possible the “contextual information”
available to these code review automation techniques. This is mostly possible thanks to the availability of large DL
models able to process large inputs (such as an entire diff). This is an interesting example of how technical constraints
(e.g.,DL models only able to process up to 512 tokes as input [ 49]) pushed researchers to artificially simplify the tackled
problem ( i.e.,only focusing on changes required to small methods [ 49]), with the most recent approaches relaxing these
constraints thanks to the advances in AI.
4.2.4 Language. Works automating code-review tasks only requiring the processing of natural language information
(i.e.,the review comments, the pull request description) are, by definition, programming language-independent ( i.e.,
identifying/improving review comments needing further explanations ,configuring static code analysis tools , and all those
related to code review sentiment analysis and to time management — see Table 6). Also, 3/4 of the works classifying the
usefulness of review comments are language independent, while one is focused on comments related to Python code. As
expected, these techniques have been experimented on English artifacts only.
Among the 36 techniques recommending reviewers , 31 are also language-independent. Indeed, many of them mostly
exploit historical information and look at the source code as a bag-of-words, do not really requiring parsing or other
language-specific implementations. The remaining 5 either support Java only [ 8,21,43] or a set of multiple languages
[14, 32].
When looking at the remaining 78 approaches, the most targeted programming language is Java (58 works), with
35 focusing exclusively on it and 23 also supporting at least another language. For example, Li et al. [47] addressed
the tasks of implementing the code change requested by a reviewer ,generating review comments , and predicting code
17
Page 18:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Table 7. Evaluation of the proposed techniques: Top-3 metrics used in the evaluation; whether a qualitative inspection of the results
has been performed; and whether the approaches have been deployed in industry
Task #1 Metric #2 Metric #3 Metric Qualitative Deployed
Assessing Review Quality
Assessing Review Quality through Biometrics (2) Accuracy F1-score Precision&Recall ✗ ✗
Classifying the Usefulness of Review Comments (4) Precision&Recall F1-score Accuracy 2/4 2/3
Identifying/Improving Review Comments Needing Further Explanations(2) Accuracy F1-score Correct Type 1/2 ✗
Code Change Analysis
Decomposing Tangled Commit (3) Accuracy MAP MRR 3/3 ✗
Impact Analysis for Code Review (1) Accuracy Recall MAP 1/1 ✗
Linking Similar Contributions (2) F1-score MRR Precision&Recall ✗ ✗
Predicting Salient-Class (1) Accuracy Precision&Recall - 1/1 ✗
Code Change Classification
Identifying Impactful Code Changes (2) AUC F1-score Precision&Recall 1/2 1/2
Identifying Large-review-effort Code Changes (1) AUC F1-score Precision&Recall 1/1 1/1
Identifying Quickly Reviewable Changes (1) NDCG - - 1/1 ✗
Predicting Code Changes Approval, Merge, or Need for review (11) F1-score Precision&Recall AUC 1/11 ✗
Code Change Quality Check
Checking Design Patterns Consistency (1) - - - 1/1 ✗
Generating Review Comments (11) BLEU Accuracy ROUGE-L 6/11 ✗
Identifying Clone Refactoring Opportunities (1) Accuracy F1-score Precision&Recall 1/1 ✗
Predicting Code Defectiveness (2) F1-score AUC* False alarms* 1/2 1/2
Predicting Problematic Code Elements (3) AUC F1-score Precision&Recall 3/3 ✗
Reviewing Code Formatting Violations (1) F1-score Precision&Recall Predicion Rate ✗ ✗
Reviewing via Static Analysis (1) - - - 1/1 1/1
Code Review Sentiment Analysis
Classifying the Sentiment of Review Comments (1) Accuracy F1-score Precision&Recall ✗ ✗
Identifying “Pushback” Feelings in Reviews (1) Precision&Recall - - ✗ ✗
Identifying Toxic/Uncivil Code Review Comments (4) F1-score Precision&Recall Accuracy ✗ ✗
Rephrasing Toxic/Uncivil Comments (1)Incivility
DeacreaseLength Dissimi-
laritySemantic Simi-
larity1/1 ✗
Retrieval of Similar CR/CC
Augmenting Reviews (8) Accuracy Precision&Recall MRR 4/8 1/8
Mining Code Improvement Patterns (1) Accuracy - - 1/1 ✗
Revised Code Generation
Implementing the Code Change Requested by a Reviewer (11) Accuracy BLEU CodeBLEU 5/11 1/11
Predicting the Code Output of the Review Process (4) Accuracy BLEU Lev. Distance 2/4 ✗
Time Management
Identifying Blocking Actors in Pull Requests (1) MAE MMRE - 1/1 1/1
Predicting Pull Request/Code Review Completion Time (5) MAE MRE - 2/5 2/2
Prioritizing Review Requests (3) Accuracy* AUC* MAP* 1/3 2/3
Other
Classifying the Goal of a Review Comment or the Type of Change Triggered
by a Comment (3)F1-score Precision&Recall MCC 1/3 ✗
Configuring Static Code Analysis Tools (1) MAP MAR Precision&Recall 1/1 ✗
Partitioning Static Analysis Warnings (1) Review Effort - - ✗ ✗
Recommending Reviewers (36) MRR Accuracy Precision&Recall 4/36 3/36
Visualizing Code Changes (4) - - - 3/4 2/4
changes approval, merge, or need for review by training a transformer model on code review instances related to code
written in nine different languages: C, C++, C#, Go, Java, JavaScript, PHP, Python, and Ruby. Finally, 9 of these 78
techniques are language-independent ( i.e.,they can be applied independently from the programming language, without
any adaptation) [44, 83, 98–101, 105, 109, 126].
Interesting is to note the complete lack of support for low-resource languages, namely programming languages for
which little training material is available ( e.g.,Julia, Lua, R). We will discuss this point further in Section 4.5.
4.3 RQ 3: How are techniques for the automation of code related tasks empirically evaluated?
Table 7 shows, for each code review task 𝑇𝑖automated in the literature:
•The top-3 metrics used in the empirical evaluation of the techniques automating 𝑇𝑖. For approaches not employing
any quantitative metrics in their evaluation ( e.g.,those visualizing code changes ) a dash is used to fill the metrics-
related columns. Also, some tasks have been automated by very few techniques which, however, have been
evaluated using disjointed sets of metrics. For example, the three techniques prioritizing review requests all
18
Page 19:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
used different evaluation metrics [ 101,134,135], not allowing to observe any trend. The same happens for the
predicting code defectiveness tasks. In these cases, we just report the three metrics that are the most popular when
also considering all other tasks. These cases are indicated in Table 7 with a “*” attached to the respective metrics.
Finally, we decided to group together precision and recall, since they were always used in combination in the set
of inspected papers.
•Whether a manual qualitative inspection of the techniques’ output has been performed. A “ ✗” indicates that
for none of the techniques automating 𝑇𝑖a manual qualitative analysis of their output has been performed.
Otherwise, a fraction explicitly shows for how many of them, out of the total, this has been done.
•Whether the proposed technique has been deployed in an industrial setting (“Approach Deployed”) . This column
must be read as the previous one. For example, out of the 36 techniques to recommend reviewers , 3 have been
deployed in industry.
Our goal with Table 7 is to provide an overview of the evaluations performed in the literature. The interested reader
can find the complete data ( e.g.,the metrics used in each of the 119 papers) in our online appendix [ 78]. In the following,
we are going to discuss visible trends, especially for tasks for which several automation solutions have been proposed.
We can observe a clear distinction in the metrics used for two clusters of tasks related to classification and generative
problems. For the former ( e.g., classifying the usefulness of review comments ,identifying impactful code changes ,classifying
the sentiment of review comments ,recommending reviewers ) well-known metrics such as precision, recall, F1-score,
accuracy, and Area Under the ROC Curve (AUC) are mostly employed. When coming to generative tasks ( e.g., generating
review comments ,implementing the code change requested by a reviewer ), researchers started borrowing evaluation
metrics from the NLP field. For example, to assess whether a DL model is able to generate meaningful review comments,
its output is compared against comments manually written by human reviewers for the same code under review, with
metrics such as BLEU [ 150] (1st) and ROUGE-L [ 151] (3rd). Both of them are basically textual-similarity metrics which
only work under specific circumstances. For example, if the DL model points to the same quality issue identified by the
human reviewer using, however, a completely different wording, these metrics are unable to reward the model for the
meaningful output. Even more penalizing for the automated technique is the usage of accuracy (2nd), which considers a
generated comment as correct only if it is identical to the human-written one. Similarly, when assessing the correctness
of automatically generating code ( e.g.,to address a review comment) researchers are using accuracy (1st), BLEU (2nd),
and CodeBLEU [ 152] (3rd). Accuracy indicates that the approach addressed the reviewer’s comment exactly as done by
a human developer ( i.e.,all code tokens are identical). The CodeBLEU is a version of the BLEU score meant to also
capture AST-level similarity between two snippets of code (rather than merely textual similarity as done by BLEU).
Also for this task, these evaluation metrics suffer of the same limitations discussed for the case of comment generation.
Indeed, the same reviewer’s comment may be successfully implemented in two different ways by the machine and
by the human, with the result of low evaluation scores even in case of meaningful recommendation. We will further
discuss these concerns in Section 4.5.
Looking at Table 7 it is also possible to see that in several cases researchers tried to compensate the lacks of the metrics
employed to assess the effectiveness of techniques for generative problems. Indeed, 6/11 approaches generating review
comments and 5/11 of those implementing the code change requested by a reviewer present some qualitative analysis
in which, e.g.,researchers looked at successful and wrong recommendations with the goal of better understanding
strengths and weaknesses of the proposed approaches. In general, qualitative analysis is quite popular, with 43% of the
19
Page 20:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
code review automation techniques presenting some form of manual inspection. This percentage is negatively affected
by the only 4/36 papers recommending reviewers which present a qualitative analysis.
Finally, only 18 of the 119 techniques (15%) have been deployed in industry. For example, Froemmgen et al. [77]
deployed their approach for implementing the code change requested by a reviewer at Google. While this percentage
may look low at a first sight, it is actually notable considering how recent several of the technologies behind these
techniques are.
4.4 RQ 4: Which techniques and datasets are publicly available?
Table 8 reports the list of works from our SLR which either do not provide a link to a replication package ( ✗in column
“Provided”) or, while having such a link ( ✓), it is not accessible ( ✗in column “Accessible”) at the date of writing (January
2025). Some references are present multiple times since the proposed approach supports several tasks.
Overall, 49 of the 119 papers (41%) part of our SLR do not provide a replication package, and 6 more (5%) provide a
link which is not accessible anymore. Considering that all surveyed papers present techniques for automating code
review tasks, this implies substantial challenges for researchers interested in replicating these approaches, for example
to use them as baselines for the proposal of a novel solution. The remaining 64 papers (54%) provide instead a working
replication package, as documented in Table 9. Besides reporting the link to the working replication package, Table 9
also indicates what the authors provide in it in terms of code/tool implementing the proposed approach (column “C”)
and data used in the paper (column “D”). Note that one of the works [ 36] does not provide both code and data, with the
linked artifact mostly presenting additional tables. The most popular platform used for sharing the replication packages
is by far GitHub (61%), followed by other solutions with a similar usage share ( i.e.,Zenodo, Figshare, bitbucket, personal
website).
While the percentage of papers providing a working replication package (54%) seems to suggest major issues in the
replicability of techniques for code review automation, it is important to look at how such a trend is evolving over
time. Fig. 4 shows that the efforts put in place by the software engineering research community for promoting open
science ( e.g.,by default all papers submitted at the International Conference on Software Engineering must disclose
data/artifacts) are improving code/data availability.
Indeed, while up to 2020 the majority of the published papers did not provide a working replication package, such
a trend changed in 2021 (81% provide a replication package) and was confirmed in all subsequent years, with 86% of
published works providing a replication package in 2024.
Encouraging signals also come from the information reported in Table 9: Indeed, of the 64 papers providing a
replication package, 54 (84%) disclose both the source code of the proposed approach and/or a tool implementing the
approach and data used in the work.
4.5 RQ 5: What are the concerns raised or the limitations observed by researchers when experimenting the
automated solutions?
Table 10 summarizes the concerns raised and limitations observed by researchers when experimenting with the proposed
automated solutions. Table 10 organizes the identified issues into four parent categories, performance -,evaluation -,
usability -, and deployment -related issues. For each of them, references to the papers in which we found evidence of
such an issue have been reported. In the following, we discuss the main issues identified, focusing on those that are
either very popular ( i.e.,reported in several papers) or that provide interesting insights for future work. We use the
icon “” to highlight lessons learned and directions for future work.
20
Page 21:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Table 8. Works on code review automation not providing a replication package or having it not accessible as of Jan 2025
Task Reference Provided Accessible
Assessing Review Quality through Biometrics Hijazi et al. [86] ✗ -
Augmenting ReviewsGuo et al. [125] ✗ -
Guo et al. [121] ✗ -
Gupta et al. [120] ✗ -
Rahman et al. [83] ✗ -
Checking Design Patterns Consistency Heet al. [107] ✗ -
Classifying the Usefulness of Review CommentsPangsakulyanont et al. [79] ✗ -
Rahman et al. [80] ✓ ✗
Decomposing Tangled CommitBarnett et al. [88] ✗ -
Taoet al. [87] ✗ -
Wang et al. [89] ✗ -
Generating Review CommentsNashaat et al. [56] ✗ -
Vijayvergiya et al. [50] ✗ -
Identifying Blocking Actors in Pull Requests Shan et al. [130] ✗ -
Identifying Clone Refactoring Opportunities Chen et al. [108] ✓ ✗
Identifying Impactful Code Changes Wen et al. [103] ✗ -
Identifying Quickly Reviewable Changes Zhao et al. [106] ✗ -
Identifying “Pushback” Feelings in Reviews Egelman et al. [115] ✗ -
Identifying/Improving Review Comments Needing Further Explanations Rahman et al. [83] ✗ -
Implementing the Code Change Requested by a ReviewerFroemmgen et al. [77] ✗ -
Nashaat et al. [56] ✗ -
Linking Similar Contributions Ayinala et al. [102] ✓ ✗
Mining Code Improvement Patterns Ueda et al. [127] ✗ -
Partitioning Static Analysis Warnings Tukaram et al. [138] ✗ -
Predicting Code Changes Approval, Merge, or Need for reviewFanet al. [97] ✗ -
Nashaat et al. [56] ✗ -
Shiet al. [96] ✗ -
Predicting Code DefectivenessSharma et al. [110] ✗ -
Soltanifar et al. [109] ✗ -
Predicting Pull Requests/Code Review Completion TimeChen et al. [133] ✗ -
Maddila et al. [131] ✗ -
Shan et al. [130] ✗ -
Predicting Salient-Class Huang et al. [91] ✗ -
Prioritizing Review Requests Saini et al. [134] ✗ -
Recommending ReviewersAlet al. [24] ✗ -
Aryendu et al. [30] ✗ -
Asthana et al. [19] ✗ -
Balachandran et al. [8] ✗ -
Chouchen et al. [26] ✗ -
Jiang et al. [9] ✗ -
Jiang et al. [22] ✗ -
Jiang et al. [17] ✗ -
Kong et al. [42] ✗ -
Liao et al. [20] ✗ -
Ouni et al. [12] ✗ -
Pandya et al. [28] ✓ ✗
Rahman et al. [32] ✓ ✗
Rebai et al. [34] ✗ -
Rong et al. [40] ✗ -
Strand et al. [25] ✗ -
Xiaet al. [11] ✗ -
Xiaet al. [16] ✗ -
Yeet al. [35] ✗ -
Ying et al. [13] ✗ -
Yuet al. [14] ✗ -
Zanjan et al. [15] ✓ ✗
Zhang et al. [31] ✗ -
Reviewing via Static Analysis Balachandran et al. [8] ✗ -
Visualizing Code Changes Menarini et al. [139] ✗ -
21
Page 22:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Table 9. Works on code review automation providing a (still accessible at Jan 2025) replication package
Task Reference Link C D
Augmenting ReviewsHirao et al. [126] https://github.com/software-rebels/ReviewLinkageGraph ✓✓
Kartal et al. [124] https://github.com/ykartal/Github-SourceCode-Review ✓✓
Shuvo et al. [123] https://drive.google.com/file/d/15kq7LqvfY-oP1M1UDdK_lLmUfq71daVR/view ✗✓
Classifying the Goal of a Review Comment Fregnan et al. [45] https://zenodo.org/records/5592254 ✓✓
or the Type of Change Triggered by a Comment Liet al. [44] https://sites.google.com/view/core2019/ ✗✓
Turzo et al. [136] https://github.com/WSU-SEAL/CR-classification-ESEM23 ✓✓
Classifying the Usefulness of Review CommentsHasan et al. [81] https://github.com/WSU-SEAL/CRA-usefulness-model ✓✓
Yang et al. [82] https://zenodo.org/records/8297481 ✓✓
Generating Review CommentsHong et al. [48] https://github.com/awsm-research/CommentFinder ✓✓
Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓
Liet al. [46] https://gitlab.com/ai-for-se-public-data/auger-fse-2022 ✓✓
Linet al. [53] https://zenodo.org/records/10572047 ✓✓
Luet al. [55] https://zenodo.org/records/7991113 ✓✓
Luet al. [51] https://zenodo.org/records/10964945 ✓✓
Sghaier et al. [54] https://zenodo.org/records/10676741 ✓✓
Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓
Yuet al. [52] https://github.com/aiopsplus/Carllm ✗✓
Identifying Toxic/Uncivil Code Review CommentsFerreira et al. [117] https://doi.org/10.6084/m9.figshare.24603237 ✓✓
Rahman et al. [119] https://github.com/Oyakiolo052/ATUC_Artifacts ✓✓
Sarker et al. [116] https://github.com/WSU-SEAL/ToxiCR ✓✓
Sarker et al. [118] https://github.com/WSU-SEAL/ToxiSpanSE ✓✓
Identifying/Improve Review Comments Needing Fur-
ther ExplanationsWidyasari et al. [84] https://figshare.com/s/135201b8f87ab705448b ✓✓
Impact Analysis for Code Review Hong et al. [93] https://figshare.com/s/135201b8f87ab705448b ✓✓
Huq et al. [73] https://github.com/Review4Repair/Review4Repair ✓✓
Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓
Luet al. [76] https://github.com/moonmengmeng/EnRefiner ✓✓
Implementing the Code Change Requested by Luet al. [55] https://zenodo.org/records/7991113 ✓✓
a Reviewer Pornprasit et al. [75] https://github.com/awsm-research/LLM-for-code-review-automatiton ✓✓
Sghaier et al. [54] https://zenodo.org/records/10676741 ✓✓
Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓
Zhang et al. [74] https://github.com/EngineeringSoftware/CoditT5 ✓✓
Chouchen et al. [101] https://github.com/stilab-ets/CostAwareCR ✓✓
Chouchen et al. [99] https://github.com/stilab-ets/multicr ✓✓
Islam et al. [98] https://github.com/khairulislam/Predict-Code-Changes ✓✓
Predicting Code Changes Approval, Merge, Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓
or Need for review Luet al. [55] https://zenodo.org/records/7991113 ✓✓
Wu and Zhang [95] https://github.com/SimAST-GCN/CLMN ✓✓
Wuet al. [94] https://github.com/SimAST-GCN/SimAST-GCN ✓✓
Yang et al. [100] https://figshare.com/s/7930029ea5ec5af2845d ✓✓
Predicting Problematic Code ElementsHong et al. [111] https://github.com/awsm-research/RevSpot-replication-package ✓✓
Olewicki et al. [112] https://zenodo.org/records/10783562 ✓✓
Sghaier et al. [71] https://zenodo.org/records/7533156 ✓✓
Predicting Pull Request/Code Review Completion TimeChouchen et al. [132] https://github.com/stilab-ets/MCRDuration ✓✓
Yang et al. [100] https://zenodo.org/records/7533156 ✓✓
Predicting the Code Output of the Review Process Pornprasit et al. [129] https://github.com/awsm-research/D-ACT-Replication-Package ✓✓
Prioritizing Review RequestsChouchen et al. [101] https://github.com/stilab-ets/CostAwareCR ✓✓
Yang et al. [135] https://figshare.com/s/133f23da558b7b254041?file=46923235 ✓✓
Recommending ReviewersAhasanuzzaman et al. [43] https://drive.google.com/drive/folders/1bSC9iRtjKjMTRa9hiyECijgABKGfpyT4 ✓✓
Chueshev et al. [33] https://github.com/alexchueshev/icsme2020 ✓✓
Fejzer et al. [18] https://github.com/mfejzer/reviewers_recommendation ✓✓
Hajari et al. [38] https://github.com/rigbypc/SofiaWL/tree/master/ReplicationPackage ✓✓
Liet al. [29] https://zenodo.org/record/7292881 ✓✓
Mirsaeedi et al. [23] https://zenodo.org/record/3678551#.ZFS5EC8RpBw ✗✓
Qiao et al. [37] https://github.com/cufeinfor/MIRRec ✓✓
Rahman et al. [39] https://zenodo.org/records/8190493 ✓✓
Sulun et al. [21] https://figshare.com/s/27a35b4ae70269481a2c ✓✓
Sulun et al. [41] https://github.com/sulunemre/rstrace-replication ✓✓
Tecimer et al. [27] https://figshare.com/s/1b9ea55377d9f2c31a7a ✓✓
Thongtanunam et al. [10] https://github.com/patanamon/revfinder ✗✓
Zhao et al. [36] https://github.com/liuj888/ReviewerRecommendationLtR ✗✗
22
Page 23:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
Table 9 (continue): Works on code review automation providing a (still accessible at Jan 2025) replication package
Task Reference Link C D
Rephrasing Toxic/Uncivil Comments Rahman et al. [119] https://github.com/Oyakiolo052/ATUC_Artifacts ✓✓
Linking Similar Contributions Wang et al. [92] https://github.com/dong-w/Replication-Patch-Linkage ✓✓
Identifying Impactful Code Changes Uchôa et al. [104] https://zenodo.org/record/4563214#.Y0kjiexBwQg ✗✓
Identifying Large-review-effort Code Changes Wang et al. [105] https://bitbucket.org/wangsonging/ist2020_repo/src/master/ ✓✓
Reviewing Code Formatting Violations Markovtsev et al. [113] https://github.com/src-d/style-analyzer ✓✓
Assessing Review Quality through Biometrics Hijazi et al. [85] https://github.com/HaythamHijazi/Supplement ✓✓
Classifying the Sentiment of Review Comments Ahmed et al. [114] https://github.com/senticr/SentiCR/ ✓✓
Retrieving Similar Reviews Siow et al. [122] https://sites.google.com/view/core2019/ ✗✓
Predicting the Code Output of the Review ProcessHuq et al. [73] https://github.com/Review4Repair/Review4Repair ✓✓
Patanamon et al. [128] https://github.com/awsm-research/AutoTransform-Replication ✓✓
Tufano et al. [72] https://github.com/RosaliaTufano/code_review ✓✓
Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓
Visualizing Code ChangesBrito and Valente [141] https://github.com/rodrigo-brito/refactoring-aware-diff ✓✗
Fadhel et al. [140] https://github.com/hadii-tech/striff-lib ✓✗
Fregnan et al. [142] https://zenodo.org/record/7047993#.Y2JqNS-B2Uo ✓✓
Configuring Static Code Analysis Tools Zampetti et al. [137] https://github.com/senticr/SentiCR/ ✗✓
Fig. 4. Availability of a working replication package by publication year
4.5.1 Performance-related issues. We use the term “performance” to refer to the ability of the technique to provide a
proper support for the automation of the targeted task2. Several researchers highlight the unsatisfactory recommendations
generated by the experimented techniques, which may make them not ready for developers’ adoption. This is a quite
crosscutting and expected concern, particularly affecting generative tasks requiring the generation of text or code ( e.g.,
implementing the code change requested by a reviewer ). However, even in classification tasks for which automated support
proved quite successful, some researchers raised major concerns about their actual effectiveness: Strand et al. [25]
observed that while their approach for reviewer recommendations performed well when evaluated on historical data, it
did not seem to save time to developers once deployed in industry. This stresses the importance of experimenting
the proposed techniques in realistic scenarios, which can provide feedback about their actual usefulness.
A very popular limitation of the automation techniques is also the limited support they offer in specific scenarios . The
term “scenario” here can have different meanings. For techniques relying on historical data such as those recommending
reviewers based on past assignments, or those retrieving code reviews performed in the past for code similar to the one
2This is unrelated to performance aspects such as memory footprint.
23
Page 24:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
Table 10. Concerns and limitations discussed by researchers
Parent Category Child Category References
PerformanceLimited support in specific scenarios [41, 48, 54, 56, 72, 82, 87, 88, 91, 93]
Lack of generalizability across different datasets [56, 79, 117]
Unsatisfactory Recommendations [8,42,44,49,72,73,84,92,115,117,128,131,137,
153]
Does not save time [25]
Noise in training data [19, 22, 33, 47, 49, 54, 93]
Bias in recommendations [10, 14, 19, 23, 25, 27, 38, 97, 110, 154]
EvaluationSuboptimal metrics [11, 21, 33, 47, 51, 105, 124, 132]
Reliability of oracle [9, 16, 19, 37, 40, 122]
Lack of tradeoffs assessment [12, 38, 73, 87, 130]
Data leakage [74]
Relevance for practitioners not assessed [45]
UsabilityResponsiveness and scalability [35, 77, 94, 121, 122, 131, 139, 142]
Steep learning curve [39, 77, 100, 139, 140]
Intepretability of LLMs [112]
Information overload [89, 142]
DeploymentDifficult to integrate in developers’ workflow [88]
Human factors [115, 119]
Privacy concerns [86]
Too expensive [50, 56, 76, 101, 125]
under review, there are limitations related to their applicability on “previously unseen data”. For example, retrieving
reviews from the past does not allow the approach to generate previously unseen review comments [ 48], something
doable nowadays by training DL models. Thus, while retrieval-based techniques offer specific advantages even in
the context of generative tasks ( e.g.,they are substantially faster as compared to DL-based techniques), relying on
them may be recommendable mostly in quite stable contexts in which the development team, review process, and
the code base are not expected to undergo major and continuous changes (thus keeping the value of what learned
in the past). One emerging concern is the already mentioned lack of support by DL-based code review automation
techniques for low-resource languages [ 54]. It is reasonable to expect that the performance of code review automation
techniques substantially drop for these languages, highlighting the importance of investigating the applications
of these tools in “niche usage scenarios”, as recently done in the context of code generation [ 155–157]. Also, some
researchers highlighted the limited applicability of their technique in specific scenarios which are, however, specific of
the tackled problem and experimented technique. For example, Huang et al. [91] claim that their approach to predict
the silent class of a commit is unable to deal with tangled commits.
Still related to the performance of the proposed automated techniques, several studies presenting learning-based
techniques report the presence of noise in training data as a major concern. Given the amount of data on which these
techniques rely for the learning, it is difficult to guarantee the quality of training data. For example, when looking
at reviewers’ recommenders noise can come from developers using multiple accounts, which are treated as different
developers by the approach [ 22], or even from sub-optimal assignment made in the past, i.e.,the reviewer assigned to a
pull request was not the most appropriate, but maybe the one which had a lower workload in that specific moment [ 19].
Similarly, researchers raised concerns about the quality of training data used for DL models aimed at generating review
comments or implementing the code change requested by a reviewer [ 47,49]. These approaches usually learn from
triplets featuring (i) the code submitted for review, (ii) the review comments posted by humans, and (iii) the revised
code implementing the changes requested in the review comments. This data is automatically mined from forges such
as GitHub and, consequently, can feature noisy data. For example, the collected revised code, while being a modified
version of the code submitted for review, may not actually implement the reviewers’ comments, but other unrequested
changes. The mined triplet will thus “teach” something wrong to the DL model. Despite the major cleaning efforts
24
Page 25:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
performed by researchers [ 47,49], noisy instances survive in the training data, since it is difficult for a single research
group to have the man power to manually validate the whole dataset. A joint effort of the research community
working on the automation of code review activities would be needed (at least for specific tasks of interest), similarly to
what done in other fields like image recognition [158].
Finally, researchers working on reviewers’ recommendation and predicting code changes approval/merge report
the presence of bias in recommendations generated by their techniques [ 10,14,19,23,25,27,38,97,110,154]. When it
comes to recommend reviewers, bias manifests in the fact that reviewers who have been employed more in the past
will also be employed more in the future. The bias becomes even more evident if the approach is re-trained over time to
include new data, also featuring pull requests in which the approach has been employed (thus again promoting over and
over the same reviewers). When it comes to predicting code changes approval/merge, researchers reported a negative
bias of the techniques towards pull requests opened by newcomers. These two examples highlight the importance
of considering human factors in the evaluation of the proposed techniques, besides computing performance-related
metrics.
4.5.2 Evaluation-related issues. For this category, we discuss the first three types of concern reported in Table 10, since
the last two ( i.e., data leakage andrelevance for practitioners not assessed ) have only been reported in one paper each.
The usage of suboptimal metrics in the run empirical validations is a major concern for the new line of research
tackling generative tasks [ 47,51,124]. For example, in the “generating review comments” task the technique is provide
as input a code to review and it is expected to comment on its quality in natural language as a human would do. The
question is how to automatically assess the quality of the generated comments. As explained in the context of RQ 3, since
the code to review is usually mined from open source projects, researchers usually compute a similarity metric ( e.g.,
the BLEU score [ 150]) between the generated comment and the comments that were posted by human reviewers for
that same code. However, there are many issues with this evaluation procedure. First, two completely different natural
language comments may point to the same quality issue in the code. For example, the deep learning model may output
“please rename variable hto something more meaningful” while the human reviewer may write “change htoheight ”. A
textual similarity metric between these two comments would point to the low quality of the comment generated by the
deep learning model while, in reality, the technique outputted a meaningful recommendation. Second, the model may
correctly spot a quality issue which, however, has been missed by the human reviewers, thus not having any “similar
human comment” to compare with. This again would result in a correct recommendation considered wrong. On top of
generative tasks, there are other code review automation tasks for which the metrics used for evaluation only represent
a weak proxy for the actual usefulness of the approach. For example, Chueshev et al. [33] stress that evaluation metrics
such as top@k accuracy and MRR which they use in the context of reviewer recommendation might not align with the
practical use of their technique, since they do not focus on the actual value added by new reviewer recommendations.
A strongly-related concern possibly affecting the validity of the reported empirical evaluations is the limited reliability
of oracles , which mirrors the previously discussed “ noise in training data ” but on the “test data”. Future work should
aim at (i) defining metrics better capturing the actual usefulness of code review automation techniques, for example
assessing the relevance of a generated comment for a given code under review, rather than only comparing it with
comments written by humans; and (ii) creating curated benchmarks for code review automation, similarly to what the
research community is doing for code generation [159].
The third evaluation-related issue we discuss concerns to the lack of tradeoffs assessment . Concrete examples of this
issue may be: (i) focusing the evaluation of a reviewer recommender only on the correctness of the recommendation,
25
Page 26:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
without considering the workload distribution among reviewers as one of the objectives to meet [ 38]; (ii) not considering
that specific decisions may be influenced by interpersonal relationships rather than by objective factors [ 12]; and
(iii) ignoring in the evaluation the cost of adopting a novel tool the developers are not familiar with [ 87].A more
comprehensive view of the tradeoffs that come into play when a new code review automation technique is proposed
would be desirable in the performed empirical evaluations. However, this may not be doable without running case
studies, which are non-trivial to run. At least, a careful discussion of the not-assessed tradeoffs is recommendable for
works in the area, especially considering the socio-technical nature of code reviews.
4.5.3 Usability-related issues. Automation solutions proposed in academia are often implemented in the form of
prototypes, with little attention given to non-functional attributes such as responsiveness and scalability [35,77,94,121,
122,131,139,142]. Sometimes these issues are indeed the result of non-optimized code, while in other cases are intrinsic
limitations of the proposed approaches. For example, retrieval-based techniques may experience an increasing lack of
responsiveness with the growth of the knowledge base from which information is retrieved. Similarly, visualization
techniques may not scale to accomodate too complex objects/large amounts of information ( e.g.,a quite large code diff
in the context of code review).
The developed solutions may also be characterized by a steep learning curve [39,77,100,139,140], which is however
difficult to assess without human-based studies. Finally, usability concerns may come from information overload [89,142]
and the lack of interpretability of deep learning models [ 112]. Concerning the former, Fregnan et al. [142] discuss the
risk of information overload when visualizing too intricate merge requests, with the concrete risk of hindering useful
information to the reviewers rather than helping them in the code inspection.
Since the overall goal of code review automation is to save time to software developers, the usability of the
proposed solutions should be considered as a first-class citizen, both at design and evaluation time. Currently, most
of techniques are assessed in in-vitro evaluations, relying on test sets built by mining software repositories. These
evaluations completely neglect the “usability” aspect. For some code review tasks for which automation has only been
recently targeted ( e.g., generating review comments ) this may be reasonable considering that the proposed solutions are
still far from generating meaningful recommendations most of times. However, for other tasks such as recommending
reviewers , dozens of techniques have been proposed with the most recent ones achieving excellent performance on the
artificial benchmarks. Investigating their usability becomes now important.
4.5.4 Deployment-related issues. Related to the former concerns are the deployment-related issues discussed by
researchers. We found these issues only discussed in a few papers and mostly pointing to the possibility that the
proposed technique may be too expensive to deploy in practice [ 50,56,76,101,125]. This type of concern is mostly
related to the proposal of AI-based solutions: Deploying an in-house AI assistant may require substantial monetary
investments for training large DL models, making them available on powerful servers, and maintaining them ( e.g.,
retraining them) to keep their usefulness over time. Given the growth of the AI4SE research field, it is important to
step back and also consider the cost-related implications of these techniques, both in terms of money and environmental
impact. Integrating techniques such as quantization [ 160], knowledge distillation [ 161], and parameter-efficient fine-
tuning [ 162] can help in reducing both the memory footprint and the training/inference cost of the proposed solutions.
Still, the impact of these techniques on the performance (quality of recommendations) of code review automation tools
must be carefully assessed ( e.g.,a quantized model may experience a substantial lost of performance when compared to
the original model).
26
Page 27:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
5 THREATS TO VALIDITY
Threats to construct validity concern the relation between theory and observation. We only included papers indexed
in the six queried databases. Also, we only focused on works published in software engineering venues. Thus, there
might be additional studies we missed. The snowballing procedure we applied helps in mitigating this threat, despite
the fact that we only performed one round of snowballing. We believe that most relevant studies were included based
on the expertise the authors have in this domain. Also, the number of papers included in our SLR is large enough to
answer our research questions and the main findings are unlikely to change even assuming a few missing works. Also,
as a design decision, we did not apply any quality assessment criteria to exclude studies from our SLR. Indeed, we
felt that the subjectiveness of this judgement was too high and decided to consider peer-reviewed papers as a sort of
“automated quality filter”. We acknowledge that some peer-reviewed studies included in our study might feature flaws
or wrong claims which could also potentially affect our findings ( e.g.,wrong data extracted).
Threats to internal validity concern external factors we did not consider that could affect the variables being
investigated. The search engines we used are continuously updated, both in terms of search features as well as in terms
of papers they index. We cannot ensure replicability of our findings. However, we provide all material we collected in
an online appendix [78].
Threats to external validity concern the generalizability of our findings. We decided to focus our SLR only on the
literature proposing code review automation tools ignoring, for example, the body of knowledge related to empirical
studies on code review. Furthermore, as our paper search was conducted in December 2024, our SLR misses works
which have been later indexed in the searched database.
Threats to conclusion validity concern the relations between the conclusions and our analyzed data. The main
threat here is related to the correctness of the data we extracted from the inspected papers. To minimize errors, the two
authors always double-checked the information each of them collected. However, especially in the context of RQ 5, we
felt that a strong subjectivity component was involved in deciding what should be considered as a limitation/concern
discussed by the paper’s authors. We acknowledge that we may have missed several insights reported in the read works.
Still, we feel the set of limitations/concerns discussed in RQ 5to be quite representative of those we encountered while
reading papers for this SLR.
6 CONCLUSIONS
We presented a systematic literature review involving 119 papers presenting solutions for the automation of code
review-related tasks. Firstly, we categorized the 34 tasks for which at least one automated approach has been proposed.
We then summarized the under-the-hood solutions behind these approaches, and the metrics used in their empirical
evaluation. We also looked for the presence of replication packages in the 119 papers, checking for the available ones
whether they are still reachable and if they provide access to the presented approach and the used data. In the end,
we highlighted the concerns and limitations researchers discussed when presenting and evaluating the proposed
approaches, using them to highlight possible directions for future work.
We release the raw data summarized in the SLR in our online appendix [78].
ACKNOWLEDGMENTS
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant agreement No. 851720).
27
Page 28:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
REFERENCES
[1] M. E. Fagan. Design and code inspections to reduce errors in program development. IBM Systems Journal , 15(3):182–211, 1976.
[2]Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In 35th IEEE/ACM International Conference on
Software Engineering, ICSE , pages 712–721, 2013.
[3]Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. The impact of code review coverage and code review participation on software
quality: A case study of the qt, vtk, and itk projects. In 11th IEEE/ACM Working Conference on Mining Software Repositories, MSR , pages 192–201, 2014.
[4]Rodrigo Morales, Shane McIntosh, and Foutse Khomh. Do code review practices impact design quality? a case study of the qt, vtk, and itk projects.
In22nd IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER , pages 171–180, 2015.
[5]Gabriele Bavota and Barbara Russo. Four eyes are better than two: On the impact of code reviews on software quality. In IEEE International Conference
on Software Maintenance and Evolution, ICSME , pages 81–90, 2015.
[6]Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. Modern code review: A case study at google. In 40th
International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP , pages 181–190, 2018.
[7]A. Bosu and J. C. Carver. Impact of peer code review on peer impression formation: A survey. In 7th IEEE/ACM International Symposium on Empirical
Software Engineering and Measurement, ESEM , pages 133–142, 2013.
[8]Vipin Balachandran. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation.
In2013 35th International Conference on Software Engineering (ICSE) , pages 931–940. IEEE, 2013.
[9]Jing Jiang, Jia-Huan He, and Xue-Yuan Chen. Coredevrec: Automatic core member recommendation for contribution evaluation. J. Comput. Sci.
Technol. , 30(5):998–1016, 2015.
[10] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. Who
should review my code? a file location-based code-reviewer recommendation approach for modern code review. In 2015 IEEE 22nd International
Conference on Software Analysis, Evolution, and Reengineering (SANER) , pages 141–150. IEEE, 2015.
[11] Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. Who should review this change?: Putting text and file location analyses together for more
accurate recommendations. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pages 261–270, 2015.
[12] Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. Search-based peer reviewers recommendation in modern code review. In 2016 IEEE International
Conference on Software Maintenance and Evolution (ICSME) , pages 367–377. IEEE, 2016.
[13] Haochao Ying, Liang Chen, Tingting Liang, and Jian Wu. Earec: Leveraging expertise and authority for pull-request reviewer recommendation in
github. In 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE) , pages 29–35, 2016.
[14] Yue Yu, Huaimin Wang, Gang Yin, and Tao Wang. Reviewer recommendation for pull-requests in github: What can we learn from code review and
bug assignment? Information and Software Technology , 74:204–218, 2016.
[15] Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. Automatically recommending peer reviewers in modern code review. IEEE
Transactions on Software Engineering , 42(6):530–543, 2016.
[16] Zhenglin Xia, Hailong Sun, Jing Jiang, Xu Wang, and Xudong Liu. A hybrid approach to code reviewer recommendation with collaborative filtering.
In2017 6th International Workshop on Software Mining (SoftwareMining) , pages 24–31. IEEE, 2017.
[17] Jing Jiang, Yun Yang, Jiahuan He, Xavier Blanc, and Li Zhang. Who should comment on this pull request? analyzing attributes for more accurate
commenter recommendation in pull-based development. Information and Software Technology , 84:48–62, 2017.
[18] Mikołaj Fejzer, Piotr Przymus, and Krzysztof Stencel. Profile based recommendation of code reviewers. Journal of Intelligent Information Systems ,
50:597–619, 2018.
[19] Sumit Asthana, Rahul Kumar, Ranjita Bhagwan, Christian Bird, Chetan Bansal, Chandra Maddila, Sonu Mehta, and B. Ashok. Whodo: Automating
reviewer suggestions at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering , ESEC/FSE 2019, page 937–945, New York, NY, USA, 2019. Association for Computing Machinery.
[20] Zhifang Liao, Zexuan Wu, Jinsong Wu, Yan Zhang, Junyi Liu, and Jun Long. Tirr: A code reviewer recommendation algorithm with topic model and
reviewer influence. In 2019 IEEE Global Communications Conference (GLOBECOM) , pages 1–6. IEEE, 2019.
[21] Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. Reviewer recommendation using software artifact traceability graphs. In Proceedings of the fifteenth
international conference on predictive models and data analytics in software engineering , pages 66–75, 2019.
[22] Jing Jiang, David Lo, Jiateng Zheng, Xin Xia, Yun Yang, and Li Zhang. Who should make decision on this pull request? analyzing time-decaying
relationships and file similarities for integrator prediction. Journal of Systems and Software , 154:196–210, 2019.
[23] Ehsan Mirsaeedi and Peter C. Rigby. Mitigating turnover with code review recommendation: Balancing expertise, workload, and knowledge
distribution. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , ICSE ’20, page 1183–1195, New York, NY, USA,
2020. Association for Computing Machinery.
[24] Wisam Haitham Abbood Al-Zubaidi, Patanamon Thongtanunam, Hoa Khanh Dam, Chakkrit Tantithamthavorn, and Aditya Ghose. Workload-aware
reviewer recommendation using a multi-objective search-based approach. In Proceedings of the 16th ACM International Conference on Predictive
Models and Data Analytics in Software Engineering , pages 21–30, 2020.
[25] Anton Strand, Markus Gunnarson, Ricardo Britto, and Muhmmad Usman. Using a context-aware approach to recommend code reviewers: findings
from an industrial case study. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice ,
pages 1–10, 2020.
28
Page 29:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
[26] Moataz Chouchen, Ali Ouni, Mohamed Wiem Mkaouer, Raula Gaikovina Kula, and Katsuro Inoue. Whoreview: A multi-objective search-based
approach for code reviewers recommendation in modern code review. Applied Soft Computing , 100:106908, 2021.
[27] K Ayberk Tecimer, Eray Tüzün, Hamdi Dibeklioglu, and Hakan Erdogmus. Detection and elimination of systematic labeling bias in code reviewer
recommendation systems. In Evaluation and Assessment in Software Engineering , pages 181–190. 2021.
[28] Prahar Pandya and Saurabh Tiwari. Corms: A github and gerrit based hybrid code reviewer recommendation approach for modern code review. In
Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC/FSE
2022, page 546–557, New York, NY, USA, 2022. Association for Computing Machinery.
[29] Ruiyin Li, Peng Liang, and Paris Avgeriou. Code reviewer recommendation for architecture violations: An exploratory study. In Proceedings of the
27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023 , pages 42–51. ACM, 2023.
[30] Ishan Aryendu, Ying Wang, Farah Elkourdi, and Eman Abdullah Alomar. Intelligent code review assignment for large scale open source software
stacks. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE , page to appear, 2023.
[31] Jiyang Zhang, Chandra Maddila, Ram Bairi, Christian Bird, Ujjwal Raizada, Apoorva Agrawal, Yamini Jhawar, Kim Herzig, and Arie van Deursen.
Using large-scale heterogeneous graph representation learning for code review recommendations at microsoft. In 2023 IEEE/ACM 45th International
Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages 162–172, 2023.
[32] Mohammad Masudur Rahman, Chanchal K. Roy, and Jason A. Collins. Correct: Code reviewer recommendation in github based on cross-project
and technology experience. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C) , pages 222–231, 2016.
[33] Aleksandr Chueshev, Julia Lawall, Reda Bendraou, and Tewfik Ziadi. Expanding the number of reviewers in open-source projects by recommending
appropriate developers. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pages 499–510, 2020.
[34] Soumaya Rebai, Abderrahmen Amich, Somayeh Molaei, Marouane Kessentini, and Rick Kazman. Multi-objective code reviewer recommendations:
Balancing expertise, availability and collaborations. Automated Software Engineering , page 301–328, 2020.
[35] Xin Ye. Learning to rank reviewers for pull requests. IEEE Access , pages 85382–85391, 2019.
[36] Guoliang Zhao, Jiawen Liu, Daniel Alencar da Costa, and Ying Zou. Adopting learning-to-rank algorithm for reviewer recommendation. In
Paria Shirani, Iosif-Viorel Onut, and Tinny Ng, editors, Proceedings of the 32nd Annual International Conference on Computer Science and Software
Engineering, CASCON 2022 , pages 22–31. ACM, 2022.
[37] Yu Qiao, Jian Wang, Can Cheng, Wei Tang, Peng Liang, Yuqi Zhao, and Bing Li. Code reviewer recommendation based on a hypergraph with
multiplex relationships. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 , pages 417–428. IEEE, 2024.
[38] Fahimeh Hajari, Samaneh Malmir, Ehsan Mirsaeedi, and Peter C. Rigby. Factoring expertise, workload, and turnover into code review recommendation.
IEEE Trans. Software Eng. , 50(4):884–899, 2024.
[39] Md Shamimur Rahman, Debajyoti Mondal, Zadia Codabux, and Chanchal K. Roy. Integrating visual aids to enhance the code reviewer selection
process. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2023, Bogotá, Colombia, October 1-6, 2023 , pages 293–305.
IEEE, 2023.
[40] Guoping Rong, Yifan Zhang, Lanxin Yang, Fuli Zhang, Hongyu Kuang, and He Zhang. Modeling review history for reviewer recommendation: a
hypergraph approach. In Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, page 1381–1392, 2022.
[41] Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. Rstrace+: Reviewer suggestion using software artifact traceability graphs. Information and Software
Technology , 130:106455, 2021.
[42] Dezhen Kong, Qiuyuan Chen, Lingfeng Bao, Chenxing Sun, Xin Xia, and Shanping Li. Recommending code reviewers for proprietary software
projects: A large scale study. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA,
March 15-18, 2022 , pages 630–640. IEEE, 2022.
[43] Md. Ahasanuzzaman, Gustavo Ansaldi Oliva, and Ahmed E. Hassan. Using knowledge units of programming languages to recommend reviewers
for pull requests: an empirical study. Empir. Softw. Eng. , 29(1):33, 2024.
[44] Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan, and Huaimin Wang. Automatic classification of review comments in pull-based development
model. In SEKE , pages 572–577, 2017.
[45] Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli. What happens in my code reviews? an investigation on automatically
classifying review changes. Empirical Software Engineering , 27(4):89, 2022.
[46] Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. Auger: automatically generating review comments
with pre-training models. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of
Software Engineering , pages 1009–1021, 2022.
[47] Li Zhiyu, Lu Shuai, Guo Daya, Duan Nan, Jannu Shailesh, Jenks Grant, Majumder Deep, Green Jared, Svyatkovskiy Alexey, Fu Shengyu, and Neel
Sundaresan. Automating code review activities by large-scale pre-training. In 30th ACM Joint European Software Engineering Conference and the
ACM/SIGSOFT International Symposium on the Foundations of Software Engineering ESEC-FSE , pages 1035–1047, 2022.
[48] Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti. Commentfinder: a simpler, faster, more accurate code
review comments recommendation. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations
of Software Engineering , pages 507–519, 2022.
[49] Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. Using pre-trained models to boost
code review automation. In 44th IEEE/ACM International Conference on Software Engineering, ICSE , pages 2291–2302, 2022.
29
Page 30:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
[50] Manushree Vijayvergiya, Malgorzata Salawa, Ivan Budiselic, Dan Zheng, Pascal Lamblin, Marko Ivankovic, Juanjo Carin, Mateusz Lewko, Jovan
Andonov, Goran Petrovic, Daniel Tarlow, Petros Maniatis, and René Just. Ai-assisted assessment of coding practices in modern code review. In Bram
Adams, Thomas Zimmermann, Ipek Ozkaya, Dayi Lin, and Jie M. Zhang, editors, Proceedings of the 1st ACM International Conference on AI-Powered
Software, AIware 2024 . ACM, 2024.
[51] Junyi Lu, Zhangyi Li, Chenjie Shen, Li Yang, and Chun Zuo. Exploring the impact of code review factors on the code review comment generation.
Autom. Softw. Eng. , 31(2):71, 2024.
[52] Yongda Yu, Guoping Rong, Haifeng Shen, He Zhang, Dong Shao, Min Wang, Zhao Wei, Yong Xu, and Juhong Wang. Fine-tuning large language
models to improve accuracy and comprehensibility of automated code review. ACM Trans. Softw. Eng. Methodol. , 34(1), 2024.
[53] Hong Yi Lin, Patanamon Thongtanunam, Christoph Treude, and Wachiraphan Charoenwet. Improving automated code reviews: Learning from
experience. In Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou, editors, 21st IEEE/ACM International Conference on Mining Software
Repositories, MSR 2024 , pages 278–283. ACM, 2024.
[54] Oussama Ben Sghaier and Houari A. Sahraoui. Improving the learning of code review successive tasks with cross-task knowledge distillation. Proc.
ACM Softw. Eng. , 1(FSE):1086–1106, 2024.
[55] Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. Llama-reviewer: Advancing code review automation with large language models through
parameter-efficient fine-tuning. In 34th IEEE International Symposium on Software Reliability Engineering, ISSRE 2023 , pages 647–658. IEEE, 2023.
[56] Mona Nashaat and James Miller. Towards efficient fine-tuning of language models with organizational data for automated software review. IEEE
Trans. Software Eng. , 50(9):2240–2253, 2024.
[57] Christoph Hannebauer, Michael Patalas, Sebastian Stünkel, and Volker Gruhn. Automatically recommending code reviewers based on their expertise:
An empirical comparison. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016 , page 99?110,
2016.
[58] Flavia Coelho, Tiago Massoni, and Everton L.G. Alves. Refactoring-aware code review: A systematic mapping study. In IEEE/ACM 3rd International
Workshop on Refactoring (IWoR’19) , pages 63–66, 2019.
[59] Nicole Davila and Ingrid Nunes. A systematic literature review and taxonomy of modern code review. Journal of Systems and Software , 177:110951,
2021.
[60] Deepika Badampudi, Ricardo Britto, and Michael Unterkalmsteiner. Modern code reviews - preliminary results of a systematic mapping study. In
Proceedings of the Evaluation and Assessment on Software Engineering EASE’19 , page 340?345, 2019.
[61] Ilenia Fronza, Arto Hellas, Petri Ihantola, and Tommi Mikkonen. Code reviews, software inspections, and code walkthroughs: Systematic mapping
study of research topics. In Software Quality: Quality Intelligence in Software and Systems Engineering , pages 121–133, 2020.
[62] Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. Can we benchmark code review studies? a systematic
mapping study of methodology, dataset, and metric. Journal of Systems and Software , 180:111009, 2021.
[63] Barbara Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering. 2007.
[64] Acm digital library. https://dl.acm.org/.
[65] Elsevier sciencedirect. https://www.sciencedirect.com/.
[66] Ieee xplore digital library. https://ieeexplore.ieee.org/.
[67] Scopus. https://www.scopus.com/.
[68] Springer link online library. https://link.springer.com/.
[69] Wiley online library. https://onlinelibrary.wiley.com/.
[70] Gali Halevi, Henk Moed, and Judit Bar-Ilan. Suitability of google scholar as a source of scientific information and as a source of data for scientific
evaluation: Review of the literature. Journal of Informetrics , 11(3):823–834, 2017.
[71] Oussama Ben Sghaier and Houari Sahraoui. A multi-step learning approach to assist code review. In 2023 IEEE International Conference on Software
Analysis, Evolution and Reengineering (SANER) , pages 450–460, 2023.
[72] Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. Towards automating code review activities. In 43rd
IEEE/ACM International Conference on Software Engineering, ICSE , pages 163–174, 2021.
[73] Faria Huq, Masum Hasan, Md Mahim Anjum Haque, Sazan Mahbub, Anindya Iqbal, and Toufique Ahmed. Review4repair: Code review aided
automatic program repairing. Information and Software Technology , 143:106765, 2022.
[74] Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Coditt5: Pretraining for source code and natural language
editing. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022 , pages 22:1–22:12. ACM, 2022.
[75] Chanathip Pornprasit and Chakkrit Tantithamthavorn. Fine-tuning and prompt engineering for large language models-based code review automation.
Information and Software Technology , 175:107523, 2024.
[76] Jiawei Lu, Zhijie Tang, and Zhongxin Liu. Improving code refinement for code review via input reconstruction and ensemble learning. In 30th
Asia-Pacific Software Engineering Conference, APSEC 2023, Seoul, Republic of Korea, December 4-7, 2023 , pages 161–170. IEEE, 2023.
[77] Alexander Froemmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine
Manzagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Daniel Zheng, Satish Chandra, and Petros Maniatis. Resolving code
review comments with machine learning. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice ,
ICSE-SEIP ’24, page 204–215, 2024.
[78] Online appendix. https://github.com/RosaliaTufano/Automating-Code-Review_SLR.
30
Page 31:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
[79] Thai Pangsakulyanont, Patanamon Thongtanunam, Daniel Port, and Hajimu Iida. Assessing mcr discussion usefulness using semantic similarity. In
2014 6th International Workshop on Empirical Software Engineering in Practice , pages 49–54. IEEE, 2014.
[80] Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. Predicting usefulness of code review comments using textual features and
developer experience. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) , pages 215–226. IEEE, 2017.
[81] Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, AJM Imtiajur Rahman, and Amiangshu Bosu. Using a balanced scorecard to identify
opportunities to improve code review effectiveness: An industrial experience report. Empirical Software Engineering , 26:1–34, 2021.
[82] Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli. Evacrc: Evaluating code review comments. In Satish Chandra, Kelly Blincoe,
and Paolo Tonella, editors, Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software
Engineering, ESEC/FSE 2023 , pages 275–287. ACM, 2023.
[83] Shadikur Rahman, Umme Ayman Koana, and Maleknaz Nayebi. Example driven code review explanation. In Proceedings of the 16th ACM/IEEE
International Symposium on Empirical Software Engineering and Measurement , pages 307–312, 2022.
[84] Ratnadira Widyasari, Ting Zhang, Abir Bouraffa, Walid Maalej, and David Lo. Explaining explanations: An empirical study of explanations in code
reviews. ACM Trans. Softw. Eng. Methodol. , December 2024.
[85] Haytham Hijazi, Joao Duraes, Ricardo Couceiro, Joao Castelhano, Raul Barbosa, Júlio Medeiros, Miguel Castelo-Branco, Paulo De Carvalho, and
Henrique Madeira. Quality evaluation of modern code reviews through intelligent biometric program comprehension. IEEE Transactions on Software
Engineering , (01):1–1, 2022.
[86] Haytham Hijazi, José Cruz, João Castelhano, Ricardo Couceiro, Miguel Castelo-Branco, Paulo de Carvalho, and Henrique Madeira. ireview: an
intelligent code review evaluation tool using biofeedback. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) ,
pages 476–485. IEEE, 2021.
[87] Yida Tao and Sunghun Kim. Partitioning composite code changes to facilitate code review. In 2015 IEEE/ACM 12th Working Conference on Mining
Software Repositories , pages 180–190. IEEE, 2015.
[88] Mike Barnett, Christian Bird, Jo ao Brunet, and Shuvendu K. Lahiri. Helping developers help themselves: Automatic decomposition of code review
changesets. In 37th IEEE/ACM International Conference on Software Engineering, ICSE , pages 134–144, 2015.
[89] Min Wang, Zeqi Lin, Yanzhen Zou, and Bing Xie. Cora: Decomposing and describing tangled code changes for reviewer. In 2019 34th IEEE/ACM
International Conference on Automated Software Engineering (ASE) , pages 1050–1061. IEEE, 2019.
[90] Kim Herzig and Andreas Zeller. The impact of tangled code changes. In Proceedings of the 10th Working Conference on Mining Software Repositories,
MSR ’13 , pages 121–130, 2013.
[91] Yuan Huang, Nan Jia, Xiangping Chen, Kai Hong, and Zibin Zheng. Code review knowledge perception: Fusing multi-features for salient-class
location. IEEE Transactions on Software Engineering , 48(5):1463–1479, 2020.
[92] Dong Wang, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. Automatic patch linkage detection in code review using textual content
and file location features. Information and Software Technology , 139:106637, 2021.
[93] Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti. Don’t forget to change these functions! recommending
co-changed functions in modern code review. Inf. Softw. Technol. , 176:107547, 2024.
[94] Bingting Wu, Bin Liang, and Xiaofang Zhang. Turn tree into graph: Automatic code review via simplified ast driven graph convolutional network.
Knowledge-Based Systems , 252:109450, 2022.
[95] Bingting Wu and Xiaofang Zhang. Contrastive learning for multi-modal automatic code review. arXiv preprint arXiv:2205.14289 , 2022.
[96] Shu-Ting Shi, Ming Li, David Lo, Ferdian Thung, and Xuan Huo. Automatic code review by learning the revision of source code. In Proceedings of
the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI
Symposium on Educational Advances in Artificial Intelligence . AAAI Press, 2019.
[97] Yuanrui Fan, Xin Xia, David Lo, and Shanping Li. Early prediction of merged code changes to prioritize reviewing tasks. Empirical Software
Engineering , 23:3346–3393, 2018.
[98] Khairul Islam, Toufique Ahmed, Rifat Shahriyar, Anindya Iqbal, and Gias Uddin. Early prediction for merged vs abandoned code changes in modern
code reviews. Information and Software Technology , 142:106756, 2022.
[99] Moataz Chouchen, Ali Ouni, and Mohamed Wiem Mkaouer. Multicr: Predicting merged and abandoned code changes in modern code review using
multi-objective search. ACM Trans. Softw. Eng. Methodol. , 33(8), 2024.
[100] Lanxin Yang, He Zhang, Jinwei Xu, Jun Lyu, Xin Zhou, Dong Shao, Shan Gao, and Alberto Bacchelli. A preliminary investigation on using
multi-task learning to predict change performance in code reviews. Empir. Softw. Eng. , 29(6):157, 2024.
[101] Moataz Chouchen and Ali Ouni. A multi-objective effort-aware approach for early code review prediction and prioritization. Empir. Softw. Eng. ,
29(1):29, 2024.
[102] Krishna Teja Ayinala, Kwok Sun Cheng, Kwangsung Oh, Teukseob Song, and Myoungkyu Song. Code inspection support for recurring changes
with deep learning in evolving software. In 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC) , pages 931–942, 2020.
[103] Ruiyin Wen, David Gilbert, Michael G Roche, and Shane McIntosh. Blimp tracer: Integrating build impact analysis with code review. In 2018 IEEE
International conference on software maintenance and evolution (ICSME) , pages 685–694. IEEE, 2018.
[104] Anderson Uchôa, Caio Barbosa, Daniel Coutinho, Willian Oizumi, Wesley KG Assunçao, Silvia Regina Vergilio, Juliana Alves Pereira, Anderson
Oliveira, and Alessandro Garcia. Predicting design impactful changes in modern code review: A large-scale empirical study. In 2021 IEEE/ACM 18th
International Conference on Mining Software Repositories (MSR) , pages 471–482. IEEE, 2021.
31
Page 32:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
[105] Song Wang, Chetan Bansal, and Nachiappan Nagappan. Large-scale intent analysis for identifying large-review-effort code changes. Information
and Software Technology , 130:106408, 2021.
[106] Guoliang Zhao, Daniel Alencar da Costa, and Ying Zou. Improving the pull requests review process using learning-to-rank algorithms. Empirical
Software Engineering , 24:2140–2170, 2019.
[107] Jiantao He, Linzhang Wang, and Jianhua Zhao. Supporting automatic code review via design. In 2013 IEEE Seventh International Conference on
Software Security and Reliability Companion , pages 211–218. IEEE, 2013.
[108] Zhiyuan Chen, Maneesh Mohanavilasam, Young-Woo Kwon, and Myoungkyu Song. Tool support for managing clone refactorings to facilitate
code review in evolving software. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) , volume 1, pages 288–297.
IEEE, 2017.
[109] Behjat Soltanifar, Atakan Erdem, and Ayse Bener. Predicting defectiveness of software patches. In Proceedings of the 10th ACM/IEEE International
Symposium on Empirical Software Engineering and Measurement , pages 1–10, 2016.
[110] Shipra Sharma and Balwinder Sodhi. Using stack overflow content to assist in code review. Software: Practice and Experience , 49(8):1255–1277, 2019.
[111] Yang Hong, Chakkrit Kla Tantithamthavorn, and Patanamon Pick Thongtanunam. Where should i look at? recommending lines that reviewers
should pay attention to. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 1034–1045. IEEE,
2022.
[112] Doriane Olewicki, Sarra Habchi, and Bram Adams. An empirical study on code review activity prediction and its impact in practice. Proc. ACM
Softw. Eng. , 1(FSE):2238–2260, 2024.
[113] Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. Style-analyzer: fixing code style inconsistencies with
interpretable unsupervised algorithms. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) , pages 468–478. IEEE,
2019.
[114] Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. Senticr: a customized sentiment analysis tool for code review interactions.
In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 106–111. IEEE, 2017.
[115] Carolyn D Egelman, Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. Predicting
developers’ negative feelings about code review. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages 174–185,
2020.
[116] Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. Automated identification of toxic code reviews using toxicr. ACM Transactions
on Software Engineering and Methodology , 2023.
[117] Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. Incivility detection in open source code review and issue discussions. J. Syst. Softw. , 209:111935,
2024.
[118] Jaydeb Sarker, Sayma Sultana, Steven R. Wilson, and Amiangshu Bosu. Toxispanse: An explainable toxicity detection in code review comments. In
ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2023, New Orleans, LA, USA, October 26-27, 2023 , pages
1–12. IEEE, 2023.
[119] Md Shamimur Rahman, Zadia Codabux, and Chanchal K. Roy. Do words have power? understanding and fostering civility in code review discussion.
Proc. ACM Softw. Eng. , 1(FSE):1632–1655, 2024.
[120] Anshul Gupta and Neel Sundaresan. Intelligent code reviews using deep learning. In Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day , 2018.
[121] Chenkai Guo, Hui Yang, Dengrong Huang, Jianwen Zhang, Naipeng Dong, Jing Xu, and Jingwen Zhu. Review sharing via deep semi-supervised
code clone detection. IEEE Access , 8:24948–24965, 2020.
[122] Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, and Yang Liu. Core: Automating review recommendation for code changes. In 2020 IEEE 27th
International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 284–295. IEEE, 2020.
[123] Ohiduzzaman Shuvo, Parvez Mahbub, and Mohammad Masudur Rahman. Recommending code reviews leveraging code changes with structured
information retrieval. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2023, Bogotá, Colombia, October 1-6, 2023 , pages
194–206. IEEE, 2023.
[124] Yusuf Kartal, Kaan Akdeniz, and Kemal Ozkan. Automating modern code review processes with code similarity measurement. Inf. Softw. Technol. ,
173:107490, 2024.
[125] Chenkai Guo, Dengrong Huang, Naipeng Dong, Quanqi Ye, Jing Xu, Yaqing Fan, Hui Yang, and Yifan Xu. Deep review sharing. In Xinyu Wang,
David Lo, and Emad Shihab, editors, 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019 , pages 61–72.
IEEE, 2019.
[126] Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto. The review linkage graph for code review analytics: a recovery approach
and empirical study. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations
of Software Engineering , pages 578–589, 2019.
[127] Yuki Ueda, Takashi Ishio, Akinori Ihara, and Kenichi Matsumoto. Mining source code improvement patterns from similar code review works. In
2019 IEEE 13th International Workshop on Software Clones (IWSC) , pages 13–19. IEEE, 2019.
[128] Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. Autotransform: Automated code transformation to support
modern code review process. In 44th IEEE/ACM International Conference on Software Engineering, ICSE , pages 237–248, 2022.
32
Page 33:
Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY
[129] Chanathip Pornprasit, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Chunyang Chen. D-ACT: towards diff-aware code transforma-
tion for code review under a time-wise evaluation. In Tao Zhang, Xin Xia, and Nicole Novielli, editors, IEEE International Conference on Software
Analysis, Evolution and Reengineering, SANER 2023 , pages 296–307. IEEE, 2023.
[130] Qianhua Shan, David Sukhdeo, Qianying Huang, Seth Rogers, Lawrence Chen, Elise Paradis, Peter C Rigby, and Nachiappan Nagappan. Using
nudges to accelerate code reviews at scale. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering , pages 472–482, 2022.
[131] Chandra Maddila, Chetan Bansal, and Nachiappan Nagappan. Predicting pull request completion time: a case study on large scale cloud services.
InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering ,
pages 874–882, 2019.
[132] Moataz Chouchen, Ali Ouni, Jefferson Olongo, and Mohamed Wiem Mkaouer. Learning to predict code review completion time in modern code
review. Empirical Software Engineering , 2023.
[133] Lawrence Chen, Peter C. Rigby, and Nachiappan Nagappan. Understanding why we cannot model how long a code review will take: An industrial
case study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering,
ESEC/FSE , page 1314–1319, 2022.
[134] Nishrith Saini and Ricardo Britto. Using machine intelligence to prioritise code review requests. In 2021 IEEE/ACM 43rd International Conference on
Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages 11–20. IEEE, 2021.
[135] Lanxin Yang, Jinwei Xu, He Zhang, Fanghao Wu, Jun Lyu, Yue Li, and Alberto Bacchelli. GPP: A graph-powered prioritizer for code review
requests. In Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, editors, Proceedings of the 39th IEEE/ACM International Conference on Automated
Software Engineering, ASE 2024 , pages 104–116. ACM, 2024.
[136] Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. Towards automated classification of code review
feedback to support analytics. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2023, New Orleans,
LA, USA, October 26-27, 2023 , pages 1–12. IEEE, 2023.
[137] Fiorella Zampetti, Saghan Mudbhari, Venera Arnaoudova, Massimiliano Di Penta, Sebastiano Panichella, and Giuliano Antoniol. Using code
reviews to automatically configure static analysis tools. Empirical Software Engineering , 27(1):28, 2022.
[138] Tukaram B. Muske, Ankit Baid, and Tushar Sanas. Review efforts reduction by partitioning of static analysis warnings. In 13th IEEE International
Working Conference on Source Code Analysis and Manipulation, SCAM 2013 , pages 106–115. IEEE Computer Society, 2013.
[139] Massimiliano Menarini, Yan Yan, and William G Griswold. Semantics-assisted code review: An efficient tool chain and a user study. In 2017 32nd
IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 554–565. IEEE, 2017.
[140] Muntazir Fadhel and Emil Sekerinski. Striffs: Architectural component diagrams for code reviews. In 2021 International Conference on Code Quality
(ICCQ) , pages 69–78. IEEE, 2021.
[141] Rodrigo Brito and Marco Tulio Valente. Raid: Tool support for refactoring-aware code reviews. In 2021 IEEE/ACM 29th International Conference on
Program Comprehension (ICPC) , pages 265–275. IEEE, 2021.
[142] Enrico Fregnan, Josua Fröhlich, Davide Spadini, and Alberto Bacchelli. Graph-based visualization of merge requests for code review. Journal of
Systems and Software , 195:111506, 2023.
[143] Peter C. Rigby, Daniel M. Germán, Laura L. E. Cowen, and Margaret-Anne D. Storey. Peer review on open-source software projects: Parameters,
statistical models, and theory. ACM Trans. Softw. Eng. Methodol. , 23(4):35:1–35:33, 2014.
[144] Peter C. Rigby and Christian Bird. Convergent contemporary software peer review practices. In 21st Joint Meeting of the European Software
Engineering Conference and the ACM/SIGSOFT Symposium on the Foundations of Software Engineering, ESEC-FSE , pages 202–212, 2013.
[145] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al.
Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 , 2023.
[146] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code
understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing , pages 8696–8708, November 2021.
[147] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
Roberta: A robustly optimized BERT pretraining approach. CoRR , abs/1907.11692, 2019.
[148] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. Generating accurate assert statements for unit test cases using pretrained
transformers. In 3rd IEEE/ACM International Conference on Automation of Software Test, AST , pages 54–64, 2022.
[149] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan
Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR , abs/1910.03771, 2019.
[150] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In 40th Annual
Meeting on Association for Computational Linguistics, ACL , pages 311–318, 2002.
[151] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004.
[152] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a
method for automatic evaluation of code synthesis. CoRR , abs/2009.10297, 2020.
[153] Håkan Petersson, Claes Wohlin, Per Runeson, and Martin Höst. Defect content estimation for two reviewers. In 12th International Symposium on
Software Reliability Engineering (ISSRE 2001) , pages 340–345. IEEE Computer Society, 2001.
33
Page 34:
Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota
[154] Yue Yu, Huaimin Wang, Gang Yin, and Charles X Ling. Who should review this pull-request: Reviewer recommendation to expedite crowd
collaboration. In 2014 21st Asia-Pacific Software Engineering Conference , volume 1, pages 335–342. IEEE, 2014.
[155] Fuxiang Chen, Fatemeh Fard, David Lo, and Timofey Bryksin. On the transferability of pre-trained language models for low-resource programming
languages. In 30th IEEE/ACM International Conference on Program Comprehension, ICPC , pages 401–412, 2022.
[156] Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael
Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms.
Proceedings of the ACM on Programming Languages , 8(OOPSLA2):677–708, 2024.
[157] Tim van Dam, Frank van der Heijden, Philippe de Bekker, Berend Nieuwschepen, Marc Otten, and Maliheh Izadi. Investigating the performance of
language models for completing code in functional programming languages: a haskell case study. arXiv preprint arXiv:2403.15185 , 2024.
[158] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: the fine print and the
benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , AAAI’16, page 308–314, 2016.
[159] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. Codereval: A benchmark of
pragmatic code generation with generative pre-trained models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering ,
ICSE ’24, 2024.
[160] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient
neural network inference. In Low-Power Computer Vision , pages 291–326. Chapman and Hall/CRC, 2022.
[161] Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas
Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 ,
2023.
[162] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain
Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning , pages 2790–2799. PMLR, 2019.
A APPENDIX
Table 11. Venue names
Acronym Venue Name
ASE International Conference on Automated Software Engineering
COMPSAC Annual Computer Software and Applications Conference
EASE International Conference on Evaluation and Assessment in Software Engineering
EMSE Empirical Software Engineering
ESEC/FSE European Software Engineering Conference and Symposium on the Foundations of Software Engineering
ESEM International Symposium on Empirical Software Engineering and Measurement
ICSE International Conference on Software Engineering
ICSE-SEIP International Conference on Software Engineering: Software Engineering in Practice
ICSME International Conference on Software Maintenance and Evolution
IEEE Access IEEE Access
ISSRE International Symposium on Software Reliability Engineering
IST Journal of Information and Software Technology
JSS Journal of Systems and Software
MSR International Conference on Mining Software Repositories
PACMSE Proceedings of the ACM on Software Engineering
PROMISE International Conference on Predictive Models and Data Analytics in Software Engineering
SANER International Conference on Software Analysis, Evolution and Reengineering
SEKE International Conference on Software Engineering and Knowledge Engineering
TOSEM Transactions on Software Engineering and Methodology
TSE Transactions on Software Engineering
34