Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

arxiv

Paper 2503.09510

Automating Code Review: A Systematic Literature Review

Authors: Rosalia Tufano, Gabriele Bavota

Published: 2025-03-12

Abstract:

Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable reviewers for a given code change) or the actual review of a given change (i.e., recommending improvements to the contributor as a human reviewer would do). Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art. We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks. We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task. The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.

Paper Content:

Page 1: Automating Code Review: A Systematic Literature Review ROSALIA TUFANO, SEART @ Software Institute - Università della Svizzera italiana, Switzerland GABRIELE BAVOTA, SEART @ Software Institute - Università della Svizzera italiana, Switzerland Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers’ time. For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection ( i.e.,identifying suitable reviewers for a given code change) or the actual review of a given change ( i.e.,recommending improvements to the contributor as a human reviewer would do). Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art. We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks. We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task. The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions. CCS Concepts: •Software and its engineering →Software development techniques . Additional Key Words and Phrases: code review, recommender systems ACM Reference Format: Rosalia Tufano and Gabriele Bavota. 2025. Automating Code Review: A Systematic Literature Review. In Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 34 pages. https://doi.org/XXXXXXX 1 INTRODUCTION The idea of inspecting peers’ code looking for bugs and suboptimal implementation choices dates back to the 70s and in particular to the seminal work by Fagan titled “ Design and code inspections to reduce errors in program development ” [1]. The formal code inspections envisioned at that time slowly evolved into what is know as modern code review (MCR) [ 2], being tool-based and more informal. One of the objectives of MCR is to reduce the inherent cost associated with code review. Indeed, while there is ample evidence about the benefits of code review [ 2–6], they do not come for free, and may result in developers spending many hours per week reviewing code [7]. For this reason, researchers proposed techniques and tools to automate specific code review tasks. For example, several studies focus the attention on the task of recommending reviewers [8–43], namely the automatic selection of proper reviewers for a given code change. Other researchers target instead the task of classifying reviewers’ comments [44,45], having the goal of automatically classifying comments posted by reviewers based on the “type of feedback” they provide to the contributor ( e.g.,feedback about the code style,functionality , etc.). With the recent adoption of deep learning (DL) in software engineering, generative tasks have also been subject of automation. For example, DL models Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ©2025 Association for Computing Machinery. Manuscript submitted to ACM 1arXiv:2503.09510v1 [cs.SE] 12 Mar 2025 Page 2: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota have been trained with the goal of generating natural language comments asking to the contributor code changes as a human reviewer would do ( i.e.,simulating a reviewer commenting on the submitted code) [46–56]. Given the numerous code review tasks in which automation attempts have been made and the large number of studies targeting this topic, it is important to synthesize the current state-of-the-art to provide researchers and practitioners with an updated entry point on code review automation. We present a systematic review of the literature presenting techniques and tools for the automation of code review tasks. Previous secondary studies on the topic [ 57,58] only focused on specific tasks ( i.e., recommending reviewers task and refactoring-aware solutions) or do not have a specific focus on code review automation [ 59]. As we show, there are 34 tasks for which researchers proposed automated solutions in 119 articles. As a comparison, the most extensive literature review at date also featuring code review automation techniques only includes 53 of these articles [ 59]. This makes our SLR by far the most comprehensive at date on the topic of code review automation. The SLR we present is the result of filtering out 119 relevant studies out of 11,165 resulting from querying popular digital libraries. Our contributions are: (i) A categorization of the 34 code review tasks for which researchers proposed automated solutions; (ii) An overview of the techniques used in the literature to automate code review ( e.g.,exploiting machine learning, DL, information retrieval, etc.) with a focus on the training strategies used for data-driven techniques; (iii) A collection of the publicly available techniques ( i.e.,the tool or the code implementing the technique is publicly available) and evaluation datasets clustered by “type of automated tasks” ( e.g.,we list all publicly available tools/techniques and evaluation datasets for the task of recommending reviewers ); (iv) A description of the evaluation frameworks adopted in the literature to assess the performance of techniques proposed for the different tasks, with a focus on the adopted metrics, targeted language, and deployment in industry of the automated solution; (v) Informed by the finding of our SLR, we list directions for future work in the field of code review automation. 1.1 Structure of the Paper Section 2 reports the related literature, presenting surveys, SLR and mapping studies dealing with modern code review. Section 3 presents the methodology we adopted to conduct the SLR. Section 4 discusses the achieved results, answering our research questions. Section 5 reports the threats that could affect the validity of our findings. Finally, Section 6 concludes the paper. 2 RELATED WORK Table 1 lists the previous secondary studies on modern code review in chronological order. For each work we also include (i) the overall number of papers part of the study ( i.e.,column “#Papers”) and (ii) the papers related to the automation of code review tasks that are featured in the study. The focus of the works by Badampudi et al. [60], Wang et al. [62], and Fronza et al. [61] is different as compared to our SLR. Badampudi et al. [60] aim at classifying the literature on modern code review based on the investigated research questions. As a result of this analysis they report that 39 of the surveyed papers present tool support for code review. However, no additional analyses are performed on these works. A similar study has also been presented by Wang et al. [62]. Also in this case the authors focus on classifying the type of contribution, reporting 37 papers out of the 112 considered as related to code review automation. Again, there are not specific research questions in the SLR about code review automation. Fronza et al. [61], instead, explicitly focus on empirical studies rather than papers presenting techniques and tools for code review automation. 2 Page 3: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Table 1. Surveys, SLRs and mapping studies dealing with modern code review Reference Main Goal Year #Papers #Papers Automation Hannebauer et al. [57] Comparing eight techniques to recommend code reviewers 2016 8 8 Badampudi et al. [60] Documenting the research questions addressed in code review literature 2019 177 39 Coelho et al. [58] Mapping refactoring-aware solutions to support modern code review 2019 13 9 Fronza et al. [61] Documenting the research questions addressed in code review literature 2020 75 0 Wang et al. [62] Mapping the type of contributions ( e.g.,empirical study, automation) in code review papers, study their replicability, document the type of data collected in such studies ( e.g.,the experience of reviewers, the workload, etc.)2021 112 37 Davila et al. [59] Mapping the type of contributions ( i.e.,foundational, proposals, evalua- tions) in code review papers2021 139 53 Our work Documenting the code review tasks automated in the literature, the adopted techniques and evaluation datasets/frameworks2025 119 119 Hannebauer et al. [57] and Coelho et al. [58] present secondary studies focusing on the automation of specific code review tasks. The former compares eight techniques for the recommending reviewers task, while the latter looks at 13 refactoring-aware solutions proposed in the literature. Our SLR has a wider target, looking at works automating any code review task. Finally, Davila et al. [59] presented another SLR mapping the type of contribution of the code review papers. Their SLR features 53 papers presenting tools and techniques for the automation of code review. As compared to the previously discussed SLRs, Davila et al. provide a detailed description of these works, including the type of task they support. However, differently from our SLR, the main focus is not on code review automation and, due to the time period in which papers have been collected ( i.e.,up to 2019), none of the recent techniques built on top of DL models is documented (and, consequently, none of the tasks that have been automated for the first time thanks to DL models). Our SLR more than doubles the paper on code review automation present in the work by Davila et al. [59] (from 53 to 119). 3 RESEARCH METHOD We describe our research method following the guidelines by Kitchenham and Charters [63] for SLR. 3.1 Research Questions Our SLR aims at informing researchers and practitioners about the state of the art in automating code review and it is thus steered by the following research questions (RQs): •RQ 1:What are the code review tasks for which researchers proposed automated solutions? We aim at categorizing the code review tasks automated in the literature to support (i) researchers, in getting a complete overview of tackled research directions in the field, thus possibly identifying areas in needed of further research; and (ii) practitioners, in discovering automated solutions which may be employed in their daily workflow. Once identified the list of automated tasks, the following RQs are answered for each task ( i.e.,by discussing the findings by task): •RQ 2:What are the under-the-hood solutions behind the techniques and tools proposed for code review automation? This RQ sheds light on the functioning of the proposed automated solutions. In particular, we present: (i) a high- level classification of the adopted technical solution — e.g.,DL-based, ML-based (Machine Learning), IR-based 3 Page 4: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota +244 +1397 +2890 +1885 +3743 +1006Online SearchAutomated FilteringSnowballing119Selected StudiesManual InspectionInvalid publication venuesSistematic literature reviewsBook chapters/MagazinesConference reviewers’ listUser/App review studiesDuplicates -1320 -3511 -3271 -1151 -306 -209ACMElsevierIEEEScopusSpringerWiley +16 -12941234n=11,165n=1,397n=103 Fig. 1. Study selection process (Information Retrieval), etc.; (ii) a description of the training strategies adopted in data-driven solutions; and (iii) information about the programming language target of the automation ( e.g.,does the technique only support Java or it is language-independent?). •RQ 3:How are techniques for the automation of code related tasks empirically evaluated? We focus on the adopted evaluation metrics and on additional qualitative/industrial studies present in the papers. RQ 3can help researches in getting a quick understanding of the possible evaluation framework to adopt for their techniques. •RQ 4:Which techniques and datasets are publicly available? While RQ 1identifies the automated tasks and, for each of them, lists the solutions proposed in the literature, not all these techniques are publicly available ( i.e., their implementation has been released by the authors). A similar observation can be made for the used datasets. The output of RQ 4is the list of techniques and datasets that, as of the day of writing (January 2025), are publicly available. Such an outcome can be useful to (i) researchers, to easily identify baselines for comparisons and/or datasets that can be used for building or evaluating automated solutions; and (ii) practitioners, to easily spot “ready-to-use” solutions they can consider for adoption. •RQ 5:What are the concerns raised or the limitations observed by researchers when experimenting the automated solutions? We inspect the 119 papers to identify limitations and concerns researchers discuss about the proposed techniques, with the goal of outlining possible future research directions in the field. 3.2 Relevant Study Identification Fig. 1 depicts the process adopted to identify the relevant primary studies. Such a process is detailed in the following. 3.2.1 Search Strategy. We queried six digital libraries to search for primary studies: ACM Digital Library [ 64], Elsevier ScienceDirect [ 65], IEEE Xplore Digital Library [ 66], Scopus [ 67], Springer Link Online Library [ 68], and Wiley Online Library [ 69]. We did not query Google Scholar due to the limitations documented by Halevi et al. [70] (e.g.,lack of quality control, missing support for data download). To define the query needed to identify works related to the automation of code review tasks, a trial-and-error procedure has been performed by the two authors. It became soon clear that searching in the paper titles for keywords such as “ automating ”, “recommending ”, etc. was not an option, even considering all their possible variations ( e.g., 4 Page 5: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY automating ,automated ,automate ). Indeed, this would have led to the lost of several relevant studies ( e.g.,“Intelligent Code Review Assignment for Large Scale Open Source Software Stacks ” [30], “A Multi-Step Learning Approach to Assist Code Review ” [71]). For this reason, we opted for a more conservative query which targets the identification of all code review-related studies, even those do not presenting automated solutions: Title CONTAINS “revi* ” OR (“ cod*” AND “ edit* ”) AND Publication venue CONTAINS (“software ” OR “ program ” OR “ code”) The query searches for the term “ revi* ” (e.g.,review, reviewing, revision) or both the terms “ cod*” (e.g.,code, coding) and “ edit* ” in the article title. The latter have been included to match works related to the recent trend of automating code editing needed to address a reviewer’s comment (see e.g.,[47,49,54–56,72–77]). While only searching in the title might be restrictive, we want to identify automated solutions which have been explicitly proposed for code review (e.g.,we are not interested in articles presenting generic static analysis tools that might be applied in code review to spot quality issues). Also, we only searched for articles published in venues containing at least one of three keywords: “software ”, “program ”, and “ code”. Such a filter is based on the authors’ knowledge of software engineering publication venues. We acknowledge that there might be relevant articles published in related fields ( e.g.,artificial intelligence) that our query would exclude. However, as explained later, we adopt a snowballing process to partially address this issue. Among the queried search engines Elsevier, Scopus, Springer, and Wiley allow to specify a discipline of interest, which is useful to minimize the retrieved false positive instances. For these libraries, we selected “Computer Science” as discipline. Springer also allows to specify sub-disciplines, for which we selected “Software Engineering/Programming”. The link with the query used for each digital library is publicly available in our replication package [ 78]. The query has been run on 20 December 2024 on all digital libraries. Table 2. Articles returned by the queried digital libraries Source Returned Articles ACM Digital Library 1,006 Elsevier ScienceDirect 3,743 IEEE Xplore Digital Library 1,885 Scopus 2,890 Springer Link Online Library 1,397 Wiley Online Library 244 Total (including duplicates) 11,165 Total (excluding duplicates) 9,845 Table 2 reports the articles returned by each digital library. Once removed duplicates ( i.e.,the same article has been returned by multiple libraries), we collected 9,845 candidate primary studies which have been manually inspected as described in the following. 3.2.2 Study Selection. Given the high number of articles returned by the formulated query, we started with an automated check aimed at excluding clear false positives. First, despite the filter on venues we set in the digital libraries, we noticed that some of the returned results concerned invalid publication venues ( i.e.,venues not featuring in their name any of the three keywords “ software ”, “program ”, and “ code”). Thus, we implemented a simple script excluding those cases (-3,511). 5 Page 6: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Other three filters were implemented. First, given our query, and in particular the retrieval of articles containing “revi* ” in their title, we retrieved several SLRs. Among those, we were only interested in the ones focusing on code review, since they represent an important source of references for the snowballing phase. Thus, we automatically removed all articles containing in the title, besides “ review ”, the term “ systematic ” and do not containing the term “code” (-3,271). Second, we excluded articles published as book chapters or in magazines, since those are usually not full research articles (-1,151). Finally, we also excluded “reviewers lists” (-306) and works related to user/app reviews (-209). At the end of this process, 1,397 candidate primary studies were left. Table 3. Inclusion and exclusion criteria Inclusion Criteria IC1 The article must be peer-reviewed, published at conferences, workshops, or journals. In the snowballing phase later described, we ignore all referenced preprints ( e.g.,those published on arXiv.org). IC2 The PDF of the article must be available online. We searched for it on the online libraries featuring and, if needed, on Google. IC3 The article must present technique(s) to automate a code review task. It is not enough to present a generic technique that, accordingly to the reader, might be useful in the context of code review: The authors must explicitly state that the technique has been thought to support code review. Exclusion Criteria EC1 The article is not written in English. EC2 The article has been published in a conference/workshop and later on extended to a journal. We only keep the journal article to avoid redundancy. EC3 The article is not a full research publication ( e.g.,doctoral symposium articles, posters, ERA track). We exclude all articles having less than six pages with the goal of removing articles that may not have been subject to the same peer-review process typical of full research articles. EC4 The article replicates a previously published technique for code review automation which has been already included in the SLR. EC5 The article is a secondary study. In this case, we keep it only as a source of references for the snowballing phase. EC6 The article has not been published in an international venue, but in a national one ( e.g., Brazilian Symposium on Programming Languages ). This set has then been manually inspected by both authors. Inclusion and exclusion criteria are listed in Table 3. This part of the manual analysis was mainly focused on the inspection of the title and abstract of the article. Authors agreed to be conservative and include the article in case of doubts, given the planned subsequent reading of the whole article as described in the following. Conflicts ( i.e.,cases in which one author considered the article as relevant and one not) arisen in 25 cases (1.8%) and have been solved through an open discussion. This filtering process left 175 candidate studies which have been equally split among the two authors. Each author downloaded the corresponding article and re-inspected it keeping the inclusion and exclusion criteria in mind (Table 3) and then either confirming the article as relevant for the SLR or discarding it. All those discarded have been double-checked by the other author to ensure no relevant studies were mistakenly excluded. This further check confirmed 103 articles as relevant primary studies. Those, together with 19 articles tagged as “relevant secondary study”, have been subject of a backward snowballing process. Backward Snowballing. The 122 articles were equally split among the authors, with each of them in charge of reading the reference list and identify possible relevant papers. At this step, we retrieved also relevant papers published in venues not containing any of the three keywords “ software ”, “program ”, and “ code” (e.g.,papers published in the Conference on Artificial Intelligence — AAAI ). Also in this phase, in case of doubts, the authors agreed to included a referenced article for a further check by the other author. The snowballing resulted in 16 additional primary studies, that summed up to the 103 previously collected leads to the final set of 119 primary studies featured in our SLR. 6 Page 7: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Table 4. Data extraction questionnaire No. Question Focus Q1 Which code review task has been automated? RQ1 Q2 Does the employed technique rely on machine/deep learning? RQ2 Q3 If yes to Q2, which specific algorithms are used? RQ2 Q4 If no to Q2, summarize the approach functioning. RQ2 Q5 Which dataset has been exploited to build the technique?1Collect information related to (i) the subject programming language(s) and (ii) the type of information featured in the dataset ( i.e.,what is an “instance” in the dataset?).RQ2 Q6 Which evaluation metrics have been employed? RQ3 Q7 Did the authors perform any sort of qualitative analysis? RQ3 Q8 Was the approach deployed in an industrial setting? RQ3 Q9 Is a link to a replication package available? Is the link still working? RQ 4 Q10 Is the implementation of the proposed solution publicly available? RQ 4 Q11 Are the datasets used for training and/or evaluating the technique publicly available? RQ 4 Q12 Do the researchers raise specific concern or discuss limitations about the experimented solutions? RQ 5 3.3 Data Extraction and Analysis The 119 primary studies have been inspected one last time with the goal of extracting the information needed to answer our RQs. The articles have been again equally split among the two authors with each of them in charge of extracting the needed data guided by the questionnaire in Table 4. The questions are clustered based on the RQ they serve. Q1 collects the data needed to answer RQ 1(i.e.,code review tasks automated in the literature). Q2-Q5 aim at categorizing the under-the-hood functioning of these techniques, thus answering RQ 2. Q6-Q8 shed light on the empirical evaluation performed to assess the proposed techniques (RQ 3) while Q9-Q11 look at the replicability of the primary studies and lists publicly available techniques and tools (RQ 4). Finally, Q12 informs our discussion of current limitations of automated techniques (RQ 5). It is worth noting that some of the considered articles did not explicitly report some of the information we aim at collecting. Those cases are all documented in the master table reported in our replication package [78]. Once collected the needed data, we answer our RQs as follows. For RQ 1we report the list of code review tasks automated in the literature. Given this list, all other RQs are discussed by task. In all RQs in which “categories” must be defined ( e.g.,the list of automated tasks in RQ 1), this has been obtained via an open-coding inspired procedure performed together by the two authors on the notes each of them took during the data extraction procedure, going back to the original paper if needed ( i.e.,if the notes were not clear/comprehensive enough). For RQ 2we classify the automated approaches based on the technical solution they are built upon ( e.g.,DL-based). Then, we distill findings about the training procedures followed for data-driven techniques and the targeted programming languages. For RQ 3we focus instead on the evaluation, reporting the metrics usually adopted in the assessment of the techniques, whether qualitative analysis was present, and if the approach has been deployed in industry. RQ4lists in a tabular fashion the available replication packages reporting for each of them whether they provide an implementation of the proposed technique and/or the datasets used in the study. Finally, for RQ 5we read the selected papers with a particular focus on the sections describing the approach, those discussing the results, and the conclusions to identify concerns/limitations about the proposed technique and its experimentation. We ignored classic limitations which can be found in any paper and which are usually discussed in the “threats to validity” section ( e.g.,lack of generalizability beyond the scope of the experiment, limited hyperparameters tuning), but focused on concerns/limitations which are peculiar of the experimented technique ( e.g.,the lack of an 7 Page 8: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Fig. 2. Publication years appropriate metric to assess its effectiveness). Once identified the relevant parts of the papers, a tag summarizing the discussed issue was defined. Then, similar tags were merged and the final list of tags was organized in a taxonomy presented in Table 10. The identified issues and their mapping with the corresponding papers were double-checked by a second author. 4 RESULTS Before answering our RQs, we provide an overview about the identified primary studies. Fig. 2 plots the publication year of the 119 articles showing, with few exceptions, an overall increasing trend over the years with 21 papers published in 2024. Fig. 3 shows instead the publication venues for these techniques, with venues such as IST, EMSE, ESEC/FSE, and ICSE being the most popular ones. Table 11 in the paper appendix indicates what each acronym used for publication venues stands for. 4.1 RQ 1: What are the code review tasks for which researchers proposed automated solutions? Table 5 presents the 34 types of code review tasks which have been automated in the literature. Table 5 groups the tasks into macro categories ( e.g.,“Code Change Analysis”) and provides a short description of each task with related references ( i.e.,the works addressing its automation). We discuss in the following each macro category. 4.1.1 Assessing Review Quality. Works in this area aim at automatically assessing the quality of the review. Such information is meant to be fed to the reviewer who can take proper actions to improve the review quality, if needed. Works in this area aimed at classifying review comments as useful or not-useful for the contributor [ 79–82]. Rahman et al. [83] address a similar problem but by focusing specifically on comments requiring additional explanations to be properly understood by the contributor (thus being a subcategory of not-useful comments). Widyasari et al. [84] also 8 Page 9: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Fig. 3. Publication venues investigate comments requiring additional explanations, also proposing the usage of Large Language Models (LLMs) to generate the additional explanations, when needed. Finally, Hijazi et al. [85,86] looked at the code review quality measurement from an orthogonal perspective using biometrics data. By monitoring the reviewer’s activities (using e.g.,an eye-tracking device) they can provide feedback to the reviewer about areas of the reviewed code they did not pay enough attention to, thus suggesting a further check. 4.1.2 Code Change Analysis. This category groups techniques aimed at analyzing the code change submitted for review in order to extract information useful to support the reviewer in its inspection. Several authors [ 87–89] targeted the splitting of tangled commits [ 90] into smaller and cohesive changes which are supposed to be easier to review. Indeed, having smaller changes can help in achieving quick review turnarounds [ 2,6] while cohesive changes simplify the identification of proper reviewers, which are more likely to have a comprehensive expertise to review the change (given its cohesiveness and focus). Huang et al. [91] propose the automated identification of the “salient-class” in a commit to review. The salient-class is the one supposed to be the main focus of the changes and which likely triggered changes to other code locations. Such a class can be used as entry point for the review process, assuming that this will simplify the code change understanding. Wang et al. [92] suggest the automated linking of similar contributions which may help in identifying duplicated patches and, more in general, in increasing the reviewers’ awareness about changes impacting similar locations, thus promoting a better code review. 9 Page 10: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Finally, with the goal of minimizing the number of code review iterations needed to accept a proposed change, Hong et al. [93] propose a change impact analysis methodology specifically tailored for the code review process and aimed at identifying functions that must co-change given the proposed contribution, but are not changed. 4.1.3 Code Change Classification. Works in this area classify the whole code change to review again with the goal of augmenting the information available to reviewers before starting the code inspection. Predicting whether the code change will be approved (merged) or needs additional review rounds is the most popular code change classification task tackled in the literature [ 47,55,56,94–101]. Works on this topic provide a representation of the code change as input to the approach ( e.g.,to a DL model) expecting it to suggest whether the implemented change is acceptable. A variation is to also provide the technique with information about the specific change the developer was asked to implement ( e.g.,a reviewer comment that the contributor had to address). The outputted boolean prediction can help, for example, to prioritize the diff hunks part of a pull request, focusing on those likely to require a reviewer’s comment [47]. Another line of research aims at identifying code contributions which, due to their nature, will require a large review effort. Uchôa et al. [104] automatically flag code changes which are likely to impact the software design, thus requiring extra care in their assessment. Wen et al. [103] propose BLIMP Tracer, a tool to support code review through impact analysis information, thus helping in identifying changes impacting mission-critical deliverables. Wang et al. [105] generalize the problem to the automated identification of large-review-effort changes while, at the other side of the spectrum, Zhao et al. [106] target the identification of quickly reviewable changes, namely contributions that are easy to merge or reject. Similarly to the work classifying the contributions as likely to be accepted/rejected, all these works provide code reviewers with information useful for prioritizing the changes to inspect. 4.1.4 Code Change Quality Check. Researchers proposed solutions to (partially) automate the quality check usually in place when reviewing a code change. Approaches addressing this task substantially vary in their goal and complexity. Some of them focus on specific code quality aspects, such as predicting whether a submitted patch is likely to introduce a bug [ 109,110], identifying the presence of missed clone refactoring opportunities [ 108], or checking whether the implemented change violates existing design patterns [ 107]. Other techniques address the same problem with, however, a more general view on code quality. Some authors [ 71,111,112] aim at predicting code elements in a patch which require the reviewer’s attention, since likely in need for changes. These approaches are useful in the context of within-patch review prioritization ( i.e.,deciding where to allocate more review effort within a patch). Other works push the boundaries further targeting the automated generation of concrete feedback for the contributor, as a human reviewer would do. A first strategy to achieve this goal consists in merging the output of several static analysis tools [ 8], providing the contributor with a list of potential flaws identified in the submitted patch. The most recent trend consists, however, in exploiting DL models to generate natural language comments for a given patch, with the model imitating a human reviewer [ 46–56]. These techniques are trained on thousands of examples of real code reviews ( i.e.,review comments liked to specific code changes) and can then be applied to previously unseen changes to generate review comments. Markovtsev et al. [113] focused on a simplified version of this problem: Their approach “learns” the code formatting style of a given software project, identifies violations to such a style, and suggests possibly fixes as automatically generated reviewer’s comments. 4.1.5 Code Review Sentiment Analysis. The code review process may result in critiques moved by a developer (reviewer) to one of their peers (contributor). The way in which these critiques are formalized in the reviewer’s comment can play an important role in the successful outcome of the whole process. For this reason, researchers applied sentiment analysis 10 Page 11: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Table 5. Code review tasks for which automated solutions have been proposed Type Task Description ReferenceAssessing Review QualityAssessing Review Quality through BiometricsEvaluate the quality of code review using biometrics data, warning the reviewer if specific areas of code deserve a further check[85, 86] Classifying the Usefulness of Review CommentsClassify a given code review comment as useful or not-useful for the contributor[79–82] Identifying/Improving Review Comments Need- ing Further ExplanationsIdentifies review comments which need further explanations to be properly understood by the contributor[83, 84]Code Change AnalysisDecomposing Tangled Commit Split a composite code change into smaller and cohesive changes [87–89] Impact Analysis for Code ReviewRecommend functions that must be changed given the submitted con- tribution[93] Linking Similar ContributionsLink similar changes to review that share textual content and modify similar code locations[92, 102] Predicting Salient-ClassIdentification of the “salient-class” in a commit to review, namely the class causing the other changes in the commit[91]Code Change ClassificationIdentifying Impactful Code Changes Identify impactful code changes ( e.g.,impacting the system design) [103, 104] Identifying Large-review-effort Code Changes Identify code changes that will require a large reviewing effort [105] Identifying Quickly Reviewable ChangesRank changes to be reviewed based on their likelihood of being quickly merged or rejected[106] Predicting Code Changes Approval, Merge, or Need for reviewPredict the likelihood of a change of being accepted, merged, or needing review[47,55,56,94– 101]Code Change Quality CheckChecking Design Patterns ConsistencyCheck whether the implemented change violates existing design pat- terns[107] Generating Review Comments Generate review comments for a given piece of code [46–56] Identifying Clone Refactoring Opportunities Detect unrefactored or partially refactored code clones [108] Predicting Code Defectiveness Predict the defectiveness of a patch before or after being reviewed [109, 110] Predicting Problematic Code ElementsPredict code elements in a given contribution reviewers should pay particular attention to ( e.g.,lines likely needing changes)[71, 111, 112] Reviewing Code Formatting Violations Suggest how to fix code formatting violations in a given piece of code [113] Reviewing via Static Analysis Use multiple static analysis tools to generate a code review [8]Code Review Sent. AnalysisClassifying the Sentiment of Review CommentsClassify the sentiment of review comments as neutral, negative, or positive[114] Identifying “Pushback” Feelings in ReviewsIdentify feelings of “pushback", with the reviewer blocking a change request for interpersonal conflicts[115] Identifying Toxic/Uncivil Review Comments Identify toxic or uncivil comments in code reviews [116–119] Rephrasing Toxic/Uncivil CommentsRephrase review comments to improve its politeness without changing its semantic[119]Retrieval of Similar CR/CCAugmenting ReviewsCan be used to provide either (i) the contributor with examples of reviews similar to those they are receiving (for better understanding); or (ii) the reviewer with examples of reviews which have been written for code similar to the one they are inspecting[83, 120–126] Mining Code Improvement PatternsExtract source code improvement patterns from existing code review history to recommend how to improve the submitted code[127]Revised Code GenerationImplementing the Code Change Requested by a ReviewerGenerate a revised version of a given piece of code by implementing a specific change requested by the reviewer in a natural language comment[47,49,54–56, 72–77] Predicting the Code Output of the Review Pro- cessGiven a code snippet submitted for review, revise it to implement changes which are likely to be required by reviewers[49, 72, 128, 129]Time ManagementIdentifying Blocking Actors in Pull RequestsIdentify who among contributor(s) and reviewer(s) is to blame for overdue pull requests[130] Predicting Pull Request/Code Review Comple- tion TimePredict the time needed to complete a pull requests/code review [100, 130–133] Prioritizing Review RequestsPrioritize code review requests based on factors such as age of the change, test verdicts, etc.[101, 134, 135]OtherClassifying the Goal of a Review Comment or the Type of Change Triggered by a CommentClassify a review comment as Style, Functionality, Test, Approval, Disagreeing, Questioning, Roadmap, Diversion, Convention, Response or Encouragement[44, 45, 136] Configuring Static Code Analysis ToolsLeverage code review comments for recommending static code analysis tools and warning categories to be used in future[137] Partitioning Static Analysis WarningsCluster the warnings of static analysis tools into categories to simplify their inspection[138] Recommending Reviewers Recommend reviewers that are best suited for the given piece of code [8–43] Visualizing Code ChangesProvide visualizations of the change to review to ease code compre- hension[139–142] 11 Page 12: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota techniques to automatically classify the sentiment of reviewers’ comments [ 114]: Flagging comments expressing a negative sentiment can provide useful information to the reviewer, who can revise those potentially problematic comments. Other authors tackled a more specific version of this problem, focusing on the identification of a specific type of reviewers’ comments expressing negative feelings. In particular, Egelman et al. [115] aim at identify review comments suggesting the will of the reviewer to block a change request for interpersonal conflicts rather than for the quality of the submitted contribution. Sarker et al. [116,118], instead, focus on the identification of “toxic code reviews”, while Ferreira et al. [117] and Rahman et al. [119] target “uncivil review comments”. Incivility represents a broader set of negative comments as compared to toxicity , since the latter entails hate speech and offensive language, while incivility does not [ 117]. Note that Rahman et al. [119], besides identifying uncivil comments, also present a model able to propose alternative civil rephrasing preserving the original comments’ semantic. 4.1.6 Retrieval of Similar Code Reviews/Code Changes. Retrieval techniques have been used to create recommender systems supporting code review from different perspectives. Given a code fragment to review, some techniques [ 83,120– 126] retrieve from a dataset of past reviews those involving similar code fragments and recommend to the reviewer comments they can reuse (since used in the past to suggest improvements to similar code). Rahman et al. [83] also proposed a similar approach, but motivated it as a mechanism to provide the contributor with additional examples of reviews similar to those they are receiving. This could help in better understanding what the reviewer meant. Ueda et al. [127] focused instead on mining recurring improvement patterns from code review ( i.e.,changes frequently suggested by reviewers). Those patterns can then be potentially applied to improve the quality of the code to review (even before the review process starts). 4.1.7 Revised Code Generation. This line of research aims at supporting the code review process by automatically generating the code output of the review process. Two variations of this task have been proposed. The fist [ 49,72,128,129] provides as input to the automated technique a code snippet submitted for review and expects the technique to revise such a code to implement changes which will likely be requested during the code review process. These techniques are meant to be used by the contributor before even starting the code review process to quickly verify whether improvements can be made to the code they write. The second [ 47,49,54–56,72–77] is instead a code refinement task in which the approaches are provided as input not only a code snippet submitted for review but also a specific reviewer’s comment to address. In this case the goal of the approach is to automatically revise the submitted code generating a version of it addressing the comment provided as input. These approaches are meant to be used during the code review process either (i) by the reviewer, to attach to their comments an example of how they envision the revised code, or (ii) by the contributor, to automatically address some of the reviewer’s requests. 4.1.8 Time Management. Evidence from the literature suggests that both open source and industrial projects can undergo hundreds of reviews per month ( e.g.,∼500 reviews per month in Linux [ 143],∼3k in Microsoft Bing [ 144]). In such a context time management becomes essential and researchers proposed solutions to help the proper allocation of reviewers’ time. Differently from previously discussed techniques which automated specific code review tasks, these approaches aim at augmenting the information available to reviewers and/or managers, thus possibly improving decisions taken during code review. 12 Page 13: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Some of the proposed solutions can be combined in a sort of pipeline to support the code review: Approaches to predict the time needed to complete a pull request [ 100,130–133] can be used to inform techniques aimed at prioritizing review requests [ 101,134,135]. Also, pull requests taking longer than expected can be provided as input to techniques identifying blocking actor(s) [ 130], namely the person(s) responsible for the delay. This could help in triggering the blocking actor or, if possible, replace them. 4.1.9 Other. The last category groups together tasks which did not fit in the previously presented categories and features heterogeneous tasks. These include the code review task which has been mostly subject to automation attempts in the literature: the recommendation of reviewers that are best suited for a given change [ 8–43]. These techniques, while sharing the same goal, differ for the underlying technical solution adopted (RQ 2focuses on this aspect) and for the features used to rank the reviewers given the change. In most of cases the features include information extracted from the history of code changes to favor the recommendation of reviewers who e.g.,already worked in the past on the code files subject of the change or already reviewed similar patches. The recency of these activities is usually considered as well. Another popular task in the “Other” category features approaches providing visualizations for the code changes to review in order to simplify the reviewer’s inspection [ 139–142]. Note that we only included in our SLR visualization techniques specifically aimed at supporting code review. Different works focus the visualization on different types of information. Brito and Valente [ 141] propose RAID, a tool for refactoring-aware code review which visualizes the refactoring operations implemented in the change to review. Fadhel and Sekerinski [ 140] target instead visualizations aimed at improving the reviewer’s awareness of the possible impact that the implemented changes can have on the system’s architecture. Fregnan et al. [142] provide a more general-purpose graph-based visualization to support code review: Each node represents a class or a method and the links between them represents dependencies such as method calls. The goal here is to improve the navigation of the change and its comprehension. Finally, still related to visualization is the behavioral diff generated by the approach proposed in [ 139]. The idea is to show the behavioral differences (in terms of test case execution) which can be observed in the system before and after the implementation of the code change to review. This can support the assessment of code change correctness made by the reviewer. Moving to the next task, Li et al. [44] present an approach to automatically classify reviewers’ comments into the categories reported in Table 5 ( e.g.,style, functionality, etc.). Their approach is meant to provide a better understanding and monitoring of the ongoing review process. On top of that, with the proposal of data-driven techniques to automate tasks such as generating review comments this approach can be used to cleanup the training set of these techniques, removing for example the comments classified as “Encouragement”, since irrelevant for training techniques suggesting how to improve code snippets. A similar approach has also been presented by Turzo et al. [136], while Fregnan et al. [45] focus on classifying the code changes implemented as result of the code review process. Tukaram et al. [138] propose the idea of partitioning static analysis warnings, with the goal of clustering the similar ones thus simplifying their interpretation. On a related research thread, Zampetti et al. [137] suggest the automated analysis of review comments posted in the past to understand which static analysis tools should be used in the continuous integration pipeline of a given project and how they should be configured. In other words, they aim at understanding what the relevant “issues” reviewers look for when inspecting a patch and which of those issues can be automatically identified by static analysis tools. 13 Page 14: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Table 6. Under-the-hood solutions behind the techniques and tools proposed for code review automation. Approaches: Deep Learning; Machine Learning; Information Retrivial; Heuristic-Based; Other. Programming Languages: Java; Multiple Languages; Language Independent; Other. Task ApproachTrainingGranularity LanguagePT NL PT code FT Assessing Review Quality Assessing Review Quality through Biometrics (2) ✗ ✗ ✗ code regions, code review Classifying the Usefulness of Review Comments (4) 1/4 ✗ 4/4 review comment Identifying/Improving Review Comments Needing Further Explanations (2)1/2 1/2 1/2 review comment Code Change Analysis Decomposing Tangled Commit (3) ✗ ✗ ✗ commit Impact Analysis for Code Review (1) ✗ ✗ ✗ PR Linking Similar Contributions (2) ✗ ✗ 1/2 code change, PR Predicting Salient-Class (1) ✗ ✗ 1/1 commit Code Change Classification Identifying Impactful Code Changes (2) ✗ ✗ 1/2 commit Identifying Large-review-effort Code Changes (1) ✗ ✗ 1/1 commit Identifying Quickly Reviewable Changes (1) ✗ ✗ 1/1 PR Predicting Code Changes Approval, Merge, or Need for review (11)3/11 4/11 11/11 diff hunk, method, PR Code Change Quality Check Checking Design Patterns Consistency (1) ✗ ✗ ✗ file Generating Review Comments (11) 9/11 10/11 10/11 code change, diff hunk, method Identifying Clone Refactoring Opportunities (1) ✗ ✗ ✗ PR Predicting Code Defectiveness (2) ✗ ✗ 1/2 file, PR Predicting Problematic Code Elements (3) 2/3 2/3 3/3 code line, file, PR Reviewing Code Formatting Violations (1) ✗ ✗ 1/1 file Reviewing via Static Analysis (1) ✗ ✗ ✗ PR Code Review Sentiment Analysis Classifying the Sentiment of Review Comments (1) ✗ ✗ 1/1 review comment Identifying “Pushback” Feelings in Reviews (1) ✗ ✗ ✗ code review Identifying Toxic/Uncivil Code Review Comments (4) 4/4 ✗ 4/4email, review comment, sen- tence Rephrasing Toxic/Uncivil Comments (1) 1/1 1/1 1/1 review comment Retrieval of Similar CR/CC Augmenting Reviews (8) ✗ ✗ 4/8code change, code review, code snippet, diff hunk, review com- ment Mining Code Improvement Patterns (1) ✗ ✗ 1/1 diff hunk Revised Code Generation Implementing the Code Change Requested by a Re- viewer (11)8/11 9/11 11/11 code change, diff hunk, method Predicting the Code Output of the Review Process (4) 2/3 2/3 4 method Time Management Identifying Blocking Actors in Pull Requests (1) ✗ ✗ ✗ PR Predicting Pull Request/Code Review Completion Time (5)✗ ✗ 4/5 code change, commit, PR Prioritizing Review Requests (3) ✗ ✗ 2/3 code change, PR Other Classifying the Goal of a Review Comment or the Type of Change Triggered by a Comment (3)1/3 1/3 3/3 code change, review comment Configuring Static Code Analysis Tools (1) ✗ ✗ 1/1 review comment Partitioning Static Analysis Warnings (1) ✗ ✗ ✗ code snippet Recommending Reviewers (36) 1/36 ✗ 23/36 commit, patch, PR Visualizing Code Changes (4) ✗ ✗ ✗ commit, diff hunk, PR 14 Page 15: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY 4.2 RQ 2: What are the under-the-hood solutions behind the techniques and tools proposed for code review automation? Table 6 summarizes the under-the-hood solutions behind the techniques proposed in the literature for code review automation. The “Task” column reports the list of automated tasks, with the number in parenthesis representing the number of papers (out of the considered 119) presenting an automation solution for such a task. For each task 𝑇𝑖, in the “Approach” column the bar chart depicts the percentage of DL-based, ML-based, IR-based, and Heuristic-based techniques out of those automating 𝑇𝑖. Approaches not relying on any of these four techniques are grouped into the “Other” categories ( e.g.,data-flow analysis [ 138] or visualization techniques [ 139]). With heuristic-based techniques we refer to hand-crafted techniques which are usually composed by multiple steps ( e.g.,building a traceability graph and defining a specific metric to identify the best-suited reviewer for a given code change [21]). For approaches based on DL/ML, the “Training” column shows whether they underwent (i) a pre-training on natural language corpus (“PT NL”); (ii) a pre-training on a code corpus (“PT code”); and (iii) a fine-tuning (“FT”). While the pre-training procedures are typical of DL-based techniques, with fine-tuning we also indicate the standard training of classic ML algorithms ( e.g.,training a classifier to identify design-impactful changes on a labeled dataset [ 104]). For each of these three training procedures, a ✗indicates that none of the corresponding papers adopts it, otherwise a fraction is used to report the number of papers employing it. The “Granularity” column indicates, for a given task, the type of “entities” for which automation solutions have been proposed. For example, among the 11 techniques aimed at commenting on source code by posting natural language comments as a human would do ( i.e., generating review comments task), some of them work at code change granularity (i.e.,they take as input the whole code diff of a PR), others consider a specific diff hunk (i.e.,only a specific part of the change, possibly spanning multiple functions), and the remaining ones work on a single function impacted by the change ( i.e.,they comment on one changed function). Finally, for each task, the “Language” column depicts, using again a bar chart, the percentage of proposed automation techniques providing support for a specific programming language. Since Java was by far the most popular language, a specific color has been assigned to it (see Table 6’s caption), while other colors are used to indicate techniques (i) supporting multiple languages, (ii) being language-independent, (iii) or being specific for a single language which is not Java. When we report an approach as only supporting a specific or multiple languages, this does not mean that the approach cannot be adapted to other languages. This is something we did not assess, since it would require a deep understanding of all technicalities behind each approach, something which is not always easy to grasp from the paper’s reading. For example, a DL model trained and tested on Java code to support a specific task, is labeled as “Java only” despite, with a reasonable effort, the approach could probably be trained on another languages keeping similar performance. Basically, we considered the languages on which the approaches can be used out of the box. 4.2.1 Approach. There is the clear distinction between the underlying solutions adopted by techniques automating classification vsgenerative tasks. For the former ( e.g., classifying the usefulness of review comments ,predicting salient-class , identifying impactful code changes ), ML-based solutions (red bars in Table 6) are the most popular ones (36%), followed by heuristic-based (24%) and DL-based (21%) techniques. Other solutions account for the remaining 19% of techniques automating classification tasks, with none of them relying on IR. The situation is quite different for generative tasks from which the generation of textual output is expected ( e.g., generating review comments ,implementing the code change requested by a reviewer ,predicting the code output of the review process ). In this case, DL-based solutions are by far the most employed (78%), followed by IR-based ones (11%) which can identify relevant content in a knowledge base and 15 Page 16: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota use it as output. For example, in the generating review comments task, the approach can take a piece of code to review 𝐶𝑖, find in a knowledge base the code 𝐶𝑗being most similar to 𝐶𝑖, and reuse the reviewers’ comments posted for 𝐶𝑗 when reviewing 𝐶𝑖. For other types of tasks which cannot really be categorized as classification or generative tasks ( e.g., checking design patterns consistency ,reviewing via static analysis ,visualizing code changes ), there is no clear trend which can be observed, with all type of solutions being explored. Interestingly is also to comment on the strongly increasing adoption of DL-based techniques for code review automation. If we focus on the last five years considered in our SLR (2020 to 2024), we find that in 2020 and in 2021 DL models have been exploited in 20% (2/10) and in 17% (2/12), respectively, of the papers in our SLR. From 2022, instead, we observe a strong increase in the adoption of DL-based solutions, with 42% in 2022 (11/26), 68% in 2023 (13/19), and 63% in 2024 (17/27). 4.2.2 Training procedures. Out of the 46 DL-based solution, 34 use some form of pre-training. The idea of pre-training is mostly to teach the DL model the language of interest, by performing a task-agnostic training. For example, a model meant to automatically generate review comments may be pre-trained on a corpus of natural language and code instances via the Masked Language Modeling (MLM) pre-training objective, providing the model with a sentence as input ( e.g.,an English sentence or a Java statement) having 15% of its tokens masked, with the model in charge of guessing the masked tokens. Of the 34 automated techniques using pre-training, 24 start from an already pre-trained model ( e.g.,Code Llama [ 145], CodeT5 [ 146], RoBERTa [ 147]), while the remaining ones pre-train their own model. In both cases, the pre-training usually involves both natural language and code ( i.e.,bi-modal pre-training): This is visible in Table 6 by comparing the number of papers using a model pre-trained on natural-language (column “PT NL”) with those exploiting a model pre-trained on code (column “PT code”). For example, out of the 11 papers predicting code changes approval, merge, or need for review , 4 use a pre-trained model, 3 of which pre-trained on bi-modal data (+1 only code). Similarly, when looking at the ones implementing the code change requested by a reviewer , 9 of the 11 papers use a pre-trained model, in 8 cases pre-trained on bi-modal data. Such a choice may have been driven by the empirical evidence showing that pre-training on natural language helps for code-related tasks as well [ 148]. There is only one exception to this trend: All four DL-based techniques aimed at identifying toxic/uncivil code review comments exploit pre-training only on natural language. This is a sensible choice considering that the tackled task does not foresee the model manipulating code elements. It is also worth mentioning the radical changes observed in the usage of pre-training in recent years. First, before 2022, we found no work on code review automation using a pre-trained DL model. Second, in 2022, out of the eight automated solutions exploiting pre-training, only one (12.5%) used an already pre-trained models (in the other cases, the authors of the technique pre-trained their own model). In 2023 and 2024 this trend radically changed, with only 2 of the 26 proposed techniques relying on a pre-trained model [ 50,77] exploiting a model pre-trained by the authors themself. Such a trend is easily explained by the always-increasing availability of open source pre-trained models in websites such as HuggingFace [149]. Concerning the fine-tuning ( i.e.,the training of a model aimed at specializing it to the target task), all but three ML/DL-based techniques exploit it. The first two are those assessing review quality through biometrics , which use already trained models to interpret in real time the biometric information collected by dedicated devices ( e.g.,heart rate variability and pupillary response are captured and interpreted as “mental workload”). The third one is the work by Widyasari et al. [84] exploiting prompt engineering techniques in ChatGPT to improve review comments needing further 16 Page 17: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY explanations. Given the recent raise of capabilities of general-purpose LLMs and their applicability to software-related tasks, we expect more and more code review automation techniques to rely on LLMs’ prompt engineering rather than on fine-tuned models. 4.2.3 Granularity. We only discuss the observed trends for a selection of the tasks, mostly those targeted by several works. For some tasks, the granularity of the targeted entities is rather homogeneous. For example, when recommending reviewers , all 36 works take as input a code change to review, which could be a commit, a patch, or a PR. Still, the overall idea is: given a change to review, suggest the best-suited reviewers. The same observation can be made for the five works predicting pull request/code review completion time ”, and for the four visualizing code changes . When looking at generative tasks, instead, differences can be observed. The most interesting ones are those related to techniques generating review comments (11) and implementing the code change requested by a reviewer (11). For both of them, we can see that there are three families of techniques working on (i) entire code changes, (ii) specific diff hunks, and (iii) a single method/function. Targeting these two tasks at these different granularities entails completely different levels of difficulty. Let us discuss this point for the approaches supporting the implementation of a code change required by a reviewer. Approaches working on diff elements (either an entire diff or a diff hunk) [ 47,54,73,76,77] require the ability of the approach to “understand” the reviewer’s comment in the context of complex diff changes which could possibly span across different code elements ( e.g.,multiple functions or even files involved). Thus, addressing these comments may be challenging, requiring modifications to several code elements. Differently, when isolating single functions which had been commented by a reviewer in the context of a larger code change ( i.e.,the single function may be only one of the impacted code elements) [ 49,55,72,74], the approach has a much more limited coding context on which to operate the required code transformations. Obviously, also the applicability and potential usefulness of these techniques is different, with the former being more flexible ( e.g.,several reviewer comments do not even concern methods, but other code elements). The recent trend is to expand as much as possible the “contextual information” available to these code review automation techniques. This is mostly possible thanks to the availability of large DL models able to process large inputs (such as an entire diff). This is an interesting example of how technical constraints (e.g.,DL models only able to process up to 512 tokes as input [ 49]) pushed researchers to artificially simplify the tackled problem ( i.e.,only focusing on changes required to small methods [ 49]), with the most recent approaches relaxing these constraints thanks to the advances in AI. 4.2.4 Language. Works automating code-review tasks only requiring the processing of natural language information (i.e.,the review comments, the pull request description) are, by definition, programming language-independent ( i.e., identifying/improving review comments needing further explanations ,configuring static code analysis tools , and all those related to code review sentiment analysis and to time management — see Table 6). Also, 3/4 of the works classifying the usefulness of review comments are language independent, while one is focused on comments related to Python code. As expected, these techniques have been experimented on English artifacts only. Among the 36 techniques recommending reviewers , 31 are also language-independent. Indeed, many of them mostly exploit historical information and look at the source code as a bag-of-words, do not really requiring parsing or other language-specific implementations. The remaining 5 either support Java only [ 8,21,43] or a set of multiple languages [14, 32]. When looking at the remaining 78 approaches, the most targeted programming language is Java (58 works), with 35 focusing exclusively on it and 23 also supporting at least another language. For example, Li et al. [47] addressed the tasks of implementing the code change requested by a reviewer ,generating review comments , and predicting code 17 Page 18: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Table 7. Evaluation of the proposed techniques: Top-3 metrics used in the evaluation; whether a qualitative inspection of the results has been performed; and whether the approaches have been deployed in industry Task #1 Metric #2 Metric #3 Metric Qualitative Deployed Assessing Review Quality Assessing Review Quality through Biometrics (2) Accuracy F1-score Precision&Recall ✗ ✗ Classifying the Usefulness of Review Comments (4) Precision&Recall F1-score Accuracy 2/4 2/3 Identifying/Improving Review Comments Needing Further Explanations(2) Accuracy F1-score Correct Type 1/2 ✗ Code Change Analysis Decomposing Tangled Commit (3) Accuracy MAP MRR 3/3 ✗ Impact Analysis for Code Review (1) Accuracy Recall MAP 1/1 ✗ Linking Similar Contributions (2) F1-score MRR Precision&Recall ✗ ✗ Predicting Salient-Class (1) Accuracy Precision&Recall - 1/1 ✗ Code Change Classification Identifying Impactful Code Changes (2) AUC F1-score Precision&Recall 1/2 1/2 Identifying Large-review-effort Code Changes (1) AUC F1-score Precision&Recall 1/1 1/1 Identifying Quickly Reviewable Changes (1) NDCG - - 1/1 ✗ Predicting Code Changes Approval, Merge, or Need for review (11) F1-score Precision&Recall AUC 1/11 ✗ Code Change Quality Check Checking Design Patterns Consistency (1) - - - 1/1 ✗ Generating Review Comments (11) BLEU Accuracy ROUGE-L 6/11 ✗ Identifying Clone Refactoring Opportunities (1) Accuracy F1-score Precision&Recall 1/1 ✗ Predicting Code Defectiveness (2) F1-score AUC* False alarms* 1/2 1/2 Predicting Problematic Code Elements (3) AUC F1-score Precision&Recall 3/3 ✗ Reviewing Code Formatting Violations (1) F1-score Precision&Recall Predicion Rate ✗ ✗ Reviewing via Static Analysis (1) - - - 1/1 1/1 Code Review Sentiment Analysis Classifying the Sentiment of Review Comments (1) Accuracy F1-score Precision&Recall ✗ ✗ Identifying “Pushback” Feelings in Reviews (1) Precision&Recall - - ✗ ✗ Identifying Toxic/Uncivil Code Review Comments (4) F1-score Precision&Recall Accuracy ✗ ✗ Rephrasing Toxic/Uncivil Comments (1)Incivility DeacreaseLength Dissimi- laritySemantic Simi- larity1/1 ✗ Retrieval of Similar CR/CC Augmenting Reviews (8) Accuracy Precision&Recall MRR 4/8 1/8 Mining Code Improvement Patterns (1) Accuracy - - 1/1 ✗ Revised Code Generation Implementing the Code Change Requested by a Reviewer (11) Accuracy BLEU CodeBLEU 5/11 1/11 Predicting the Code Output of the Review Process (4) Accuracy BLEU Lev. Distance 2/4 ✗ Time Management Identifying Blocking Actors in Pull Requests (1) MAE MMRE - 1/1 1/1 Predicting Pull Request/Code Review Completion Time (5) MAE MRE - 2/5 2/2 Prioritizing Review Requests (3) Accuracy* AUC* MAP* 1/3 2/3 Other Classifying the Goal of a Review Comment or the Type of Change Triggered by a Comment (3)F1-score Precision&Recall MCC 1/3 ✗ Configuring Static Code Analysis Tools (1) MAP MAR Precision&Recall 1/1 ✗ Partitioning Static Analysis Warnings (1) Review Effort - - ✗ ✗ Recommending Reviewers (36) MRR Accuracy Precision&Recall 4/36 3/36 Visualizing Code Changes (4) - - - 3/4 2/4 changes approval, merge, or need for review by training a transformer model on code review instances related to code written in nine different languages: C, C++, C#, Go, Java, JavaScript, PHP, Python, and Ruby. Finally, 9 of these 78 techniques are language-independent ( i.e.,they can be applied independently from the programming language, without any adaptation) [44, 83, 98–101, 105, 109, 126]. Interesting is to note the complete lack of support for low-resource languages, namely programming languages for which little training material is available ( e.g.,Julia, Lua, R). We will discuss this point further in Section 4.5. 4.3 RQ 3: How are techniques for the automation of code related tasks empirically evaluated? Table 7 shows, for each code review task 𝑇𝑖automated in the literature: •The top-3 metrics used in the empirical evaluation of the techniques automating 𝑇𝑖. For approaches not employing any quantitative metrics in their evaluation ( e.g.,those visualizing code changes ) a dash is used to fill the metrics- related columns. Also, some tasks have been automated by very few techniques which, however, have been evaluated using disjointed sets of metrics. For example, the three techniques prioritizing review requests all 18 Page 19: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY used different evaluation metrics [ 101,134,135], not allowing to observe any trend. The same happens for the predicting code defectiveness tasks. In these cases, we just report the three metrics that are the most popular when also considering all other tasks. These cases are indicated in Table 7 with a “*” attached to the respective metrics. Finally, we decided to group together precision and recall, since they were always used in combination in the set of inspected papers. •Whether a manual qualitative inspection of the techniques’ output has been performed. A “ ✗” indicates that for none of the techniques automating 𝑇𝑖a manual qualitative analysis of their output has been performed. Otherwise, a fraction explicitly shows for how many of them, out of the total, this has been done. •Whether the proposed technique has been deployed in an industrial setting (“Approach Deployed”) . This column must be read as the previous one. For example, out of the 36 techniques to recommend reviewers , 3 have been deployed in industry. Our goal with Table 7 is to provide an overview of the evaluations performed in the literature. The interested reader can find the complete data ( e.g.,the metrics used in each of the 119 papers) in our online appendix [ 78]. In the following, we are going to discuss visible trends, especially for tasks for which several automation solutions have been proposed. We can observe a clear distinction in the metrics used for two clusters of tasks related to classification and generative problems. For the former ( e.g., classifying the usefulness of review comments ,identifying impactful code changes ,classifying the sentiment of review comments ,recommending reviewers ) well-known metrics such as precision, recall, F1-score, accuracy, and Area Under the ROC Curve (AUC) are mostly employed. When coming to generative tasks ( e.g., generating review comments ,implementing the code change requested by a reviewer ), researchers started borrowing evaluation metrics from the NLP field. For example, to assess whether a DL model is able to generate meaningful review comments, its output is compared against comments manually written by human reviewers for the same code under review, with metrics such as BLEU [ 150] (1st) and ROUGE-L [ 151] (3rd). Both of them are basically textual-similarity metrics which only work under specific circumstances. For example, if the DL model points to the same quality issue identified by the human reviewer using, however, a completely different wording, these metrics are unable to reward the model for the meaningful output. Even more penalizing for the automated technique is the usage of accuracy (2nd), which considers a generated comment as correct only if it is identical to the human-written one. Similarly, when assessing the correctness of automatically generating code ( e.g.,to address a review comment) researchers are using accuracy (1st), BLEU (2nd), and CodeBLEU [ 152] (3rd). Accuracy indicates that the approach addressed the reviewer’s comment exactly as done by a human developer ( i.e.,all code tokens are identical). The CodeBLEU is a version of the BLEU score meant to also capture AST-level similarity between two snippets of code (rather than merely textual similarity as done by BLEU). Also for this task, these evaluation metrics suffer of the same limitations discussed for the case of comment generation. Indeed, the same reviewer’s comment may be successfully implemented in two different ways by the machine and by the human, with the result of low evaluation scores even in case of meaningful recommendation. We will further discuss these concerns in Section 4.5. Looking at Table 7 it is also possible to see that in several cases researchers tried to compensate the lacks of the metrics employed to assess the effectiveness of techniques for generative problems. Indeed, 6/11 approaches generating review comments and 5/11 of those implementing the code change requested by a reviewer present some qualitative analysis in which, e.g.,researchers looked at successful and wrong recommendations with the goal of better understanding strengths and weaknesses of the proposed approaches. In general, qualitative analysis is quite popular, with 43% of the 19 Page 20: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota code review automation techniques presenting some form of manual inspection. This percentage is negatively affected by the only 4/36 papers recommending reviewers which present a qualitative analysis. Finally, only 18 of the 119 techniques (15%) have been deployed in industry. For example, Froemmgen et al. [77] deployed their approach for implementing the code change requested by a reviewer at Google. While this percentage may look low at a first sight, it is actually notable considering how recent several of the technologies behind these techniques are. 4.4 RQ 4: Which techniques and datasets are publicly available? Table 8 reports the list of works from our SLR which either do not provide a link to a replication package ( ✗in column “Provided”) or, while having such a link ( ✓), it is not accessible ( ✗in column “Accessible”) at the date of writing (January 2025). Some references are present multiple times since the proposed approach supports several tasks. Overall, 49 of the 119 papers (41%) part of our SLR do not provide a replication package, and 6 more (5%) provide a link which is not accessible anymore. Considering that all surveyed papers present techniques for automating code review tasks, this implies substantial challenges for researchers interested in replicating these approaches, for example to use them as baselines for the proposal of a novel solution. The remaining 64 papers (54%) provide instead a working replication package, as documented in Table 9. Besides reporting the link to the working replication package, Table 9 also indicates what the authors provide in it in terms of code/tool implementing the proposed approach (column “C”) and data used in the paper (column “D”). Note that one of the works [ 36] does not provide both code and data, with the linked artifact mostly presenting additional tables. The most popular platform used for sharing the replication packages is by far GitHub (61%), followed by other solutions with a similar usage share ( i.e.,Zenodo, Figshare, bitbucket, personal website). While the percentage of papers providing a working replication package (54%) seems to suggest major issues in the replicability of techniques for code review automation, it is important to look at how such a trend is evolving over time. Fig. 4 shows that the efforts put in place by the software engineering research community for promoting open science ( e.g.,by default all papers submitted at the International Conference on Software Engineering must disclose data/artifacts) are improving code/data availability. Indeed, while up to 2020 the majority of the published papers did not provide a working replication package, such a trend changed in 2021 (81% provide a replication package) and was confirmed in all subsequent years, with 86% of published works providing a replication package in 2024. Encouraging signals also come from the information reported in Table 9: Indeed, of the 64 papers providing a replication package, 54 (84%) disclose both the source code of the proposed approach and/or a tool implementing the approach and data used in the work. 4.5 RQ 5: What are the concerns raised or the limitations observed by researchers when experimenting the automated solutions? Table 10 summarizes the concerns raised and limitations observed by researchers when experimenting with the proposed automated solutions. Table 10 organizes the identified issues into four parent categories, performance -,evaluation -, usability -, and deployment -related issues. For each of them, references to the papers in which we found evidence of such an issue have been reported. In the following, we discuss the main issues identified, focusing on those that are either very popular ( i.e.,reported in several papers) or that provide interesting insights for future work. We use the icon “” to highlight lessons learned and directions for future work. 20 Page 21: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Table 8. Works on code review automation not providing a replication package or having it not accessible as of Jan 2025 Task Reference Provided Accessible Assessing Review Quality through Biometrics Hijazi et al. [86] ✗ - Augmenting ReviewsGuo et al. [125] ✗ - Guo et al. [121] ✗ - Gupta et al. [120] ✗ - Rahman et al. [83] ✗ - Checking Design Patterns Consistency Heet al. [107] ✗ - Classifying the Usefulness of Review CommentsPangsakulyanont et al. [79] ✗ - Rahman et al. [80] ✓ ✗ Decomposing Tangled CommitBarnett et al. [88] ✗ - Taoet al. [87] ✗ - Wang et al. [89] ✗ - Generating Review CommentsNashaat et al. [56] ✗ - Vijayvergiya et al. [50] ✗ - Identifying Blocking Actors in Pull Requests Shan et al. [130] ✗ - Identifying Clone Refactoring Opportunities Chen et al. [108] ✓ ✗ Identifying Impactful Code Changes Wen et al. [103] ✗ - Identifying Quickly Reviewable Changes Zhao et al. [106] ✗ - Identifying “Pushback” Feelings in Reviews Egelman et al. [115] ✗ - Identifying/Improving Review Comments Needing Further Explanations Rahman et al. [83] ✗ - Implementing the Code Change Requested by a ReviewerFroemmgen et al. [77] ✗ - Nashaat et al. [56] ✗ - Linking Similar Contributions Ayinala et al. [102] ✓ ✗ Mining Code Improvement Patterns Ueda et al. [127] ✗ - Partitioning Static Analysis Warnings Tukaram et al. [138] ✗ - Predicting Code Changes Approval, Merge, or Need for reviewFanet al. [97] ✗ - Nashaat et al. [56] ✗ - Shiet al. [96] ✗ - Predicting Code DefectivenessSharma et al. [110] ✗ - Soltanifar et al. [109] ✗ - Predicting Pull Requests/Code Review Completion TimeChen et al. [133] ✗ - Maddila et al. [131] ✗ - Shan et al. [130] ✗ - Predicting Salient-Class Huang et al. [91] ✗ - Prioritizing Review Requests Saini et al. [134] ✗ - Recommending ReviewersAlet al. [24] ✗ - Aryendu et al. [30] ✗ - Asthana et al. [19] ✗ - Balachandran et al. [8] ✗ - Chouchen et al. [26] ✗ - Jiang et al. [9] ✗ - Jiang et al. [22] ✗ - Jiang et al. [17] ✗ - Kong et al. [42] ✗ - Liao et al. [20] ✗ - Ouni et al. [12] ✗ - Pandya et al. [28] ✓ ✗ Rahman et al. [32] ✓ ✗ Rebai et al. [34] ✗ - Rong et al. [40] ✗ - Strand et al. [25] ✗ - Xiaet al. [11] ✗ - Xiaet al. [16] ✗ - Yeet al. [35] ✗ - Ying et al. [13] ✗ - Yuet al. [14] ✗ - Zanjan et al. [15] ✓ ✗ Zhang et al. [31] ✗ - Reviewing via Static Analysis Balachandran et al. [8] ✗ - Visualizing Code Changes Menarini et al. [139] ✗ - 21 Page 22: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Table 9. Works on code review automation providing a (still accessible at Jan 2025) replication package Task Reference Link C D Augmenting ReviewsHirao et al. [126] https://github.com/software-rebels/ReviewLinkageGraph ✓✓ Kartal et al. [124] https://github.com/ykartal/Github-SourceCode-Review ✓✓ Shuvo et al. [123] https://drive.google.com/file/d/15kq7LqvfY-oP1M1UDdK_lLmUfq71daVR/view ✗✓ Classifying the Goal of a Review Comment Fregnan et al. [45] https://zenodo.org/records/5592254 ✓✓ or the Type of Change Triggered by a Comment Liet al. [44] https://sites.google.com/view/core2019/ ✗✓ Turzo et al. [136] https://github.com/WSU-SEAL/CR-classification-ESEM23 ✓✓ Classifying the Usefulness of Review CommentsHasan et al. [81] https://github.com/WSU-SEAL/CRA-usefulness-model ✓✓ Yang et al. [82] https://zenodo.org/records/8297481 ✓✓ Generating Review CommentsHong et al. [48] https://github.com/awsm-research/CommentFinder ✓✓ Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓ Liet al. [46] https://gitlab.com/ai-for-se-public-data/auger-fse-2022 ✓✓ Linet al. [53] https://zenodo.org/records/10572047 ✓✓ Luet al. [55] https://zenodo.org/records/7991113 ✓✓ Luet al. [51] https://zenodo.org/records/10964945 ✓✓ Sghaier et al. [54] https://zenodo.org/records/10676741 ✓✓ Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓ Yuet al. [52] https://github.com/aiopsplus/Carllm ✗✓ Identifying Toxic/Uncivil Code Review CommentsFerreira et al. [117] https://doi.org/10.6084/m9.figshare.24603237 ✓✓ Rahman et al. [119] https://github.com/Oyakiolo052/ATUC_Artifacts ✓✓ Sarker et al. [116] https://github.com/WSU-SEAL/ToxiCR ✓✓ Sarker et al. [118] https://github.com/WSU-SEAL/ToxiSpanSE ✓✓ Identifying/Improve Review Comments Needing Fur- ther ExplanationsWidyasari et al. [84] https://figshare.com/s/135201b8f87ab705448b ✓✓ Impact Analysis for Code Review Hong et al. [93] https://figshare.com/s/135201b8f87ab705448b ✓✓ Huq et al. [73] https://github.com/Review4Repair/Review4Repair ✓✓ Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓ Luet al. [76] https://github.com/moonmengmeng/EnRefiner ✓✓ Implementing the Code Change Requested by Luet al. [55] https://zenodo.org/records/7991113 ✓✓ a Reviewer Pornprasit et al. [75] https://github.com/awsm-research/LLM-for-code-review-automatiton ✓✓ Sghaier et al. [54] https://zenodo.org/records/10676741 ✓✓ Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓ Zhang et al. [74] https://github.com/EngineeringSoftware/CoditT5 ✓✓ Chouchen et al. [101] https://github.com/stilab-ets/CostAwareCR ✓✓ Chouchen et al. [99] https://github.com/stilab-ets/multicr ✓✓ Islam et al. [98] https://github.com/khairulislam/Predict-Code-Changes ✓✓ Predicting Code Changes Approval, Merge, Liet al. [47] https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer ✓✓ or Need for review Luet al. [55] https://zenodo.org/records/7991113 ✓✓ Wu and Zhang [95] https://github.com/SimAST-GCN/CLMN ✓✓ Wuet al. [94] https://github.com/SimAST-GCN/SimAST-GCN ✓✓ Yang et al. [100] https://figshare.com/s/7930029ea5ec5af2845d ✓✓ Predicting Problematic Code ElementsHong et al. [111] https://github.com/awsm-research/RevSpot-replication-package ✓✓ Olewicki et al. [112] https://zenodo.org/records/10783562 ✓✓ Sghaier et al. [71] https://zenodo.org/records/7533156 ✓✓ Predicting Pull Request/Code Review Completion TimeChouchen et al. [132] https://github.com/stilab-ets/MCRDuration ✓✓ Yang et al. [100] https://zenodo.org/records/7533156 ✓✓ Predicting the Code Output of the Review Process Pornprasit et al. [129] https://github.com/awsm-research/D-ACT-Replication-Package ✓✓ Prioritizing Review RequestsChouchen et al. [101] https://github.com/stilab-ets/CostAwareCR ✓✓ Yang et al. [135] https://figshare.com/s/133f23da558b7b254041?file=46923235 ✓✓ Recommending ReviewersAhasanuzzaman et al. [43] https://drive.google.com/drive/folders/1bSC9iRtjKjMTRa9hiyECijgABKGfpyT4 ✓✓ Chueshev et al. [33] https://github.com/alexchueshev/icsme2020 ✓✓ Fejzer et al. [18] https://github.com/mfejzer/reviewers_recommendation ✓✓ Hajari et al. [38] https://github.com/rigbypc/SofiaWL/tree/master/ReplicationPackage ✓✓ Liet al. [29] https://zenodo.org/record/7292881 ✓✓ Mirsaeedi et al. [23] https://zenodo.org/record/3678551#.ZFS5EC8RpBw ✗✓ Qiao et al. [37] https://github.com/cufeinfor/MIRRec ✓✓ Rahman et al. [39] https://zenodo.org/records/8190493 ✓✓ Sulun et al. [21] https://figshare.com/s/27a35b4ae70269481a2c ✓✓ Sulun et al. [41] https://github.com/sulunemre/rstrace-replication ✓✓ Tecimer et al. [27] https://figshare.com/s/1b9ea55377d9f2c31a7a ✓✓ Thongtanunam et al. [10] https://github.com/patanamon/revfinder ✗✓ Zhao et al. [36] https://github.com/liuj888/ReviewerRecommendationLtR ✗✗ 22 Page 23: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY Table 9 (continue): Works on code review automation providing a (still accessible at Jan 2025) replication package Task Reference Link C D Rephrasing Toxic/Uncivil Comments Rahman et al. [119] https://github.com/Oyakiolo052/ATUC_Artifacts ✓✓ Linking Similar Contributions Wang et al. [92] https://github.com/dong-w/Replication-Patch-Linkage ✓✓ Identifying Impactful Code Changes Uchôa et al. [104] https://zenodo.org/record/4563214#.Y0kjiexBwQg ✗✓ Identifying Large-review-effort Code Changes Wang et al. [105] https://bitbucket.org/wangsonging/ist2020_repo/src/master/ ✓✓ Reviewing Code Formatting Violations Markovtsev et al. [113] https://github.com/src-d/style-analyzer ✓✓ Assessing Review Quality through Biometrics Hijazi et al. [85] https://github.com/HaythamHijazi/Supplement ✓✓ Classifying the Sentiment of Review Comments Ahmed et al. [114] https://github.com/senticr/SentiCR/ ✓✓ Retrieving Similar Reviews Siow et al. [122] https://sites.google.com/view/core2019/ ✗✓ Predicting the Code Output of the Review ProcessHuq et al. [73] https://github.com/Review4Repair/Review4Repair ✓✓ Patanamon et al. [128] https://github.com/awsm-research/AutoTransform-Replication ✓✓ Tufano et al. [72] https://github.com/RosaliaTufano/code_review ✓✓ Tufano et al. [49] https://github.com/RosaliaTufano/code_review_automation ✓✓ Visualizing Code ChangesBrito and Valente [141] https://github.com/rodrigo-brito/refactoring-aware-diff ✓✗ Fadhel et al. [140] https://github.com/hadii-tech/striff-lib ✓✗ Fregnan et al. [142] https://zenodo.org/record/7047993#.Y2JqNS-B2Uo ✓✓ Configuring Static Code Analysis Tools Zampetti et al. [137] https://github.com/senticr/SentiCR/ ✗✓ Fig. 4. Availability of a working replication package by publication year 4.5.1 Performance-related issues. We use the term “performance” to refer to the ability of the technique to provide a proper support for the automation of the targeted task2. Several researchers highlight the unsatisfactory recommendations generated by the experimented techniques, which may make them not ready for developers’ adoption. This is a quite crosscutting and expected concern, particularly affecting generative tasks requiring the generation of text or code ( e.g., implementing the code change requested by a reviewer ). However, even in classification tasks for which automated support proved quite successful, some researchers raised major concerns about their actual effectiveness: Strand et al. [25] observed that while their approach for reviewer recommendations performed well when evaluated on historical data, it did not seem to save time to developers once deployed in industry. This stresses the importance of experimenting the proposed techniques in realistic scenarios, which can provide feedback about their actual usefulness. A very popular limitation of the automation techniques is also the limited support they offer in specific scenarios . The term “scenario” here can have different meanings. For techniques relying on historical data such as those recommending reviewers based on past assignments, or those retrieving code reviews performed in the past for code similar to the one 2This is unrelated to performance aspects such as memory footprint. 23 Page 24: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota Table 10. Concerns and limitations discussed by researchers Parent Category Child Category References PerformanceLimited support in specific scenarios [41, 48, 54, 56, 72, 82, 87, 88, 91, 93] Lack of generalizability across different datasets [56, 79, 117] Unsatisfactory Recommendations [8,42,44,49,72,73,84,92,115,117,128,131,137, 153] Does not save time [25] Noise in training data [19, 22, 33, 47, 49, 54, 93] Bias in recommendations [10, 14, 19, 23, 25, 27, 38, 97, 110, 154] EvaluationSuboptimal metrics [11, 21, 33, 47, 51, 105, 124, 132] Reliability of oracle [9, 16, 19, 37, 40, 122] Lack of tradeoffs assessment [12, 38, 73, 87, 130] Data leakage [74] Relevance for practitioners not assessed [45] UsabilityResponsiveness and scalability [35, 77, 94, 121, 122, 131, 139, 142] Steep learning curve [39, 77, 100, 139, 140] Intepretability of LLMs [112] Information overload [89, 142] DeploymentDifficult to integrate in developers’ workflow [88] Human factors [115, 119] Privacy concerns [86] Too expensive [50, 56, 76, 101, 125] under review, there are limitations related to their applicability on “previously unseen data”. For example, retrieving reviews from the past does not allow the approach to generate previously unseen review comments [ 48], something doable nowadays by training DL models. Thus, while retrieval-based techniques offer specific advantages even in the context of generative tasks ( e.g.,they are substantially faster as compared to DL-based techniques), relying on them may be recommendable mostly in quite stable contexts in which the development team, review process, and the code base are not expected to undergo major and continuous changes (thus keeping the value of what learned in the past). One emerging concern is the already mentioned lack of support by DL-based code review automation techniques for low-resource languages [ 54]. It is reasonable to expect that the performance of code review automation techniques substantially drop for these languages, highlighting the importance of investigating the applications of these tools in “niche usage scenarios”, as recently done in the context of code generation [ 155–157]. Also, some researchers highlighted the limited applicability of their technique in specific scenarios which are, however, specific of the tackled problem and experimented technique. For example, Huang et al. [91] claim that their approach to predict the silent class of a commit is unable to deal with tangled commits. Still related to the performance of the proposed automated techniques, several studies presenting learning-based techniques report the presence of noise in training data as a major concern. Given the amount of data on which these techniques rely for the learning, it is difficult to guarantee the quality of training data. For example, when looking at reviewers’ recommenders noise can come from developers using multiple accounts, which are treated as different developers by the approach [ 22], or even from sub-optimal assignment made in the past, i.e.,the reviewer assigned to a pull request was not the most appropriate, but maybe the one which had a lower workload in that specific moment [ 19]. Similarly, researchers raised concerns about the quality of training data used for DL models aimed at generating review comments or implementing the code change requested by a reviewer [ 47,49]. These approaches usually learn from triplets featuring (i) the code submitted for review, (ii) the review comments posted by humans, and (iii) the revised code implementing the changes requested in the review comments. This data is automatically mined from forges such as GitHub and, consequently, can feature noisy data. For example, the collected revised code, while being a modified version of the code submitted for review, may not actually implement the reviewers’ comments, but other unrequested changes. The mined triplet will thus “teach” something wrong to the DL model. Despite the major cleaning efforts 24 Page 25: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY performed by researchers [ 47,49], noisy instances survive in the training data, since it is difficult for a single research group to have the man power to manually validate the whole dataset. A joint effort of the research community working on the automation of code review activities would be needed (at least for specific tasks of interest), similarly to what done in other fields like image recognition [158]. Finally, researchers working on reviewers’ recommendation and predicting code changes approval/merge report the presence of bias in recommendations generated by their techniques [ 10,14,19,23,25,27,38,97,110,154]. When it comes to recommend reviewers, bias manifests in the fact that reviewers who have been employed more in the past will also be employed more in the future. The bias becomes even more evident if the approach is re-trained over time to include new data, also featuring pull requests in which the approach has been employed (thus again promoting over and over the same reviewers). When it comes to predicting code changes approval/merge, researchers reported a negative bias of the techniques towards pull requests opened by newcomers. These two examples highlight the importance of considering human factors in the evaluation of the proposed techniques, besides computing performance-related metrics. 4.5.2 Evaluation-related issues. For this category, we discuss the first three types of concern reported in Table 10, since the last two ( i.e., data leakage andrelevance for practitioners not assessed ) have only been reported in one paper each. The usage of suboptimal metrics in the run empirical validations is a major concern for the new line of research tackling generative tasks [ 47,51,124]. For example, in the “generating review comments” task the technique is provide as input a code to review and it is expected to comment on its quality in natural language as a human would do. The question is how to automatically assess the quality of the generated comments. As explained in the context of RQ 3, since the code to review is usually mined from open source projects, researchers usually compute a similarity metric ( e.g., the BLEU score [ 150]) between the generated comment and the comments that were posted by human reviewers for that same code. However, there are many issues with this evaluation procedure. First, two completely different natural language comments may point to the same quality issue in the code. For example, the deep learning model may output “please rename variable hto something more meaningful” while the human reviewer may write “change htoheight ”. A textual similarity metric between these two comments would point to the low quality of the comment generated by the deep learning model while, in reality, the technique outputted a meaningful recommendation. Second, the model may correctly spot a quality issue which, however, has been missed by the human reviewers, thus not having any “similar human comment” to compare with. This again would result in a correct recommendation considered wrong. On top of generative tasks, there are other code review automation tasks for which the metrics used for evaluation only represent a weak proxy for the actual usefulness of the approach. For example, Chueshev et al. [33] stress that evaluation metrics such as top@k accuracy and MRR which they use in the context of reviewer recommendation might not align with the practical use of their technique, since they do not focus on the actual value added by new reviewer recommendations. A strongly-related concern possibly affecting the validity of the reported empirical evaluations is the limited reliability of oracles , which mirrors the previously discussed “ noise in training data ” but on the “test data”. Future work should aim at (i) defining metrics better capturing the actual usefulness of code review automation techniques, for example assessing the relevance of a generated comment for a given code under review, rather than only comparing it with comments written by humans; and (ii) creating curated benchmarks for code review automation, similarly to what the research community is doing for code generation [159]. The third evaluation-related issue we discuss concerns to the lack of tradeoffs assessment . Concrete examples of this issue may be: (i) focusing the evaluation of a reviewer recommender only on the correctness of the recommendation, 25 Page 26: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota without considering the workload distribution among reviewers as one of the objectives to meet [ 38]; (ii) not considering that specific decisions may be influenced by interpersonal relationships rather than by objective factors [ 12]; and (iii) ignoring in the evaluation the cost of adopting a novel tool the developers are not familiar with [ 87].A more comprehensive view of the tradeoffs that come into play when a new code review automation technique is proposed would be desirable in the performed empirical evaluations. However, this may not be doable without running case studies, which are non-trivial to run. At least, a careful discussion of the not-assessed tradeoffs is recommendable for works in the area, especially considering the socio-technical nature of code reviews. 4.5.3 Usability-related issues. Automation solutions proposed in academia are often implemented in the form of prototypes, with little attention given to non-functional attributes such as responsiveness and scalability [35,77,94,121, 122,131,139,142]. Sometimes these issues are indeed the result of non-optimized code, while in other cases are intrinsic limitations of the proposed approaches. For example, retrieval-based techniques may experience an increasing lack of responsiveness with the growth of the knowledge base from which information is retrieved. Similarly, visualization techniques may not scale to accomodate too complex objects/large amounts of information ( e.g.,a quite large code diff in the context of code review). The developed solutions may also be characterized by a steep learning curve [39,77,100,139,140], which is however difficult to assess without human-based studies. Finally, usability concerns may come from information overload [89,142] and the lack of interpretability of deep learning models [ 112]. Concerning the former, Fregnan et al. [142] discuss the risk of information overload when visualizing too intricate merge requests, with the concrete risk of hindering useful information to the reviewers rather than helping them in the code inspection. Since the overall goal of code review automation is to save time to software developers, the usability of the proposed solutions should be considered as a first-class citizen, both at design and evaluation time. Currently, most of techniques are assessed in in-vitro evaluations, relying on test sets built by mining software repositories. These evaluations completely neglect the “usability” aspect. For some code review tasks for which automation has only been recently targeted ( e.g., generating review comments ) this may be reasonable considering that the proposed solutions are still far from generating meaningful recommendations most of times. However, for other tasks such as recommending reviewers , dozens of techniques have been proposed with the most recent ones achieving excellent performance on the artificial benchmarks. Investigating their usability becomes now important. 4.5.4 Deployment-related issues. Related to the former concerns are the deployment-related issues discussed by researchers. We found these issues only discussed in a few papers and mostly pointing to the possibility that the proposed technique may be too expensive to deploy in practice [ 50,56,76,101,125]. This type of concern is mostly related to the proposal of AI-based solutions: Deploying an in-house AI assistant may require substantial monetary investments for training large DL models, making them available on powerful servers, and maintaining them ( e.g., retraining them) to keep their usefulness over time. Given the growth of the AI4SE research field, it is important to step back and also consider the cost-related implications of these techniques, both in terms of money and environmental impact. Integrating techniques such as quantization [ 160], knowledge distillation [ 161], and parameter-efficient fine- tuning [ 162] can help in reducing both the memory footprint and the training/inference cost of the proposed solutions. Still, the impact of these techniques on the performance (quality of recommendations) of code review automation tools must be carefully assessed ( e.g.,a quantized model may experience a substantial lost of performance when compared to the original model). 26 Page 27: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY 5 THREATS TO VALIDITY Threats to construct validity concern the relation between theory and observation. We only included papers indexed in the six queried databases. Also, we only focused on works published in software engineering venues. Thus, there might be additional studies we missed. The snowballing procedure we applied helps in mitigating this threat, despite the fact that we only performed one round of snowballing. We believe that most relevant studies were included based on the expertise the authors have in this domain. Also, the number of papers included in our SLR is large enough to answer our research questions and the main findings are unlikely to change even assuming a few missing works. Also, as a design decision, we did not apply any quality assessment criteria to exclude studies from our SLR. Indeed, we felt that the subjectiveness of this judgement was too high and decided to consider peer-reviewed papers as a sort of “automated quality filter”. We acknowledge that some peer-reviewed studies included in our study might feature flaws or wrong claims which could also potentially affect our findings ( e.g.,wrong data extracted). Threats to internal validity concern external factors we did not consider that could affect the variables being investigated. The search engines we used are continuously updated, both in terms of search features as well as in terms of papers they index. We cannot ensure replicability of our findings. However, we provide all material we collected in an online appendix [78]. Threats to external validity concern the generalizability of our findings. We decided to focus our SLR only on the literature proposing code review automation tools ignoring, for example, the body of knowledge related to empirical studies on code review. Furthermore, as our paper search was conducted in December 2024, our SLR misses works which have been later indexed in the searched database. Threats to conclusion validity concern the relations between the conclusions and our analyzed data. The main threat here is related to the correctness of the data we extracted from the inspected papers. To minimize errors, the two authors always double-checked the information each of them collected. However, especially in the context of RQ 5, we felt that a strong subjectivity component was involved in deciding what should be considered as a limitation/concern discussed by the paper’s authors. We acknowledge that we may have missed several insights reported in the read works. Still, we feel the set of limitations/concerns discussed in RQ 5to be quite representative of those we encountered while reading papers for this SLR. 6 CONCLUSIONS We presented a systematic literature review involving 119 papers presenting solutions for the automation of code review-related tasks. Firstly, we categorized the 34 tasks for which at least one automated approach has been proposed. We then summarized the under-the-hood solutions behind these approaches, and the metrics used in their empirical evaluation. We also looked for the presence of replication packages in the 119 papers, checking for the available ones whether they are still reachable and if they provide access to the presented approach and the used data. In the end, we highlighted the concerns and limitations researchers discussed when presenting and evaluating the proposed approaches, using them to highlight possible directions for future work. We release the raw data summarized in the SLR in our online appendix [78]. ACKNOWLEDGMENTS This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851720). 27 Page 28: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota REFERENCES [1] M. E. Fagan. Design and code inspections to reduce errors in program development. IBM Systems Journal , 15(3):182–211, 1976. [2]Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In 35th IEEE/ACM International Conference on Software Engineering, ICSE , pages 712–721, 2013. [3]Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects. In 11th IEEE/ACM Working Conference on Mining Software Repositories, MSR , pages 192–201, 2014. [4]Rodrigo Morales, Shane McIntosh, and Foutse Khomh. Do code review practices impact design quality? a case study of the qt, vtk, and itk projects. In22nd IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER , pages 171–180, 2015. [5]Gabriele Bavota and Barbara Russo. Four eyes are better than two: On the impact of code reviews on software quality. In IEEE International Conference on Software Maintenance and Evolution, ICSME , pages 81–90, 2015. [6]Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. Modern code review: A case study at google. In 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP , pages 181–190, 2018. [7]A. Bosu and J. C. Carver. Impact of peer code review on peer impression formation: A survey. In 7th IEEE/ACM International Symposium on Empirical Software Engineering and Measurement, ESEM , pages 133–142, 2013. [8]Vipin Balachandran. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In2013 35th International Conference on Software Engineering (ICSE) , pages 931–940. IEEE, 2013. [9]Jing Jiang, Jia-Huan He, and Xue-Yuan Chen. Coredevrec: Automatic core member recommendation for contribution evaluation. J. Comput. Sci. Technol. , 30(5):998–1016, 2015. [10] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. Who should review my code? a file location-based code-reviewer recommendation approach for modern code review. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER) , pages 141–150. IEEE, 2015. [11] Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pages 261–270, 2015. [12] Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. Search-based peer reviewers recommendation in modern code review. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pages 367–377. IEEE, 2016. [13] Haochao Ying, Liang Chen, Tingting Liang, and Jian Wu. Earec: Leveraging expertise and authority for pull-request reviewer recommendation in github. In 2016 IEEE/ACM 3rd International Workshop on CrowdSourcing in Software Engineering (CSI-SE) , pages 29–35, 2016. [14] Yue Yu, Huaimin Wang, Gang Yin, and Tao Wang. Reviewer recommendation for pull-requests in github: What can we learn from code review and bug assignment? Information and Software Technology , 74:204–218, 2016. [15] Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. Automatically recommending peer reviewers in modern code review. IEEE Transactions on Software Engineering , 42(6):530–543, 2016. [16] Zhenglin Xia, Hailong Sun, Jing Jiang, Xu Wang, and Xudong Liu. A hybrid approach to code reviewer recommendation with collaborative filtering. In2017 6th International Workshop on Software Mining (SoftwareMining) , pages 24–31. IEEE, 2017. [17] Jing Jiang, Yun Yang, Jiahuan He, Xavier Blanc, and Li Zhang. Who should comment on this pull request? analyzing attributes for more accurate commenter recommendation in pull-based development. Information and Software Technology , 84:48–62, 2017. [18] Mikołaj Fejzer, Piotr Przymus, and Krzysztof Stencel. Profile based recommendation of code reviewers. Journal of Intelligent Information Systems , 50:597–619, 2018. [19] Sumit Asthana, Rahul Kumar, Ranjita Bhagwan, Christian Bird, Chetan Bansal, Chandra Maddila, Sonu Mehta, and B. Ashok. Whodo: Automating reviewer suggestions at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC/FSE 2019, page 937–945, New York, NY, USA, 2019. Association for Computing Machinery. [20] Zhifang Liao, Zexuan Wu, Jinsong Wu, Yan Zhang, Junyi Liu, and Jun Long. Tirr: A code reviewer recommendation algorithm with topic model and reviewer influence. In 2019 IEEE Global Communications Conference (GLOBECOM) , pages 1–6. IEEE, 2019. [21] Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. Reviewer recommendation using software artifact traceability graphs. In Proceedings of the fifteenth international conference on predictive models and data analytics in software engineering , pages 66–75, 2019. [22] Jing Jiang, David Lo, Jiateng Zheng, Xin Xia, Yun Yang, and Li Zhang. Who should make decision on this pull request? analyzing time-decaying relationships and file similarities for integrator prediction. Journal of Systems and Software , 154:196–210, 2019. [23] Ehsan Mirsaeedi and Peter C. Rigby. Mitigating turnover with code review recommendation: Balancing expertise, workload, and knowledge distribution. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , ICSE ’20, page 1183–1195, New York, NY, USA, 2020. Association for Computing Machinery. [24] Wisam Haitham Abbood Al-Zubaidi, Patanamon Thongtanunam, Hoa Khanh Dam, Chakkrit Tantithamthavorn, and Aditya Ghose. Workload-aware reviewer recommendation using a multi-objective search-based approach. In Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering , pages 21–30, 2020. [25] Anton Strand, Markus Gunnarson, Ricardo Britto, and Muhmmad Usman. Using a context-aware approach to recommend code reviewers: findings from an industrial case study. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice , pages 1–10, 2020. 28 Page 29: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY [26] Moataz Chouchen, Ali Ouni, Mohamed Wiem Mkaouer, Raula Gaikovina Kula, and Katsuro Inoue. Whoreview: A multi-objective search-based approach for code reviewers recommendation in modern code review. Applied Soft Computing , 100:106908, 2021. [27] K Ayberk Tecimer, Eray Tüzün, Hamdi Dibeklioglu, and Hakan Erdogmus. Detection and elimination of systematic labeling bias in code reviewer recommendation systems. In Evaluation and Assessment in Software Engineering , pages 181–190. 2021. [28] Prahar Pandya and Saurabh Tiwari. Corms: A github and gerrit based hybrid code reviewer recommendation approach for modern code review. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC/FSE 2022, page 546–557, New York, NY, USA, 2022. Association for Computing Machinery. [29] Ruiyin Li, Peng Liang, and Paris Avgeriou. Code reviewer recommendation for architecture violations: An exploratory study. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023 , pages 42–51. ACM, 2023. [30] Ishan Aryendu, Ying Wang, Farah Elkourdi, and Eman Abdullah Alomar. Intelligent code review assignment for large scale open source software stacks. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE , page to appear, 2023. [31] Jiyang Zhang, Chandra Maddila, Ram Bairi, Christian Bird, Ujjwal Raizada, Apoorva Agrawal, Yamini Jhawar, Kim Herzig, and Arie van Deursen. Using large-scale heterogeneous graph representation learning for code review recommendations at microsoft. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages 162–172, 2023. [32] Mohammad Masudur Rahman, Chanchal K. Roy, and Jason A. Collins. Correct: Code reviewer recommendation in github based on cross-project and technology experience. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C) , pages 222–231, 2016. [33] Aleksandr Chueshev, Julia Lawall, Reda Bendraou, and Tewfik Ziadi. Expanding the number of reviewers in open-source projects by recommending appropriate developers. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pages 499–510, 2020. [34] Soumaya Rebai, Abderrahmen Amich, Somayeh Molaei, Marouane Kessentini, and Rick Kazman. Multi-objective code reviewer recommendations: Balancing expertise, availability and collaborations. Automated Software Engineering , page 301–328, 2020. [35] Xin Ye. Learning to rank reviewers for pull requests. IEEE Access , pages 85382–85391, 2019. [36] Guoliang Zhao, Jiawen Liu, Daniel Alencar da Costa, and Ying Zou. Adopting learning-to-rank algorithm for reviewer recommendation. In Paria Shirani, Iosif-Viorel Onut, and Tinny Ng, editors, Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering, CASCON 2022 , pages 22–31. ACM, 2022. [37] Yu Qiao, Jian Wang, Can Cheng, Wei Tang, Peng Liang, Yuqi Zhao, and Bing Li. Code reviewer recommendation based on a hypergraph with multiplex relationships. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 , pages 417–428. IEEE, 2024. [38] Fahimeh Hajari, Samaneh Malmir, Ehsan Mirsaeedi, and Peter C. Rigby. Factoring expertise, workload, and turnover into code review recommendation. IEEE Trans. Software Eng. , 50(4):884–899, 2024. [39] Md Shamimur Rahman, Debajyoti Mondal, Zadia Codabux, and Chanchal K. Roy. Integrating visual aids to enhance the code reviewer selection process. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2023, Bogotá, Colombia, October 1-6, 2023 , pages 293–305. IEEE, 2023. [40] Guoping Rong, Yifan Zhang, Lanxin Yang, Fuli Zhang, Hongyu Kuang, and He Zhang. Modeling review history for reviewer recommendation: a hypergraph approach. In Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, page 1381–1392, 2022. [41] Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. Rstrace+: Reviewer suggestion using software artifact traceability graphs. Information and Software Technology , 130:106455, 2021. [42] Dezhen Kong, Qiuyuan Chen, Lingfeng Bao, Chenxing Sun, Xin Xia, and Shanping Li. Recommending code reviewers for proprietary software projects: A large scale study. In IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022 , pages 630–640. IEEE, 2022. [43] Md. Ahasanuzzaman, Gustavo Ansaldi Oliva, and Ahmed E. Hassan. Using knowledge units of programming languages to recommend reviewers for pull requests: an empirical study. Empir. Softw. Eng. , 29(1):33, 2024. [44] Zhixing Li, Yue Yu, Gang Yin, Tao Wang, Qiang Fan, and Huaimin Wang. Automatic classification of review comments in pull-based development model. In SEKE , pages 572–577, 2017. [45] Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli. What happens in my code reviews? an investigation on automatically classifying review changes. Empirical Software Engineering , 27(4):89, 2022. [46] Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. Auger: automatically generating review comments with pre-training models. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 1009–1021, 2022. [47] Li Zhiyu, Lu Shuai, Guo Daya, Duan Nan, Jannu Shailesh, Jenks Grant, Majumder Deep, Green Jared, Svyatkovskiy Alexey, Fu Shengyu, and Neel Sundaresan. Automating code review activities by large-scale pre-training. In 30th ACM Joint European Software Engineering Conference and the ACM/SIGSOFT International Symposium on the Foundations of Software Engineering ESEC-FSE , pages 1035–1047, 2022. [48] Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti. Commentfinder: a simpler, faster, more accurate code review comments recommendation. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 507–519, 2022. [49] Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. Using pre-trained models to boost code review automation. In 44th IEEE/ACM International Conference on Software Engineering, ICSE , pages 2291–2302, 2022. 29 Page 30: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota [50] Manushree Vijayvergiya, Malgorzata Salawa, Ivan Budiselic, Dan Zheng, Pascal Lamblin, Marko Ivankovic, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrovic, Daniel Tarlow, Petros Maniatis, and René Just. Ai-assisted assessment of coding practices in modern code review. In Bram Adams, Thomas Zimmermann, Ipek Ozkaya, Dayi Lin, and Jie M. Zhang, editors, Proceedings of the 1st ACM International Conference on AI-Powered Software, AIware 2024 . ACM, 2024. [51] Junyi Lu, Zhangyi Li, Chenjie Shen, Li Yang, and Chun Zuo. Exploring the impact of code review factors on the code review comment generation. Autom. Softw. Eng. , 31(2):71, 2024. [52] Yongda Yu, Guoping Rong, Haifeng Shen, He Zhang, Dong Shao, Min Wang, Zhao Wei, Yong Xu, and Juhong Wang. Fine-tuning large language models to improve accuracy and comprehensibility of automated code review. ACM Trans. Softw. Eng. Methodol. , 34(1), 2024. [53] Hong Yi Lin, Patanamon Thongtanunam, Christoph Treude, and Wachiraphan Charoenwet. Improving automated code reviews: Learning from experience. In Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou, editors, 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024 , pages 278–283. ACM, 2024. [54] Oussama Ben Sghaier and Houari A. Sahraoui. Improving the learning of code review successive tasks with cross-task knowledge distillation. Proc. ACM Softw. Eng. , 1(FSE):1086–1106, 2024. [55] Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In 34th IEEE International Symposium on Software Reliability Engineering, ISSRE 2023 , pages 647–658. IEEE, 2023. [56] Mona Nashaat and James Miller. Towards efficient fine-tuning of language models with organizational data for automated software review. IEEE Trans. Software Eng. , 50(9):2240–2253, 2024. [57] Christoph Hannebauer, Michael Patalas, Sebastian Stünkel, and Volker Gruhn. Automatically recommending code reviewers based on their expertise: An empirical comparison. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016 , page 99?110, 2016. [58] Flavia Coelho, Tiago Massoni, and Everton L.G. Alves. Refactoring-aware code review: A systematic mapping study. In IEEE/ACM 3rd International Workshop on Refactoring (IWoR’19) , pages 63–66, 2019. [59] Nicole Davila and Ingrid Nunes. A systematic literature review and taxonomy of modern code review. Journal of Systems and Software , 177:110951, 2021. [60] Deepika Badampudi, Ricardo Britto, and Michael Unterkalmsteiner. Modern code reviews - preliminary results of a systematic mapping study. In Proceedings of the Evaluation and Assessment on Software Engineering EASE’19 , page 340?345, 2019. [61] Ilenia Fronza, Arto Hellas, Petri Ihantola, and Tommi Mikkonen. Code reviews, software inspections, and code walkthroughs: Systematic mapping study of research topics. In Software Quality: Quality Intelligence in Software and Systems Engineering , pages 121–133, 2020. [62] Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. Can we benchmark code review studies? a systematic mapping study of methodology, dataset, and metric. Journal of Systems and Software , 180:111009, 2021. [63] Barbara Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering. 2007. [64] Acm digital library. https://dl.acm.org/. [65] Elsevier sciencedirect. https://www.sciencedirect.com/. [66] Ieee xplore digital library. https://ieeexplore.ieee.org/. [67] Scopus. https://www.scopus.com/. [68] Springer link online library. https://link.springer.com/. [69] Wiley online library. https://onlinelibrary.wiley.com/. [70] Gali Halevi, Henk Moed, and Judit Bar-Ilan. Suitability of google scholar as a source of scientific information and as a source of data for scientific evaluation: Review of the literature. Journal of Informetrics , 11(3):823–834, 2017. [71] Oussama Ben Sghaier and Houari Sahraoui. A multi-step learning approach to assist code review. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 450–460, 2023. [72] Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. Towards automating code review activities. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE , pages 163–174, 2021. [73] Faria Huq, Masum Hasan, Md Mahim Anjum Haque, Sazan Mahbub, Anindya Iqbal, and Toufique Ahmed. Review4repair: Code review aided automatic program repairing. Information and Software Technology , 143:106765, 2022. [74] Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Coditt5: Pretraining for source code and natural language editing. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022 , pages 22:1–22:12. ACM, 2022. [75] Chanathip Pornprasit and Chakkrit Tantithamthavorn. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology , 175:107523, 2024. [76] Jiawei Lu, Zhijie Tang, and Zhongxin Liu. Improving code refinement for code review via input reconstruction and ensemble learning. In 30th Asia-Pacific Software Engineering Conference, APSEC 2023, Seoul, Republic of Korea, December 4-7, 2023 , pages 161–170. IEEE, 2023. [77] Alexander Froemmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Manzagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Daniel Zheng, Satish Chandra, and Petros Maniatis. Resolving code review comments with machine learning. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice , ICSE-SEIP ’24, page 204–215, 2024. [78] Online appendix. https://github.com/RosaliaTufano/Automating-Code-Review_SLR. 30 Page 31: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY [79] Thai Pangsakulyanont, Patanamon Thongtanunam, Daniel Port, and Hajimu Iida. Assessing mcr discussion usefulness using semantic similarity. In 2014 6th International Workshop on Empirical Software Engineering in Practice , pages 49–54. IEEE, 2014. [80] Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. Predicting usefulness of code review comments using textual features and developer experience. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) , pages 215–226. IEEE, 2017. [81] Masum Hasan, Anindya Iqbal, Mohammad Rafid Ul Islam, AJM Imtiajur Rahman, and Amiangshu Bosu. Using a balanced scorecard to identify opportunities to improve code review effectiveness: An industrial experience report. Empirical Software Engineering , 26:1–34, 2021. [82] Lanxin Yang, Jinwei Xu, Yifan Zhang, He Zhang, and Alberto Bacchelli. Evacrc: Evaluating code review comments. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors, Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023 , pages 275–287. ACM, 2023. [83] Shadikur Rahman, Umme Ayman Koana, and Maleknaz Nayebi. Example driven code review explanation. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement , pages 307–312, 2022. [84] Ratnadira Widyasari, Ting Zhang, Abir Bouraffa, Walid Maalej, and David Lo. Explaining explanations: An empirical study of explanations in code reviews. ACM Trans. Softw. Eng. Methodol. , December 2024. [85] Haytham Hijazi, Joao Duraes, Ricardo Couceiro, Joao Castelhano, Raul Barbosa, Júlio Medeiros, Miguel Castelo-Branco, Paulo De Carvalho, and Henrique Madeira. Quality evaluation of modern code reviews through intelligent biometric program comprehension. IEEE Transactions on Software Engineering , (01):1–1, 2022. [86] Haytham Hijazi, José Cruz, João Castelhano, Ricardo Couceiro, Miguel Castelo-Branco, Paulo de Carvalho, and Henrique Madeira. ireview: an intelligent code review evaluation tool using biofeedback. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) , pages 476–485. IEEE, 2021. [87] Yida Tao and Sunghun Kim. Partitioning composite code changes to facilitate code review. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories , pages 180–190. IEEE, 2015. [88] Mike Barnett, Christian Bird, Jo ao Brunet, and Shuvendu K. Lahiri. Helping developers help themselves: Automatic decomposition of code review changesets. In 37th IEEE/ACM International Conference on Software Engineering, ICSE , pages 134–144, 2015. [89] Min Wang, Zeqi Lin, Yanzhen Zou, and Bing Xie. Cora: Decomposing and describing tangled code changes for reviewer. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 1050–1061. IEEE, 2019. [90] Kim Herzig and Andreas Zeller. The impact of tangled code changes. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13 , pages 121–130, 2013. [91] Yuan Huang, Nan Jia, Xiangping Chen, Kai Hong, and Zibin Zheng. Code review knowledge perception: Fusing multi-features for salient-class location. IEEE Transactions on Software Engineering , 48(5):1463–1479, 2020. [92] Dong Wang, Raula Gaikovina Kula, Takashi Ishio, and Kenichi Matsumoto. Automatic patch linkage detection in code review using textual content and file location features. Information and Software Technology , 139:106637, 2021. [93] Yang Hong, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Aldeida Aleti. Don’t forget to change these functions! recommending co-changed functions in modern code review. Inf. Softw. Technol. , 176:107547, 2024. [94] Bingting Wu, Bin Liang, and Xiaofang Zhang. Turn tree into graph: Automatic code review via simplified ast driven graph convolutional network. Knowledge-Based Systems , 252:109450, 2022. [95] Bingting Wu and Xiaofang Zhang. Contrastive learning for multi-modal automatic code review. arXiv preprint arXiv:2205.14289 , 2022. [96] Shu-Ting Shi, Ming Li, David Lo, Ferdian Thung, and Xuan Huo. Automatic code review by learning the revision of source code. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence . AAAI Press, 2019. [97] Yuanrui Fan, Xin Xia, David Lo, and Shanping Li. Early prediction of merged code changes to prioritize reviewing tasks. Empirical Software Engineering , 23:3346–3393, 2018. [98] Khairul Islam, Toufique Ahmed, Rifat Shahriyar, Anindya Iqbal, and Gias Uddin. Early prediction for merged vs abandoned code changes in modern code reviews. Information and Software Technology , 142:106756, 2022. [99] Moataz Chouchen, Ali Ouni, and Mohamed Wiem Mkaouer. Multicr: Predicting merged and abandoned code changes in modern code review using multi-objective search. ACM Trans. Softw. Eng. Methodol. , 33(8), 2024. [100] Lanxin Yang, He Zhang, Jinwei Xu, Jun Lyu, Xin Zhou, Dong Shao, Shan Gao, and Alberto Bacchelli. A preliminary investigation on using multi-task learning to predict change performance in code reviews. Empir. Softw. Eng. , 29(6):157, 2024. [101] Moataz Chouchen and Ali Ouni. A multi-objective effort-aware approach for early code review prediction and prioritization. Empir. Softw. Eng. , 29(1):29, 2024. [102] Krishna Teja Ayinala, Kwok Sun Cheng, Kwangsung Oh, Teukseob Song, and Myoungkyu Song. Code inspection support for recurring changes with deep learning in evolving software. In 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC) , pages 931–942, 2020. [103] Ruiyin Wen, David Gilbert, Michael G Roche, and Shane McIntosh. Blimp tracer: Integrating build impact analysis with code review. In 2018 IEEE International conference on software maintenance and evolution (ICSME) , pages 685–694. IEEE, 2018. [104] Anderson Uchôa, Caio Barbosa, Daniel Coutinho, Willian Oizumi, Wesley KG Assunçao, Silvia Regina Vergilio, Juliana Alves Pereira, Anderson Oliveira, and Alessandro Garcia. Predicting design impactful changes in modern code review: A large-scale empirical study. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) , pages 471–482. IEEE, 2021. 31 Page 32: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota [105] Song Wang, Chetan Bansal, and Nachiappan Nagappan. Large-scale intent analysis for identifying large-review-effort code changes. Information and Software Technology , 130:106408, 2021. [106] Guoliang Zhao, Daniel Alencar da Costa, and Ying Zou. Improving the pull requests review process using learning-to-rank algorithms. Empirical Software Engineering , 24:2140–2170, 2019. [107] Jiantao He, Linzhang Wang, and Jianhua Zhao. Supporting automatic code review via design. In 2013 IEEE Seventh International Conference on Software Security and Reliability Companion , pages 211–218. IEEE, 2013. [108] Zhiyuan Chen, Maneesh Mohanavilasam, Young-Woo Kwon, and Myoungkyu Song. Tool support for managing clone refactorings to facilitate code review in evolving software. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) , volume 1, pages 288–297. IEEE, 2017. [109] Behjat Soltanifar, Atakan Erdem, and Ayse Bener. Predicting defectiveness of software patches. In Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement , pages 1–10, 2016. [110] Shipra Sharma and Balwinder Sodhi. Using stack overflow content to assist in code review. Software: Practice and Experience , 49(8):1255–1277, 2019. [111] Yang Hong, Chakkrit Kla Tantithamthavorn, and Patanamon Pick Thongtanunam. Where should i look at? recommending lines that reviewers should pay attention to. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 1034–1045. IEEE, 2022. [112] Doriane Olewicki, Sarra Habchi, and Bram Adams. An empirical study on code review activity prediction and its impact in practice. Proc. ACM Softw. Eng. , 1(FSE):2238–2260, 2024. [113] Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, and Egor Bulychev. Style-analyzer: fixing code style inconsistencies with interpretable unsupervised algorithms. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) , pages 468–478. IEEE, 2019. [114] Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. Senticr: a customized sentiment analysis tool for code review interactions. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 106–111. IEEE, 2017. [115] Carolyn D Egelman, Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. Predicting developers’ negative feelings about code review. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering , pages 174–185, 2020. [116] Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. Automated identification of toxic code reviews using toxicr. ACM Transactions on Software Engineering and Methodology , 2023. [117] Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. Incivility detection in open source code review and issue discussions. J. Syst. Softw. , 209:111935, 2024. [118] Jaydeb Sarker, Sayma Sultana, Steven R. Wilson, and Amiangshu Bosu. Toxispanse: An explainable toxicity detection in code review comments. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2023, New Orleans, LA, USA, October 26-27, 2023 , pages 1–12. IEEE, 2023. [119] Md Shamimur Rahman, Zadia Codabux, and Chanchal K. Roy. Do words have power? understanding and fostering civility in code review discussion. Proc. ACM Softw. Eng. , 1(FSE):1632–1655, 2024. [120] Anshul Gupta and Neel Sundaresan. Intelligent code reviews using deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day , 2018. [121] Chenkai Guo, Hui Yang, Dengrong Huang, Jianwen Zhang, Naipeng Dong, Jing Xu, and Jingwen Zhu. Review sharing via deep semi-supervised code clone detection. IEEE Access , 8:24948–24965, 2020. [122] Jing Kai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, and Yang Liu. Core: Automating review recommendation for code changes. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 284–295. IEEE, 2020. [123] Ohiduzzaman Shuvo, Parvez Mahbub, and Mohammad Masudur Rahman. Recommending code reviews leveraging code changes with structured information retrieval. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2023, Bogotá, Colombia, October 1-6, 2023 , pages 194–206. IEEE, 2023. [124] Yusuf Kartal, Kaan Akdeniz, and Kemal Ozkan. Automating modern code review processes with code similarity measurement. Inf. Softw. Technol. , 173:107490, 2024. [125] Chenkai Guo, Dengrong Huang, Naipeng Dong, Quanqi Ye, Jing Xu, Yaqing Fan, Hui Yang, and Yifan Xu. Deep review sharing. In Xinyu Wang, David Lo, and Emad Shihab, editors, 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019 , pages 61–72. IEEE, 2019. [126] Toshiki Hirao, Shane McIntosh, Akinori Ihara, and Kenichi Matsumoto. The review linkage graph for code review analytics: a recovery approach and empirical study. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 578–589, 2019. [127] Yuki Ueda, Takashi Ishio, Akinori Ihara, and Kenichi Matsumoto. Mining source code improvement patterns from similar code review works. In 2019 IEEE 13th International Workshop on Software Clones (IWSC) , pages 13–19. IEEE, 2019. [128] Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn. Autotransform: Automated code transformation to support modern code review process. In 44th IEEE/ACM International Conference on Software Engineering, ICSE , pages 237–248, 2022. 32 Page 33: Automating Code Review: A Systematic Literature Review Woodstock ’18, June 03–05, 2018, Woodstock, NY [129] Chanathip Pornprasit, Chakkrit Tantithamthavorn, Patanamon Thongtanunam, and Chunyang Chen. D-ACT: towards diff-aware code transforma- tion for code review under a time-wise evaluation. In Tao Zhang, Xin Xia, and Nicole Novielli, editors, IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023 , pages 296–307. IEEE, 2023. [130] Qianhua Shan, David Sukhdeo, Qianying Huang, Seth Rogers, Lawrence Chen, Elise Paradis, Peter C Rigby, and Nachiappan Nagappan. Using nudges to accelerate code reviews at scale. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 472–482, 2022. [131] Chandra Maddila, Chetan Bansal, and Nachiappan Nagappan. Predicting pull request completion time: a case study on large scale cloud services. InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering , pages 874–882, 2019. [132] Moataz Chouchen, Ali Ouni, Jefferson Olongo, and Mohamed Wiem Mkaouer. Learning to predict code review completion time in modern code review. Empirical Software Engineering , 2023. [133] Lawrence Chen, Peter C. Rigby, and Nachiappan Nagappan. Understanding why we cannot model how long a code review will take: An industrial case study. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE , page 1314–1319, 2022. [134] Nishrith Saini and Ricardo Britto. Using machine intelligence to prioritise code review requests. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages 11–20. IEEE, 2021. [135] Lanxin Yang, Jinwei Xu, He Zhang, Fanghao Wu, Jun Lyu, Yue Li, and Alberto Bacchelli. GPP: A graph-powered prioritizer for code review requests. In Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, editors, Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024 , pages 104–116. ACM, 2024. [136] Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. Towards automated classification of code review feedback to support analytics. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2023, New Orleans, LA, USA, October 26-27, 2023 , pages 1–12. IEEE, 2023. [137] Fiorella Zampetti, Saghan Mudbhari, Venera Arnaoudova, Massimiliano Di Penta, Sebastiano Panichella, and Giuliano Antoniol. Using code reviews to automatically configure static analysis tools. Empirical Software Engineering , 27(1):28, 2022. [138] Tukaram B. Muske, Ankit Baid, and Tushar Sanas. Review efforts reduction by partitioning of static analysis warnings. In 13th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2013 , pages 106–115. IEEE Computer Society, 2013. [139] Massimiliano Menarini, Yan Yan, and William G Griswold. Semantics-assisted code review: An efficient tool chain and a user study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages 554–565. IEEE, 2017. [140] Muntazir Fadhel and Emil Sekerinski. Striffs: Architectural component diagrams for code reviews. In 2021 International Conference on Code Quality (ICCQ) , pages 69–78. IEEE, 2021. [141] Rodrigo Brito and Marco Tulio Valente. Raid: Tool support for refactoring-aware code reviews. In 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC) , pages 265–275. IEEE, 2021. [142] Enrico Fregnan, Josua Fröhlich, Davide Spadini, and Alberto Bacchelli. Graph-based visualization of merge requests for code review. Journal of Systems and Software , 195:111506, 2023. [143] Peter C. Rigby, Daniel M. Germán, Laura L. E. Cowen, and Margaret-Anne D. Storey. Peer review on open-source software projects: Parameters, statistical models, and theory. ACM Trans. Softw. Eng. Methodol. , 23(4):35:1–35:33, 2014. [144] Peter C. Rigby and Christian Bird. Convergent contemporary software peer review practices. In 21st Joint Meeting of the European Software Engineering Conference and the ACM/SIGSOFT Symposium on the Foundations of Software Engineering, ESEC-FSE , pages 202–212, 2013. [145] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 , 2023. [146] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8696–8708, November 2021. [147] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR , abs/1907.11692, 2019. [148] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. Generating accurate assert statements for unit test cases using pretrained transformers. In 3rd IEEE/ACM International Conference on Automation of Software Test, AST , pages 54–64, 2022. [149] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR , abs/1910.03771, 2019. [150] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In 40th Annual Meeting on Association for Computational Linguistics, ACL , pages 311–318, 2002. [151] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out , pages 74–81, 2004. [152] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. CoRR , abs/2009.10297, 2020. [153] Håkan Petersson, Claes Wohlin, Per Runeson, and Martin Höst. Defect content estimation for two reviewers. In 12th International Symposium on Software Reliability Engineering (ISSRE 2001) , pages 340–345. IEEE Computer Society, 2001. 33 Page 34: Woodstock ’18, June 03–05, 2018, Woodstock, NY Tufano and Bavota [154] Yue Yu, Huaimin Wang, Gang Yin, and Charles X Ling. Who should review this pull-request: Reviewer recommendation to expedite crowd collaboration. In 2014 21st Asia-Pacific Software Engineering Conference , volume 1, pages 335–342. IEEE, 2014. [155] Fuxiang Chen, Fatemeh Fard, David Lo, and Timofey Bryksin. On the transferability of pre-trained language models for low-resource programming languages. In 30th IEEE/ACM International Conference on Program Comprehension, ICPC , pages 401–412, 2022. [156] Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms. Proceedings of the ACM on Programming Languages , 8(OOPSLA2):677–708, 2024. [157] Tim van Dam, Frank van der Heijden, Philippe de Bekker, Berend Nieuwschepen, Marc Otten, and Maliheh Izadi. Investigating the performance of language models for completing code in functional programming languages: a haskell case study. arXiv preprint arXiv:2403.15185 , 2024. [158] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , AAAI’16, page 308–314, 2016. [159] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , ICSE ’24, 2024. [160] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision , pages 291–326. Chapman and Hall/CRC, 2022. [161] Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 , 2023. [162] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning , pages 2790–2799. PMLR, 2019. A APPENDIX Table 11. Venue names Acronym Venue Name ASE International Conference on Automated Software Engineering COMPSAC Annual Computer Software and Applications Conference EASE International Conference on Evaluation and Assessment in Software Engineering EMSE Empirical Software Engineering ESEC/FSE European Software Engineering Conference and Symposium on the Foundations of Software Engineering ESEM International Symposium on Empirical Software Engineering and Measurement ICSE International Conference on Software Engineering ICSE-SEIP International Conference on Software Engineering: Software Engineering in Practice ICSME International Conference on Software Maintenance and Evolution IEEE Access IEEE Access ISSRE International Symposium on Software Reliability Engineering IST Journal of Information and Software Technology JSS Journal of Systems and Software MSR International Conference on Mining Software Repositories PACMSE Proceedings of the ACM on Software Engineering PROMISE International Conference on Predictive Models and Data Analytics in Software Engineering SANER International Conference on Software Analysis, Evolution and Reengineering SEKE International Conference on Software Engineering and Knowledge Engineering TOSEM Transactions on Software Engineering and Methodology TSE Transactions on Software Engineering 34