loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.11637v1

On Crowdsourcing Task Design for Discourse Relation Annotation

Authors: Frances Yung, Vera Demberg

Published: 2024-12-16

Abstract:

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like because or then are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource English implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus -- initially annotated with the free-choice method -- using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators' abilities to interpret and produce discourse relations.

Paper Content: on Alphaxiv
Page 1: On Crowdsourcing Task Design for Discourse Relation Annotation Frances Yung and Vera Demberg Saarland University, Saarbrücken, Germany {frances, vera}@lst.uni-saarland.de Abstract Interpreting implicit discourse relations in- volves complex reasoning, requiring the in- tegration of semantic cues with background knowledge, as overt connectives like because orthen are absent. These relations often allow multiple interpretations, best represented as dis- tributions. In this study, we compare two estab- lished methods that crowdsource English im- plicit discourse relation annotation by connec- tive insertion: a free-choice approach, which allows annotators to select any suitable connec- tive, and a forced-choice approach, which asks them to select among a set of predefined op- tions. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced- choice approach. The free-choice approach allows for flexible and intuitive insertion of var- ious connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often con- verging on common labels. Analysis of the results reveals the interplay between task de- sign and the annotators’ abilities to interpret and produce discourse relations. 1 Introduction Disagreement in linguistic annotation is increas- ingly seen not as noise but as a valuable signal capturing diverse perspectives in language interpre- tation (Dumitrache et al., 2021; Uma et al., 2021; Frenda et al., 2024). A single gold label, tradi- tionally provided by one or two trained annotators, often fails to capture the full range of interpreta- tions, which may arise from linguistic ambiguity, contextual factors, or annotators’ cultural and ex- periential backgrounds. Crowdsourcing offers a scalable solution for gathering these alternative in- terpretations. To guide untrained crowd workers in reliably annotating abstract linguistic phenomenon, intu-itive and carefully designed workflows are essential. Task design has been identified as one of the factors behind annotation disagreement and bias (Pavlick and Kwiatkowski, 2019; Jiang and de Marneffe, 2022), and even can impact annotation quality (Shaw et al., 2011; Gadiraju et al., 2017; Guru- rangan et al., 2018). For example, Pyatkin et al. (2023) investigate the bias of task design guiding workers in annotating implicit discourse relation (IDR) senses, which often have multiple interpreta- tions. They compared two methods: one based on insertion of discourse connectives (DCs), e.g. John fell down because he tripped , and the other on para- phrasing discourse arguments to question-answer (QA) pairs, e.g. Q:Why (did) John fell down? A: He tripped . While annotations from both methods were found to align closely, subtle bias in the an- notation preference are found in both methods due to limitations of using natural language to anno- tate specialized linguistic concepts (not all senses can be easily expressed by a connective or by a question). Building on this line of work, we explore the po- tential method bias of two IDR annotation tasks for English based on DC insertion. These methods dif- fer solely in whether annotators select from prede- fined options (Rohde et al., 2016; Yung et al., 2024) or freely type in their choices (Yung et al., 2019). The free-choice method was employed to annotate 6,500 English IDRs in the DiscoGeM 1.0 corpus (Scholman et al., 2022),whereas the DiscoGeM 2.0 corpus, comprising multi-lingual translations or original texts from DiscoGeM 1.0, was annotated using the forced-choice method. An initial com- parison of the statistics of the two corpora revealed characteristics unique to the English annotations, such as a higher proportion of CONJUNCTION rela- tions. Our findings indicate that the free-choice ap- proach achieves higher agreements among anno- tators, while the forced-choice approach is morearXiv:2412.11637v1 [cs.CL] 16 Dec 2024 Page 2: effective at capturing a diverse range of alternative interpretations. Further analysis reveals that the free-choice approach favours intuitive and frequent intuitive sense, whereas the provided options in the forced-choice approach serve as prompts for the workers to identify rare, fine-grained senses. Moreover, the method bias interacts with individ- ual differences in discourse processing: workers who could identify a wider range of senses in one approach also tended to label more different senses in the other approach. These results highlight the nuanced impact of task design on annotation out- comes. The re-annotated resource is freely download- able1alongside the original DiscoGeM 1.0. It pro- vides an interesting dataset for the study of perspec- tivism and design in annotation as well as a rich collection of rare IDR examples, contributing to the major data bottleneck for current IDR recognition models. 2 Related work Annotation of IDR requires integrating subtle se- mantic cues with background knowledge and map- ping these to abstract labels – a task that is challeng- ing even for trained annotators (Hoek and Schol- man, 2017). Previous attempts to create datasets by crowdsourcing annotations often compromise on label variety or annotation quality (Kawahara et al., 2014; Kishimoto et al., 2018). Inspired by the Penn Discourse Treebank’s (PDTB) lexicalized approach to annotate IDRs (Prasad et al., 2019), prior work has proposed crowdsourcing IDRs via DC insertion. For exam- ple, to label the REASON relation between the ar- guments " John missed the bus " and " He was late to work. ", the DC " therefore " could be inserted. In the initial proposal, crowd workers selected a DC from a fixed list, each corresponding to a unique IDR sense (Scholman and Demberg, 2017). While achieving high agreement with expert annotations, the method was tested on only 6IDR senses to avoid overwhelming workers with too many op- tions. Choosing a DC is often context-dependent; for example, while " although " and " even though " are nearly interchangeable, " also" versus " further- more " (both indicating CONJUNCTION ) may de- pend on context. Workers might reject an appropri- ate sense if a DC feels contextually awkward. To handle a broader range of IDRs, Yung et al. 1https://github.com/merelscholman/DiscoGeM(2019) proposed a two-step approach: first, workers freely type a DC that fits between two arguments; second, they select from a list of unambiguous DCs corresponding to their free-choice. For in- stance, if they type " while " in the first step, they should choose between " at the same time" and " in contrast " in the second step, which are mapped to the relations SYNCHRONY and CONTRAST re- spectively. This method was used to create the DiscoGeM 1.0 corpus, which contains 6,500En- glish IDRs each annotated by 10workers (Schol- man et al., 2022). Nonetheless, DiscoGeM 2.0, which extends the annotations to German, French, and Czech (Yung et al., 2024), adopted the one-step forced-choice method: workers directly chose from 28DC choices, which were grouped by semantics and shuffled per worker to facilitate navigation and avoid positional bias. The free- and forced- choice methods were reported to yield similar annotations, but the comparisons were based on a limited subset of items ( 234in Yung et al. (2019) and 18in Yung et al. (2024)), with a restricted range of IDR senses. Using a different crowd-annotation method, Py- atkin et al. (2020) crowdsourced discourse relations by instructing workers to create QA pairs from the provided text, e.g., " Q: What is the reason John was late? A: He missed the bus. " Comparisons of QA-based and free-choice DC insertion methods show that both exhibit biases toward specific sense categories. In contrast to common attribution of method artifacts to degraded data quality (Guru- rangan et al., 2018; Zhu and Rzeszotarski, 2024), it was found that training on the complementary data collected by both methods enhanced the per- formance of IDR identification models (Pyatkin et al., 2023). 3 Annotation experiment We adopt the forced-choice approach to re-annotate the DiscoGeM 1.0 corpus, which was originally an- notated using the free-choice approach. For this, an annotation interface was implemented based on the description of DiscoGeM 2.0 (Yung et al., 2024). One representative DC was selected for each of the28relations to be annotated. The selection was primarily based on the disambiguating DCs from the second step of the free-choice method,2 while ensuring they were sufficiently frequent and not highly context dependent. The complete list is 2the DC lexicon and per-worker annotations are available together with the corpus Page 3: shown in Table 2 in the Appendix. Following the procedure of DiscoGeM 1.0, na- tive English-speaking crowd workers were re- cruited via the Prolific platform. Based on the anonymous Prolific worker IDs, we invited the 199 workers who contributed to DisocoGeM 1.0 to par- ticipate in the annotation task again. We assumed that they would not recall the texts they annotated three years ago and including them allows direct comparison of annotations from the same work- ers across both methods. Of these, 91workers took part again, and 73additional workers were recruited through a selection task. Out of the 6505 items in DiscoGeM 1.0, 16du- plicates were identified and removed. The remain- ing items were divided into batches of 20−25, with each batch assigned to at least 10workers. The workers were awarded £1.8−£2.2per batch. The quality of DiscoGeM 1.0 annotations was primarily controlled by a screening task that se- lected candidates achieving at least 50% agreement with gold labels. During the data collection phase, annotation quality was monitored twice to identify and remove poorly performing annotators, while retaining their earlier annotations (Scholman et al., 2022). Similarly, we used an initial screening task to ensure annotation quality. However, to maxi- mize the number of annotators participating in both tasks, we did not screen those who had contributed to DiscoGeM 1.0, nor did we conduct additional screening during the annotation process. We compare the newly collected data against the original DiscoGeM 1.0. In addition to analyzing la- bel distributions from 10 workers per item, we com- pared aggregated annotations to highlight the differ- ences. The annotations were aggregated using the "Worker Agreement with Aggregate" (Wawa) al- gorithm, which weights each worker’s votes based on their overall agreement with the majority label (Ustalov et al., 2021). 4 Results Table 1 presents the agreement between the annota- tions obtained by the two methods. We computed the averaged Jensen-Shanon divergense (JSD) be- tween the label distributions of each item, as well as hard and soft agreement rates. Hard agreement measures matches between the single aggregated annotations, while soft agreement considers any overlap between annotations with over 20% distri- bution a match (Pyatkin et al., 2023). We also calcu-lated the softκscores, an inter-annotator agreement metric that accounts for the increased chance agree- ment in multi-label predictions (Marchal et al., 2022). inter-method comparison free vs forced JSD (full dist.) .527 Hard agreement (single label) .425 Soft agreement (multi-labels) .708 Softκ(multi-labels) .663 intra-method comparison free forced Entropy 0.353 0.460 Agreement (max. label dist) 0.508 0.404 Per-item unique label count 4.309 6.275 Table 1: Annotation Agreement It can be observed that the inter-method agree- ment between single aggregated annotations is moderate, comparable to the accuracy of state-of- the-art IDR classification models (Costa and Kos- seim, 2024; Zeng et al., 2024), but the agreement is substantially higher when multiple annotations are considered. This demonstrates that both methods are capable of annotating the same types of rela- tions, which often co-occur with other relations. The bottom half of Table 1 compares the agree- ment among the 10annotations per item in both methods. The forced-choice method shows higher averaged entropy in the per-item label distributions, indicating greater annotation uncertainty. In ad- dition, the forced-choice approach yields smaller averaged per-item agreement (i.e., the proportion of the majority label) and a higher average number of unique annotations per item. These results all indicate lower annotator agreement in the forced- choice approach. Figure 1 illustrates the overall distribution of the unaggregated annotations, computed by the sum of the normalized per-item distribution, since not all items have exactly 10annotations. The free-choice approach clearly converges on a narrower set of labels, while the forced-choice approach spans a wider range. Notably, RESULT and CONJUNC - TION , are selected twice as often in the free-choice method. The trend is similar when focusing on the most agreed labels. Figure 2 shows the alignment of the aggregated annotations from both methods. The annotations are grouped at level-2 granularity ac- cording to the PDTB sense hierarchy, e.g. ARG1- AS-DETAIL and ARG2-AS-DETAIL are grouped as Page 4: 0 200 400 600 800 1000 1200 1400synchronous precedence succession reason result arg1-as-goal arg2-as-goal arg1-as-cond arg1-as-negcond arg2-as-cond arg2-as-negcond arg1-as-denier arg2-as-denier contrast similarity equivalence arg1-as-instance arg2-as-instance arg1-as-detail arg2-as-detail conjunction disjunction arg1-as-excpt arg2-as-excpt arg1-as-manner arg2-as-manner arg1-as-subst arg2-as-subst norelfree choice forced choiceFigure 1: Distribution of the unaggregated annotations LEVEL -OF-DETAIL . Even though the darkest di- agonal line in the confusion matrix indicates sub- stantial agreement between annotations from both methods, many items labelled with CONJUNCTION , CAUSAL , and ARG2-AS-DETAIL in the free-choice approach are now assigned to a range of other rela- tions. While the aggregated annotations from the forced-choice approach cover all level-2 senses de- fined in the framework, half of these senses never appear in the aggregated annotations from the free- choice method. Next, we directly compare the annotations of the same workers. In total, we identified 3,223annota- tions per method that were annotated by the same worker on the same item (spanning 2,542unique items and 91workers). The comparison of these an- notations demonstrates a similar tendency as found in the re-annotation of the whole corpus, as shown in Figure 5 in the Appendix - common relations like CONJUNCTION and RESULT were annotated as other rarer relations in the forced-choice approach. Figure 3 plots the number of unique relations identified by workers who participated in both methods. To ensure comparablility, results from workers who annotated fewer than 50items in ei- ther method or annotated items 3times more in one method than the other were excluded. This results in60workers, who annotated on average 621and 525items in the free- and forced- choice methods respectively. synchronous asynchronous cause purpose condition concession contrast similarity equivalence instantiation level-of-detail conjunction disjunction exception manner substitution norel forced choicesynchronous asynchronous cause purpose condition concession contrast similarity equivalence instantiation level-of-detail conjunction disjunction exception manner substitution norelfree choice8123 133 3 15349415 1212 11611 1120 4910311161078389181638632479011162952 1 17117783184202 23520 21610 3241 22453 41611 1146 324 1 10 336 1 39474 42771364515 1129 272822526104879268053172 222475 12510628946117822641953255586 2447110 2 121 1 8 1Figure 2: Confusion matrix of the aggregated annota- tions from both methods, with labels merged at level-2 granularity It shows that all workers identified a broader range of IDR senses using the forced-choice method, as indicated by all data points falling below the diagonal line. Furthermore, workers who could identify more sense types with the free-choice ap- proach also identify more sense types in the forced- choice approach. This suggests individual differ- ences in sensitivity to the subtle contrast in fine- grained discourse relations, with the forced-choice method further expanding the range of relations these workers could identify by presenting all pos- sible options. 5 10 15 20 25 forced choice label range510152025free choice label range Figure 3: Total number of unique relations annotated by the same workers on the same set of items Page 5: 5 Discussion and conclusion We examined the impact of two similar interfaces used to crowdsourcing IDR annotations. Using the free-choice approach, workers tend to select common IDR labels with higher inter-annotator agreement, while the forced-choice approach en- couraged a larger variety of relations, including rare ones. Notably, both methods produce valid annotations, as evidenced by the high soft match agreement. Frequent senses can often be inferred alongside other senses, such as the CONJUNCTION sense in the examples in Figure 4. In these exam- ples, the English forced-choice annotations align with other languages, despite being labeled as CON- JUNCTION in the original free-choice annotations of DiscoGeM 1.0. High inter-annotator agreement is often linked to higher data quality. However, for inherently am- biguous tasks like IDR identification, we showed that higher-agreement annotations that converge on common labels are not always superior. Recogniz- ing the method bias enables tailoring the approach to the annotation goal — whether to achieve con- sensus on a single label or capture diverse perspec- tives. Since current IDR classification models often struggle with rare labels, datasets with more label variety may be more valuable. Still, distinguishing genuine perspectives from annotation errors is chal- lenging. Minimal data cleaning, such as removing labels with very few votes, could be applied. For corpus analysis, data should be collected con- sistently using the same method. Initial analysis reveals significant differences between the inter- annotator agreements of the English annotations in DiscoGeM 1.0 and the multilingual annotations in DiscoGeM 2.0, whereas the re-annotated data in this study aligns more closely with the other languages (e.g. averaged per-item agreement = .508/.404 (EN free-/forced-choice) .410−.439 (DE, FR, CS forced-choice), indicating the influ- ence of the method bias. Our next step is to analyze the cross-lingual difference based on annotations collected with the same method. Acknowledgements This project is supported by the German Research Foundation (DFG) under Grant SFB 1102 (“Infor- mation Density and Linguistic Encoding", Project- ID 232722074).1) Arg 1: It was because of this tiny piece of information that Ford Prefect was not now a whiff of hydrogen, ozone and carbon monoxide. He heard a slight groan. Arg2: By the light of the match he saw a heavy shape mov- ing slightly on the floor. Quickly he shook the match out, reached in his pocket, found what he was looking for and took it out. Aggregated annotation = PRECEDENCE (English, German, French, Czech forced-choice) Aggregated annotation = CONJUNCTION (English free-choice) 2) Arg1: In yesterday’s debate in the European Parliament some Members of this Parliament expressed worry that we were interfering in the internal affairs of a Member State. Arg2: Such a concern is misplaced. The European Parlia- ment has never been slow to comment on developments in Member States with which they disagree. Aggregated annotation = REASON (English, German, French, Czech forced-choice) Aggregated annotation = CONJUNCTION (English free-choice) 3) Arg1: With a spring Gollum got up and started shambling off at a great pace. Bilbo hurried after him, still cautiously, though his chief fear now was of tripping on another snag and falling with a noise. His head was in a whirl of hope and wonder. Arg2: It seemed that the ring he had was a magic ring: it made you invisible! Aggregated annotation = SYNCHRONOUS (English, German, French, Czech forced-choice) Aggregated annotation = CONJUNCTION (English free-choice) Figure 4: Examples taken from DiscoGeM where the an- notations by the forced- and free- choice approaches are alternative interpretations. The English forced-choice annotations come from the current study and those from the other languages come from DiscoGeM 2.0. The En- glish free-choice annotations come from DiscoGeM 1.0. Page 6: References Nelson Filipe Costa and Leila Kosseim. 2024. A multi-task and multi-label classification model for im- plicit discourse relation recognition. arXiv preprint arXiv:2408.08971 . Anca Dumitrache, Oana Inel, Benjamin Timmermans, Carlos Ortiz, Robert-Jan Sips, Lora Aroyo, and Chris Welty. 2021. Empirical methodology for crowdsourc- ing ground truth. Semantic Web , 12(3):403–421. Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessan- dra Teresa Cignarella, Cristina Marco, and Davide Bernardi. 2024. Perspectivist approaches to natural language processing: a survey. Language Resources and Evaluation , pages 1–28. Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. 2017. Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing. In Proceedings of the 28th ACM conference on hypertext and social media , pages 5–14. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language infer- ence data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers) , pages 107–112, New Orleans, Louisiana. Association for Computa- tional Linguistics. Jet Hoek and Merel Scholman. 2017. Evaluating dis- course annotation: Some recent insights and new approaches. In Proceedings of the 13th Joint ISO- ACL Workshop on Interoperable Semantic Annotation (isa-13) . Nanjiang Jiang and Marie-Catherine de Marneffe. 2022. Investigating reasons for disagreement in natural lan- guage inference. Transactions of the Association for Computational Linguistics , 10:1357–1374. Daisuke Kawahara, Yuichiro Machida, Tomohide Shi- bata, Sadao Kurohashi, Hayato Kobayashi, and Man- abu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowd- sourcing. In Proceedings of COLING 2014, the 25th International Conference on Computational Lin- guistics: Technical Papers , pages 269–278, Dublin, Ireland. Dublin City University and Association for Computational Linguistics. Yudai Kishimoto, Shinnosuke Sawada, Yugo Murawaki, Daisuke Kawahara, and Sadao Kurohashi. 2018. Im- proving crowdsourcing-based annotation of Japanese discourse relations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan. Eu- ropean Language Resources Association (ELRA). Marian Marchal, Merel Scholman, Frances Yung, and Vera Demberg. 2022. Establishing annotation qual- ity in multi-label annotations. In Proceedings of the29th International Conference on Computational Lin- guistics , pages 3659–3668, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transac- tions of the Association for Computational Linguis- tics, 7:677–694. Rashmi Prasad, Bonnie Webber, Alan Lee, and Aravind Joshi. 2019. Penn Discourse Treebank Version 3.0. Valentina Pyatkin, Ayal Klein, Reut Tsarfaty, and Ido Dagan. 2020. QADiscourse - Discourse Relations as QA Pairs: Representation, Crowdsourcing and Baselines. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 2804–2819, Online. Association for Computational Linguistics. Valentina Pyatkin, Frances Yung, Merel C. J. Schol- man, Reut Tsarfaty, Ido Dagan, and Vera Demberg. 2023. Design choices for crowdsourcing implicit discourse relations: Revealing the biases introduced by task design. Transactions of the Association for Computational Linguistics , 11:1014–1032. Hannah Rohde, Anna Dickinson, Nathan Schneider, Christopher N. L. Clark, Annie Louis, and Bonnie Webber. 2016. Filling in the blanks in understand- ing discourse adverbials: Consistency, conflict, and context-dependence in a crowdsourced elicitation task. In Proceedings of the 10th Linguistic Anno- tation Workshop held in conjunction with ACL 2016 (LAW-X 2016) , pages 49–58, Berlin, Germany. Asso- ciation for Computational Linguistics. Merel Scholman and Vera Demberg. 2017. Crowd- sourcing discourse interpretations: On the influence of context and the reliability of a connective inser- tion task. In Proceedings of the 11th Linguistic An- notation Workshop , pages 24–33, Valencia, Spain. Association for Computational Linguistics. Merel C. J. Scholman, Tianai Dong, Frances Yung, and Vera Demberg. 2022. DiscoGeM: A crowd- sourced corpus of genre-mixed implicit discourse relations. In Proceedings of the Thirteenth Inter- national Conference on Language Resources and Evaluation (LREC’22) , Marseille, France. European Language Resources Association (ELRA). Aaron D Shaw, John J Horton, and Daniel L Chen. 2011. Designing incentives for inexpert human raters. In Proceedings of the ACM 2011 conference on Com- puter supported cooperative work , pages 275–284. Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Sil- viu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: A survey. Journal of Artificial Intelligence Research , 72:1385–1470. Dmitry Ustalov, Nikita Pavlichenko, Vladimir Losev, Iulian Giliazev, and Evgeny Tulin. 2021. A general- purpose crowdsourcing computational quality control Page 7: toolkit for python. In The Ninth AAAI Conference on Human Computation and Crowdsourcing: Works-in- Progress and Demonstration Track (HCOMP 2021) . Frances Yung, Vera Demberg, and Merel Scholman. 2019. Crowdsourcing discourse relation annotations by a two-step connective insertion task. In Proceed- ings of the 13th Linguistic Annotation Workshop , pages 16–25, Florence, Italy. Association for Com- putational Linguistics. Frances Yung, Merel Scholman, Sarka Zikanova, and Vera Demberg. 2024. DiscoGeM 2.0: A parallel cor- pus of English, German, French and Czech implicit discourse relations. In Proceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024) , pages 4940–4956, Torino, Italia. ELRA and ICCL. Lei Zeng, Ruifang He, Haowen Sun, Jing Xu, Chang Liu, and Bo Wang. 2024. Global and local hierarchi- cal prompt tuning framework for multi-level implicit discourse relation recognition. In Proceedings of the 2024 Joint International Conference on Compu- tational Linguistics, Language Resources and Eval- uation (LREC-COLING 2024) , pages 7760–7773, Torino, Italia. ELRA and ICCL. Shengqi Zhu and Jeffrey Rzeszotarski. 2024. “Get their hands dirty, not mine”: On researcher-annotator col- laboration and the agency of annotators. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 8773–8782, Bangkok, Thailand. Association for Computational Linguistics. A Appendixlevel-2.level-3 IDR sense label DC Temporal SYNCHRONOUS .SYNCHRONOUS at the same time ASYNCHRONOUS .PRECEDENCE then ASYNCHRONOUS .SUCCESSION after Contingency CAUSE .REASON because CAUSE .RESULT as a result PURPOSE .ARG1-AS-GOAL for that purpose PURPOSE .ARG2-AS-GOAL so that CONDITION .ARG1-AS-COND in that case CONDITION .ARG1-AS-NEGCOND if not CONDITION .ARG2-AS-COND if CONDITION .ARG2-AS-NEGCOND unless Comparison CONCESSION .ARG1-AS-DENIER even though CONCESSION .ARG2-AS-DENIER nonetheless CONTRAST .CONTRAST on the other hand COMPARISON .SIMILARITY .SIMILARITY similarly Expansion EQUIVALENCE .EQUIVALENCE in other words INSTANTIATION .ARG1-AS-INSTANCE this illustrates that INSTANTIATION .ARG2-AS-INSTANCE for example LEVEL -OF-DETAIL .ARG1-AS-DETAIL in short LEVEL -OF-DETAIL .ARG2-AS-DETAIL in more detail CONJUNCTION .CONJUNCTION also DISJUNCTION .DISJUNCTION or EXCEPTION .ARG1-AS-EXCPT other than that EXCEPTION .ARG2-AS-EXCPT an exception is that MANNER .ARG1-AS-MANNER thereby MANNER .ARG2-AS-MANNER as if SUBSTITUTION .ARG1-AS-SUBST rather than SUBSTITUTION .ARG2-AS-SUBST instead NOREL (no direct relation) Table 2: English DC choices used in the forced-choice DC insertion method Page 8: synchronous precedence succession reason result arg1-as-goal arg2-as-goal arg1-as-cond arg1-as-negcond arg2-as-cond arg2-as-negcond arg1-as-denier arg2-as-denier contrast similarity equivalence arg1-as-instance arg2-as-instance arg1-as-detail arg2-as-detail conjunction disjunction arg1-as-excpt arg2-as-excpt arg1-as-manner arg2-as-manner arg1-as-subst arg2-as-subst norel forced choicesynchronous precedence succession reason result arg1-as-goal arg2-as-goal arg1-as-cond arg1-as-negcond arg2-as-cond arg2-as-negcond arg1-as-denier arg2-as-denier contrast similarity equivalence arg1-as-instance arg2-as-instance arg1-as-detail arg2-as-detail conjunction disjunction arg1-as-excpt arg2-as-excpt arg1-as-manner arg2-as-manner arg1-as-subst arg2-as-subst norelfree choice918182111 3 22 12 147821616326 5323115219 1 5 56 312 1121 1 62 1 53 888437 2 21332398239 1 51165 253383417150203612 1526121328151443554114 25 21421 1 1 11 81 853 4 161231115274 1 2 17 108 9166110 194710322599161222 2102 12 1110 31 551833343101121 95 541111 1 2122 3259 1 12 1 11 1 1 111 21 11 1 1 2 1 1 1 1 1031164543 5334642762911 111163 42 106713 3 14331094 1 11 15 14162303423420 1 1094122216382320236 1171 1020 5736447372391621123238401273520110164 6110 530 111 1 1 1 321 11 12 71 1 1 1 1 1 13Figure 5: Comparison between 3233 annotations by the same workers on the same items using both methods

---