Page 1:
On Crowdsourcing Task Design for Discourse Relation Annotation
Frances Yung and Vera Demberg
Saarland University, Saarbrücken, Germany
{frances, vera}@lst.uni-saarland.de
Abstract
Interpreting implicit discourse relations in-
volves complex reasoning, requiring the in-
tegration of semantic cues with background
knowledge, as overt connectives like because
orthen are absent. These relations often allow
multiple interpretations, best represented as dis-
tributions. In this study, we compare two estab-
lished methods that crowdsource English im-
plicit discourse relation annotation by connec-
tive insertion: a free-choice approach, which
allows annotators to select any suitable connec-
tive, and a forced-choice approach, which asks
them to select among a set of predefined op-
tions. Specifically, we re-annotate the whole
DiscoGeM 1.0 corpus - initially annotated with
the free-choice method - using the forced-
choice approach. The free-choice approach
allows for flexible and intuitive insertion of var-
ious connectives, which are context-dependent.
Comparison among over 130,000 annotations,
however, shows that the free-choice strategy
produces less diverse annotations, often con-
verging on common labels. Analysis of the
results reveals the interplay between task de-
sign and the annotators’ abilities to interpret
and produce discourse relations.
1 Introduction
Disagreement in linguistic annotation is increas-
ingly seen not as noise but as a valuable signal
capturing diverse perspectives in language interpre-
tation (Dumitrache et al., 2021; Uma et al., 2021;
Frenda et al., 2024). A single gold label, tradi-
tionally provided by one or two trained annotators,
often fails to capture the full range of interpreta-
tions, which may arise from linguistic ambiguity,
contextual factors, or annotators’ cultural and ex-
periential backgrounds. Crowdsourcing offers a
scalable solution for gathering these alternative in-
terpretations.
To guide untrained crowd workers in reliably
annotating abstract linguistic phenomenon, intu-itive and carefully designed workflows are essential.
Task design has been identified as one of the factors
behind annotation disagreement and bias (Pavlick
and Kwiatkowski, 2019; Jiang and de Marneffe,
2022), and even can impact annotation quality
(Shaw et al., 2011; Gadiraju et al., 2017; Guru-
rangan et al., 2018). For example, Pyatkin et al.
(2023) investigate the bias of task design guiding
workers in annotating implicit discourse relation
(IDR) senses, which often have multiple interpreta-
tions. They compared two methods: one based on
insertion of discourse connectives (DCs), e.g. John
fell down because he tripped , and the other on para-
phrasing discourse arguments to question-answer
(QA) pairs, e.g. Q:Why (did) John fell down? A:
He tripped . While annotations from both methods
were found to align closely, subtle bias in the an-
notation preference are found in both methods due
to limitations of using natural language to anno-
tate specialized linguistic concepts (not all senses
can be easily expressed by a connective or by a
question).
Building on this line of work, we explore the po-
tential method bias of two IDR annotation tasks for
English based on DC insertion. These methods dif-
fer solely in whether annotators select from prede-
fined options (Rohde et al., 2016; Yung et al., 2024)
or freely type in their choices (Yung et al., 2019).
The free-choice method was employed to annotate
6,500 English IDRs in the DiscoGeM 1.0 corpus
(Scholman et al., 2022),whereas the DiscoGeM 2.0
corpus, comprising multi-lingual translations or
original texts from DiscoGeM 1.0, was annotated
using the forced-choice method. An initial com-
parison of the statistics of the two corpora revealed
characteristics unique to the English annotations,
such as a higher proportion of CONJUNCTION rela-
tions.
Our findings indicate that the free-choice ap-
proach achieves higher agreements among anno-
tators, while the forced-choice approach is morearXiv:2412.11637v1 [cs.CL] 16 Dec 2024
Page 2:
effective at capturing a diverse range of alternative
interpretations. Further analysis reveals that the
free-choice approach favours intuitive and frequent
intuitive sense, whereas the provided options in
the forced-choice approach serve as prompts for
the workers to identify rare, fine-grained senses.
Moreover, the method bias interacts with individ-
ual differences in discourse processing: workers
who could identify a wider range of senses in one
approach also tended to label more different senses
in the other approach. These results highlight the
nuanced impact of task design on annotation out-
comes.
The re-annotated resource is freely download-
able1alongside the original DiscoGeM 1.0. It pro-
vides an interesting dataset for the study of perspec-
tivism and design in annotation as well as a rich
collection of rare IDR examples, contributing to the
major data bottleneck for current IDR recognition
models.
2 Related work
Annotation of IDR requires integrating subtle se-
mantic cues with background knowledge and map-
ping these to abstract labels – a task that is challeng-
ing even for trained annotators (Hoek and Schol-
man, 2017). Previous attempts to create datasets
by crowdsourcing annotations often compromise
on label variety or annotation quality (Kawahara
et al., 2014; Kishimoto et al., 2018).
Inspired by the Penn Discourse Treebank’s
(PDTB) lexicalized approach to annotate IDRs
(Prasad et al., 2019), prior work has proposed
crowdsourcing IDRs via DC insertion. For exam-
ple, to label the REASON relation between the ar-
guments " John missed the bus " and " He was late
to work. ", the DC " therefore " could be inserted. In
the initial proposal, crowd workers selected a DC
from a fixed list, each corresponding to a unique
IDR sense (Scholman and Demberg, 2017). While
achieving high agreement with expert annotations,
the method was tested on only 6IDR senses to
avoid overwhelming workers with too many op-
tions. Choosing a DC is often context-dependent;
for example, while " although " and " even though "
are nearly interchangeable, " also" versus " further-
more " (both indicating CONJUNCTION ) may de-
pend on context. Workers might reject an appropri-
ate sense if a DC feels contextually awkward.
To handle a broader range of IDRs, Yung et al.
1https://github.com/merelscholman/DiscoGeM(2019) proposed a two-step approach: first, workers
freely type a DC that fits between two arguments;
second, they select from a list of unambiguous
DCs corresponding to their free-choice. For in-
stance, if they type " while " in the first step, they
should choose between " at the same time" and " in
contrast " in the second step, which are mapped
to the relations SYNCHRONY and CONTRAST re-
spectively. This method was used to create the
DiscoGeM 1.0 corpus, which contains 6,500En-
glish IDRs each annotated by 10workers (Schol-
man et al., 2022). Nonetheless, DiscoGeM 2.0,
which extends the annotations to German, French,
and Czech (Yung et al., 2024), adopted the one-step
forced-choice method: workers directly chose from
28DC choices, which were grouped by semantics
and shuffled per worker to facilitate navigation and
avoid positional bias. The free- and forced- choice
methods were reported to yield similar annotations,
but the comparisons were based on a limited subset
of items ( 234in Yung et al. (2019) and 18in Yung
et al. (2024)), with a restricted range of IDR senses.
Using a different crowd-annotation method, Py-
atkin et al. (2020) crowdsourced discourse relations
by instructing workers to create QA pairs from the
provided text, e.g., " Q: What is the reason John
was late? A: He missed the bus. " Comparisons of
QA-based and free-choice DC insertion methods
show that both exhibit biases toward specific sense
categories. In contrast to common attribution of
method artifacts to degraded data quality (Guru-
rangan et al., 2018; Zhu and Rzeszotarski, 2024),
it was found that training on the complementary
data collected by both methods enhanced the per-
formance of IDR identification models (Pyatkin
et al., 2023).
3 Annotation experiment
We adopt the forced-choice approach to re-annotate
the DiscoGeM 1.0 corpus, which was originally an-
notated using the free-choice approach. For this, an
annotation interface was implemented based on the
description of DiscoGeM 2.0 (Yung et al., 2024).
One representative DC was selected for each of
the28relations to be annotated. The selection
was primarily based on the disambiguating DCs
from the second step of the free-choice method,2
while ensuring they were sufficiently frequent and
not highly context dependent. The complete list is
2the DC lexicon and per-worker annotations are available
together with the corpus
Page 3:
shown in Table 2 in the Appendix.
Following the procedure of DiscoGeM 1.0, na-
tive English-speaking crowd workers were re-
cruited via the Prolific platform. Based on the
anonymous Prolific worker IDs, we invited the 199
workers who contributed to DisocoGeM 1.0 to par-
ticipate in the annotation task again. We assumed
that they would not recall the texts they annotated
three years ago and including them allows direct
comparison of annotations from the same work-
ers across both methods. Of these, 91workers
took part again, and 73additional workers were
recruited through a selection task.
Out of the 6505 items in DiscoGeM 1.0, 16du-
plicates were identified and removed. The remain-
ing items were divided into batches of 20−25,
with each batch assigned to at least 10workers.
The workers were awarded £1.8−£2.2per batch.
The quality of DiscoGeM 1.0 annotations was
primarily controlled by a screening task that se-
lected candidates achieving at least 50% agreement
with gold labels. During the data collection phase,
annotation quality was monitored twice to identify
and remove poorly performing annotators, while
retaining their earlier annotations (Scholman et al.,
2022). Similarly, we used an initial screening task
to ensure annotation quality. However, to maxi-
mize the number of annotators participating in both
tasks, we did not screen those who had contributed
to DiscoGeM 1.0, nor did we conduct additional
screening during the annotation process.
We compare the newly collected data against the
original DiscoGeM 1.0. In addition to analyzing la-
bel distributions from 10 workers per item, we com-
pared aggregated annotations to highlight the differ-
ences. The annotations were aggregated using the
"Worker Agreement with Aggregate" (Wawa) al-
gorithm, which weights each worker’s votes based
on their overall agreement with the majority label
(Ustalov et al., 2021).
4 Results
Table 1 presents the agreement between the annota-
tions obtained by the two methods. We computed
the averaged Jensen-Shanon divergense (JSD) be-
tween the label distributions of each item, as well
as hard and soft agreement rates. Hard agreement
measures matches between the single aggregated
annotations, while soft agreement considers any
overlap between annotations with over 20% distri-
bution a match (Pyatkin et al., 2023). We also calcu-lated the softκscores, an inter-annotator agreement
metric that accounts for the increased chance agree-
ment in multi-label predictions (Marchal et al.,
2022).
inter-method comparison free vs forced
JSD (full dist.) .527
Hard agreement (single label) .425
Soft agreement (multi-labels) .708
Softκ(multi-labels) .663
intra-method comparison free forced
Entropy 0.353 0.460
Agreement (max. label dist) 0.508 0.404
Per-item unique label count 4.309 6.275
Table 1: Annotation Agreement
It can be observed that the inter-method agree-
ment between single aggregated annotations is
moderate, comparable to the accuracy of state-of-
the-art IDR classification models (Costa and Kos-
seim, 2024; Zeng et al., 2024), but the agreement is
substantially higher when multiple annotations are
considered. This demonstrates that both methods
are capable of annotating the same types of rela-
tions, which often co-occur with other relations.
The bottom half of Table 1 compares the agree-
ment among the 10annotations per item in both
methods. The forced-choice method shows higher
averaged entropy in the per-item label distributions,
indicating greater annotation uncertainty. In ad-
dition, the forced-choice approach yields smaller
averaged per-item agreement (i.e., the proportion
of the majority label) and a higher average number
of unique annotations per item. These results all
indicate lower annotator agreement in the forced-
choice approach.
Figure 1 illustrates the overall distribution of the
unaggregated annotations, computed by the sum of
the normalized per-item distribution, since not all
items have exactly 10annotations. The free-choice
approach clearly converges on a narrower set of
labels, while the forced-choice approach spans a
wider range. Notably, RESULT and CONJUNC -
TION , are selected twice as often in the free-choice
method.
The trend is similar when focusing on the most
agreed labels. Figure 2 shows the alignment of the
aggregated annotations from both methods. The
annotations are grouped at level-2 granularity ac-
cording to the PDTB sense hierarchy, e.g. ARG1-
AS-DETAIL and ARG2-AS-DETAIL are grouped as
Page 4:
0
200
400
600
800
1000
1200
1400synchronous
precedence
succession
reason
result
arg1-as-goal
arg2-as-goal
arg1-as-cond
arg1-as-negcond
arg2-as-cond
arg2-as-negcond
arg1-as-denier
arg2-as-denier
contrast
similarity
equivalence
arg1-as-instance
arg2-as-instance
arg1-as-detail
arg2-as-detail
conjunction
disjunction
arg1-as-excpt
arg2-as-excpt
arg1-as-manner
arg2-as-manner
arg1-as-subst
arg2-as-subst
norelfree choice
forced choiceFigure 1: Distribution of the unaggregated annotations
LEVEL -OF-DETAIL . Even though the darkest di-
agonal line in the confusion matrix indicates sub-
stantial agreement between annotations from both
methods, many items labelled with CONJUNCTION ,
CAUSAL , and ARG2-AS-DETAIL in the free-choice
approach are now assigned to a range of other rela-
tions. While the aggregated annotations from the
forced-choice approach cover all level-2 senses de-
fined in the framework, half of these senses never
appear in the aggregated annotations from the free-
choice method.
Next, we directly compare the annotations of the
same workers. In total, we identified 3,223annota-
tions per method that were annotated by the same
worker on the same item (spanning 2,542unique
items and 91workers). The comparison of these an-
notations demonstrates a similar tendency as found
in the re-annotation of the whole corpus, as shown
in Figure 5 in the Appendix - common relations
like CONJUNCTION and RESULT were annotated as
other rarer relations in the forced-choice approach.
Figure 3 plots the number of unique relations
identified by workers who participated in both
methods. To ensure comparablility, results from
workers who annotated fewer than 50items in ei-
ther method or annotated items 3times more in one
method than the other were excluded. This results
in60workers, who annotated on average 621and
525items in the free- and forced- choice methods
respectively.
synchronous
asynchronous
cause
purpose
condition
concession
contrast
similarity
equivalence
instantiation
level-of-detail
conjunction
disjunction
exception
manner
substitution
norel
forced choicesynchronous
asynchronous
cause
purpose
condition
concession
contrast
similarity
equivalence
instantiation
level-of-detail
conjunction
disjunction
exception
manner
substitution
norelfree choice8123 133 3
15349415 1212 11611 1120
4910311161078389181638632479011162952
1
17117783184202 23520 21610
3241 22453 41611 1146
324 1 10 336 1
39474 42771364515 1129
272822526104879268053172 222475
12510628946117822641953255586 2447110
2
121 1 8
1Figure 2: Confusion matrix of the aggregated annota-
tions from both methods, with labels merged at level-2
granularity
It shows that all workers identified a broader
range of IDR senses using the forced-choice
method, as indicated by all data points falling below
the diagonal line. Furthermore, workers who could
identify more sense types with the free-choice ap-
proach also identify more sense types in the forced-
choice approach. This suggests individual differ-
ences in sensitivity to the subtle contrast in fine-
grained discourse relations, with the forced-choice
method further expanding the range of relations
these workers could identify by presenting all pos-
sible options.
5 10 15 20 25
forced choice label range510152025free choice label range
Figure 3: Total number of unique relations annotated
by the same workers on the same set of items
Page 5:
5 Discussion and conclusion
We examined the impact of two similar interfaces
used to crowdsourcing IDR annotations. Using
the free-choice approach, workers tend to select
common IDR labels with higher inter-annotator
agreement, while the forced-choice approach en-
couraged a larger variety of relations, including
rare ones. Notably, both methods produce valid
annotations, as evidenced by the high soft match
agreement. Frequent senses can often be inferred
alongside other senses, such as the CONJUNCTION
sense in the examples in Figure 4. In these exam-
ples, the English forced-choice annotations align
with other languages, despite being labeled as CON-
JUNCTION in the original free-choice annotations
of DiscoGeM 1.0.
High inter-annotator agreement is often linked
to higher data quality. However, for inherently am-
biguous tasks like IDR identification, we showed
that higher-agreement annotations that converge on
common labels are not always superior. Recogniz-
ing the method bias enables tailoring the approach
to the annotation goal — whether to achieve con-
sensus on a single label or capture diverse perspec-
tives. Since current IDR classification models often
struggle with rare labels, datasets with more label
variety may be more valuable. Still, distinguishing
genuine perspectives from annotation errors is chal-
lenging. Minimal data cleaning, such as removing
labels with very few votes, could be applied.
For corpus analysis, data should be collected con-
sistently using the same method. Initial analysis
reveals significant differences between the inter-
annotator agreements of the English annotations
in DiscoGeM 1.0 and the multilingual annotations
in DiscoGeM 2.0, whereas the re-annotated data
in this study aligns more closely with the other
languages (e.g. averaged per-item agreement =
.508/.404 (EN free-/forced-choice) .410−.439
(DE, FR, CS forced-choice), indicating the influ-
ence of the method bias. Our next step is to analyze
the cross-lingual difference based on annotations
collected with the same method.
Acknowledgements
This project is supported by the German Research
Foundation (DFG) under Grant SFB 1102 (“Infor-
mation Density and Linguistic Encoding", Project-
ID 232722074).1)
Arg 1: It was because of this tiny piece of information that
Ford Prefect was not now a whiff of hydrogen, ozone and
carbon monoxide. He heard a slight groan.
Arg2: By the light of the match he saw a heavy shape mov-
ing slightly on the floor. Quickly he shook the match out,
reached in his pocket, found what he was looking for and
took it out.
Aggregated annotation = PRECEDENCE
(English, German, French, Czech forced-choice)
Aggregated annotation = CONJUNCTION
(English free-choice)
2)
Arg1: In yesterday’s debate in the European Parliament
some Members of this Parliament expressed worry that we
were interfering in the internal affairs of a Member State.
Arg2: Such a concern is misplaced. The European Parlia-
ment has never been slow to comment on developments in
Member States with which they disagree.
Aggregated annotation = REASON
(English, German, French, Czech forced-choice)
Aggregated annotation = CONJUNCTION
(English free-choice)
3)
Arg1: With a spring Gollum got up and started shambling
off at a great pace. Bilbo hurried after him, still cautiously,
though his chief fear now was of tripping on another snag
and falling with a noise. His head was in a whirl of hope
and wonder.
Arg2: It seemed that the ring he had was a magic ring: it
made you invisible!
Aggregated annotation = SYNCHRONOUS
(English, German, French, Czech forced-choice)
Aggregated annotation = CONJUNCTION
(English free-choice)
Figure 4: Examples taken from DiscoGeM where the an-
notations by the forced- and free- choice approaches are
alternative interpretations. The English forced-choice
annotations come from the current study and those from
the other languages come from DiscoGeM 2.0. The En-
glish free-choice annotations come from DiscoGeM 1.0.
Page 6:
References
Nelson Filipe Costa and Leila Kosseim. 2024. A
multi-task and multi-label classification model for im-
plicit discourse relation recognition. arXiv preprint
arXiv:2408.08971 .
Anca Dumitrache, Oana Inel, Benjamin Timmermans,
Carlos Ortiz, Robert-Jan Sips, Lora Aroyo, and Chris
Welty. 2021. Empirical methodology for crowdsourc-
ing ground truth. Semantic Web , 12(3):403–421.
Simona Frenda, Gavin Abercrombie, Valerio Basile,
Alessandro Pedrani, Raffaella Panizzon, Alessan-
dra Teresa Cignarella, Cristina Marco, and Davide
Bernardi. 2024. Perspectivist approaches to natural
language processing: a survey. Language Resources
and Evaluation , pages 1–28.
Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. 2017.
Clarity is a worthwhile quality: On the role of task
clarity in microtask crowdsourcing. In Proceedings
of the 28th ACM conference on hypertext and social
media , pages 5–14.
Suchin Gururangan, Swabha Swayamdipta, Omer Levy,
Roy Schwartz, Samuel Bowman, and Noah A. Smith.
2018. Annotation artifacts in natural language infer-
ence data. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 2 (Short Papers) , pages 107–112,
New Orleans, Louisiana. Association for Computa-
tional Linguistics.
Jet Hoek and Merel Scholman. 2017. Evaluating dis-
course annotation: Some recent insights and new
approaches. In Proceedings of the 13th Joint ISO-
ACL Workshop on Interoperable Semantic Annotation
(isa-13) .
Nanjiang Jiang and Marie-Catherine de Marneffe. 2022.
Investigating reasons for disagreement in natural lan-
guage inference. Transactions of the Association for
Computational Linguistics , 10:1357–1374.
Daisuke Kawahara, Yuichiro Machida, Tomohide Shi-
bata, Sadao Kurohashi, Hayato Kobayashi, and Man-
abu Sassano. 2014. Rapid development of a corpus
with discourse annotations using two-stage crowd-
sourcing. In Proceedings of COLING 2014, the
25th International Conference on Computational Lin-
guistics: Technical Papers , pages 269–278, Dublin,
Ireland. Dublin City University and Association for
Computational Linguistics.
Yudai Kishimoto, Shinnosuke Sawada, Yugo Murawaki,
Daisuke Kawahara, and Sadao Kurohashi. 2018. Im-
proving crowdsourcing-based annotation of Japanese
discourse relations. In Proceedings of the Eleventh
International Conference on Language Resources
and Evaluation (LREC 2018) , Miyazaki, Japan. Eu-
ropean Language Resources Association (ELRA).
Marian Marchal, Merel Scholman, Frances Yung, and
Vera Demberg. 2022. Establishing annotation qual-
ity in multi-label annotations. In Proceedings of the29th International Conference on Computational Lin-
guistics , pages 3659–3668, Gyeongju, Republic of
Korea. International Committee on Computational
Linguistics.
Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent
disagreements in human textual inferences. Transac-
tions of the Association for Computational Linguis-
tics, 7:677–694.
Rashmi Prasad, Bonnie Webber, Alan Lee, and Aravind
Joshi. 2019. Penn Discourse Treebank Version 3.0.
Valentina Pyatkin, Ayal Klein, Reut Tsarfaty, and Ido
Dagan. 2020. QADiscourse - Discourse Relations
as QA Pairs: Representation, Crowdsourcing and
Baselines. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP) , pages 2804–2819, Online. Association for
Computational Linguistics.
Valentina Pyatkin, Frances Yung, Merel C. J. Schol-
man, Reut Tsarfaty, Ido Dagan, and Vera Demberg.
2023. Design choices for crowdsourcing implicit
discourse relations: Revealing the biases introduced
by task design. Transactions of the Association for
Computational Linguistics , 11:1014–1032.
Hannah Rohde, Anna Dickinson, Nathan Schneider,
Christopher N. L. Clark, Annie Louis, and Bonnie
Webber. 2016. Filling in the blanks in understand-
ing discourse adverbials: Consistency, conflict, and
context-dependence in a crowdsourced elicitation
task. In Proceedings of the 10th Linguistic Anno-
tation Workshop held in conjunction with ACL 2016
(LAW-X 2016) , pages 49–58, Berlin, Germany. Asso-
ciation for Computational Linguistics.
Merel Scholman and Vera Demberg. 2017. Crowd-
sourcing discourse interpretations: On the influence
of context and the reliability of a connective inser-
tion task. In Proceedings of the 11th Linguistic An-
notation Workshop , pages 24–33, Valencia, Spain.
Association for Computational Linguistics.
Merel C. J. Scholman, Tianai Dong, Frances Yung,
and Vera Demberg. 2022. DiscoGeM: A crowd-
sourced corpus of genre-mixed implicit discourse
relations. In Proceedings of the Thirteenth Inter-
national Conference on Language Resources and
Evaluation (LREC’22) , Marseille, France. European
Language Resources Association (ELRA).
Aaron D Shaw, John J Horton, and Daniel L Chen. 2011.
Designing incentives for inexpert human raters. In
Proceedings of the ACM 2011 conference on Com-
puter supported cooperative work , pages 275–284.
Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Sil-
viu Paun, Barbara Plank, and Massimo Poesio. 2021.
Learning from disagreement: A survey. Journal of
Artificial Intelligence Research , 72:1385–1470.
Dmitry Ustalov, Nikita Pavlichenko, Vladimir Losev,
Iulian Giliazev, and Evgeny Tulin. 2021. A general-
purpose crowdsourcing computational quality control
Page 7:
toolkit for python. In The Ninth AAAI Conference on
Human Computation and Crowdsourcing: Works-in-
Progress and Demonstration Track (HCOMP 2021) .
Frances Yung, Vera Demberg, and Merel Scholman.
2019. Crowdsourcing discourse relation annotations
by a two-step connective insertion task. In Proceed-
ings of the 13th Linguistic Annotation Workshop ,
pages 16–25, Florence, Italy. Association for Com-
putational Linguistics.
Frances Yung, Merel Scholman, Sarka Zikanova, and
Vera Demberg. 2024. DiscoGeM 2.0: A parallel cor-
pus of English, German, French and Czech implicit
discourse relations. In Proceedings of the 2024 Joint
International Conference on Computational Linguis-
tics, Language Resources and Evaluation (LREC-
COLING 2024) , pages 4940–4956, Torino, Italia.
ELRA and ICCL.
Lei Zeng, Ruifang He, Haowen Sun, Jing Xu, Chang
Liu, and Bo Wang. 2024. Global and local hierarchi-
cal prompt tuning framework for multi-level implicit
discourse relation recognition. In Proceedings of
the 2024 Joint International Conference on Compu-
tational Linguistics, Language Resources and Eval-
uation (LREC-COLING 2024) , pages 7760–7773,
Torino, Italia. ELRA and ICCL.
Shengqi Zhu and Jeffrey Rzeszotarski. 2024. “Get their
hands dirty, not mine”: On researcher-annotator col-
laboration and the agency of annotators. In Findings
of the Association for Computational Linguistics:
ACL 2024 , pages 8773–8782, Bangkok, Thailand.
Association for Computational Linguistics.
A Appendixlevel-2.level-3 IDR sense label DC
Temporal
SYNCHRONOUS .SYNCHRONOUS at the same time
ASYNCHRONOUS .PRECEDENCE then
ASYNCHRONOUS .SUCCESSION after
Contingency
CAUSE .REASON because
CAUSE .RESULT as a result
PURPOSE .ARG1-AS-GOAL for that purpose
PURPOSE .ARG2-AS-GOAL so that
CONDITION .ARG1-AS-COND in that case
CONDITION .ARG1-AS-NEGCOND if not
CONDITION .ARG2-AS-COND if
CONDITION .ARG2-AS-NEGCOND unless
Comparison
CONCESSION .ARG1-AS-DENIER even though
CONCESSION .ARG2-AS-DENIER nonetheless
CONTRAST .CONTRAST on the other hand
COMPARISON .SIMILARITY .SIMILARITY similarly
Expansion
EQUIVALENCE .EQUIVALENCE in other words
INSTANTIATION .ARG1-AS-INSTANCE this illustrates that
INSTANTIATION .ARG2-AS-INSTANCE for example
LEVEL -OF-DETAIL .ARG1-AS-DETAIL in short
LEVEL -OF-DETAIL .ARG2-AS-DETAIL in more detail
CONJUNCTION .CONJUNCTION also
DISJUNCTION .DISJUNCTION or
EXCEPTION .ARG1-AS-EXCPT other than that
EXCEPTION .ARG2-AS-EXCPT an exception is that
MANNER .ARG1-AS-MANNER thereby
MANNER .ARG2-AS-MANNER as if
SUBSTITUTION .ARG1-AS-SUBST rather than
SUBSTITUTION .ARG2-AS-SUBST instead
NOREL (no direct relation)
Table 2: English DC choices used in the forced-choice
DC insertion method
Page 8:
synchronous
precedence
succession
reason
result
arg1-as-goal
arg2-as-goal
arg1-as-cond
arg1-as-negcond
arg2-as-cond
arg2-as-negcond
arg1-as-denier
arg2-as-denier
contrast
similarity
equivalence
arg1-as-instance
arg2-as-instance
arg1-as-detail
arg2-as-detail
conjunction
disjunction
arg1-as-excpt
arg2-as-excpt
arg1-as-manner
arg2-as-manner
arg1-as-subst
arg2-as-subst
norel
forced choicesynchronous
precedence
succession
reason
result
arg1-as-goal
arg2-as-goal
arg1-as-cond
arg1-as-negcond
arg2-as-cond
arg2-as-negcond
arg1-as-denier
arg2-as-denier
contrast
similarity
equivalence
arg1-as-instance
arg2-as-instance
arg1-as-detail
arg2-as-detail
conjunction
disjunction
arg1-as-excpt
arg2-as-excpt
arg1-as-manner
arg2-as-manner
arg1-as-subst
arg2-as-subst
norelfree choice918182111 3 22 12
147821616326 5323115219 1 5 56
312 1121 1 62 1
53 888437 2 21332398239 1 51165
253383417150203612 1526121328151443554114 25 21421
1 1
11
81 853 4 161231115274 1 2 17
108 9166110 194710322599161222 2102
12 1110 31 551833343101121 95
541111 1 2122 3259 1 12
1 11 1 1 111
21 11 1 1 2 1 1 1 1
1031164543 5334642762911 111163
42 106713 3 14331094 1 11 15
14162303423420 1 1094122216382320236 1171 1020
5736447372391621123238401273520110164 6110 530
111
1
1 1
321 11 12 71
1 1 1 1 1 13Figure 5: Comparison between 3233 annotations by the same workers on the same items using both methods