loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2502.06648

The 2021 Tokyo Olympics Multilingual News Article Dataset

Authors: Erik Novak, Erik Calcina, Dunja Mladenić, Marko Grobelnik

Published: 2025-02-10

Abstract:

In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository.

Paper Content: on Alphaxiv
Page 1: THE2021 T OKYO OLYMPICS MULTILINGUAL NEWS ARTICLE DATASET Erik Novak, Erik Calcina, Dunja Mladeni ´c, Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute Ljubljana, Slovenia {name.surname}@ijs.si ABSTRACT In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub- events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository. Keywords Dataset ·Multilingual ·Event Clusters ·News articles ·Automatic annotation ·Manual annotation Background & Summary Through news articles, we can learn about global events. Different publishers report the same event from various perspectives, highlighting what they find essential for their audience. Depending on the publisher’s country, articles may be in different languages and may reflect the writer’s biases. Thus, news articles are key for identifying world events, their coverage, and their global significance. To analyze these aspects, we need effective methods to group multilingual news articles based on their events, which typically involve similar entities (who/what was involved), time (when it happened), and place (where it happened). There is a scarcity of multilingual news datasets for developing methods and models to group articles reporting on the same event. Table 1 lists the datasets suitable for news-related tasks. Most existing news datasets support research in classifying news articles into topics or domains [ 1,2,3,4,5,6,7,8,9,10]. Datasets like 20 Newsgroups [ 1], AG News [ 3], BBC News Archive [ 4], and POLUSA [ 10] contain articles annotated with a few labels, usually corresponding to the news domain (e.g. Sports, Business, Tech, World). Others, such as Routers-21578 [ 2], TDT2 [ 6], and News Category Dataset [ 11,12], have annotated articles suitable for topic detection. Recent datasets, NELA-GT-2018 [ 7] and NELA-GT-2019 [ 9], have multiple label groups corresponding to specific news aspects, such as political bias, reliability, and transparency. While they help classify news into topics or domains, these datasets are not necessarily suitable for identifying news events, as the labels are often too broad and invariant to the temporal and geographical information of the article. Furthermore, most datasets contain primarily English articles, making them unsuitable for multilingual tasks. Those that are multilingual are typically designed for other news-related tasks, such as discrimination (SETimes.HBS [ 13]) and alignment (Hashemi et al. dataset [ 14]) between languages, language simplification (SNIML [ 15]), and news summarization (MassiveSumm [ 16]). Multilingual news datasets, such as CC News [ 17] and Babel Briefings [ 18], are developed to analyze world events, cultural narratives, and more. In addition to datasets, news monitoring systems sucharXiv:2502.06648v2 [cs.IR] 13 Feb 2025 Page 2: Erik Novak et al. as Event Registry [ 19] and the GDELT Project [ 20] track news reported from various international sources and perform data analysis. Both systems provide detailed event records, including date, location, and involving actors. The GDELT Project provides event information with the source article URLs used to detect the event. Event Registry provides the event information with it’s news article metadata for analysis purposes. Our research identified only one multilingual news dataset with articles annotated based on the events they cover. This dataset was prepared by Miranda et al. [ 21] for evaluating news stream clustering algorithms. It was initially gathered from the Event Registry [ 19] to assess a cross-lingual news similarity and event tracking method [ 22]. The data set contains 34k news articles in various languages published between 2013 and 2015. The articles are grouped into 1.5k events covering various topics. Table 1: Existing news article datasets grouped based on their expected usage. A label group is a set of labels corresponding to a particular news aspect, such as political bias, reliability, or transparency. Both news event tracking systems have an increasing number of articles and clusters, denoted with the star ( ⋆) symbol. Corpus nameNews publication time rangeLanguage(s) No. articlesNo. clusters or labels NEWS EVENT CLUSTERING DATASETS Miranda et al. Dataset [21] Dec 2013 - Avg 2015 English 21k 808 Spanish 6.7k 554 German 6.1k 490 Chinese 450 9 Slovenian 37 3 Croatian 13 2 French 61 2 Russian 231 1 Italian 88 2 NEWS CLASSIFICATION DATASETS News Category Dataset [11, 12] Jan 2012 - Sept 2022 English 210k 42 POLUSA dataset [10] Jan 2017 - Aug 2019 English 0.9M 18 NELA-GT-2019 [9] Jan 2019 - Dec 2019 English 1.12M 7 label groups* NELA-GT-2018 [7] Feb 2018 - Nov 2018 English 713k 8 label groups* BBC News Archive [4] 2004 - 2005 English 2k 5 AG News [3] 2004 English 1M 4 TDT2 Multilanguage Text Corpus [6] Jan 1998 - June 1998 English 53.6k 100 Chinese 18.8k 100 RCV1 [5] Aug 1996 - Aug 1997 English 804k 53 20 Newsgroups [1] Apr 1993 - May 1993 English 20k 20 Routers-21578 [2] Feb 1987 - Oct 1987 English 21.5k 135 Pelicon et al. Dataset [8] - Croatian 2k 3 OTHER NEWS DATASETS SNIML [15] 2003 - 2022 6 languages 13.4k - Babel Briefings [18] Aug 2020 - Nov 2021 30 languages 4.7M - CC News [17] Jan 2017 - Dec 2019 English 708k - Hashemi et al. Dataset [14] Jan 2002 - Dec 2006 English & Persian 245k - MassiveSumm [16] - 92 languages 28.8M - SETimes.HBS [13] - 3 languages 9k - NEWS EVENT TRACKING SYSTEMS Event Registry [19] since 2014 +60 languages ⋆ ⋆ GDELT Project [20] since 1979 +100 languages ⋆ ⋆ Because of the low number of events in the wide time range of the Miranda et al. dataset, it may not be suitable for developing approaches focused on high-frequency events, i.e., events that happen close in temporal proximity and possibly in similar locations. To address this gap, we created a novel multilingual news dataset named OG2021, containing articles reporting on the 2021 Tokyo Olympics. The articles are written in multiple languages and annotated based on the events they report; articles on the same event have the same annotation. The Olympics, spanning 18 days, 2 Page 3: The 2021 Tokyo Olympics Multilingual News Article Dataset presented a dense array of sub-events happening simultaneously, including articles with temporal, geographical, and contextual similarities that may challenge separation. Figure 1 illustrates the schematic overview of the novel OG2021 dataset preparation, including news article retrieval, annotation, and technical evaluation. The dataset was primarily developed to evaluate online multilingual news clustering algorithms in a high-frequency event setting. However, it can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics, including cultural and linguistic differences in the articles and challenges faced by the organizers and competitors, such as COVID regulations. 1. News retrieval2. News annotationCriteria definitionlanguage time rang wiki conceptsNews collection news cleanupautomatic news clusteringnews ar ticlesnews clust ersannotat ed news clust ersManual annotation and evaluationmanual annotation criteria relocatesplitmergeeliminateremoveCorpus characteristics and validationDescriptive characteristicsmultilingualarticle and cluster count statisticstitle words statisticsbody words statisticsscripts and language families overviewmonolingualT echnical validationmultilingualarticle distribution by datecluster distribution by sizecluster distribution per languagelang co-occurrence in clustersmonolingualDataset publicationresearch versionpublic version Figure 1: The schematic overview of the OG2021 development. Methods This section focuses on the approach used to create the OG2021 dataset. It first describes the news retrieval, followed by their annotation process. News Retrieval Before we can begin any data processing, we must first retrieve the relevant news articles. This section describes the defined criteria and the approach taken to retrieve these articles. Criteria definition To retrieve the appropriate articles, we first defined the criteria that the articles must follow. The requirements consist of three conditions: (1) the languages in which the article must be written, (2) the publication time, which must be within a predefined time range, and (3) the contextual concepts the article must include. Table 2 shows the overview of the defined conditions. Below is a detailed description of each criterion. Languages . We aim to include articles in languages from diverse language families and scripts to represent a wide range of linguistic features. Following Miranda et al.’s dataset, we chose English, Spanish, German, French, Russian, Chinese, and Slovenian. We also include Portuguese and Arabic to increase linguistic diversity. The selected languages cover different language families (Germanic, Italic, Slavic, Semitic, and Sinitic) and are written in different scripts (Latin, Cyrillic, Arabic, and Chinese). 3 Page 4: Erik Novak et al. Table 2: The news retrieval criteria conditions. No. Condition type Condition values 1 Languages English, Portuguese, Spanish, French, Russian, German, Slovenian, Arabic, Chinese 2 Publication time range July 1, 2021 - August 14, 2021 3 Contextual concepts Olympic Games ,Japan and at least one of basketball ,sports climbing , swimming ,judo ,rowing ,skateboarding , ortable tennis Publication time range . To capture relevant articles, we limited our selection to those published between July 1, 2021, and August 14, 2021. This time range includes articles from three weeks before the start of the 2021 Tokyo Olympics (July 23, 2021) to one week after its conclusion (August 8, 2021). This allowed us to gather articles covering the events leading up to the opening as well as post-event coverage. Contextual concepts . The articles’ contextual concepts must be related to the 2021 Tokyo Olympics. Because of this, we decided the articles must first be related to the general concepts Olympic Games andJapan , which will limit the scope of retrieved articles. To further narrow the focus, we concentrate on seven different sports: basketball , sports climbing ,swimming ,judo ,rowing ,skateboarding , and table tennis . The sports were chosen for their diversity in competition organization and duration, as well as their historic presence in previous Olympic Games. Notably, sports climbing andskateboarding debuted in the 2021 Olympic Games. News collection With the defined article criteria, we can now retrieve the relevant news articles. To do this, we use Event Registry [ 19], the system that collects news articles from thousands of publishers, clusters them into news events, and enriches news articles and event clusters by extracting the named entities mentioned. It also links the article’s textual components to corresponding Wikipedia pages through a process called wikification [ 23,24]. Due to the structure of Wikipedia, where different-language Wikipedia pages corresponding to the same concept are linked, the wikification process identifies the Wikipedia pages in both the article’s language and in English, if available. The Event Registry system has a dedicated API ( https://www.newsapi.ai/ ), which enables the retrieval of news articles and event cluster metadata. The retrieval can be done through multiple endpoints, which accept various parameters, including the language of the article, the publication date range, and the Wikipedia concepts the articles must relate to. Although the system provides clustered news articles, we used the API to collect only the news articles, as the clusters created by the system do not have the structure we are targeting. Using the defined criteria, we developed code to retrieve the relevant articles automatically via the Event Registry API. We split the time range into days, then retrieved the articles published on a given day that were written in one of the selected languages and contained both the Olympic Games andJapan Wikipedia concepts, as well as at least one of the sports concepts listed in the criteria definition. The API allows using English Wikipedia concepts to retrieve articles in other languages, eliminating the need for prior translation. Through this process, we retrieved 36k articles that matched the defined criteria. In addition, some articles cover also concepts not defined in the criteria, such as additional sports and issues relating to COVID. Therefore, the dataset covers a broader range of concepts. The articles were then sent for annotation. News Annotation Once the news articles were retrieved, we proceeded with the annotation process. The goal is to annotate articles based on the events they reported on, i.e., ensuring that articles reporting on the same event received the same annotation. We first describe how we prepare and clean the retrieved news articles. Afterward, we present the annotation process, which comprises automatic news clustering and the subsequent manual annotation and evaluation. News cleanup The retrieved articles have various attributes, both original and enriched by the Event Registry. To provide only the original information, we include the following attributes: its unique ID, its title and body text, the publication datetime, its language, its published URL address and the name of the source publisher. We then split the articles into seven datasets, each corresponding to one of the sports concepts used during the retrieval process. Each dataset is then processed in the automatic clustering and manual annotation and evaluation, in that order. 4 Page 5: The 2021 Tokyo Olympics Multilingual News Article Dataset Automatic news clustering We apply an online multilingual news clustering algorithm [ 25] to automatically cluster each dataset corresponding to one of the sport’s concepts. This algorithm, a single-pass clustering approach, processes the collected datasets by representing each article using its content embedding, the set of extracted named entities, and its publication datetime. The content embedding is created using the SBERT [ 26,27] language model, trained to generate contextual embeddings appropriate for pairwise sentence similarity, while the named entities were extracted using the WikiNEuRal [ 28] multilingual named entity extraction model. The representations for event clusters are created as aggregates of content embeddings, entity sets, and publication datetime values of articles within the clusters. The algorithm measures the content similarity, ratio of overlapping entities, and temporal proximity between the article and all clusters, checking if all three values are above the set thresholds defined at the beginning of the clustering process. The article is placed into the best event cluster that meets these criteria. If no such cluster exists, a new one is created containing the current article. The algorithm’s hyper-parameters (thresholds) were chosen to optimize precision while maintaining a reasonable recall score. This intentional selection results in more clusters containing only the most similar articles, potentially reducing the inclusion of articles that should not be clustered together. The automatic news clustering resulted in around 16k event clusters containing an average of 2-3 articles. These clusters are then processed through manual annotation in the next step. Manual annotation and evaluation Following the automatic news clustering process, the datasets underwent a manual evaluation and annotation process. Each dataset related to a sports concept was assessed using the following procedure. Cluster representation . To represent each automatically generated cluster, a table was created to display the included article values. Each table row contains the current cluster ID, the unique article ID, the article publication datetime, the article language, as well as its title and body. Notably, the article’s URL and source were intentionally omitted from the table, as they do not contain relevant event information. Manual annotation criteria . The annotation criteria is designed to categorize articles based on their responses to fundamental journalistic questions: who, what, where, when, and how. The first four questions yield objective answers, while the response to how can involve subjectivity. Because of this, alignment with the how question is considered an optional criterion. The annotators are tasks to group articles based on the responses to these questions. In addition, articles reflecting the same event from different perspectives (e.g. two articles describing the finals of a sport event, where one focuses on the gold and the other on the silver medalists) should be in the same event cluster, although they might not have the same responses to the above questions. To further improve the quality of the dataset, annotators are provided with additional guidelines: • Articles should predominantly focus on a singular event, situation, or key information. •Articles must avoid resembling “click-bait,” meaning they should not include phrases like “watch now” or “click here to see more.” If they do, the article should be excluded from the dataset. •Articles should not primarily focus on presenting a schedule of events occurring on a specific day. If they do, the article should be excluded from the dataset. The intentionally ambiguous criterion aims to create diverse clusters of articles in various languages. Additionally, the criterion allows validation of an article based solely on its title, provided that the title is informative enough to provide the correct annotation. Manual evaluation process . Given the evaluation criteria, the annotators were tasked with reviewing the clusters and annotating the articles. If an annotator encountered a text that was not comprehensible, they were permitted to utilize machine translation services such as Google Translate ( https://translate.google.com/ ). Annotators had the flexibility to remove specific articles from the dataset, relocate articles between clusters, merge, divide, and eliminate entire clusters. Each action taken was recorded and contributed to the creation of the final dataset. After the manual annotation and evaluation process for all datasets, the annotated datasets underwent a joint evaluation, focusing on the final annotation process that involved joining clusters reporting on the same event and removing 5 Page 6: Erik Novak et al. any inappropriate articles that might have been overlooked during the initial annotation pass. Due to the removal of irrelevant news articles and event clusters, corresponding to the schedules and “click-bait” like content, the number of articles was reduced by around 26k articles. This shows how much content of that nature is generated in the short time span. Furthermore, the number of clusters was also significantly reduced due to the removal and merging of event clusters. The final dataset was then formatted and prepared for analysis and publication. Data Records The OG2021 dataset consists of 10,940 news articles reporting on the 2021 Tokyo Olympics. Table 3 shows the variables describing each news article. Each article is described with its unique ID, its title and truncated body, the publication datetime, its language, its published URL address, and the source publisher. Each news article is also assigned a cluster ID, indicating articles reporting on the same sub-event during the 2021 Olympics. The news articles are written in nine different languages (English, Portuguese, Spanish, French, Russian, German, Slovenian, Arabic and Chinese), with the article’s publication datetime spanning between July 1, 2021 and August 14, 2021. Table 3: The OG2021 variables used in the released version. Each variable is presented with its name, type, and description. Due to copyright and legal restrictions, the truncated body of the news article is available only in the research version of the dataset. No. Variable name Variable type Description 1 ID number The unique ID of the news article. 2 TITLE string The title of the news article. 3 BODY* string The truncated body of the news article. 4 LANG string The language in which the article is written. It can be one of nine values: eng,por,spa,fra,rus,deu,slv,ara,zho. 5 SOURCE string The news publisher’s name. 6 PUBLISHED_AT date The date and time the article was published. Format: YYYY-mm-DD HH:MM:SS . 7 URL string The URL location of the news article. 8 CLUSTER_ID string The ID of the cluster the article is a member of. Format: cls-xxxxxxxx , where xcan be a number of character. Table 4 shows the statistics of the dataset, including the average number of words in the article’s title and body, the average cluster size, along with their standard deviation. The most present language in the dataset is English, comprised of 4k articles, followed by Portuguese and Spanish. The least articles present in the dataset is in Chinese with a total of nine articles. Other languages contain between 200 and 1,000 articles. In total, all articles are grouped into 1,350 event clusters. Table 4: The OG2021 dataset statistics. It shows the number of articles, the average number of words in the title and body, the number of clusters, and the average cluster size. For Chinese, we report the average number of characters in the title and body. Language Script Language familyNo. articles (percent of total)Words in title (mean and std)Words in body (mean and std)No. clustersCluster size (mean and std) All - - 10,940 - - 1,350 8 (20) English Latin Germanic 4,009 (37%) 11 (3) 1,231 (1,147) 729 5 (11) Portuguese Latin Italic 2,410 (22%) 13 (3) 527 (374) 368 7 (12) Spanish Latin Italic 2,049 (19%) 13 (4) 562 (427) 381 5 (9) French Latin Italic 845 (8%) 13 (4) 565 (464) 170 5 (8) Russian Cyrillic Slavic 553 (5%) 10 (3) 301 (358) 152 4 (6) German Latin Germanic 516 (4%) 9 (3) 833 (1,011) 100 5 (6) Slovenian Latin Slavic 331 (3%) 9 (3) 450 (370) 102 3 (3) Arabic Arabic Semitic 218 (2%) 10 (3) 405 (269) 71 3 (3) Chinese∗Chinese Sinitic 9 (0%) 28 (7) 3,402 (1,550) 5 2 (1) The OG2021 dataset is stored in a single CSV file, where each line corresponds to a single news article, and published on the CLARIN.SI repository. Furthermore, it is published in two versions: 6 Page 7: The 2021 Tokyo Olympics Multilingual News Article Dataset •The public version [ 29]. Due to legal restrictions, the public dataset does not contain the body of the articles. However, other article metadata is available, including its title and published URL address, that can be used to fetch the article’s content. The dataset is publicly available and licensed under CC BY-NC-ND 4.0. •The research version [ 30]. The research dataset contains all of the article attributes. The dataset is available for academic use and licensed under CLARIN.SI license ACA ID-BY-NC-INF-NORED 1.0, which requires the user to log into the CLARIN.SI repository via their academic institution. Technical Validation In addition to the manual evaluation during the news annotation process, we also performed the technical validation to ensure the dataset indeed contains multilingual news articles within a high-frequency event setting. Article distribution over time Figure 2 illustrates the article’s distribution based on their publication datetime. The first articles were published on July 1, 2021. The peak concentration of articles occurs between July 23, 2021, and August 8, 2021, corresponding to the start and the end of the 2021 Tokyo Olympics event, respectively. The latest articles in the dataset were published on August 14, 2021. Furthermore, the article distributions for each language separately show a similar distribution, with the peak concentration of articles happening during the 2021 Olympics. Figure 2: The OG2021 article distribution by date. The majority of articles were published between the official start of the Olympic Games (July 23, 2021) and the official end of the Olympic Games (August 8, 2021). Cluster distribution based on their size As mentioned before, the dataset is comprised of 1,350 distinct clusters, with an average size of eight articles per cluster. The distribution of clusters based on their size is shown in Figure 3. Looking at the whole dataset, around 95% of 7 Page 8: Erik Novak et al. clusters contain 25 articles or fewer. The highest number of clusters consists of only two articles, while the largest cluster, which corresponds to the 2021 Tokyo Olympics opening ceremony, contains 499 articles. Furthermore, isolating each language separately, the distribution varies across languages, showing that except for English and German, the highest number of clusters in each language contains a single article. Figure 3: The OG2021 article distribution by size. Globally, about 95% of clusters contain 25 or fewer articles. Cluster distribution per language Next, we counted the number of languages that appear in each cluster.Figure 4 shows the distribution of clusters based on the number of present languages. The majority of clusters, approximately 72%, consist of articles written in a single language. The figure also shows the distribution of monolingual clusters, where the overall distribution per language roughly corresponds to the language’s presence in the dataset. English has the highest number of monolingual clusters, followed by Portuguese and Spanish. French, Russian, German, Slovenian, and Arabic have approximately the same number of monolingual clusters. Chinese has no monolingual clusters, so its articles are always present with those from other languages. Nonetheless, approximately 28% of clusters include articles written in two or more languages. The highest number of languages found in a single cluster is nine, containing all the languages present in the dataset. Language co-occurrence in clusters To further highlight the multilingual nature of the dataset, we computed the language co-occurrence across all clusters, as depicted in Figure 5. The diagonal of the co-occurrence graph showcases the percentage of clusters in which the language is present. According to the calculations, any two languages co-appear in at least one cluster, equivalent to 0.1% of the total number of clusters. The English, Portuguese, and Spanish language pairs co-occur in about 10% of the clusters, while the remaining pairs exhibit co-occurrence ranging between 1-6%. The exception is Chinese, which has a low presence in the dataset and consists of only nine articles. 8 Page 9: The 2021 Tokyo Olympics Multilingual News Article Dataset Figure 4: The OG2021 cluster distribution per language. Almost 28% of the clusters contain two or more languages. The lower graph shows the distribution of monolingual clusters across languages. English PortugueseSpanishFrenchRussian GermanSlovenianArabicChineseEnglish Portuguese Spanish French Russian German Slovenian Arabic Chinese54.0% 10.1% 13.6% 6.4% 6.4% 2.8% 2.2% 3.0% 0.2% 10.1% 27.3% 10.2% 5.0% 5.0% 2.5% 2.4% 2.4% 0.2% 13.6% 10.2% 28.2% 5.1% 5.0% 2.1% 2.4% 2.5% 0.3% 6.4% 5.0% 5.1% 12.6% 3.6% 2.0% 1.5% 1.9% 0.2% 6.4% 5.0% 5.0% 3.6% 11.3% 1.9% 1.8% 1.5% 0.2% 2.8% 2.5% 2.1% 2.0% 1.9% 7.4% 0.8% 0.8% 0.2% 2.2% 2.4% 2.4% 1.5% 1.8% 0.8% 7.6% 0.7% 0.1% 3.0% 2.4% 2.5% 1.9% 1.5% 0.8% 0.7% 5.3% 0.1% 0.2% 0.2% 0.3% 0.2% 0.2% 0.2% 0.1% 0.1% 0.4%language co-occurrence matrix Figure 5: The OG2021 language co-occurrence in clusters. All language pairs appear together in at least one cluster. 9 Page 10: Erik Novak et al. Limitations and future work The raw news articles were collected using a list of concepts related to the 2021 Olympics and specific sports. Because of this, the dataset includes news articles that focus on the chosen sports rather than all sports at the event. For instance, athletics andgymnastics are less covered due to the defined retrieval conditions. Furthermore, the removal of “click-bait” and scheduling news articles significantly reduced the dataset size. These articles can introduce noise, which can be useful for developing methods to effectively group news articles or identify such types. Considering these potential extensions, we plan to develop similar news article datasets focusing on future Olympic Games, starting with the 2024 Olympics in Paris. We would expand the retrieval conditions to include more languages and a broader list of sports. Additionally, we would include “click-bait” and scheduling articles, annotating them to reflect their content type. Usage Notes To use the OG2021 dataset, the user must first download it from the CLARIN.SI repository [ 29,30]. Since the dataset is in CSV format, it can be opened using various programs and programming libraries. The OG2021 dataset can be used to evaluate (online) multilingual news clustering algorithms, which is the main reason for creating it. The algorithm would process and group the news articles into event clusters. The created event clusters would then be compared with the dataset’s cluster IDs to measure the algorithm’s performance, which might include the standard and BCubed [ 31] variants of the F1, precision, and recall scores, cluster purity, and normalized mutual information. The performance metrics would then show how the algorithm fares in the high-frequency event setting presented within the dataset. The dataset can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics, including cultural differences and perspectives in reporting based on news publishers and the language used. Furthermore, it allows for analysis of the 2021 Olympics timeline, viewing the challenges faced by organizers and competitors and the solutions introduced. Code availability The code version used to generate the OG2021 dataset is available on both Zenodo [32]. Acknowledgements This work was supported by the Slovenian Research Agency and the European Union’s Horizon 2020 project Humane AI Net [Grant No. 952026]. Furthermore, we would like to thank Anton Križnar and Matevž Matjašec for their contributions regarding the preliminary analysis of news articles, based on which we defined and executed the creation of the OG2021 dataset. References [1]Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995 , pages 331–339. Morgan Kaufmann, 1995. [2] David Lewis. Reuters-21578 text categorization collection, 1997. [3]Antonio Gulli. AG news. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html , 2004. Accessed: 2024-08-05. [4]Derek Greene and Pádraig Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine learning (ICML) , pages 377–384. ACM Press, 2006. [5]David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A new benchmark collection for text categoriza- tion research. Journal of Machine Learning Research , 5:361–397, 2004. [6]Charles Wayne, George R Doddington, Jonathan G Fiscus, Mark Liberman, Jennifer Alabiso, David Graff, and Christopher Cieri. TDT2 multilanguage text version 4.0, 2001. 10 Page 11: The 2021 Tokyo Olympics Multilingual News Article Dataset [7]Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media , volume 13, pages 630–638. Association for the Advancement of Artificial Intelligence (AAAI), 2019. [8]Andraž Pelicon, Marko Pranji ´c, Dragana Miljkovi ´c, Blaž Škrlj, and Senja Pollak. Zero-shot learning for cross- lingual news sentiment classification. Applied sciences (Basel, Switzerland) , 10:5993, 2020. [9]Maurício Gruppi, Benjamin D Horne, and Sibel Adalı. NELA-GT-2019: A large multi-labelled news dataset for the study of misinformation in news articles. Preprint at https://arxiv.org/abs/2003.08444 , 2020. [10] Lukas Gebhard and Felix Hamborg. The POLUSA dataset: 0.9M political news articles balanced by time and outlet popularity. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 , pages 467–468. ACM, 2020. [11] Rishabh Misra and Jigyasa Grover. Sculpting Data for ML: The first act of Machine Learning . Independently Published, 2021. [12] Rishabh Misra. News category dataset. Preprint at https://arxiv.org/abs/2209.11429 , 2022. [13] Jörg Tiedemann and Nikola Ljubeši ´c. Efficient discrimination between closely related languages. In Proceedings of COLING 2012 , pages 2619–2634, 2012. [14] Homa Baradaran Hashemi, Azadeh Shakery, and Heshaam Faili. Creating a persian-english comparable corpus. InMultilingual and Multimodal Information Access Evaluation , Lecture notes in computer science, pages 27–39. Springer Berlin Heidelberg, 2010. [15] Renate Hauser, Jannis Vamvas, Sarah Ebling, and Martin V olk. A multilingual simplified language news corpus. In Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference , pages 25–30, 2022. [16] Daniel Varab and Natalie Schluter. MassiveSumm: a very large-scale, very multilingual, news summarisation dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 10150–10161. Association for Computational Linguistics, 2021. [17] Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, and Alistair Moffat. CC- news-en: A large english news corpus. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management . ACM, 2020. [18] Felix Leeb and Bernhard Schölkopf. A diverse multilingual news headlines dataset from around the world. Preprint at https://arxiv.org/abs/2403.19352 , 2024. [19] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. Event registry: learning about world events from news. In Proceedings of the 23rd International Conference on World Wide Web . ACM, 2014. [20] K Leetaru and P A Schrodt. Gdelt: Global data on events, location, and tone, 1979–2012. ISA Annual Convention , 2013. [21] Sebastião Miranda, Art ¯urs Znoti n,š, Shay B Cohen, and Guntis Barzdins. Multilingual clustering of streaming news. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 4535–4544. Association for Computational Linguistics, 2018. [22] Jan Rupnik, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, and Marko Grobelnik. News across languages - cross-lingual document similarity and event tracking. Journal of Artificial Intelligence Research , 55:283–316, 2016. [23] Janez Brank, Gregor Leban, and Marko Grobelnik. Annotating documents with relevant wikipedia concepts. In Proceedings of Slovenian KDD Conference on Data Mining and Data Warehouses (SiKDD) , 2017. [24] Janez Brank, Gregor Leban, and Marko Grobelnik. Semantic annotation of documents based on wikipedia concepts. Informatica , 42, 2018. [25] Erik Novak. News stream clustering using multilingual language models. In The Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD) , 2021. [26] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Pro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguistics, 2019. 11 Page 12: Erik Novak et al. [27] Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. Augmented SBERT: Data augmenta- tion method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics, 2020. [28] Simone Tedeschi, Valentino Maiorca, Niccolò Campolungo, Francesco Cecconi, and Roberto Navigli. WikiNEu- Ral: Combined neural and knowledge-based silver data creation for multilingual NER. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2521–2533. Association for Computational Linguistics, 2021. [29] Erik Novak, Erik Calcina, Dunja Mladeni ´c, and Marko Grobelnik. The news articles reporting on the 2021 tokyo olympics data set OG2021 (public), 2024. Slovenian language resource repository CLARIN.SI. [30] Erik Novak, Erik Calcina, Dunja Mladeni ´c, and Marko Grobelnik. The news articles reporting on the 2021 tokyo olympics data set OG2021 (research), 2024. Slovenian language resource repository CLARIN.SI. [31] Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval , 12:461–486, 2009. [32] Erik Novak, Matevž Matjašec, and Erik Calcina. The code for creating the OG2021 dataset, 2024. 12

---