Skip to content

Latest commit

 

History

History
202 lines (170 loc) · 7.4 KB

README.md

File metadata and controls

202 lines (170 loc) · 7.4 KB

TLS-Covid19 dataset

What

The TLS-Covid19 is a multi-lingual and multi-document Timeline Summarization (TLS) annotated dataset built to foster the emergence and evaluation of new algorithms, and, at the same time, enable the study of news coverage about the COVID-19 pandemic. It consists of a number of curated topics related to the Covid-19 outbreak, with associated news articles from portuguese and english news outlets and their respective reference timelines as gold-standard. The following figure shows the format and the structure of the dataset.

Dataset structure

Why

The rise of social media and the explosion of digital news in the web sphere have created new challenges to extract knowledge and make sense of published information. Automated timeline generation appears in this context as a promising answer to help users dealing with this information overload problem. Formally, Timeline Summarization (TLS) can be defined as a subtask of Multi-Document Summarization (MDS) conceived to highlight the most important information during the development of a story over time by summarizing long-lasting events in a timely ordered fashion. As opposed to traditional MDS, however, TLS has a limited number of publicly available datasets. This lack of datasets is even more noticeable for low resource languages, including Portuguese, which despite being the sixth most spoken language in the world [Ethnologue (2019, 22nd edition)] lacks a specific TLS dataset.

Following the worldwide coverage of the coronavirus pandemic, we propose the TLS-Covid19 dataset, a novel corpus for the Portuguese and English languages.

How

To create this dataset, we take advantage of liveblogs, a webpage where news media outlets offer a daily live coverage about an ongoing event. Each liveblog (usually with a different URL) consists of a set of news stories and a set of key moments. The key moments stories are manually selected by journalists from the whole set of news articles, thus giving rise to the ground-truth timeline.

Data Sources

We consider two Portuguese news sources, Público and Observador, and two English news sources, CNN and The Guardian.

As a rule-of-thumb, we consider the beginning of the liveblog coverage as the start time-period. For instance, Público liveblog is tracked since March 16, 2020; Observador since January 30, 2020; CNN since January 22, 2020; and The Guardian since January 24, 2020. Our aim is to continue expanding the dataset with further articles and possibly new topics until the end of the outbreak and/or the end of the liveblogs’ coverage. We anticipate that as the pandemic evolves, the amount of data collected will grow significantly.

The source code to reproduce the dataset is available in a Google Colab notebook. Try it here:

Statistics

As of December 31, 2020

The following tables describe detailed statistics about the dataset. As of the date of December 31, 2020, we have collected 143 common topics for Publico and Observador, and 35 common topics for CNN and The Guardian.

By news source:

Input Docs Ground-Truth Compression
Sources #Topics Lang #Docs Avg #sents Avg #dates Avg sents/dates Avg #sents Avg #dates Avg sents/dates Sents Dates
Público 143 PT 28,327 1092.15 99.93 10.93 62.82 40.05 1.57 5.75 40.08
Observador 143 PT 40,181 1653.22 120.52 13.72 114.90 57.77 1.99 6.95 47.93
CNN 35 EN 26,043 6178.54 189.71 32.57 30.11 20.97 1.44 0.49 11.05
Guardian 35 EN 5,848 1118.86 80.69 13.87 25.26 21.97 1.15 2.26 27.23

By news source language:

Input Docs Ground-Truth Compression
Lang #Topics #Docs Avg #sents Avg #dates Avg sents/dates Avg #sents Avg #dates Avg sents/dates Sents Dates
PT 143 68,508 1372.69 110.23 12.45 88.86 48.91 1.82 6.47 44.37
EN 35 31,891 3648.70 135.20 26.99 27.69 21.47 1.29 0.76 15.89

Distribution of topics by type:

Type PT EN
PER 17 3
ORG 33 6
LOC 82 25
KW 11 1

WordClouds EN/PT:

Word Clouds

Use Cases

The TLS-Covid19 allows one to see the evolution of a topic over time and to compare what is being said about a certain topic by different news outlets.

One can also look at keywords, part-of-speech tags, entities or events to see how things have changed over time.

As is common with most of the datasets of this kind, one can also look at collocates. A few examples might be: keywords that were common in the same time-period, words that appear near covid-19 in different time-periods, entites, events, nouns or verbs that were more common at the beginning of the pandemics than in December 2020.

Finally, one can also create a sub-set of the dataset based on the publication date, the source, the country, etc.

Publication

Pasquali, A., Campos, R., Ribeiro, A., Santana, B., Jorge, A., and Jatowt, A. (2021). TLS-Covid19: A New Annotated Corpus for Timeline Summarization. In: Hiemstra D., Moens M-F., Mothe J., Perego R., Potthast M., Sebastiani F. (eds), Advances in Information Retrieval. ECIR'21 (Lucca, Italy. March 28 - April 1). Lecture Notes in Computer Science, vol 12656, pp. 497 - 512. ECIR21 presentation

Contact

For further information related to the TLS-Covid19 dataset please contact Alexandre Ribeiro ([email protected]), Arian Pasquali ([email protected]), or Ricardo Campos ([email protected]).