A dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UNCERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was published.
Paper (covering methodology and results of machine learning classification): https://lirneasia.net/2021/07/a-corpus-and-machine-learning-models-for-fake-news-classification-in-sinhala/
Update as of Nov 2022: please note that some parts of the original corpus were corrupted, for reasons unknown to us. This repo restores the files.
This dataset is released under a CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. For more information, see https://creativecommons.org/licenses/by/4.0/
@misc{jayawickrama2021sinhala,
title={A corpus and machine learning models for fake news classification in sinhala},
author={Vihanga Jayawickrama, Asanka Ranasinghe, Dimuthu C. Attanayake, and Yudhanjaya Wijeratne,
year={2021},
primaryClass={cs.CL}
}