This folder contains the different datasets that we collected.
The goal is to acquire as many as possible urls with a label that tells if they are true or fake.
For this purpose we use a binary lable (true or fake).
We select from the datasets only items that are (almost) completely true or fake, removing the variations in the middle.
Goal: have a list of URL labelled with fake
/ true
source url: https://www.datacommons.org/factcheck/download
This is a collection of claimReviews. The problem is that they contain fewer attributes than the claimReviews that are published on the fact-checking websites. For this reason the fact checker websites are scraped to obtain the full claimReview.
source url: https://storage.googleapis.com/datacommons-feeds/claimreview/latest/data.json
labels:
majority: ratingValue between worstRating and bestRating factcheckni: alternateName text false: "False.", "Misleading", "This claim is false", "Mostly false." true: "True", "Accurate", "The claim is accurate", "The claim is true" other: "Unproven", "Inaccurate.", "Correct with consideration.", "Partly accurate", "Broadly accurate", "Uncertain"
problem: the URL is to fact checker, not the source
conclusion: not used
source url: https://www.cs.ucsb.edu/~william/data/liar_dataset.zip
labels are ok
source urls: not present in the dataset, but there are links to politifacts
success!
No URLs, just claims as text
the urls are to facebook.
- filter type='link' in tsv
- go to facebook url and parse html
- filter a tabindex="-1" target="_blank"
- take href, select queryParam 'u', unescape it
- this is the link
success!
source: https://github.com/several27/FakeNewsCorpus --> http://researchably-fake-news-recognition.s3.amazonaws.com/public_corpus/news_cleaned_2018_02_13.csv.zip