Datasets

This folder contains the different datasets that we collected.

The goal is to acquire as many as possible urls with a label that tells if they are true or fake.

For this purpose we use a binary lable (true or fake).

We select from the datasets only items that are (almost) completely true or fake, removing the variations in the middle.

Goal: have a list of URL labelled with fake / true

`datacommons_factcheck`

source url: https://www.datacommons.org/factcheck/download

This is a collection of claimReviews. The problem is that they contain fewer attributes than the claimReviews that are published on the fact-checking websites. For this reason the fact checker websites are scraped to obtain the full claimReview.

datacommons_feeds

source url: https://storage.googleapis.com/datacommons-feeds/claimreview/latest/data.json

labels:

majority: ratingValue between worstRating and bestRating factcheckni: alternateName text false: "False.", "Misleading", "This claim is false", "Mostly false." true: "True", "Accurate", "The claim is accurate", "The claim is true" other: "Unproven", "Inaccurate.", "Correct with consideration.", "Partly accurate", "Broadly accurate", "Uncertain"

problem: the URL is to fact checker, not the source

conclusion: not used

`liar`

source url: https://www.cs.ucsb.edu/~william/data/liar_dataset.zip

labels are ok

source urls: not present in the dataset, but there are links to politifacts

`golbeck_fakenews`

success!

`fever`

No URLs, just claims as text

`buzzface`

the urls are to facebook.

filter type='link' in tsv
go to facebook url and parse html
filter a tabindex="-1" target="_blank"
take href, select queryParam 'u', unescape it
this is the link

success!

`several27_fakenews_corpus`

source: https://github.com/several27/FakeNewsCorpus --> http://researchably-fake-news-recognition.s3.amazonaws.com/public_corpus/news_cleaned_2018_02_13.csv.zip

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
.gitignore		.gitignore
README.md		README.md
aggregate.py		aggregate.py
buzzface.py		buzzface.py
cache_manager.py		cache_manager.py
claimreview.py		claimreview.py
datacommons_factcheck.py		datacommons_factcheck.py
datacommons_feeds.py		datacommons_feeds.py
domain_list.py		domain_list.py
factcheckni_list.py		factcheckni_list.py
fakenewsnet.py		fakenewsnet.py
golbeck_fakenews.py		golbeck_fakenews.py
hyperpartisan.py		hyperpartisan.py
jruvika_fakenews.py		jruvika_fakenews.py
liar.py		liar.py
mrisdal_fakenews.py		mrisdal_fakenews.py
opensources.py		opensources.py
rbutr.py		rbutr.py
requirements.txt		requirements.txt
several27_fakenews.py		several27_fakenews.py
unshortener.py		unshortener.py
utils.py		utils.py
wikipedia.py		wikipedia.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets

`datacommons_factcheck`

datacommons_feeds

`liar`

`golbeck_fakenews`

`fever`

`buzzface`

`several27_fakenews_corpus`

About

Releases

Packages

Languages

Institute-Web-Science-and-Technologies/misinformation_datasets_processing

Folders and files

Latest commit

History

Repository files navigation

Datasets

datacommons_factcheck

datacommons_feeds

liar

golbeck_fakenews

fever

buzzface

several27_fakenews_corpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`datacommons_factcheck`

`liar`

`golbeck_fakenews`

`fever`

`buzzface`

`several27_fakenews_corpus`

Packages