Identifying Fake News

Project Description

The increased use of the Internet or social medias to share news has allowed information to travel at record speeds. However, it has also led to the rise of fake news stories, a recent phenomenon that relies on the ability of an article to go "viral" without being vetted by an editorial team, as in traditional news sources. This project seeks to identify fake or highly-biased news articles to help prevent the spread of false information. More specially, we implemented a program to examine the existence of authors, words and punctuation usage in titles, and article bodies, and used machine learning algorithms to identify news from unreliable sources.

Building Dataset

Our final dataset (balanced_data.csv under data directory) contains 1473 articles from reliable sources and 1473 from unreliable sources. Items from reliable sources have attribute authenticity as 0, while entities from ureliable sources have authenticity of 1.

1. Collect Data

Our data collection scripts:

Step1_CollectData_Hyesoo.ipynb (Hyesoo's script)
Step1_CollectData_Jinmei_FakeNews.ipynb and Step1_CollectData_Jinmei_RealNews.ipynb (Jinmei's script')

used the Python library Newspaper that collects information of articles from a wide variety of news sources.

We have collected news from 12 reliable news sources and 42 unreliable news websites.

The reliable sources include msnbc, nbcnews, politico, foxnews, nytimes, reuters, abc, bbc, cnn, newyorker, cbsnews, and npr.
The unreliable news sources we used are 24wpn, beforeitsnews, readconservatives, newsbbc, now8news, americanfreepress, nephef, nationonenews, infostormer, Conservativedailypost, donaldtrumppotus45, ladylibertysnews, interestingdailynews, president45donaldtrump, openmagazines, krbcnews, bizstandardnews, bipartisanreport, local31news, nbcnews, CivicTribune, politicono, redcountry, AmericanFlavor, ddsnewstrend, Clashdaily, realnewsrightnow, wordpress, reagancoalition, lastdeplorables, Americannews, aurora-news, thedcgazette, politicalo, newswithviews, pamelageller, Bighairynews, ABCnews, sputniknews, prntly, Americanoverlook, and majorthoughts.

2. Clean Data

Data collected with Step1_CollectData_Hyesoo.ipynb use the script, Step2_CleanData_Hyesoo.ipynb, to remove short articles and errors.

3. Merge Data

Data collected by Hyesoo and Jinmei are merged with the script, Step3_MergeData.ipynb. It also includes cleaning procedure for data collected by Jinmei.

Feature engineering

Features include

existence of authors
exaggerating punctuations used in titles
rate of uppercases used in titles
TF-IDF values generated with text body

The first three features were generated using Step4_GenerateExtraFeatures.ipynb script. They were rescaled with TF-IDF features afterwards.

Predicting fake news with various machine learning classifiers

Models used include

Naive Bayes
Logistic Regression
Neural Network (MLPclassifier)
Random Forest
Support Vector Machine

See script Step5_ExtractFeatures_Predict_w_MachineLearning.ipynb for details.

Results

We are able to obtain > 92 % prediction accuracy within our dataset.

User Interface

We have developed a simple user interface that predicts the authenticity of an article of interest. In detail, running the file 'run_final.py' in flask_api folder will generate a local http address, in which the user can submit the url of an article, and the model predicts the article's authenticity.

References

Lists of fake news websites:

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
additional_codes		additional_codes
data		data
figures		figures
flask_api		flask_api
reference		reference
.gitignore		.gitignore
README.md		README.md
Step1_CollectData_Hyesoo.ipynb		Step1_CollectData_Hyesoo.ipynb
Step1_CollectData_Jinmei_FakeNews.ipynb		Step1_CollectData_Jinmei_FakeNews.ipynb
Step1_CollectData_Jinmei_RealNews.ipynb		Step1_CollectData_Jinmei_RealNews.ipynb
Step2_CleanData_Hyesoo.ipynb		Step2_CleanData_Hyesoo.ipynb
Step3_MergeData.ipynb		Step3_MergeData.ipynb
Step4_GenerateExtraFeatures.ipynb		Step4_GenerateExtraFeatures.ipynb
Step5_ExtractFeatures_Predict_w_MachineLearning.ipynb		Step5_ExtractFeatures_Predict_w_MachineLearning.ipynb
Step5b_ExtractFeatures_Predict_w_MachineLearning_for-saving-model.ipynb		Step5b_ExtractFeatures_Predict_w_MachineLearning_for-saving-model.ipynb
Step5b_ExtractFeatures_Predict_w_MachineLearning_for-saving-model_2.ipynb		Step5b_ExtractFeatures_Predict_w_MachineLearning_for-saving-model_2.ipynb
additional_bar.jpeg		additional_bar.jpeg
cdips_fakenews_project.pptx		cdips_fakenews_project.pptx
stem_tokens.pkl		stem_tokens.pkl
tokenize_stemmer.pkl		tokenize_stemmer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Fake News

Project Description

Building Dataset

1. Collect Data

2. Clean Data

3. Merge Data

Feature engineering

Predicting fake news with various machine learning classifiers

Results

User Interface

References

About

Releases

Packages

Languages

rbarreto/Fake-News-Project-1

Folders and files

Latest commit

History

Repository files navigation

Identifying Fake News

Project Description

Building Dataset

1. Collect Data

2. Clean Data

3. Merge Data

Feature engineering

Predicting fake news with various machine learning classifiers

Results

User Interface

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages