The increased use of the Internet or social medias to share news has allowed information to travel at record speeds. However, it has also led to the rise of fake news stories, a recent phenomenon that relies on the ability of an article to go "viral" without being vetted by an editorial team, as in traditional news sources. This project seeks to identify fake or highly-biased news articles to help prevent the spread of false information. More specially, we implemented a program to examine the existence of authors, words and punctuation usage in titles, and article bodies, and used machine learning algorithms to identify news from unreliable sources.
Our final dataset (balanced_data.csv under data directory) contains 1473 articles from reliable sources and 1473 from unreliable sources. Items from reliable sources have attribute authenticity as 0, while entities from ureliable sources have authenticity of 1.
Our data collection scripts:
- Step1_CollectData_Hyesoo.ipynb (Hyesoo's script)
- Step1_CollectData_Jinmei_FakeNews.ipynb and Step1_CollectData_Jinmei_RealNews.ipynb (Jinmei's script')
used the Python library Newspaper that collects information of articles from a wide variety of news sources.
We have collected news from 12 reliable news sources and 42 unreliable news websites.
- The reliable sources include msnbc, nbcnews, politico, foxnews, nytimes, reuters, abc, bbc, cnn, newyorker, cbsnews, and npr.
- The unreliable news sources we used are 24wpn, beforeitsnews, readconservatives, newsbbc, now8news, americanfreepress, nephef, nationonenews, infostormer, Conservativedailypost, donaldtrumppotus45, ladylibertysnews, interestingdailynews, president45donaldtrump, openmagazines, krbcnews, bizstandardnews, bipartisanreport, local31news, nbcnews, CivicTribune, politicono, redcountry, AmericanFlavor, ddsnewstrend, Clashdaily, realnewsrightnow, wordpress, reagancoalition, lastdeplorables, Americannews, aurora-news, thedcgazette, politicalo, newswithviews, pamelageller, Bighairynews, ABCnews, sputniknews, prntly, Americanoverlook, and majorthoughts.
Data collected with Step1_CollectData_Hyesoo.ipynb use the script, Step2_CleanData_Hyesoo.ipynb, to remove short articles and errors.
Data collected by Hyesoo and Jinmei are merged with the script, Step3_MergeData.ipynb. It also includes cleaning procedure for data collected by Jinmei.
Features include
- existence of authors
- exaggerating punctuations used in titles
- rate of uppercases used in titles
- TF-IDF values generated with text body
The first three features were generated using Step4_GenerateExtraFeatures.ipynb script. They were rescaled with TF-IDF features afterwards.
Models used include
- Naive Bayes
- Logistic Regression
- Neural Network (MLPclassifier)
- Random Forest
- Support Vector Machine
See script Step5_ExtractFeatures_Predict_w_MachineLearning.ipynb for details.
We are able to obtain > 92 % prediction accuracy within our dataset.
We have developed a simple user interface that predicts the authenticity of an article of interest. In detail, running the file 'run_final.py' in flask_api folder will generate a local http address, in which the user can submit the url of an article, and the model predicts the article's authenticity.
Lists of fake news websites: