- Collecting top 500 news at https://www.reuters.com/breakingviews
- The goal is to break text documents down into topics by word and to experience how topics are modelled with different appraches. We want to find “topics” that are collections of words that appear in similar documents
- There are 2 popular libraries for LDA/LSAsuch as scikit-learn and gensim. I choose gensim for this project.
- Retrieving top 500 latest breaking news from https://www.reuters.com/breakingviews
- Cleaning the data with beautifulsoup & save them into csv file (data/breakingnews.csv) in order to do analysis and to build model
- get_historical_news.py: pulling historial news from https://www.reuters.com/breakingviews
- accessory_function.py: is a collection of functions imported in notebooks
- clean raw data
- sort returned values
- write pickle file
- read pickle file
- model_preparation.ipynb:
- Read breakingnews.csv and clean special letters
- Visualize the most popular words by WordCloud
- Create dictionary from processed data and save it as dictionary.plk (/data/dictionary.pkl)
- Create a corpus from processed data and save it in
/data/processed_data.pkl
- Create bag of words (BOW) and save it in
/data/bow.pkl
- Create a TF-IDF and save it in
/data/tfidf.pkl
- Topic Modeling-LDA.ipynb:
- Build LDA model with bag-of-word from
processed_data.pkl
,bow.pkl
anddictionary.pkl
- Build LDA model with TF-IDF from
processed_data.pkl
,ifidf.pkl
anddictionary.pkl
- Print top 5 topics of each model and interpret the results
- Visual the topics and their words with pyLDAvis
- Calulate Perplexity and Topic Cohenrence between two models
- Build LDA model with bag-of-word from
- Topic Modeling-LSA.ipynb:
- Build LSA model with bag-of-word from
processed_data.pkl,bow.pkl, dictionary.pkl
- Build LSA model with TF-IDF from
processed_data.pk, ifidf.pkl, dictionary.pkl
- Print top 5 topics of each model and interpret the results
- Build LSA model with bag-of-word from
- model_preparation.ipynb: https://colab.research.google.com/drive/1VLf69UIoJ79TuMq2fh4BOnd8qeTzYUBI?authuser=1#scrollTo=aiERuDAhef71
- Topic Modeling-LDA.ipynb: https://colab.research.google.com/drive/1RhSUArIbix4oF3ZbfSC94lHlTjhtBLUr?authuser=1#scrollTo=f28PG4o7x4SB
- Topic Modeling-LSA.ipynb : https://colab.research.google.com/drive/1ZLiw8up2og9UVa2D6A8Wqa_P3YbJgWX7?authuser=1#scrollTo=rfjSPpiY299P
- Cleaning the dataset & Lemmatization
- Creat a dictionay from processed data
- Create Corpus and LDA/LSA Model with bag of words
- Create Coprpus and LDA/LSA with TF-IDF
- Caculate the Perplexity and Topic Cohenrence between two models
- Visualize topics with the help of pyLDAvis
- Python >= 3.7
- Jupyter Notebook
- pandas
- matplotlib
- seaborn
- pyLDAvis
- scikit-learn
- numpy
- gensim
- Scipy
- nltk
- string
- beautifulsoup
- WordCloud
- requests
- Checkout the project : git clone https://github.com/diem-ai/topic-modeling.git
- Install the latest version of libraries in requirements and dependencies
- Run get_historical_news.py to collect 500 latest news : python get_historical_news.py
- Comment
Colab Setup
and changedata path
in notebooks - Run model_preparation.ipynb to produce the data
- Run Topic Modeling-LDA.ipynb for LDA topic modeling
- Run Topic Modeling-LSA.ipynb for LSA topic modeling