This repository contains a Python script main.py
that performs topic modelling on textual data using the BERTopic
library. The script takes as input a dataset of text entries and outputs, a copy of the CSV
file with a topic assigned to each data entry, as well as a visualization of a distribution of topics across the data.
The sample dataset used for this project is the A Million News Headlines dataset from Kaggle. The dataset contains over a million news headlines published over a period of 15 years. The data is stored in a single CSV file, with each row containing the headline text and the publish date. The dataset can be downloaded from the following link: https://www.kaggle.com/therohk/million-headlines. The code, however, can be used with any dataset of text entries in a CSV file.
The topic modeling in this project is performed using BERTopic
, an extension of the BERT
language model. BERTopic leverages pre-trained language embeddings to identify topics in textual data. For topic modeling, the all-MiniLM-L6-v2
embedding model is utilized. The model is a sentence-transformers model: it maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. Visit https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 for more information about the model.
To use the code you need to adopt the following steps.
NOTE: Please note that the instructions provided here have been tested on a Mac machine running macOS Ventura 13.1, using Visual Studio Code version 1.76.0 (Universal) and a Unix-based bash terminal. While they should also be compatible with other Unix-based systems like Linux, slight variations may exist depending on the terminal and operating system you are using. To ensure a smooth installation process and avoid potential package conflicts, it is recommended to use the provided setup.sh
bash file, which includes the necessary steps to create a virtual environment for the project. However, if you encounter any issues or have questions regarding compatibility on other platforms, please don't hesitate to reach out for assistance.
- Clone repository
- Run
setup.sh
in the terminal - Activate the virtual environment
- Run
main.py
in the terminal - Deactivate the virtual environment
Clone repository using the following lines in the your terminal:
git clone https://github.com/sashapustota/topic-modelling-with-BERT
cd topic-modelling-with-BERT
The setup.sh
script is used to automate the installation of project dependencies and configuration of the environment. By running this script, you ensure consistent setup across different environments and simplify the process of getting the project up and running.
The script performs the following steps:
- Creates a virtual environment for the project
- Activates the virtual environment
- Installs the required packages
- Deactivates the virtual environment
To run the script, run the following line in the terminal:
bash setup.sh
To activate the newly created virtual environment, run the following line in the terminal:
source ./topic_modelling_with_BERT_venv/bin/activate
The main.py
script performs the following steps:
- Loads the data from the CSV file
- Preprocesses the data
- Performs topic modeling
- Prints the generated topics in the rerminal, as well as saves them in .txt file
- Saves a new CSV file with topics assigned to each data entry
- Plots the results and saves the plot as a .png file
To use the script with the sample data, run the following line in the terminal:
python3 src/main.py
If you are using your own data, the script allows for the following optional arguments:
main.py [-h] [--data DATA] [--column COLUMN] [--n_topics LABEL]
options:
--data DATA Name of the CSV file to use. (default: abcnews-date-text.csv)
--column COLUMN Name of the column with text data in the CSV file. (default: headline_text)
--n_topics N_TOPICS Number of topics to extract. (default: 10)
For example, if you have a CSV file named my_data.csv
with text data in the column text
, and want to extract 12 topics, you can run the following line in the terminal:
python3 src/main.py --data my_data.csv --column text --n_topics 12
When you are done using the script, you can deactivate the virtual environment by running the following line in the terminal:
deactivate
This repository has the following structure:
│ .gitignore
│ README.md
│ requirements.txt
│ setup.sh
│
├───data
│ .gitkeep
│
├───out
│ sample_topic_distribution.png
│
└───src
main.py
The following results were obtained using the first 10,000 rows of the headline dataset and using the default arguments for the main.py
script.
Topics generated:
Topic Count Name CustomName
0 -1 7973 -1_us_govt_iraq_police us - govt - iraq
1 0 307 0_iraq_us_turkey_iraqi iraq - us - turkey
2 1 251 1_man_charged_stabbing_court man - charged - stabbing
3 2 218 2_crash_car_helicopter_dies crash - car - helicopter
4 3 204 3_anti_war_protesters_protest anti - war - protesters
5 4 183 4_baghdad_basra_us_forces baghdad - basra - us
6 5 177 5_water_rain_restrictions_river water - rain - restrictions
7 6 176 6_fire_bushfire_inquiry_firefighters fire - bushfire - inquiry
8 7 174 7_arsenal_ajax_real_leeds arsenal - ajax - real
9 8 172 8_howard_australia_war_troops howard - australia - war
10 9 165 9_hospital_health_doctors_medical hospital - health - doctors
Topic distribution: