Report if any bugs are found
git clone https://github.com/Shadow977/preprocessing.git
pip install -r requirements.txt
After this, move the preprocessing.py
file to the dataset folder Tweet-dataset
The following commands cleans the CSV file for the specified user:
from preprocessing import Cleaner
cleaner = Cleaner()
cleaner.clean_data('username.csv') # Example: cleaner.clean_data('1capplegate.csv')
This will preprocess the particular CSV file and write the cleaned data to tweets/cleaned_data/username.csv
It will remove the following things:
- Hashtags
- User Mentions
- StopWords
- Will lemmatize all the words. You may also try with Stemming
- Punctuations
- Emojis and all other junks
- URLs
- Replaces numbers with <NUMBER>
- Replaces words like 15k, 12cm, 100Km, etc with <UNIT>
- Processes words like can't, won't, I'll, etc
If you do not want to remove stop words, run the following:
cleaner.clean_data('username.csv', remove_stopwords=False)
Similarly,
cleaner.clean_data('username.csv', remove_hashtags=False) # Keep hashtags
cleaner.clean_data('username.csv', remove_mentions=False) # Keep @ mentions
You may add more to this script
You may use the clean_data
function in loop to loop through all users
The following commands may be used:
from preprocessing import Analytics
analyser = Analytics()
max_word_counts, average = analyser.analyse_user('username.csv')
# Make sure you have cleaned the data first for the specific user
# Plot the data
analyser.plot_data('username.csv')
# This will save the png file of the bar graph to teets/users/username.png
# This way it will not show you the plot
If you just want to view the plot and not save it, use
analyser.plot_data('username.csv', save=False)
You can now extract feature keywords from the document for each user using this script:
dependency: scikit-learn
. If not installed, please run the following command again:
pip install -r requirements.txt
from preprocessing import Analytics
analyser = Analytics()
top_keywords = analyser.extract_keywords('username.csv')
# By default, the top 5 keywords will be returned. If you want the top n keywords, run
top_n_keywords = analyser.extract_keywords('username.csv', max_features=n) # Where n is an integer
Run the following scripts:
cleaner = Cleaner()
cleaner.clean() # To clean all data
analyser = Analytics()
analyser.analyse() # For overall analysis