Skip to content

Latest commit

 

History

History
89 lines (78 loc) · 2.81 KB

README.md

File metadata and controls

89 lines (78 loc) · 2.81 KB

Preprocessing script for the collected tweets

Follow these steps to use the script:

Report if any bugs are found

1) Installation:

git clone https://github.com/Shadow977/preprocessing.git
pip install -r requirements.txt

After this, move the preprocessing.py file to the dataset folder Tweet-dataset

2) Clean the dataset:

The following commands cleans the CSV file for the specified user:

from preprocessing import Cleaner
cleaner = Cleaner()
cleaner.clean_data('username.csv')  # Example: cleaner.clean_data('1capplegate.csv')

This will preprocess the particular CSV file and write the cleaned data to tweets/cleaned_data/username.csv It will remove the following things:

  • Hashtags
  • User Mentions
  • StopWords
  • Will lemmatize all the words. You may also try with Stemming
  • Punctuations
  • Emojis and all other junks
  • URLs
  • Replaces numbers with <NUMBER>
  • Replaces words like 15k, 12cm, 100Km, etc with <UNIT>
  • Processes words like can't, won't, I'll, etc

If you do not want to remove stop words, run the following:

cleaner.clean_data('username.csv', remove_stopwords=False)

Similarly,

cleaner.clean_data('username.csv', remove_hashtags=False) # Keep hashtags
cleaner.clean_data('username.csv', remove_mentions=False) # Keep @ mentions

You may add more to this script

You may use the clean_data function in loop to loop through all users

3) Analyse Data and plot:

The following commands may be used:

from preprocessing import Analytics
analyser = Analytics()
max_word_counts, average = analyser.analyse_user('username.csv')
# Make sure you have cleaned the data first for the specific user

# Plot the data
analyser.plot_data('username.csv')
# This will save the png file of the bar graph to teets/users/username.png
# This way it will not show you the plot

If you just want to view the plot and not save it, use

analyser.plot_data('username.csv', save=False)

4) Extract feature keywords:

You can now extract feature keywords from the document for each user using this script:
dependency: scikit-learn. If not installed, please run the following command again:

pip install -r requirements.txt
from preprocessing import Analytics
analyser = Analytics()
top_keywords = analyser.extract_keywords('username.csv')
# By default, the top 5 keywords will be returned. If you want the top n keywords, run
top_n_keywords = analyser.extract_keywords('username.csv', max_features=n)  # Where n is an integer

5) Overall cleaning and analysis:

Run the following scripts:

cleaner = Cleaner()
cleaner.clean()     # To clean all data
analyser = Analytics()
analyser.analyse()  # For overall analysis

You may contribute more to this if required.

Please open an issue if any bug found