Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweetclean #17

Open
wants to merge 51 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
b530c66
- adds main function to most scripts
TobiObeck Oct 6, 2021
7b953d6
adds multiple evaluation metrics for classifier
TobiObeck Oct 6, 2021
de3084f
Merge branch 'main' into add-evaluation-metrics
TobiObeck Oct 6, 2021
86a9eb1
separates loading of data sets into setup.sh script
TobiObeck Oct 6, 2021
929db48
Merge branch 'main' of https://github.com/lbechberger/MLinPractice
TobiObeck Oct 7, 2021
ae3c345
Update util.py
pariyashu Oct 7, 2021
0daf32a
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 7, 2021
a7a61aa
Merge commit 'a7c7fdb9a3ff5a5af9677b5808279c5f8b018662'
TobiObeck Oct 7, 2021
ee9f4c0
adds parsing of tokenized tweets example
TobiObeck Oct 7, 2021
fde0818
adds func to document tests
TobiObeck Oct 7, 2021
a9ce4de
moves tests into respective code folder
TobiObeck Oct 7, 2021
f37740f
adds documentation how to run tests
TobiObeck Oct 7, 2021
303ba92
test connection
pariyashu Oct 9, 2021
7c873ac
remove unnecessary comment
pariyashu Oct 9, 2021
d0cb537
adds counting of mentions & removal of orig. column
TobiObeck Oct 10, 2021
39b1bdc
minor cleanup
TobiObeck Oct 10, 2021
6483c10
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 10, 2021
8988ef6
adds MentionsCounter Preprocessor
TobiObeck Oct 10, 2021
8e92246
minor cleanup
TobiObeck Oct 10, 2021
cf0c340
test connection
pariyashu Oct 9, 2021
ca148df
remove unnecessary comment
pariyashu Oct 9, 2021
5e03e47
Merge pull request #1 from TobiObeck/mentions-count-col
pariyashu Oct 10, 2021
c45a142
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice into…
pariyashu Oct 10, 2021
8777d37
filter language
pariyashu Oct 10, 2021
240dc9a
drop columns inc eng
pariyashu Oct 10, 2021
9ea4c4e
implements column remover as proper preprocessor
TobiObeck Oct 12, 2021
bebf224
Merge branch 'main' of https://github.com/lbechberger/MLinPractice in…
TobiObeck Oct 12, 2021
62b0de2
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
6e19a6d
gets rid of warning by specifying dtypes while reading csv
TobiObeck Oct 13, 2021
fba4d6b
Merge branch 'main' of https://github.com/TobiObeck/MLinPractice
TobiObeck Oct 13, 2021
6fdf847
renames folder code -> src
TobiObeck Oct 19, 2021
eaf0117
adds test for counting feature
TobiObeck Oct 19, 2021
33c42bc
cleanup
TobiObeck Oct 19, 2021
d584460
minor changes
TobiObeck Oct 19, 2021
a821f94
renames folder code -> src
TobiObeck Oct 19, 2021
00c91be
stores mlflow and pickle data
TobiObeck Oct 19, 2021
39b3875
Merge branch 'temp-pull-grid-from-bech'
TobiObeck Oct 19, 2021
a07f531
separates examples into corresponding files
TobiObeck Oct 19, 2021
6c860e8
adds randomforest classifier
TobiObeck Oct 22, 2021
2227e83
disables dimensionality reduction
TobiObeck Oct 22, 2021
46d98b4
adds a classification run for all classifiers
TobiObeck Oct 22, 2021
411ba86
tweet clean remove @user,#hashtag,https and emojis
pariyashu Oct 23, 2021
71108c0
append tweetClean class
pariyashu Oct 23, 2021
5d90eec
adds test for tweet cleaner
TobiObeck Oct 24, 2021
df0f208
makes tweet_cleaner runnable, but still WIP
TobiObeck Oct 24, 2021
774784a
text removel after link solved
pariyashu Oct 26, 2021
b214a25
Update tweet_clean.py
pariyashu Oct 26, 2021
9728f8c
white spece problem solve 1 remaining
pariyashu Oct 26, 2021
4d37f95
fixes tweet clean test
TobiObeck Oct 26, 2021
8883475
strips white spaces text start & end
TobiObeck Oct 26, 2021
90c6f70
renames leftover filepaths from `code` to `src`
TobiObeck Oct 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
},
{
"name": "Python: Module",
"type": "python",
"request": "launch",
"module": "code",
"cwd": "${workspaceFolder}",
},
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"cwd": "${workspaceFolder}",
// "pythonArgs": [
// "-m",
// "src.feature_extraction.test.feature_extraction_test",
// "E:\\MyPC\\code\\git\\myforkMLiP\\MLinPractice\\src\\feature_extraction\\test\\feature_extraction_test.py"
// ],
// "env": {
// "PYTHONPATH": "${workspaceFolder}/code"
// }
}
]
}
92 changes: 75 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,24 +27,39 @@ In order to save some space on your local machine, you can run `conda clean -y -

The installed libraries are used for machine learning (`scikit-learn`), visualizations (`matplotlib`), NLP (`nltk`), word embeddings (`gensim`), and IDE (`spyder`), and data handling (`pandas`)

## Overall Pipeline
## Setup & Overall Pipeline & Tests

The overall pipeline can be executed with the script `code/pipeline.sh`, which executes all of the following shell scripts:
- The script `code/load_data.sh` downloads the raw csv files containing the tweets and their metadata. They are stored in the folder `data/raw/` (which will be created if it does not yet exist).
### Setup

The shell script `code/setup.sh` needs to be run once before the actual `code/pipeline.sh` script or any other shell scripts can be executed. It downloads necessary data by running the scripts `code/load_data.sh` and `code/load_nltk_data.sh`.
- The former script `code/load_data.sh` downloads the Data Science Tweets as raw csv files containing the tweets and their metadata. They are stored in the folder `data/raw/` (which will be created if it does not yet exist).
- The latter script `code/load_nltk_data.sh` downloads necessary NLTK data sets, corpora and models (see more: https://www.nltk.org/data.html)

### Pipeline

The overall pipeline can be executed with the script `code/pipeline.sh`, which executes all of the following shell scripts except `setup.py`:
- The script `code/preprocessing.sh` executes all necessary preprocessing steps, including a creation of labels and splitting the data set.
- The script `code/feature_extraction.sh` takes care of feature extraction.
- The script `code/dimensionality_reduction.sh` takes care of dimensionality reduction.
- The script `code/classification.sh` takes care of training and evaluating a classifier.
- The script `code/application.sh` launches the application example.

### Tests

For running unit tests use the following line of code:

```shell
python -m unittest discover -s src -p '*_test.py'
```

## Preprocessing

All python scripts and classes for the preprocessing of the input data can be found in `code/preprocessing/`.

### Creating Labels

The script `create_labels.py` assigns labels to the raw data points based on a threshold on a linear combination of the number of likes and retweets. It is executed as follows:
```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
```python -m src.preprocessing.create_labels path/to/input_dir path/to/output.csv```
Here, `input_dir` is the directory containing the original raw csv files, while `output.csv` is the single csv file where the output will be written.
The script takes the following optional parameters:
- `-l` or `--likes_weight` determines the relative weight of the number of likes a tweet has received. Defaults to 1.
Expand All @@ -54,7 +69,7 @@ The script takes the following optional parameters:
### Classical Preprocessing

The script `run_preprocessing.py` is used to run various preprocessing steps on the raw data, producing additional columns in the csv file. It is executed as follows:
```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
```python -m src.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
Here, `input.csv` is a csv file (ideally the output of `create_labels.py`), while `output.csv` is the csv file where the output will be written.
The preprocessing steps to take can be configured with the following flags:
- `-p` or `--punctuation`: A new column "tweet_no_punctuation" is created, where all punctuation is removed from the original tweet. (See `code/preprocessing/punctuation_remover.py` for more details)
Expand All @@ -66,7 +81,7 @@ Moreover, the script accepts the following optional parameters:
### Splitting the Data Set

The script `split_data.py` splits the overall preprocessed data into training, validation, and test set. It can be invoked as follows:
```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
```python -m src.preprocessing.split_data path/to/input.csv path/to/output_dir```
Here, `input.csv` is the input csv file to split (containing a column "label" with the label information, i.e., `create_labels.py` needs to be run beforehand) and `output_dir` is the directory where three individual csv files `training.csv`, `validation.csv`, and `test.csv` will be stored.
The script takes the following optional parameters:
- `-t` or `--test_size` determines the relative size of the test set and defaults to 0.2 (i.e., 20 % of the data).
Expand All @@ -79,7 +94,7 @@ The script takes the following optional parameters:
All python scripts and classes for feature extraction can be found in `code/feature_extraction/`.

The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
```python -m src.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
Here, `input.csv` is the respective training, validation, or test set file created by `split_data.py`. The file `output.pickle` will be used to store the results of the feature extraction process, namely a dictionary with the following entries:
- `"features"`: a numpy array with the raw feature values (rows are training examples, colums are features)
- `"feature_names"`: a list of feature names for the columns of the numpy array
Expand All @@ -98,7 +113,7 @@ All python scripts and classes for dimensionality reduction can be found in `cod

The script `reduce_dimensionality.py` takes care of the overall dimensionality reduction procedure and can be invoked as follows:

```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
```python -m src.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
Here, `input.pickle` is the respective training, validation, or test set file created by `extract_features.py`.
The file `output.pickle` will be used to store the results of the dimensionality reduction process, containing `"features"` (which are the selected/projected ones) and `"labels"` (same as in the input file).

Expand All @@ -118,19 +133,28 @@ All python scripts and classes for classification can be found in `code/classifi
### Train and Evaluate a Single Classifier

The script `run_classifier.py` can be used to train and/or evaluate a given classifier. It can be executed as follows:
```python -m code.classification.run_classifier path/to/input.pickle```
```python -m src.classification.run_classifier path/to/input.pickle```
Here, `input.pickle` is a pickle file of the respective data subset, produced by either `extract_features.py` or `reduce_dimensionality.py`.

By default, this data is used to train a classifier, which is specified by one of the following optional arguments:
- `-m` or `--majority`: Majority vote classifier that always predicts the majority class.
- `-f` or `--frequency`: Dummy classifier that makes predictions based on the label frequency in the training data.
By default, this data is used to train a **classifier**, which is specified by one of the following optional arguments:
- `-c` or `--classifier` followed by either `most_frequent` or `stratified`
- `most_frequent` is a [_DummyClassifier_](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) which always predicts the most frequently occuring label in the training set.
- `stratified` is a [_DummyClassifier_](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) that makes predictions based on the label frequency in the training data (respects the training set’s class distribution).

**Evaluation metrics** are then used by the classifier. Which metrics are used evaluatioon is specified with the following optional arguments:
- `-m` or `--metrics` followed by another option (default is `none`):
`none`, `all`,

The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
- `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
- `accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
- `precision`
- `recall`
- `f1`
- `jaccard`

For more details on the metrics used, see: https://scikit-learn.org/stable/modules/classes.html#classification-metrics

Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
Moreover, the script support **importing and exporting trained classifiers** with the following optional arguments:
- `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier.
- `-e` or `--export_file`: Export the trained classifier into the given pickle file.

Expand All @@ -143,5 +167,39 @@ All python code for the application demo can be found in `code/application/`.

The script `application.py` provides a simple command line interface, where the user is asked to type in their prospective tweet, which is then analyzed using the trained ML pipeline.
The script can be invoked as follows:
```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
```python -m src.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.

## Debugging in Visual Studio Code

1. Running a file in debug mode configured as waiting, because otherwise it woulk just finish to quickly

```
python -m debugpy --wait-for-client --listen 5678 .\src\feature_extraction\test\feature_extraction_test.py
```

2. `launch.json` configuration to attach the editor to the already started debug process.

```json
...
"configurations": [
{
"name": "Python: Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
},
]
...
```

3. Start the attach debug configuration via the VS Code UI ([F5] key or `Run`/`Run and Debug` menu)

## Running MlFlow

```
mlflow ui --backend-store-uri data/classification/mlflow
```
5 changes: 0 additions & 5 deletions code/application.sh

This file was deleted.

60 changes: 0 additions & 60 deletions code/application/application.py

This file was deleted.

14 changes: 0 additions & 14 deletions code/classification.sh

This file was deleted.

99 changes: 0 additions & 99 deletions code/classification/run_classifier.py

This file was deleted.

Loading