Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Feature replies count #14

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
1e3dbff
added SVM classifier
MagMueller Oct 9, 2021
efd3421
added to classification.sh
MagMueller Oct 9, 2021
004eb57
Merge remote-tracking branch 'origin/main' into SupportVectorMachineC…
MagMueller Oct 9, 2021
637faec
add knn
MagMueller Oct 9, 2021
97c7e69
add svm
MagMueller Oct 9, 2021
e5a4e9b
Merge remote-tracking branch 'origin/main' into SupportVectorMachineC…
MagMueller Oct 9, 2021
7755e5d
Merge branch 'main' of https://github.com/avocardio/MLinPractice into…
MagMueller Oct 9, 2021
0525e22
Merge branch 'SupportVectorMachineClassifier' into main
MagMueller Oct 9, 2021
8f31962
spelling error
MagMueller Oct 9, 2021
58fbed5
use not all data
MagMueller Oct 9, 2021
c09128c
safer limit
MagMueller Oct 9, 2021
d18496a
Merge branch 'lbechberger:main' into main
MagMueller Oct 11, 2021
f078050
implementet hash feature
MagMueller Oct 11, 2021
78b3fc0
Merge branch 'lbechberger:main' into main
avocardio Oct 11, 2021
3fa2beb
new docu. file
avocardio Oct 11, 2021
cd4810c
Merge branch 'main' of https://github.com/avocardio/MLinPractice
avocardio Oct 11, 2021
521132a
Update Documentation
avocardio Oct 11, 2021
6c8c4e3
added hash vecor, but cohens kappa still 0.0
MagMueller Oct 11, 2021
b36b6e7
Merge branch 'main' of https://github.com/avocardio/MLinPractice
avocardio Oct 11, 2021
b5b598d
Wrong Docu
avocardio Oct 11, 2021
d0d66c8
Documentation update
avocardio Oct 11, 2021
4e1faf8
test for hash vector
MagMueller Oct 12, 2021
cf8f5a4
Merge remote-tracking branch 'origin/main' into hash_feature
MagMueller Oct 12, 2021
645d2bf
updated readme and add first try to documentation.md
MagMueller Oct 12, 2021
70f3928
spelling mistaks
MagMueller Oct 12, 2021
a0c9f2b
Merge pull request #1 from avocardio/hash_feature
MagMueller Oct 12, 2021
dd87c7b
filter out all languages except from english, maybe later: Translate
MagMueller Oct 12, 2021
f881a57
preprocess start
avocardio Oct 13, 2021
35bffee
preproccesing works now
MagMueller Oct 13, 2021
92c7321
now the outputfile looks correct
MagMueller Oct 13, 2021
9891b2e
Merge branch 'main' into preprocessing/english_tweets
MagMueller Oct 13, 2021
95efc6c
edit documentation
MagMueller Oct 13, 2021
b2bf9bb
edit other files for test run
MagMueller Oct 13, 2021
c879c9d
deleted file
avocardio Oct 13, 2021
7e806da
fix
avocardio Oct 14, 2021
e729203
small changes / fixes
avocardio Oct 14, 2021
9b5d213
commented out small dataset
avocardio Oct 14, 2021
3a13c43
renamed file, added emoji / link remover
avocardio Oct 15, 2021
3730089
small mistake
avocardio Oct 15, 2021
cc5d908
preprocessing done, edit string remover it works now!!!
MagMueller Oct 15, 2021
2e36ac8
prettier
MagMueller Oct 15, 2021
1a9a399
Added Photo Feature
shagemann2021 Oct 16, 2021
2b8377d
Added replies count feature; added 'help' description
shagemann2021 Oct 17, 2021
b74f3e9
Appends int(row) instead of row, same as video_bool
avocardio Oct 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 51 additions & 20 deletions Documentation.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,89 @@
# Documentation Example

Some introductory sentence(s). Data set and task are relatively fixed, so
This is the forked repository for Magnus Müller, Maximilian Kalcher and Samuel Hagemann.

Our task involved building and documenting a real-life application of machine learning.
We were given a dataset of N tweets from the years X until Y and had to build a classifier that would detect whether a tweet would go viral.
The measure for it being viral was when the sum of likes and retweets were bigger than 50.

The dataset was very variable and we had a lot of features to work with, which gave us the freedom to choose and experiment with these freely.

At the end, our classifier is implemented into an 'application', callable by terminal, which gives the likeliness of an input tweet being viral, having used the dataset as training.

//Some introductory sentence(s). Data set and task are relatively fixed, so
probably you don't have much to say about them (unless you modifed them).
If you haven't changed the application much, there's also not much to say about
that.
The following structure thus only covers preprocessing, feature extraction,
dimensionality reduction, classification, and evaluation.

## Evaluation
## Preprocessing

Before using the data or some aspects of it, it is important to process some of it beforehand so our chosen features can be extracted smoothly.
Many tweets had different kind of punctuation, ..., emojis, and some of them even were written in different languages.

### Design Decisions

Which evaluation metrics did you use and why?
Which baselines did you use and why?
After looking at the dataset closely, we chose to keep the core words of the sentence, ...
- remove stopwords like 'a' or 'is'
- remove punctation
- use just englisch tweets
- tokenize

### Results

How do the baselines perform with respect to the evaluation metrics?

Maybe show a short example what your preprocessing does.
language summary:
({'en': 282035, 'it': 4116, 'es': 3272, 'fr': 2781, 'de': 714, 'id': 523, 'nl': 480, 'pt': 364, 'ca': 275, 'ru': 204, 'th': 157, 'ar': 126, 'tl': 108, 'tr': 84, 'hr': 68, 'da': 66, 'ro': 60, 'ja': 58, 'sv': 42, 'et': 29, 'pl': 25, 'bg': 24, 'af': 23, 'no': 21, 'fi': 20, 'so': 16, 'ta': 16, 'hi': 11, 'mk': 11, 'he': 9, 'sw': 9, 'lt': 7, 'uk': 6, 'sl': 6, 'te': 5, 'zh-cn': 5, 'lv': 5, 'ko': 5, 'bn': 4, 'el': 4, 'fa': 3, 'vi': 2, 'mr': 2, 'ml': 2, 'hu': 2, 'kn': 1, 'cs': 1, 'gu': 1, 'sk': 1, 'ur': 1, 'sq': 1})
Total:
295811
English tweets are 95%. So we can delete (maybe later translate) 5% of disrupting data.

Lenght of all tweets:
- before preprocessing: 52686072
- after preprocessing (just englisch + punctation + stopwords): 39666607
39666607/52686072 = 0.75
### Interpretation

Is there anything we can learn from these results?

## Preprocessing
Probably, no real interpretation possible, so feel free to leave this section out.

I'm following the "Design Decisions - Results - Interpretation" structure here,
but you can also just use one subheading per preprocessing step to organize
things (depending on what you do, that may be better structured).
## Evaluation

### Design Decisions

Which kind of preprocessing steps did you implement? Why are they necessary
and/or useful down the road?
Which evaluation metrics did you use and why?
Which baselines did you use and why?

### Results

Maybe show a short example what your preprocessing does.
How do the baselines perform with respect to the evaluation metrics?

### Interpretation

Probably, no real interpretation possible, so feel free to leave this section out.
Is there anything we can learn from these results?

## Feature Extraction

Again, either structure among decision-result-interpretation or based on feature,
up to you.
Again, either structure under decision-result interpretation or based on features,
is up to you.



### Design Decisions

Which features did you implement? What's their motivation and how are they computed?

We want to try something we didn't hear in the lecture. Therefore, we used the HashingVectorizer from sklearn to create an individual hash for each tweet. For a sentence like 'I love Machine Learning', the output can look like [0.4, 0.3, 0.9, 0, 0.21], with length n representing the number of features. It's not very intuitive to humans why this works, but after a long time of version conflicts and other problems, we enjoyed the simplicity of using sklearn.

Usage: `--hash_vec`
and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py
### Results

Can you say something about how the feature values are distributed? Maybe show some plots?

When we finally ran it successfully with 25 features, we tried it with the SVM classifier, but that took too much time (nearly endless), so we used KNN with 4 NN on a 20000 sample subset and for the first time our Cohen kappa went from 0.0 to 0.1 and after some tuning (using more data) to 0.3.


### Interpretation

Can we already guess which features may be more useful than others?
Expand All @@ -78,12 +108,13 @@ Can we somehow make sense of the dimensionality reduction results?
Which features are the most important ones and why may that be the case?

## Classification

First of all we add a new argument: --small 1000 which would just use 1000s tweets.
### Design Decisions

Which classifier(s) did you use? Which hyperparameter(s) (with their respective
candidate values) did you look at? What were your reasons for this?

- SVM
### Results

The big finale begins: What are the evaluation results you obtained with your
Expand All @@ -94,4 +125,4 @@ selected setup: How well does it generalize to the test set?

Which hyperparameter settings are how important for the results?
How good are we? Can this be used in practice or are we still too bad?
Anything else we may have learned?
Anything else we may have learned?
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ conda install -y -q -c conda-forge gensim=4.1.2
conda install -y -q -c conda-forge spyder=5.1.5
conda install -y -q -c conda-forge pandas=1.1.5
conda install -y -q -c conda-forge mlflow=1.20.2
conda install -y -q -c conda-forge spacy
conda install -c conda-forge langdetect
```

You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
Expand Down Expand Up @@ -91,6 +93,8 @@ The features to be extracted can be configured with the following optional param
Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
- `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
- `-e` or `--export_file`: Export the configured and fitted feature extraction into the given pickle file.
- `--hash_vec`: use HashingVectorizer from sklearn.
and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py

## Dimensionality Reduction

Expand Down Expand Up @@ -128,7 +132,7 @@ By default, this data is used to train a classifier, which is specified by one o
The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
- `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).

- `--small 1000`: use just 1000 tweets.

Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
- `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier.
Expand Down
5 changes: 2 additions & 3 deletions code/classification.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@ mkdir -p data/classification/

# run feature extraction on training set (may need to fit extractors)
echo " training set"
python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa

python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm --knn 4 --accuracy --kappa
# run feature extraction on validation set (with pre-fit extractors)
echo " validation set"
python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa

# don't touch the test set, yet, because that would ruin the final generalization experiment!
# don't touch the test set, yet, because that would ruin the final generalization experiment!
34 changes: 26 additions & 8 deletions code/classification/run_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
import argparse, pickle
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, cohen_kappa_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


# setting up CLI
parser = argparse.ArgumentParser(description = "Classifier")
Expand All @@ -23,11 +25,14 @@
parser.add_argument("-i", "--import_file", help = "import a trained classifier from the given location", default = None)
parser.add_argument("-m", "--majority", action = "store_true", help = "majority class classifier")
parser.add_argument("-f", "--frequency", action = "store_true", help = "label frequency classifier")
parser.add_argument("-v", "--svm", action = "store_true", help = "SVM classifier")
parser.add_argument("--knn", type = int, help = "k nearest neighbor classifier with the specified value of k", default = None)
parser.add_argument("-a", "--accuracy", action = "store_true", help = "evaluate using accuracy")
parser.add_argument("-k", "--kappa", action = "store_true", help = "evaluate using Cohen's kappa")
args = parser.parse_args()
parser.add_argument("--small", type = int, help = "not use all data but just subset", default = None)

args = parser.parse_args()
#args, unk = parser.parse_known_args()
# load data
with open(args.input_file, 'rb') as f_in:
data = pickle.load(f_in)
Expand All @@ -43,24 +48,37 @@
# majority vote classifier
print(" majority vote classifier")
classifier = DummyClassifier(strategy = "most_frequent", random_state = args.seed)

elif args.frequency:
# label frequency classifier
print(" label frequency classifier")
classifier = DummyClassifier(strategy = "stratified", random_state = args.seed)


elif args.svm:
print(" SVM classifier")
classifier = make_pipeline(StandardScaler(), SVC(probability=True))
elif args.knn is not None:
print(" {0} nearest neighbor classifier".format(args.knn))
standardizer = StandardScaler()
knn_classifier = KNeighborsClassifier(args.knn)
classifier = make_pipeline(standardizer, knn_classifier)

classifier.fit(data["features"], data["labels"].ravel())




if args.small is not None:
# if limit is given
max_length = len(data['features'])
limit = min(args.small, max_length)
# go through data and limit it
for key, value in data.items():
data[key] = value[:limit]


classifier.fit(data["features"], data["labels"].ravel())
# now classify the given data
prediction = classifier.predict(data["features"])



# collect all evaluation metrics
evaluation_metrics = []
if args.accuracy:
Expand All @@ -75,4 +93,4 @@
# export the trained classifier if the user wants us to do so
if args.export_file is not None:
with open(args.export_file, 'wb') as f_out:
pickle.dump(classifier, f_out)
pickle.dump(classifier, f_out)
2 changes: 2 additions & 0 deletions code/dimensionality_reduction/reduce_dimensionality.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
if args.mutual_information is not None:
# select K best based on Mutual Information
dim_red = SelectKBest(mutual_info_classif, k = args.mutual_information)

dim_red.fit(features, labels.ravel())

# resulting feature names based on support given by SelectKBest
Expand All @@ -64,6 +65,7 @@ def get_feature_names(kbest, names):
# store the results
output_data = {"features": reduced_features,
"labels": labels}

with open(args.output_file, 'wb') as f_out:
pickle.dump(output_data, f_out)

Expand Down
12 changes: 10 additions & 2 deletions code/feature_extraction/extract_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
import pandas as pd
import numpy as np
from code.feature_extraction.character_length import CharacterLength
from code.feature_extraction.hash_vector import HashVector
from code.feature_extraction.feature_collector import FeatureCollector
from code.util import COLUMN_TWEET, COLUMN_LABEL
from code.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_PREPROCESS


# setting up CLI
Expand All @@ -23,6 +24,7 @@
parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None)
parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None)
parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet")
parser.add_argument("--hash_vec", action = "store_true", help = "compute the hash vector of the tweet")
args = parser.parse_args()

# load data
Expand All @@ -40,13 +42,18 @@
if args.char_length:
# character length of original tweet (without any changes)
features.append(CharacterLength(COLUMN_TWEET))

if args.hash_vec:
# hash of original tweet (without any changes)
features.append(HashVector(COLUMN_TWEET))


# create overall FeatureCollector
feature_collector = FeatureCollector(features)

# fit it on the given data set (assumed to be training data)
feature_collector.fit(df)



# apply the given FeatureCollector on the current data set
# maps the pandas DataFrame to an numpy array
Expand All @@ -59,6 +66,7 @@
# store the results
results = {"features": feature_array, "labels": label_array,
"feature_names": feature_collector.get_feature_names()}

with open(args.output_file, 'wb') as f_out:
pickle.dump(results, f_out)

Expand Down
37 changes: 37 additions & 0 deletions code/feature_extraction/hash_vector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Simple feature that counts the number of characters in the given column.

Created on Wed Sep 29 12:29:25 2021

@author: lbechberger
"""

import numpy as np
from code.feature_extraction.feature_extractor import FeatureExtractor
from sklearn.feature_extraction.text import HashingVectorizer

from code.util import HASH_VECTOR_N_FEATURES

# class for extracting the character-based length as a feature


class HashVector(FeatureExtractor):

# constructor
def __init__(self, input_column):
super().__init__([input_column], "{0}_hashvector".format(input_column))

# don't need to fit, so don't overwrite _set_variables()

# compute the word length based on the inputs
def _get_values(self, inputs):
# inputs is list of text documents
# create the transform
# pdb.set_trace()
vectorizer = HashingVectorizer(n_features=HASH_VECTOR_N_FEATURES,
strip_accents='ascii', stop_words='english', ngram_range=(2, 2))
# encode document
vector = vectorizer.fit_transform(inputs[0])
return vector.toarray()
12 changes: 6 additions & 6 deletions code/preprocessing.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
#!/bin/bash

# create directory if not yet existing
mkdir -p data/preprocessing/split/
#mkdir -p data/preprocessing/split/

# install all NLTK models
python -m nltk.downloader all
#python -m nltk.downloader all

# add labels
echo " creating labels"
echo -e "\n -> creating labels\n"
python -m code.preprocessing.create_labels data/raw/ data/preprocessing/labeled.csv

# other preprocessing (removing punctuation etc.)
echo " general preprocessing"
python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --tokenize -e data/preprocessing/pipeline.pickle
echo -e "\n -> general preprocessing\n"
python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --strings --tokenize --language en -e data/preprocessing/pipeline.pickle

# split the data set
echo " splitting the data set"
echo -e "\n -> splitting the data set\n"
python -m code.preprocessing.split_data data/preprocessing/preprocessed.csv data/preprocessing/split/ -s 42
2 changes: 1 addition & 1 deletion code/preprocessing/create_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
# load all csv files
dfs = []
for file_path in file_paths:
dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n"))
dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False))

# join all data into a single DataFrame
df = pd.concat(dfs)
Expand Down
Loading