lbechberger · shagemann2021 · Oct 9, 2021 · Oct 9, 2021 · Oct 9, 2021 · Oct 9, 2021
diff --git a/Documentation.md b/Documentation.md
@@ -1,59 +1,89 @@
 # Documentation Example
 
-Some introductory sentence(s). Data set and task are relatively fixed, so 
+This is the forked repository for Magnus Müller, Maximilian Kalcher and Samuel Hagemann. 
+
+Our task involved building and documenting a real-life application of machine learning. 
+We were given a dataset of N tweets from the years X until Y and had to build a classifier that would detect whether a tweet would go viral. 
+The measure for it being viral was when the sum of likes and retweets were bigger than 50. 
+
+The dataset was very variable and we had a lot of features to work with, which gave us the freedom to choose and experiment with these freely. 
+
+At the end, our classifier is implemented into an 'application', callable by terminal, which gives the likeliness of an input tweet being viral, having used the dataset as training. 
+
+//Some introductory sentence(s). Data set and task are relatively fixed, so 
 probably you don't have much to say about them (unless you modifed them).
 If you haven't changed the application much, there's also not much to say about
 that.
 The following structure thus only covers preprocessing, feature extraction,
 dimensionality reduction, classification, and evaluation.
 
-## Evaluation
+## Preprocessing
+
+Before using the data or some aspects of it, it is important to process some of it beforehand so our chosen features can be extracted smoothly. 
+Many tweets had different kind of punctuation, ..., emojis, and some of them even were written in different languages.
 
 ### Design Decisions
 
-Which evaluation metrics did you use and why? 
-Which baselines did you use and why?
+After looking at the dataset closely, we chose to keep the core words of the sentence, ...
+- remove stopwords like 'a' or 'is'
+- remove punctation 
+- use just englisch tweets
+- tokenize
 
 ### Results
 
-How do the baselines perform with respect to the evaluation metrics?
-
+Maybe show a short example what your preprocessing does.
+language summary:
+({'en': 282035, 'it': 4116, 'es': 3272, 'fr': 2781, 'de': 714, 'id': 523, 'nl': 480, 'pt': 364, 'ca': 275, 'ru': 204, 'th': 157, 'ar': 126, 'tl': 108, 'tr': 84, 'hr': 68, 'da': 66, 'ro': 60, 'ja': 58, 'sv': 42, 'et': 29, 'pl': 25, 'bg': 24, 'af': 23, 'no': 21, 'fi': 20, 'so': 16, 'ta': 16, 'hi': 11, 'mk': 11, 'he': 9, 'sw': 9, 'lt': 7, 'uk': 6, 'sl': 6, 'te': 5, 'zh-cn': 5, 'lv': 5, 'ko': 5, 'bn': 4, 'el': 4, 'fa': 3, 'vi': 2, 'mr': 2, 'ml': 2, 'hu': 2, 'kn': 1, 'cs': 1, 'gu': 1, 'sk': 1, 'ur': 1, 'sq': 1})
+Total:
+295811
+English tweets are 95%. So we can delete (maybe later translate) 5% of disrupting data.
+
+Lenght of all tweets:
+- before preprocessing: 52686072
+- after preprocessing (just englisch + punctation + stopwords):  39666607
+39666607/52686072 = 0.75
 ### Interpretation
 
-Is there anything we can learn from these results?
-
-## Preprocessing
+Probably, no real interpretation possible, so feel free to leave this section out.
 
-I'm following the "Design Decisions - Results - Interpretation" structure here,
-but you can also just use one subheading per preprocessing step to organize
-things (depending on what you do, that may be better structured).
+## Evaluation
 
 ### Design Decisions
 
-Which kind of preprocessing steps did you implement? Why are they necessary
-and/or useful down the road?
+Which evaluation metrics did you use and why? 
+Which baselines did you use and why?
 
 ### Results
 
-Maybe show a short example what your preprocessing does.
+How do the baselines perform with respect to the evaluation metrics?
 
 ### Interpretation
 
-Probably, no real interpretation possible, so feel free to leave this section out.
+Is there anything we can learn from these results?
 
 ## Feature Extraction
 
-Again, either structure among decision-result-interpretation or based on feature,
-up to you.
+Again, either structure under decision-result interpretation or based on features,
+is up to you.
+
+
 
 ### Design Decisions
 
 Which features did you implement? What's their motivation and how are they computed?
 
+We want to try something we didn't hear in the lecture. Therefore, we used the HashingVectorizer from sklearn to create an individual hash for each tweet. For a sentence like 'I love Machine Learning', the output can look like [0.4, 0.3, 0.9, 0, 0.21], with length n representing the number of features. It's not very intuitive to humans why this works, but after a long time of version conflicts and other problems, we enjoyed the simplicity of using sklearn. 
+
+Usage: `--hash_vec` 
+and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py 
 ### Results
 
 Can you say something about how the feature values are distributed? Maybe show some plots?
 
+When we finally ran it successfully with 25 features, we tried it with the SVM classifier, but that took too much time (nearly endless), so we used KNN with 4 NN on a 20000 sample subset and for the first time our Cohen kappa went from 0.0 to 0.1 and after some tuning (using more data) to 0.3.
+
+
 ### Interpretation
 
 Can we already guess which features may be more useful than others?
@@ -78,12 +108,13 @@ Can we somehow make sense of the dimensionality reduction results?
 Which features are the most important ones and why may that be the case?
 
 ## Classification
-
+First of all we add a new argument: --small 1000 which would just use 1000s tweets.
 ### Design Decisions
 
 Which classifier(s) did you use? Which hyperparameter(s) (with their respective
 candidate values) did you look at? What were your reasons for this?
 
+- SVM
 ### Results
 
 The big finale begins: What are the evaluation results you obtained with your
@@ -94,4 +125,4 @@ selected setup: How well does it generalize to the test set?
 
 Which hyperparameter settings are how important for the results?
 How good are we? Can this be used in practice or are we still too bad?
-Anything else we may have learned?
+Anything else we may have learned?
diff --git a/README.md b/README.md
@@ -19,6 +19,8 @@ conda install -y -q -c conda-forge gensim=4.1.2
 conda install -y -q -c conda-forge spyder=5.1.5
 conda install -y -q -c conda-forge pandas=1.1.5
 conda install -y -q -c conda-forge mlflow=1.20.2
+conda install -y -q -c conda-forge spacy
+conda install -c conda-forge langdetect
 ```
 
 You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
@@ -91,6 +93,8 @@ The features to be extracted can be configured with the following optional param
 Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
 - `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
 - `-e` or `--export_file`: Export the configured and fitted feature extraction into the given pickle file.
+- `--hash_vec`: use HashingVectorizer from sklearn.
+and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py
 
 ## Dimensionality Reduction
 
@@ -128,7 +132,7 @@ By default, this data is used to train a classifier, which is specified by one o
 The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
 - `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
 - `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
-
+- `--small 1000`: use just 1000 tweets.
 
 Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
 - `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier.

diff --git a/code/classification.sh b/code/classification.sh
@@ -5,10 +5,9 @@ mkdir -p data/classification/
 
 # run feature extraction on training set (may need to fit extractors)
 echo "  training set"
-python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa
-
+python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm --knn 4 --accuracy --kappa
 # run feature extraction on validation set (with pre-fit extractors)
 echo "  validation set"
 python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa
 
-# don't touch the test set, yet, because that would ruin the final generalization experiment!
+# don't touch the test set, yet, because that would ruin the final generalization experiment!
diff --git a/code/classification/run_classifier.py b/code/classification/run_classifier.py
@@ -11,9 +11,11 @@
 import argparse, pickle
 from sklearn.dummy import DummyClassifier
 from sklearn.metrics import accuracy_score, cohen_kappa_score
-from sklearn.preprocessing import StandardScaler
+from sklearn.svm import SVC
 from sklearn.neighbors import KNeighborsClassifier
 from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+
 
 # setting up CLI
 parser = argparse.ArgumentParser(description = "Classifier")
@@ -23,11 +25,14 @@
 parser.add_argument("-i", "--import_file", help = "import a trained classifier from the given location", default = None)
 parser.add_argument("-m", "--majority", action = "store_true", help = "majority class classifier")
 parser.add_argument("-f", "--frequency", action = "store_true", help = "label frequency classifier")
+parser.add_argument("-v", "--svm", action = "store_true", help = "SVM classifier")
 parser.add_argument("--knn", type = int, help = "k nearest neighbor classifier with the specified value of k", default = None)
 parser.add_argument("-a", "--accuracy", action = "store_true", help = "evaluate using accuracy")
 parser.add_argument("-k", "--kappa", action = "store_true", help = "evaluate using Cohen's kappa")
-args = parser.parse_args()
+parser.add_argument("--small", type = int, help = "not use all data but just subset", default = None)
 
+args = parser.parse_args()
+#args, unk = parser.parse_known_args()
 # load data
 with open(args.input_file, 'rb') as f_in:
     data = pickle.load(f_in)
@@ -43,24 +48,37 @@
         # majority vote classifier
         print("    majority vote classifier")
         classifier = DummyClassifier(strategy = "most_frequent", random_state = args.seed)
-
     elif args.frequency:
         # label frequency classifier
         print("    label frequency classifier")
         classifier = DummyClassifier(strategy = "stratified", random_state = args.seed)
-
-
+    elif args.svm:
+        print("    SVM classifier")
+        classifier = make_pipeline(StandardScaler(), SVC(probability=True))
     elif args.knn is not None:
         print("    {0} nearest neighbor classifier".format(args.knn))
         standardizer = StandardScaler()
         knn_classifier = KNeighborsClassifier(args.knn)
         classifier = make_pipeline(standardizer, knn_classifier)
-
-    classifier.fit(data["features"], data["labels"].ravel())
 
+
+
+
+if args.small is not None:
+    # if limit is given
+    max_length = len(data['features'])
+    limit = min(args.small, max_length)
+    # go through data and limit it
+    for key, value in data.items():
+        data[key] = value[:limit]
+
+
+classifier.fit(data["features"], data["labels"].ravel())
 # now classify the given data
 prediction = classifier.predict(data["features"])
 
+
+
 # collect all evaluation metrics
 evaluation_metrics = []
 if args.accuracy:
@@ -75,4 +93,4 @@
 # export the trained classifier if the user wants us to do so
 if args.export_file is not None:
     with open(args.export_file, 'wb') as f_out:
-        pickle.dump(classifier, f_out)
+        pickle.dump(classifier, f_out)
diff --git a/code/dimensionality_reduction/reduce_dimensionality.py b/code/dimensionality_reduction/reduce_dimensionality.py
@@ -40,6 +40,7 @@
     if args.mutual_information is not None:
         # select K best based on Mutual Information
         dim_red = SelectKBest(mutual_info_classif, k = args.mutual_information)
+
         dim_red.fit(features, labels.ravel())
 
         # resulting feature names based on support given by SelectKBest
@@ -64,6 +65,7 @@ def get_feature_names(kbest, names):
 # store the results
 output_data = {"features": reduced_features, 
                "labels": labels}
+
 with open(args.output_file, 'wb') as f_out:
     pickle.dump(output_data, f_out)
 

diff --git a/code/feature_extraction/extract_features.py b/code/feature_extraction/extract_features.py
@@ -12,8 +12,9 @@
 import pandas as pd
 import numpy as np
 from code.feature_extraction.character_length import CharacterLength
+from code.feature_extraction.hash_vector import HashVector
 from code.feature_extraction.feature_collector import FeatureCollector
-from code.util import COLUMN_TWEET, COLUMN_LABEL
+from code.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_PREPROCESS
 
 
 # setting up CLI
@@ -23,6 +24,7 @@
 parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None)
 parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None)
 parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet")
+parser.add_argument("--hash_vec", action = "store_true", help = "compute the hash vector of the tweet")
 args = parser.parse_args()
 
 # load data
@@ -40,13 +42,18 @@
     if args.char_length:
         # character length of original tweet (without any changes)
         features.append(CharacterLength(COLUMN_TWEET))
-
+    if args.hash_vec:
+        # hash of original tweet (without any changes)
+        features.append(HashVector(COLUMN_TWEET))
+
+
     # create overall FeatureCollector
     feature_collector = FeatureCollector(features)
 
     # fit it on the given data set (assumed to be training data)
     feature_collector.fit(df)
 
+
 
 # apply the given FeatureCollector on the current data set
 # maps the pandas DataFrame to an numpy array
@@ -59,6 +66,7 @@
 # store the results
 results = {"features": feature_array, "labels": label_array, 
            "feature_names": feature_collector.get_feature_names()}
+
 with open(args.output_file, 'wb') as f_out:
     pickle.dump(results, f_out)
 

diff --git a/code/feature_extraction/hash_vector.py b/code/feature_extraction/hash_vector.py
@@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Simple feature that counts the number of characters in the given column.
+
+Created on Wed Sep 29 12:29:25 2021
+
+@author: lbechberger
+"""
+
+import numpy as np
+from code.feature_extraction.feature_extractor import FeatureExtractor
+from sklearn.feature_extraction.text import HashingVectorizer
+
+from code.util import HASH_VECTOR_N_FEATURES
+
+# class for extracting the character-based length as a feature
+
+
+class HashVector(FeatureExtractor):
+
+    # constructor
+    def __init__(self, input_column):
+        super().__init__([input_column], "{0}_hashvector".format(input_column))
+
+    # don't need to fit, so don't overwrite _set_variables()
+
+    # compute the word length based on the inputs
+    def _get_values(self, inputs):
+        # inputs is list of text documents
+        # create the transform
+        # pdb.set_trace()
+        vectorizer = HashingVectorizer(n_features=HASH_VECTOR_N_FEATURES,
+                                       strip_accents='ascii', stop_words='english', ngram_range=(2, 2))
+        # encode document
+        vector = vectorizer.fit_transform(inputs[0])
+        return vector.toarray()
diff --git a/code/preprocessing.sh b/code/preprocessing.sh
@@ -1,19 +1,19 @@
 #!/bin/bash
 
 # create directory if not yet existing
-mkdir -p data/preprocessing/split/
+#mkdir -p data/preprocessing/split/
 
 # install all NLTK models
-python -m nltk.downloader all
+#python -m nltk.downloader all
 
 # add labels
-echo "  creating labels"
+echo -e "\n -> creating labels\n"
 python -m code.preprocessing.create_labels data/raw/ data/preprocessing/labeled.csv
 
 # other preprocessing (removing punctuation etc.)
-echo "  general preprocessing"
-python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --tokenize -e data/preprocessing/pipeline.pickle
+echo -e "\n -> general preprocessing\n"
+python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --strings --tokenize --language en -e data/preprocessing/pipeline.pickle
 
 # split the data set
-echo "  splitting the data set"
+echo -e "\n -> splitting the data set\n"
 python -m code.preprocessing.split_data data/preprocessing/preprocessed.csv data/preprocessing/split/ -s 42
diff --git a/code/preprocessing/create_labels.py b/code/preprocessing/create_labels.py
@@ -28,7 +28,7 @@
 # load all csv files
 dfs = []
 for file_path in file_paths:
-    dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n"))
+    dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False))
 
 # join all data into a single DataFrame
 df = pd.concat(dfs)