The current repository holds the code for our exam project for the Natural Language Processing (NLP) exam at the Cognitive Science MSc 2023, Aarhus University. Be aware, that this repository do not hold any data as we do not own it. However, the exact data we used can be extracted following the steps provided in the scraping pipeline, and in this way the entire analysis can be replicated.
The project was developed by Laura Paaby
and Emma Olsen
.
To run the scripts below, a virtual environment must be initiated containing the installments required.
source setup.sh
URLs to all concert reviews from Ekstra Bladet were collected using webscraper.io. The scraped articles themselves are not shared on this repository, but all code used to fetch the articles are shared for transparency. The fetched URLs can be seen in the csv file scrape/webscraper_eb_urls.csv
.
To scrape the EB articles, run the following script:
python scrape/scraping_eb.py
To replace out names and pronouns of the artist in each article with the gender-neutral: "artist/artist's", run the following script:
python analysis/mask_name_gender.py
To execute this step, run:
python BERT/prep_df_for_distmBERT.py
Now we are leaving scripts behind, and enter the realm of notebooks, as the remaining parts have been executed in Google Colab
We hyperparameter tune the nb-BERT-large model on the validation data prior to fine-tuning using Optuna
. This step results in finding the optimal hyperparameters.
To execute this step, run the nb_BERT_large/nbl_OPTUNA_param_tune.ipynb
We fine-tune the nb-BERT-large model with the hyperparameters found above on the validation data. The fine-tuned model yielding the lowest loss is found at step 174, epoch 3. We employ this model to the classification task on the test data.
To execute this step, run the nb_BERT_large/nbL_Finetune_opt_param.ipynb
We employ the model prior fine-tuning to the classification task on the test data. This model have never seen any of the data before.
To execute this step, run the nb_BERT_large/nbL_pretrained_class.ipynb
Now the classification performance of the model prior to and post fine-tuning can be compared.
We extract the Integrated Gradients from the nb-BERT-large model prior to fine-tuning.
To execute this step, run the IG/IG_pretrained.ipynb
.
We extract the Integrated Gradients from the nb-BERT-large model post fine-tuning. Thus, the model at checkpoint-174 is loaded and used for the extraction.
To execute this step, run the IG/IG_nbl_FT.ipynb
.
We find the Integrated Gradients Differentials by substracting the absolute values found by the model prior to fine-tuning from the ones found post fine-tuning. This is done for male and female predictions respectively and visualised.
To execute this step, run the IG/IG_DIFFERENTIALS.ipynb
.