Welcome! This repo contains scripts for classifying issues in political ads that are trained on hand-coded 2018 and 2020 advertising data.
This repo is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification step in our pipeline.
The issue classifier, trained on 2018 and 2020 ads - both TV and Facebook - is designed to be applied to uncoded 2022 ads. It is based on issues as coded by the WMP. In total, we code ads into 65 different issue categories based on which issue an ad is focused on.
To decide which issues to classify, we looked at which issues occurred at least 100 times in the TV data, and excluded two (Issue 116 and 209) that were problematic. So we have 65 issues. For a list of the issues of interest, see this spreadsheet.
NOTE: Some parts of the data in the datasets repo include TV data. In particular, files in this repo numbered 01 through 21 use TV data. Due to contractual reasons users must apply directly to receive raw TV data. Fill out the online request form to request access to TV datasets.
-
First, make sure you have R installed. While R can be run from the terminal, many people find it easier to use RStudio along with R. Here is a tutorial for setting up R and RStudio. The scripts are tested on R 4.0, 4.1, 4.2, 4.3, and 4.4.
-
Next, make sure you have the following packages installed in R (the exact version we used of each package is listed in the requirements_r.txt file. These are the versions we tested our scripts on. Scripts may also work with other versions, but we cannot ensure this). You can install the specific versions by calling:
install.packages("remotes") install_version("data.table", version="1.15.4") install_version("stringr", version="1.5.1") install_version("stringi", version="1.7.12") install_version("haven", version="2.5.4") install_version("dplyr", version="1.1.4") install_version("tidyr", version="1.3.1")
Or you can install the most recent versions of each package by running:
install.packages('data.table') install.packages("stringr") install.packages("stringi") install.packages("haven") install.packages("dplyr") install.packages("tidyr")
-
In order to successfully run each R script you must first set your working directory. The working directory is the location on your computer that R will use for reading and writing files. You can do so by adding the line
setwd("your/working/directory")
to the top of the R scripts, replacing"your/working/directory"
with your actual working directory. You must make sure that your working directory is in agreement with the paths to where any input files exist and where output files will be created.For instance, in script
01_prepare_fbel.R
the input and output are written as follows:# Input data path_input_data <- "data/fbel_w_train.csv" # Output data path_output_data <- "data/fbel_prepared.csv"
If you do not wish to change either of these paths, your working directory should be set as follows:
setwd("/local/path/to/ad_goal_classifier/")
where
/local/path/to/
represents the location at which the ad_goal_classifier folder resides on your computer. -
In order to execute an R script you can run the following command from your terminal from within the directory of the script replacing
file.R
with the file name of the script you want to run:Rscript file.R
-
First, make sure you have Python installed. The scripts are tested on Python 3.9 and 3.10.
-
In addition, make sure you have the following packages installed in Python (the exact version we use for each package is listed in the requirements_py.txt file. These are the versions we tested our scripts on. Scripts may also work with other versions, but we cannot ensure this). You can install by running the following command in your terminal:
pip3 install pandas==2.2.2 pip3 install scikit-learn==1.0.2 pip3 install numpy==1.26.4 pip3 install joblib==1.4.2 pip3 install torch==2.3.1 pip3 install tqdm==4.66.4 pip3 install transformers==4.41.2 pip3 install datasets==2.20.0
-
We recommend creating and activating a Python virtual environment before running the .py and .ipynb scripts. We create it using Python 3.10 since the scripts have been previously tested with this version:
python3.10 -m venv venv source venv/bin/activate
If you want to stop the virtual environment at some point, you can deactivate it:
deactivate
-
In order to execute a Python script you can run the following command from your terminal from within the directory of the script replacing
file.py
with the file name of the script you want to run:python3 file.py
-
Some of our scripts use the Jupyter Notebook as Python user interface. Thus, you need to install it to run any script with the
.ipynb
extension. You can install Jupyter Notebook using the following command in your terminal:pip install jupyter
-
Start Jupyter Notebook using the following command in your terminal:
jupyter notebook
NOTE: If you do not want to train a model from scratch, you can use the trained model with multi-label classification we provide. Since this model is too large, only a part of it is included in this repo here. For the other part, you will need to download it from Figshare. Please note that to access the data stored on Figshare, you will need to fill out a brief form and will then immediately get data access. Make sure to download it to the same directory with the other part of the model. Keep in mind that we only provide the trained model for multi-label classification.
Given that an ad can have multiple issues - or none - there are two basic approaches. One is to use a binary classifier for each issue separately. The other is to use a multi-label classifier which processes all issues together. Binary classifiers tend to have higher precision but lower recall, while multi-label classifiers tend to have lower precision and higher recall. Multi-label classifiers generally tend to have higher F1 scores. All measures noted here are only for the positive instances (1s) of each class as predicting the negative instances (0s) would yield a 97-98% score due to the overwhelming presence of negative instances.
Although we provide scripts for both binary and multi-label classifications, we recommend the multi-label classifier as it provides higher accuracy. For the final model to be used for inference we use a transformer-based multi-label model, mostly based on the code by a recent Political Analysis article. We utilized only the training scripts (e.g., trainer setup, performance report, etc.) from this article for our own purposes. We only utilize their model training structure and not their training data since that paper is using text data in German to train a text classifier. Additionally, we use the DistilBERT model from Huggingface instead of the model provided by that article (e.g., German Electra) to avoid any issues with domain knowledge and language.
To train the models, run the scripts in our main directory, numbered from 01 to 31.
- 01_tv_merge_2018_2020_asr_with_wmp_issues.R: This script merges ASR text data for TV ads with the issues of interest. This script requires television data which we are contractually unable to share through Github. However, you can request this data by following the instructions at the following link! (https://mediaproject.wesleyan.edu/dataaccess/). Or, you can skip this step since we provide the output file from script here.
- 02_tv_impute_18_from_20.py: The 2018 WMP coding is missing a few of the issues that were coded in 2020. In other words, we did not code for some of the issues in 2018 which we later coded for in 2020. Furthermore, some of the ads are missing random issues due to coding errors or inconsistencies. Due to both of these problems, we do imputation by training binary classifiers for each issue with missing data. The binary classifiers are improved as a result of this approach since we want to be cautious and err on the side of more negative instances. We want to use as much data for imputation as we can. In addition, the multi-label model can only use the ads for which no issues are missing, even if those other issues do not matter for imputing for just one issue.
- 11_fb18.R: This script prepares the training data with Facebook 2018 ads and merges them with the transcripts.
- 12_fb20.R: This script prepares the training data with Facebook 2020 ads and merges them with the transcripts.
- 13_combine_18_20_fb.R: This script merges the training data for Facebook 2018 and 2020.
- 14_fb_impute_18_from_20.py: This is the same imputation task, this time for the Facebook 2018 data. Imputation is done by training models on Facebook data and then imputing the missing Facebook data, and training models on TV data and then imputing the missing TV data.
- 21_merge_tv_with_fb.R: This script merges TV training data with Facebook training data.
- 31_train_binary_rf.py: This script trains a binary classification model for each issue separately. Thus, if you run this, you will have 65 models, one for each issue. We use Random Forest Classification.
- 31_train_multilabel_trf_v1.ipynb: This script trains a multilabel classification model. We use a DistilBERT model from Huggingface for training. Due to having 65 categories and a large text data, this training could take days with a CPU. For context, we used a NVIDIA Tesla P100 GPU with 16GB of memory which took over three hours.
The model performances for both binary and multilabel models are located here.
Once you have the trained model, you can run the inference for Facebook and Google 2022 data. Both Facebook and Google have their own folders for their inference scripts. Scripts are numbered in the order that they should be run.
For Facebook 2022:
- 01_f2022_prep.R: This prepares the inference data. You will need the
fb_2022_adid_text.csv.gz
data which can be downloaded from Figshare. Please note that to access the data stored on Figshare, you will need to fill out a brief form and will then immediately get data access. - 02_fb2022_inf_multilabel.ipynb & 02_fb2022_inf_binary.ipynb: These scripts carry out the classification task. The first does the classification using multilabel model while the second classifies data using the binary models.
- 03_fb2022_post_process.ipynb: This script processes the output data from the previous step. In particular, it combines the all detected issues for a given ad into one column.
For Google 2022:
- 01_g2022_prep.R: This prepares the inference data. You will need the
g2022_adid_01062021_11082022_text.csv.gz
data which can be downloaded from Figshare. Please note that to access the data stored on Figshare, you will need to fill out a brief form and then immediately will get data access. - 02_g2022_inf_multilabel.ipynb & 02_g2022_inf_binary.ipynb: These scripts carry out the classification task. The first does the classification using multilabel model while the second classifies data using the binary models.
- 03_g2022_post_process.ipynb: This script processes the output data from the previous step. In particular, it combines the all detected issues for a given ad into one column.
The data created by the scripts in this repo is in csv
format and located here.
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.
`````