Colleges love to send email advertisements, so much so that it becomes inbox clutter. This project serves to analyze this spam, and look at some interesting trends in the emails I've received in the past year regarding college.
A Svelte frontend of statistics hosted on Netlify.
A number of Node JS scripts to parse emails and get college data. These are designed to be used through Github Actions, but can also be run locally.
Run all scripts, including downloading emails and generating statistics.
A set of utilities used to download and parse the data found in data
.
To create the same type of visualization locally for your own emails, follow these steps.
- Clone this repository (
git clone https://github.com/louismeunier/college-emails.git
) - Delete
client/src/data.json
,client/src/dates.json
, andclient/src/updated.json
. - While in the directory containing the repo, run
cd client && yarn && cd ../scripts && yarn
to install dependencies.
- To access your emails, you'll need to authenticate with the GMail API.
- Follow these steps to create the project.
- Enable the GMail API with the scope 'https://www.googleapis.com/auth/gmail.readonly'
- IMPORTANT: Make sure you add your email address as a tester for your application. Otherwise, as your project is unverified, it will not work.
- Download your credentials as JSON, and save it to
scripts/credentials.json
- Run
node scripts
, and if your setup was done correctly, it should prompt you to visit a URL and authenticate. This should save a filescripts/token.json
.
- Run
node scripts
a second time. It should now actually run the program, and regularly print output to the screen indicating progress. - Note: it can take quite a while for the scripts to run, around 1.5 minutes per 1000 emails.
- When completed, the scripts should print some tables of output, as well as some statistics of how well the run went.
- It should also have created 3 new files,
client/src/dev_data.json
,client/src/dev_data.json
, andclient/src/dev_updated.json
. Deletedev_
from each of these.
- Run
cd client && yarn dev
.
The dataset used containing college websites, names, locations, etc. was found here.
Because of the way emails are linked to their respective college (via the domain name of the sender), there are some emails that are unable to be linked to a college and are thus not included in the final statistics. This, however, only accounts for ~2% of all the emails parsed per run.