The Human Rights First Organization is a US-based nonprofit, nonpartisan organization concerned with international human rights. At its forefront are American ideals and universal values. For nearly 40 years HRF has challenged the status quo by highlighting the global struggle for human rights and stepping in to demand reform, accountability and justice. The goal of this project is to create a fully functioning web application capable of visually demonstrating valid and current incidences of police use of force within the United States. The information will help users, such as journalists and passersby, to formulate their perspectives on current matters. The exemplary user interface immediately captures attention with the clusters of incidence shown by geotagging.
This project has been worked on by many BloomTech labs teams over the past 10 months. In the final month of development, Labs Cohort 36 was tasked with finalizing our codebase and architecture to deploy a production-ready app. This included: automating our collection of Twitter data, deploying to AWS Elastic Beanstalk, adapting our database architecture to the backend team's schema, labeling 5,000 tweets to retrain our BERT model, creating performance metrics for our model, cleaning our codebase, and updating the documentation.
Front End Dashboard | Data Science API
- Automated through the FastAPI framework in
main.py
to run every four hours - Everytime it runs, will randomly select a search query from a set of phrases (police, police brutality, police abuse, police violence) to use in the Twitter API search
- Relevant functions for the scraper feature can be found in
scraper.py
- Invoked through
main.py.form_out
- Needs to run
main.py.advance_all
to advance each conversation 1 step (This is automated to run every 4 minutes in one of the pull requests) main.py.advance_all
runs every hour automatically, distributed lock means only one worker runs at a time- Code fragments left to allow Twitter conversational bot to be updated
- Checks made is being updated for each check, there should be an implementation for exponential backoff on check frequency. Look up exponential backoff.
- Allows developers to manage migrations safely
- Connected to models.py through declarative_base import
- Connected to production DB through .env file (obviously not in repo)
- in CLI, after generating virtual environment from requirements.txt:
- to generate a revision file run:
alembic revision --autogenerate
then spot check revision file for errors - to run that revision, run
alembic upgrade head
- to undo a revison run
alembic downgrade
- bear in mind that revisions won't store data if you drop a row, so keep a
pg_dump
file on hand topsql
recreate db
BERT is an open-source, pre-trained, natural language processing (NLP) model from Google. The role of BERT in our project is to take the tweets collected from our Twitter scraper and predict whether or not the tweet discusses police use-of-force and what type of force they used. BERT uses a 6-rank classification system as follows:
- Rank 0: No police presence.
- Rank 1: Police are present, but no force detected.
- Rank 2: Open-hand: Officers use bodily force to gain control of a situation. Officers may use grabs, holds, and joint locks to restrain an individual.
- Rank 3: Blunt Force: Officers use less-lethal technologies to gain control of a situation. Baton or projectile may be used to immobilize a combative person for example.
- Rank 4: Chemical & Electric: Officers use less-lethal technologies to gain control of a situation, such as chemical sprays, projectiles embedded with chemicals, or tasers to restrain an individual.
- Rank 5: Lethal Force: Officers use lethal weapons (guns, explosives) to gain control of a situation.
The BERT model does not currently live in the GitHub repository due to its large file size. When running the app locally, it is best to manually store the saved_model
file in the app
directory.
Taking a deeper dive we can turn our eyes to the black box of our model. For this task we will use LIME. LIME is an acronym for local interpretable model-agnostic explanations. Local is refers to local fidelity, meaning we want the explanation to really reflect the behaviour of the classifier "around" the instance being predicted. Interpretable refers to making sense of these explanations. Lastly, model-agnostic refers to giving explanations without needing to ‘peak’ into it.
How does LIME work? For our problem we will utilize the LIME TextExplainer. The TextExplainer generates a lot of texts similar to the document(by removing some words), then trains a white-box classifier that predicts the output of the black-box classifier. This process can be broken down into three simple steps. First, generate text. Second, predict probabilities for these generated texts. Third, train another classifier to predict the output of the black box classifier. While black boxes are hard to approximate, this algorithm works by approximating it in a small neighborhood near the given text in a white-box classifier. Finally, let's look at some visualizations! Below LimeTextExplainer is showing us the weights for each word in an incident report.
In the picture above the model is predicting class 5 with a 100% probability. Within the incident report the word “shot” has the highest weights for class 5 at 0.22. Meaning if we remove the word ‘shot’ from the incident report we would expect the model to predict class 5 with the probability at 100% - 22% = 78%. Conversely, the words “handgun” and “was” have small negative weights.
There are two notebooks pertaining to the model:
FrankenBERT_Training.ipynb
: trains a BERT instance based on the data given to it from thetraining
table in ourpostgres
AWS EB database and our generated tweetsFrankenBERT_Performance.ipynb
: used for statistical analysis and to calculate model performance metrics (i.e. binary and multi-classification confusion matrices, accuracy, etc.)
There is a supplementary notebook for generating synthetic tweets with GPT-2:
Training_GPT_2_w_GPU.ipynb
: trains GPT-2 to on force rank classes based on the data given to it from ourpostgres
AWS database before generating batches of synthetic tweets
These notebooks can be accessed from your virtual environment once all dependencies are installed within it. Two additional libraries, Transformers and psycopg2-binary, are both installed after running the first cell in the notebooks.
Old and currently undeployed code is stored in the archive
folder of the repo. Some files are stored to show the evolution of the code from previous BloomTech Labs cohorts to the current deployed code. Some files are starter codes that could help provide inspiration for features that were deprioritized for initial release (e.g. conversational Twitter Bot). A more in-depth description of each of the files is stored in a markdown file in the archive
directory.
In the test folder there is a FastApi Test-Client script to test all api endpoints. FastApi testclient allows developers to check that application endpoints are working as expected, ensures junk data does not enter your database, and allows the dev to easily debug with custom pytest reports. In order to run our FAST-API client test, one must previously have pytest installed [https://www.guru99.com/pytest-tutorial.html]. In the root directory, command line run $ pytest.
For those interested in improving upon the data science codebase, here are some recommendations:
- Explore the efficacy of separating the AWS 'postgres' database into two different databases. The first database would be the primary database for the Twitter scraper outputs and DS would redesign the schema to fit their needs. The second database would be the primary database for backend and they could extract data from the DS database and fit the schema to their needs. Currently, the primary AWS data table 'force_ranks' is accessible in both the data science and backend codebases.
- Develop an evidence-based strategy to maximize the effectiveness of our Twitter queries in the scraper feature. Currently, the Twitter API has a 500 tweet limit per scraping. This would include developing metrics to compare querying methods. Metrics would allow us to determine which methods return a greater percentage of tweets describing police use-of-force in the United States.
- Stakeholder would like for us to filter out incidents based on location before the incident is put into the database. This means we would have to try to gather location from the initial tweet. The scraper function may need to be re-worked slightly to accomidate this.
Philip Feiran Lee | Michael Carrier | Christopher Chilton |
---|---|---|
Technical Project Manager | Machine Learning Ops | Outside-Consultant |
Christopher Chilton | Ian Knight | Gabriel Nosek | Michael Carrier |
---|---|---|---|
DS Project Manager | Data Engineer | Machine Learning Engineer | Machine Learning Ops |
Ryan Fikejs | Imani Kirika | Joshua Elamin | Rowen Witt |
---|---|---|---|
Technical Project Manager | Technical Project Manager | Technical Project Manager | Data Engineer |
Brody Osterbuhr | Rhia George | Andrew Haney | Murat Benbanaste |
---|---|---|---|
Data Scientist: ML Ops | Machine Learning Engineer | Data Scientist: ML Ops | Machine Learning Engineer |
Hillary Khan | Marcos Morales | Eric Park |
---|---|---|
Data Scientist | Data Scientist | Data Scientist |
Brody Osterbuhr | Rhia George | Andrew Haney | Murat Benbanaste |
---|---|---|---|
Data Scientist: ML Ops | Machine Learning Engineer | Data Scientist: ML Ops | Machine Learning Engineer |
Hillary Khan | Marcos Morales | Eric Park |
---|---|---|
Data Scientist | Data Scientist | Data Scientist |
In order for the app to function correctly, the user must set up their own environment variables. There should be a .env file containing the following:
1. Twitter API Connection - through tweepy - use HRF twitter developer account.
a. CONSUMER_KEY=
b. CONSUMER_SECRET=
c. ACCESS_KEY=
d. ACCESS_SECRET=
2. Postgres database connection
a. DB_URL= <Currently pointing at production Database>
3. Map Api credentials
a. MAP_API= <Credentials for Google Maps API >
4. Bot variables
a. BOT_NAME= <This can be anything. Currently being stored in env but can move locations>
For AWS deployment we used requirement.txt to store our dependencies. Here are steps to create a virtual environment and install dependencies from our requirements.txt to run the app locally. Alternative instructions for creating a pipfile with pipenv follow.
- clone the repo
- cd into repo
- create virtual environment:
$ python3 -m venv name_for_env
- activate virtual environment:
$ source name_for_env/bin/activate
- check activation:
$ which python
# should return:
# name_for_env/bin/python
- install all dependencies with requirements.txt:
$ python3 -m pip install -r requirements.txt
- run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker
Or
uvicorn app.main:app --reload
- close the app with control+c in terminal
- deactivate environment:
$ deactivate
If you prefer to use pipenv and create a pipfile from our requirements.txt:
- clone the repo
- cd into repo
- install pip environment
$ pipenv install
will create a pipfile for you 4. activate the environment
$ pipenv shell
- run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker
Or
uvicorn app.main:app --reload
- close the app with control+c in terminal
- deactivate environment:
$ exit
- clone the repo
- cd into repo
- create virtual environment:
$ py -m venv env
- activate virtual environment:
$ .\env\Scripts\activate
- check activation:
$ which python
# should return:
# name_for_env/bin/python
- install all dependencies with requirements.txt:
$ py -m pip install -r requirements.txt
- run the API locally on your machine
uvicorn app.main:app --reload
- close the app with control+c in terminal
- deactivate environment:
$ deactivate
If you prefer to use pipenv and create a pipfile from our requirements.txt:
- clone the repo
- cd into repo
- install pip environment
$ pipenv install
will create a pipfile for you 4. activate the environment
$ pipenv shell
- run the API locally on your machine
uvicorn app.main:app --reload
- close the app with control+c in terminal
- deactivate environment:
$ exit