Skip to content

Latest commit

 

History

History
340 lines (274 loc) · 24.1 KB

README.md

File metadata and controls

340 lines (274 loc) · 24.1 KB

Overview

The Human Rights First Organization is a US-based nonprofit, nonpartisan organization concerned with international human rights. At its forefront are American ideals and universal values. For nearly 40 years HRF has challenged the status quo by highlighting the global struggle for human rights and stepping in to demand reform, accountability and justice. The goal of this project is to create a fully functioning web application capable of visually demonstrating valid and current incidences of police use of force within the United States. The information will help users, such as journalists and passersby, to formulate their perspectives on current matters. The exemplary user interface immediately captures attention with the clusters of incidence shown by geotagging.

This project has been worked on by many BloomTech labs teams over the past 10 months. In the final month of development, Labs Cohort 36 was tasked with finalizing our codebase and architecture to deploy a production-ready app. This included: automating our collection of Twitter data, deploying to AWS Elastic Beanstalk, adapting our database architecture to the backend team's schema, labeling 5,000 tweets to retrain our BERT model, creating performance metrics for our model, cleaning our codebase, and updating the documentation.


Features

Deployed Product

Front End Dashboard | Data Science API


Twitter Scraper

  • Automated through the FastAPI framework in main.py to run every four hours
  • Everytime it runs, will randomly select a search query from a set of phrases (police, police brutality, police abuse, police violence) to use in the Twitter API search
  • Relevant functions for the scraper feature can be found in scraper.py

Twitter Bot

  • Invoked through main.py.form_out
  • Needs to run main.py.advance_all to advance each conversation 1 step (This is automated to run every 4 minutes in one of the pull requests)
  • main.py.advance_all runs every hour automatically, distributed lock means only one worker runs at a time
  • Code fragments left to allow Twitter conversational bot to be updated
  • Checks made is being updated for each check, there should be an implementation for exponential backoff on check frequency. Look up exponential backoff.

Alembic

  • Allows developers to manage migrations safely
  • Connected to models.py through declarative_base import
  • Connected to production DB through .env file (obviously not in repo)
  • in CLI, after generating virtual environment from requirements.txt:
  • to generate a revision file run: alembic revision --autogenerate then spot check revision file for errors
  • to run that revision, run alembic upgrade head
  • to undo a revison run alembic downgrade
  • bear in mind that revisions won't store data if you drop a row, so keep a pg_dump file on hand to psql recreate db

BERT Model

BERT is an open-source, pre-trained, natural language processing (NLP) model from Google. The role of BERT in our project is to take the tweets collected from our Twitter scraper and predict whether or not the tweet discusses police use-of-force and what type of force they used. BERT uses a 6-rank classification system as follows:

  • Rank 0: No police presence.
  • Rank 1: Police are present, but no force detected.
  • Rank 2: Open-hand: Officers use bodily force to gain control of a situation. Officers may use grabs, holds, and joint locks to restrain an individual.
  • Rank 3: Blunt Force: Officers use less-lethal technologies to gain control of a situation. Baton or projectile may be used to immobilize a combative person for example.
  • Rank 4: Chemical & Electric: Officers use less-lethal technologies to gain control of a situation, such as chemical sprays, projectiles embedded with chemicals, or tasers to restrain an individual.
  • Rank 5: Lethal Force: Officers use lethal weapons (guns, explosives) to gain control of a situation.

The BERT model does not currently live in the GitHub repository due to its large file size. When running the app locally, it is best to manually store the saved_model file in the app directory.

BERT Rankings

Taking a deeper dive we can turn our eyes to the black box of our model. For this task we will use LIME. LIME is an acronym for local interpretable model-agnostic explanations. Local is refers to local fidelity, meaning we want the explanation to really reflect the behaviour of the classifier "around" the instance being predicted. Interpretable refers to making sense of these explanations. Lastly, model-agnostic refers to giving explanations without needing to ‘peak’ into it.

How does LIME work? For our problem we will utilize the LIME TextExplainer. The TextExplainer generates a lot of texts similar to the document(by removing some words), then trains a white-box classifier that predicts the output of the black-box classifier. This process can be broken down into three simple steps. First, generate text. Second, predict probabilities for these generated texts. Third, train another classifier to predict the output of the black box classifier. While black boxes are hard to approximate, this algorithm works by approximating it in a small neighborhood near the given text in a white-box classifier. Finally, let's look at some visualizations! Below LimeTextExplainer is showing us the weights for each word in an incident report.

Screenshot (12)

In the picture above the model is predicting class 5 with a 100% probability. Within the incident report the word “shot” has the highest weights for class 5 at 0.22. Meaning if we remove the word ‘shot’ from the incident report we would expect the model to predict class 5 with the probability at 100% - 22% = 78%. Conversely, the words “handgun” and “was” have small negative weights.

Notebooks

There are two notebooks pertaining to the model:

  • FrankenBERT_Training.ipynb: trains a BERT instance based on the data given to it from the training table in our postgres AWS EB database and our generated tweets
  • FrankenBERT_Performance.ipynb: used for statistical analysis and to calculate model performance metrics (i.e. binary and multi-classification confusion matrices, accuracy, etc.)

There is a supplementary notebook for generating synthetic tweets with GPT-2:

  • Training_GPT_2_w_GPU.ipynb: trains GPT-2 to on force rank classes based on the data given to it from our postgres AWS database before generating batches of synthetic tweets

These notebooks can be accessed from your virtual environment once all dependencies are installed within it. Two additional libraries, Transformers and psycopg2-binary, are both installed after running the first cell in the notebooks.


DS Architecture

Architecture


Old Codebase

Old and currently undeployed code is stored in the archive folder of the repo. Some files are stored to show the evolution of the code from previous BloomTech Labs cohorts to the current deployed code. Some files are starter codes that could help provide inspiration for features that were deprioritized for initial release (e.g. conversational Twitter Bot). A more in-depth description of each of the files is stored in a markdown file in the archive directory.


FastAPI Test Client

In the test folder there is a FastApi Test-Client script to test all api endpoints. FastApi testclient allows developers to check that application endpoints are working as expected, ensures junk data does not enter your database, and allows the dev to easily debug with custom pytest reports. In order to run our FAST-API client test, one must previously have pytest installed [https://www.guru99.com/pytest-tutorial.html]. In the root directory, command line run $ pytest.


Next Steps

For those interested in improving upon the data science codebase, here are some recommendations:

  • Explore the efficacy of separating the AWS 'postgres' database into two different databases. The first database would be the primary database for the Twitter scraper outputs and DS would redesign the schema to fit their needs. The second database would be the primary database for backend and they could extract data from the DS database and fit the schema to their needs. Currently, the primary AWS data table 'force_ranks' is accessible in both the data science and backend codebases.
  • Develop an evidence-based strategy to maximize the effectiveness of our Twitter queries in the scraper feature. Currently, the Twitter API has a 500 tweet limit per scraping. This would include developing metrics to compare querying methods. Metrics would allow us to determine which methods return a greater percentage of tweets describing police use-of-force in the United States.
  • Stakeholder would like for us to filter out incidents based on location before the incident is put into the database. This means we would have to try to gather location from the initial tweet. The scraper function may need to be re-worked slightly to accomidate this.



Labs 39 Contributors

Philip Feiran Lee Michael Carrier Christopher Chilton
Technical Project Manager Machine Learning Ops Outside-Consultant

Labs 38 Contributors

Christopher Chilton Ian Knight Gabriel Nosek Michael Carrier
DS Project Manager Data Engineer Machine Learning Engineer Machine Learning Ops

Labs 37 Contributors

Ryan Fikejs Imani Kirika Joshua Elamin Rowen Witt
Technical Project Manager Technical Project Manager Technical Project Manager Data Engineer
Brody Osterbuhr Rhia George Andrew Haney Murat Benbanaste
Data Scientist: ML Ops Machine Learning Engineer Data Scientist: ML Ops Machine Learning Engineer

Labs 36 Contributors

Hillary Khan Marcos Morales Eric Park
Data Scientist Data Scientist Data Scientist

Brody Osterbuhr Rhia George Andrew Haney Murat Benbanaste
Data Scientist: ML Ops Machine Learning Engineer Data Scientist: ML Ops Machine Learning Engineer



Labs 36 Contributors

Hillary Khan Marcos Morales Eric Park
Data Scientist Data Scientist Data Scientist




Getting Started

Dependencies

pandas numpy scikit-learn torch transformers spacy plotly tweepy beautifulsoup4 SQLAlchemy dataset python-dotenv uvicorn fastapi fastapi-utils


Environment Variables

In order for the app to function correctly, the user must set up their own environment variables. There should be a .env file containing the following:

1. Twitter API Connection - through tweepy - use HRF twitter developer account.
	a. CONSUMER_KEY=
	b. CONSUMER_SECRET=
	c. ACCESS_KEY=
	d. ACCESS_SECRET=
2. Postgres database connection 
	a. DB_URL= <Currently pointing at production Database>
3. Map Api credentials
	a. MAP_API= <Credentials for Google Maps API >
4. Bot variables
	a. BOT_NAME= <This can be anything. Currently being stored in env but can move locations>

Installation Instructions and running API locally

For AWS deployment we used requirement.txt to store our dependencies. Here are steps to create a virtual environment and install dependencies from our requirements.txt to run the app locally. Alternative instructions for creating a pipfile with pipenv follow.

MacOS:

  1. clone the repo
  2. cd into repo
  3. create virtual environment:
$ python3 -m venv name_for_env
  1. activate virtual environment:
$ source name_for_env/bin/activate
  1. check activation:
$ which python
# should return:
#   name_for_env/bin/python
  1. install all dependencies with requirements.txt:
$ python3 -m pip install -r requirements.txt
  1. run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ deactivate

If you prefer to use pipenv and create a pipfile from our requirements.txt:

  1. clone the repo
  2. cd into repo
  3. install pip environment
$ pipenv install

will create a pipfile for you 4. activate the environment

$ pipenv shell
  1. run the API locally on your machine
$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ exit

Windows:

  1. clone the repo
  2. cd into repo
  3. create virtual environment:
$ py -m venv env
  1. activate virtual environment:
$ .\env\Scripts\activate
  1. check activation:
$ which python
# should return:
#   name_for_env/bin/python
  1. install all dependencies with requirements.txt:
$ py -m pip install -r requirements.txt
  1. run the API locally on your machine
uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ deactivate

If you prefer to use pipenv and create a pipfile from our requirements.txt:

  1. clone the repo
  2. cd into repo
  3. install pip environment
$ pipenv install

will create a pipfile for you 4. activate the environment

$ pipenv shell
  1. run the API locally on your machine
uvicorn app.main:app --reload
  1. close the app with control+c in terminal
  2. deactivate environment:
$ exit

How to access DB from browser

CredentialsMap