Skip to content

Zacharia-Schmitz/Project_LA_Crime

Repository files navigation

Evaluating Inglewood Area Robberies


Zacharia Schmitz

3 October 2023

Alt text

Heatmap of Robberies in Cluster 4 (Inglewood area)


Project Plan:

(Jump To)

Project Overview

Data Acquisition

Preparation

Exploration

Models

Conclusion




From the City of Los Angeles website about the data:
  • This dataset reflects incidents of crime in the City of Los Angeles from 2010 - 2019.

  • This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data.

  • Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy.

  • This data is as accurate as the data in the LA database.





Overview

Project Goals

  • Initially: Explore high areas of gun violence in the Los Angeles area and what to do to protect yourself

  • Upon further evaluation: After it was identified in clustering, explore drivers of the high (almost double) robbery rate in the Inglewood area. Identify features and develop a model to encourage public safety and prevent crime.





Project Description

  • I initially started with 2 datasets from the city of Los Angeles.

  • After combing, there was close to 3 million rows (2,943,476).

  • Each row was a unique incidence of crime in the Los Angeles area.

  • There was 29 different features about each crime. This only included the initial report, and not the final closed case details.

  • The target variable was a column created from the crime description, 'is_robbery'.

  • The comparison was ultimately Cluster 4 (Inglewood Area) versus all of the other areas.





Initial Hypotheses

  • Time of day is probably a large driver of robberies. Most likely more at night.

  • There are probably not very many robberies during the day.

  • Gender and age is most likely a driver of robberies. Most likely people preying on females and the very young and old.

  • Most robberies are most likely occuring at gunpoint, or assumed to be gunpoint.





Acquire

  • 2010 to 2019 Data (crime_data_2010_2019).csv

https://data.lacity.org/Public-Safety/Crime-Data-from-2010-to-2019/63jg-8b9z

  • 2020 to 25 Sep 23 (crime_data_2020_2023).csv

https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8

Both dataframes shared the same features, and were merged together.

  • Merged CSV Format (crime_data.csv)

https://drive.google.com/file/d/14FBSb-iADac0jENqeEfceTcmRi2Btaoh/view?usp=sharing





Pre-Modeling Data Dictionary:

Alt text

Column Definition
is_robbery Feature engineered from taking robbery and attempted robbery from crime description
is_street Feature engineered from taking if the robbery occured on the street from premise description
victim_sex Encoded column for the victim's sex. F = Female, M = Male, X = Non-Binary
victim_descent The victim's ethnicity





Preparation Steps

Logically Rename Columns

  • Rename columns to make sense for exploration.

Choose columns to work with

  • Based on what we're looking for, we'll start with only looking at some of the features.

Fix Time Occurred

  • Change time occurred to 4 digit 24 hour time. We won't move it to index for now, because we may not be doing any time series

Fix Victim Age

  • Looks like we have negative values in the victim_age. We will have to drop those, since no fair assumptions can be made.

Fix Victim Descent

  • We'll map the victim_descent based on abbreviations and assign nulls 'unknown' for exploration before potentially dropping nulls.

Fix Victim Sex

  • We'll have to map victim_sex. We can assume F = Female, M = Male, X = Non binary. Due to large amount of other values, we will make them 'unknown' rather than imputing any assumptions.

Nulls in premise_description

  • Very small amount of nulls in premise_description (0.00023). We'll assign them 'unknown'.

Weapon_description

  • 66% Nulls in weapon_description, we'll fill them with 'No Weapon'.

Weapon_category

  • Make a weapon bin out of all of the weapon_descriptions. No Weapon, Firearm, Melee Object, Threats, Vehicle, Other

Create report_delay column

  • Take the date of reporting, subtract the crime_occurred.

Fix Part 1 & 2

  • Part 1 is for heinous crimes & 2 is for less heinous.

Fix latitude and longitude

  • A total of 3321 values of 0 in both lat/long. We'll sort by the area name, and forward fill so they stay in the correct area.

Filter down DF to guns only

  • Due to scope of data and time for project, we will only focus on firearm related crimes.

Bin Robbery and Attempted Robbery

  • We made a column named is_robbery and if the crime was robbery or attempted robbery, we marked it true.

Bin Time of Day

  • We binned the time of crime to time of day to morning, afternoon, evening, and night.

Make a column for crimes that happened on the street.

  • It was identified that crimes that a majority of robberies happened on the street. We'll make a binary column for it name is_street.




Explore

  • Does age affect the rate of robbery in cluster 4?

  • Does time of day affect robbery rate in cluster 4?

  • Does the victim's gender affect robbery rate in cluster 4?

  • Where are you most likely to be robbed in cluster 4?

  • Is the weapon used in cluster 4 different than the rest of Los Angeles?





Modeling

  • Use drivers in explore to build predictive models of different types

  • Evaluate models on train and validate data

  • Select the best model based on accuracy

  • Evaluate the test data

Best Model on Test

Logistic Regression

C=10
max_iter=1000
penalty="l1"
solver="saga"
random_state=321

Baseline: 53%

Test Set: 60%





How to Reproduce:

REQUIRED LIBRARIES:

  • numpy

  • pandas

  • seaborn

  • matplotlib

  • scipy

  • sklearn

  • folium

    • pip install folium
  • yellowbrick

    • pip install yellowbrick
  1. Clone this repo.

  2. Acquire data by:

    • Source the two LA Crime Data files through the links

    OR

    • Download my combined CSV (750MB) of crime data from LA here

  3. Save CSVs in same folder as Jupyter Notebook with the correct naming convention

  4. Run the notebook

    • Do the required pip installs as they come up




Conclusions

Key Findings / Questions Answered:

1. Does age affect the rate of robbery in cluster 4?

  • Age does affect the rate of robbert in cluster 4.

2. Does time of day affect robbery rate in cluster 4?

  • People are much more likely to be robbed in the evening.

3. Does the victim's gender affect robbery rate in cluster 4?

  • Gender did not strongly affect the robbery rate.

4. Where are you most likely to be robbed in cluster 4?

  • Most outside areas are very likely to be robbed. Number one being on the street/sidewalk.

5. Is the weapon used in cluster 4 different than the rest of Los Angeles?

  • We found the weapon used, is not different in cluster 4, compared to the rest of LA.




Takeaways:

  • With only 3 identified features, I developed a model with 60% accuracy that outperformed the baseline accuracy of 53%.

    • is_street

    • victim_sex

    • victim_descent

  • The best model between SVC, KNN, LR and DTC was LogisticRegression witha train set accuracy of 58%, validation set accuracy of 62%, and a test set accuracy of 60%.

C=10, max_iter=1000, penalty="l1", solver="saga", random_state=321

Given that this model performed 5% better than baseline on our test set, we would expect it to also perform well on unseen data





Recommendations

For Modeling:

  • Continue to run feature engineering and potentially test other models with other hyperparameters

  • Possibly create models for the other identified clusters

  • Potentially include some form of time series analysis to obtain trends

For Population:

  • Stay out of Inglewood at night

Next Steps

  • When dealing with crime rates and stereotypes, model performance is extremely important. Making policing assumptions from a model 7 points better than baseline would not be appropriate.

  • Given more time and access suspect information could drastically improve the model.

  • Also with more access, the closed case details could cut down on incorrect, and most importantly biased initial reporting.

  • Combine with weather data for more insights

  • Combine with poverty data for more insights

  • Combine with house values for more insights

Back To Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published