Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating 50-org to conform to the Updated Data schemas #29

Merged
merged 10 commits into from
Nov 17, 2024

Conversation

DMalone87
Copy link
Collaborator

Updating the 50-a Scraper to use the new schemas outlined in the NPDI API Documentation.

Additionally:

  • Adds a testing mode (scrapes only 10 random items rather than the entire site)
  • Adds user agent data to identify the scrapper as coming from the National Police Data Coalition.

Add user agent information to 50-a scraper
- Add testing mode (Selects 10 random units to scrape)
- Output conforms to schema
- leverage the data classes in the common package
- Collect employment data for officers
- Include source data in data items
@DMalone87 DMalone87 requested review from aliavni and zganger October 20, 2024 03:03
@DMalone87
Copy link
Collaborator Author

Much of this will be covered by the CSV files that 50-a provides. But, I'm adding this PR for 2 reasons:

  1. As a reference for using the Pydantic models to create the JSONL data
  2. Just for myself to have a better understanding of Scrapy
  3. Some data doesn't get included in the CSV files. For example, CSV only includes an officer's current unit assignment while the page has their previous unit assignments as well.

Remove "Unknown" option from ethnicity
- Returns None is a matching ENUM isn't found
@@ -1,2 +1,3 @@
requests==2.32.3
Scrapy==2.11.2
pydantic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Pin the package version

Copy link
Collaborator

@aliavni aliavni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DMalone87 DMalone87 merged commit b8d3d15 into main Nov 17, 2024
5 checks passed
@DMalone87 DMalone87 deleted the dmalone/update_schemas branch November 17, 2024 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants