Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PA municipal scraper #240

Open
4 tasks
josh-chamberlain opened this issue Feb 6, 2024 · 18 comments
Open
4 tasks

PA municipal scraper #240

josh-chamberlain opened this issue Feb 6, 2024 · 18 comments
Assignees
Labels
help wanted Extra attention is needed scraper-request

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Feb 6, 2024

Context

Related to data source request 102

Pennsylvania publishes municipal, county, and state budgets. It's possible to find individual municipal budgets, which include police budgets, but cumbersome to get a bunch at once. Let's make a scraper which can be run to iterate through the interface and collect them all. Each municipality has its own police force.

Source 1. Municipal and police budget: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm

Screen Shot 2024-02-06 at 12 39 13 PM

Source 2. Police details: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=MuniPolice_Excel

Screen Shot 2024-02-06 at 12 34 09 PM

Source 3. Municipal demographics: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=CountyMuniDemo_Excel

Screen Shot 2024-02-06 at 12 41 40 PM

Requirements

  • create a scraper which can be run on-demand to generate a CSV for every municipality in Pennsylvania
    • our current primary users are in Allegheny County; for simplicity and speed of testing, you can focus on that county
    • there are effectively 3 sources; we should probably treat each of them individually, maybe with shared utilities since they're on the same website, adding them to the spreadsheet
  • the CSV should have the following columns:
column source
Municipality 1, 2, 3 (use this one to link them together)
MUNI_ID 2, 3
county 2
Population 3
Police Service Type 2
Full Time Officers 2
Part Time Officers 2
Contract to Others? 2
Contracted to Municipalities 2
Contracted from Municipality 2
Regional Name 2
Regional Municipalities 2
2019 Police Expenditures 1
2019 Total Expenditures 1
2020 Police Expenditures 1
2020 Total Expenditures 1
2021 Police Expenditures 1
2021 Total Expenditures 1
2022 Police Expenditures 1
2022 Total Expenditures 1
2023 Police Expenditures 1
2023 Total Expenditures 1

Example

Here's an sample from a manually generated document from ~2020:

<style type="text/css"></style>

Municipality MUNI_ID Population Police Service Type Full Time Officers Part Time Officers Contract to Others? Contracted to Municipalities Contracted from Municipality Regional Name Regional Municipalities 2019 Police Expenditures 2019 Total Expenditures
ALEPPO TWP 20034 1825 Police Service contracted from another municipality 0 0     OHIO TWP     $104,266.00 $2,253,234.00
ASPINWALL BORO 20063 2916 Own Municipal Police Force 7 2 NO         $624,961.00 $3,932,034.00
AVALON BORO 20093 4762 Own Municipal Police Force 6 7 NO         $1,223,255.00 $5,713,865.00
BALDWIN BORO 20123 21510 Own Municipal Police Force 24 0 NO         $3,918,035.00 $17,874,484.00
BALDWIN TWP 20154 1985 Own Municipal Police Force 6 0 NO         $775,007.00 $2,009,493.00
BELL ACRES BORO 20183 1505 Own Municipal Police Force 5 4 NO         $491,571.00 $2,284,663.00
@josh-chamberlain josh-chamberlain added help wanted Extra attention is needed scraper-request labels Feb 6, 2024
@maxachis
Copy link
Contributor

maxachis commented Feb 6, 2024

Looking at the web sources, the problem can be broken down into three parts

  1. Iterate through all possible selection options for each page. Where possible, use a "select all" option.
  2. Traverse through the generated table (which are often paginated), extract relevant data
  3. Put relevant data into csv format.

Of these, 3 is fairly straightforward, and it'll be 1 and 2 with the most complexity. Selenium IDE would likely be one option for components requiring a user interface, but there might be easier ways to do it if UX interaction isn't necessary (perhaps through automating the form submissions?). I'd need to take a look at the most up to date libraries for scraping.

Additionally, if this information is provided through alternative means, such as an RSS feed, worth noting.

@maxachis
Copy link
Contributor

maxachis commented Feb 6, 2024

Gonna try approaching it with Puppeteer .

@josh-chamberlain
Copy link
Contributor Author

@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.

  1. head to https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
  2. i picked allegheny county, braddock hills, and 2021
  3. page 3 of the report, revenues & expenditures, there's a public safety section—police $533,540
  4. for wilkinsburg, 2021, same thing $3,179,327

@maxachis
Copy link
Contributor

@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.

  1. head to https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
  2. i picked allegheny county, braddock hills, and 2021
  3. page 3 of the report, revenues & expenditures, there's a public safety section—police $533,540
  4. for wilkinsburg, 2021, same thing $3,179,327

I was able to find it. I was looking at the wrong url 🙃. Deleted the message because it was just me making an error.

I've got code that, with a few more tweaks, I should be able to use to start pulling the data. I'll start off by just pulling the download urls for everything, rather than downloading the files directly. After that, I'll work on creating another script that will pull the data and process the relevant information.

Primary question for me is where all the data should ultimately go? I should be able to produce the csvs of the data without too much trouble, but where they'll be ultimately placed is another question.

@josh-chamberlain
Copy link
Contributor Author

@maxachis ok, great! Thank you for working on this. For now, let's just have them go in the same directory as the scraper, in a "data" folder or something. We can use GitHub Actions to run them, if we ever need to, or just manually update and make a PR.

@maxachis
Copy link
Contributor

Additionally, I note that a lot of the Municipal Finance data includes a wealth of information, including on sources of income relating to the police, such as Fines and Forfeits, and "Public Safety Charges for Service". That's outside the scope of this task, but it seems like it could be valuable intel, and it would be possible to extend or modify this scraper to gather that information.

@josh-chamberlain
Copy link
Contributor Author

@maxachis yeah, agreed—ton of good stuff there. if you make an issue with your ideas to extend this scraper, someone could do the enhancement!

how's this going, btw? need anything?

@maxachis
Copy link
Contributor

So far progressing! I've developed alpha-level code that can iterate through and download the options for Source 1, a script for finding the relevant data in the downloaded excel files, and a SQLite database for caching the attempts (so a person can retain progress if the scraper halts mid-process) and storing some of the relevant information. Still need to sort out kinks, and eventually expand to the other Sources. Helpfully, those other Sources don't seem as demanding as Source 1.

You can track the current status of my scraper here: https://github.com/maxachis/pa_municipal_scraper

@maxachis
Copy link
Contributor

Scraper for Source 1 is progressing apace. I now have it running continuously.

The primary bottleneck is the network speed of the Municipal website, which is slow to respond. I'm able to work around this by utilizing Node.js's concurrent processing, essentially having multiple webscrapers operating at a time (currently 10). In theory, I could increase the number of webscrapers, but each webscraper costs memory, and I only have so much computer. Plus, and perhaps I'm being overly cautious, I'm not sure how many concurrent requests this government website can handle. In theory, even 100 of my scrapers shouldn't pose a problem for it, but I don't know how brittle the backend is.

Still, I can currently process around 50 entries per minute, and that's probably a conservative estimate. At that rate, assuming no interruptions (which is an assumption), processing would be done within 14 hours of continuous operation, AKA I could easily have it done by the next Friday meeting, and likely much sooner. I've currently got a little over 5,800 of approximately 41,000 possible entries scraped, including the majority of Allegheny County (I say majority only because some entries are not available).

I've done spot checking to validate I'm pulling the correct data, but it's possible there are other errors I'll only discover later, which would of course necessitate rerunning some or all of the code.

One note on data integrity: Some of these financial reports do not indicate any police expenditures. By my count, approximately 26% of what I've processed overall (and happily only 3% of Allegheny county) do not report any police expenditures. Why this is, I don't know.

Sources 2 and 3 should be considerably easier. I should be able to pull both through a single request, and after that it's just a matter of parsing the Excel scripts.

@maxachis
Copy link
Contributor

I ran into an instance where the rate at which I was able to download data slowed considerably between Saturday and Sunday. While I'm not sure how plausible it is that they detected that bots were snorkeling up their data, it did bring to bear the question of how to ethically scrape the website. From what I can tell, a decent rule of thumb is to only pull data around as fast as a human user can pull it. Unfortunately, that does mean that pulling additional data will take longer than I had planned.

Fortunately, I do have the majority of the Allegheny County information, and can move forward with that, while still running the scraper continuously to trickle in the remaining data. But this does bring up a few additional questions:

  1. What do we consider to be an acceptable rate with which to pull data from these websites? Currently, at my rate-limited, uh, rate, I pull in about 1-2 entries per minute. Obviously, I'd like to go faster, but don't want to upset anyone managing the Muni website, nor degrade their service.
  2. Might it be prudent to email the webmasters, announce our goals and interests, and request either a data dump or an API to more efficiently scrape the data?
  3. Would we want to pre-emptively pull more data than what we're pulling now, in anticipation of potential future requests?

@maxachis
Copy link
Contributor

Created a draft of this information. Best results are for Allegheny County, with substantial gaps in the Municipal Finance for a number of other counties.
results.csv

@josh-chamberlain
Copy link
Contributor Author

@maxachis thanks for sharing your thought process here.

  1. Most important is not taking the site down or disrupting access; defaulting to human-speed is sensible. Since this data is updated yearly, starting from Allegheny County and getting the rest at a slow trickle seems OK to me. I can anticipate practical limitations to doing this on a personal computer, but we should be well within the free tier of GitHub Actions if we want to set it up that way.
  2. The original requestor did this, and I have heard rumors of a past data dump. This time they have not been responsive.
  3. If you can see anything that you want to use, or that seems obviously useful, you could try that. In general, getting extra data in anticipation of future requests has not been worth the effort.

The results look great! My suggestion would be to submit the results with the code, so that the scraper's README contains:

  • reference the source material
  • the results from when it was last run
  • brief instructions for generating one's own results, which could be submitted back to the repo as a PR
  • an explanation of any anomalies / issues with the source data that might make the scraped data look funky

@maxachis
Copy link
Contributor

@josh-chamberlain Where should the results of this data be stored? HuggingFace? Airtable? Somewhere else?

@josh-chamberlain
Copy link
Contributor Author

@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?

@maxachis
Copy link
Contributor

@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?

This is doable. Note that recommended Github Repo size is less than 5 Gigabytes. Unclear what the repository size is currently, but regardless, we should be able to store the data within here without adding too much. Assuming we were able to get data for every single municipality for all 15-ish years (which is a substantial if) the total amount of rows would amount to around 40,000. Probably not wise to download to your iPod Nano, but should be doable for the repo.

@josh-chamberlain
Copy link
Contributor Author

@maxachis i remember your scraper working well; want to submit it and call this closed?

@maxachis
Copy link
Contributor

@josh-chamberlain Can do! May take me a second while I work through other parts, though, unless you want me to put this at the head of the queue.

@josh-chamberlain
Copy link
Contributor Author

@maxachis great! This isn't urgent, but it is a nice utility for anyone in the state. Even in the state you used it, which might be "incomplete", it could be committed to the scrapers repo and used in the future; there's certainly worse/broken code there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed scraper-request
Projects
Status: In progress
Development

No branches or pull requests

2 participants