Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🕷️ Fix spider: Northeast Ohio Areawide Coordinating Agency #116

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

SimmonsRitchie
Copy link
Contributor

@SimmonsRitchie SimmonsRitchie commented Aug 13, 2024

What's this PR do?

Fixes our Northeast Ohio Areawide Coordinating Agency spider (aka. cuya_northeast_ohio_coordinating).

Why are we doing this?

The spider broke due to what appear to be security enhancements on the agency's servers. The webpage didn't change but requests were returning 403 errors. Using a headless browser and a different user agent appear to circumvent whatever bot-detection system the agency is using.

Steps to manually test

After installing the project using pipenv:

  1. Activate the virtual environment:
pipenv shell
  1. Run the spider:
scrapy crawl cuya_northeast_ohio_coordinating -O test_output.csv
  1. Monitor the stdout and ensure that the crawl proceeds without raising any errors. Pay attention to the final status report from scrapy.

  2. Inspect test_output.csv to ensure the data looks valid. I suggest opening a few of the URLs under the source column of test_output.csv and comparing the data for the row with what you see on the page.

Are there any smells or added technical debt to note?

  • A rotating user agent would probably provide a more robust longterm solution, but we'll see how this works for now.
  • Adding a headless browser means that we the workflows need to install the browser each time they're run. This means their execution time will be a bit longer from this point forward. We consider it an acceptable tradeoff to ensure this spider is working properly.

Use playwright and different user-agent to avoid 403 responses on agency's website that are likely caused by bot-detection software.
@SimmonsRitchie SimmonsRitchie marked this pull request as ready for review August 13, 2024 18:54
@SimmonsRitchie SimmonsRitchie merged commit 1f16c81 into main Aug 13, 2024
2 checks passed
@SimmonsRitchie SimmonsRitchie deleted the fix/ne-coord branch August 13, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant