-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Connecticut scraper #3
base: master
Are you sure you want to change the base?
Conversation
This is an initial stab at a scraper for Connecticut. The website is a bit awkward if not insane. This currently only handles the first 20 as it need subsequent requests to get more.
Sorry this commit blows. A little too much spike work going on and didn't clean up the patch well. Basically move the chain logic into functions. Saves results to an array higher in the scope. Pulls the total logic out of the processHTML function
Will use the total to batch up page requests and wait for them to finish.
Seems that the site uses some wacked session management that I haven't been able to crack yet. Disable the feature till it can be understood. This is a problem with proper crafting of the form-data not the scraper logic.
Awesome!! 🌠 You want me to just leave this PR open for you to add more commits? |
Sure unless you know how to fix it. Sigh I gave up last night. |
K -- I'll take a look but it could take me a while, since I'm back in day-job mode here at DOBT. That postdata looks absolutely ridiculous. The other option is to use a headless browser instead of crafting the requests manually -- maybe that's worth a shot if we can't get this method to paginate. |
This is a bit of WIP. It will scrape the first 20. I can't seem to craft the following page requests' form-data correctly and each page after the first are returning the wrong page.
P.S. The site / markup is scary as all hell 😱