Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Connecticut scraper #3

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Add Connecticut scraper #3

wants to merge 7 commits into from

Conversation

sukima
Copy link

@sukima sukima commented Feb 24, 2014

This is a bit of WIP. It will scrape the first 20. I can't seem to craft the following page requests' form-data correctly and each page after the first are returning the wrong page.

  • Scrape site contents (first page)
  • Scrape further pages

P.S. The site / markup is scary as all hell 😱

This is an initial stab at a scraper for Connecticut. The website is a
bit awkward if not insane.

This currently only handles the first 20 as it need subsequent requests
to get more.
Sorry this commit blows. A little too much spike work going on and
didn't clean up the patch well.

Basically move the chain logic into functions. Saves results to an array
higher in the scope. Pulls the total logic out of the processHTML function
Will use the total to batch up page requests and wait for them to finish.
Seems that the site uses some wacked session management that I haven't
been able to crack yet. Disable the feature till it can be understood.

This is a problem with proper crafting of the form-data not the scraper
logic.
@ajb
Copy link
Contributor

ajb commented Feb 24, 2014

Awesome!! 🌠

You want me to just leave this PR open for you to add more commits?

@sukima
Copy link
Author

sukima commented Feb 24, 2014

Sure unless you know how to fix it. Sigh I gave up last night.

@ajb
Copy link
Contributor

ajb commented Feb 24, 2014

K -- I'll take a look but it could take me a while, since I'm back in day-job mode here at DOBT.

That postdata looks absolutely ridiculous. The other option is to use a headless browser instead of crafting the requests manually -- maybe that's worth a shot if we can't get this method to paginate.

@ajb ajb added the wip label Mar 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants