Add Connecticut scraper #3

sukima · 2014-02-24T03:41:18Z

This is a bit of WIP. It will scrape the first 20. I can't seem to craft the following page requests' form-data correctly and each page after the first are returning the wrong page.

Scrape site contents (first page)
Scrape further pages

P.S. The site / markup is scary as all hell 😱

This is an initial stab at a scraper for Connecticut. The website is a bit awkward if not insane. This currently only handles the first 20 as it need subsequent requests to get more.

Sorry this commit blows. A little too much spike work going on and didn't clean up the patch well. Basically move the chain logic into functions. Saves results to an array higher in the scope. Pulls the total logic out of the processHTML function

Will use the total to batch up page requests and wait for them to finish.

Seems that the site uses some wacked session management that I haven't been able to crack yet. Disable the feature till it can be understood. This is a problem with proper crafting of the form-data not the scraper logic.

ajb · 2014-02-24T03:53:45Z

Awesome!! 🌠

You want me to just leave this PR open for you to add more commits?

sukima · 2014-02-24T15:14:00Z

Sure unless you know how to fix it. Sigh I gave up last night.

ajb · 2014-02-24T15:18:20Z

K -- I'll take a look but it could take me a while, since I'm back in day-job mode here at DOBT.

That postdata looks absolutely ridiculous. The other option is to use a headless browser instead of crafting the requests manually -- maybe that's worth a shot if we can't get this method to paginate.

sukima added 7 commits February 22, 2014 15:39

Add Q to package dependencies

216e384

Add CT scraper

70b6397

This is an initial stab at a scraper for Connecticut. The website is a bit awkward if not insane. This currently only handles the first 20 as it need subsequent requests to get more.

Clean up code, whitespace, varnames

3a9f48a

Modularize and cleanup promise chain

7f77bac

Sorry this commit blows. A little too much spike work going on and didn't clean up the patch well. Basically move the chain logic into functions. Saves results to an array higher in the scope. Pulls the total logic out of the processHTML function

Add ability to request subsequent pages

108b902

Will use the total to batch up page requests and wait for them to finish.

Disable the async requests

22ff170

Seems that the site uses some wacked session management that I haven't been able to crack yet. Disable the feature till it can be understood. This is a problem with proper crafting of the form-data not the scraper logic.

Remove commented code 💀

8a1dca1

ajb added the wip label Mar 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Connecticut scraper #3

Add Connecticut scraper #3

sukima commented Feb 24, 2014

ajb commented Feb 24, 2014

sukima commented Feb 24, 2014

ajb commented Feb 24, 2014

Add Connecticut scraper #3

Are you sure you want to change the base?

Add Connecticut scraper #3

Conversation

sukima commented Feb 24, 2014

ajb commented Feb 24, 2014

sukima commented Feb 24, 2014

ajb commented Feb 24, 2014