Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul #12

Merged
merged 3 commits into from
Sep 7, 2014
Merged

Overhaul #12

merged 3 commits into from
Sep 7, 2014

Conversation

blahah
Copy link
Member

@blahah blahah commented Sep 7, 2014

This is the first, and most major step in a complete overhaul of thresher.
The purpose of this is to support the current and near-future needs of
scraperJSON, based on revisiting the design and incorporating a lot of
user feedback.

Major changes:

  • all scraping functionality has been moved to the Scraper class
  • the Thresher class now only handles selecting a scraper by URL, and running it
  • ScraperBox class holds a collection of scrapers and can match them to URLs
  • all logging has been removed and the entire module now operates using events

scraperJSON features implemented:

Richard Smith and others added 3 commits August 23, 2014 12:34
This is the first, and most major step in a complete overhaul of thresher.
The purpose of this is to support the current and near-future needs of
scraperJSON, based on revisiting the design and incorporating a lot of
user feedback.

Major changes:

- all scraping functionality has been moved to the Scraper class
- the Thresher class now only handles selecting a scraper by URL, and running it
- ScraperBox class holds a collection of scrapers and can match them to URLs
- all logging has been removed and the entire module now operates using events

scraperJSON features implemented:

- elements can be nested (fixes #2 and ContentMine/scraperJSON#3)
- elements can depend on 'following' the captured URLs from other elements (fixes #6)
- URLs are resolved (and all redirects followed) before scraping (fixes #10)
- headless pre-rendering is no longer default (for a massive speed/efficiency increase)
blahah added a commit that referenced this pull request Sep 7, 2014
@blahah blahah merged commit 5b17e4f into master Sep 7, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+6.24%) when pulling f4406c6 on follow-on into 145c0e0 on master.

@blahah blahah removed the in progress label Sep 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants