Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: ability to nest elements #3

Open
blahah opened this issue Jul 11, 2014 · 0 comments
Open

feature: ability to nest elements #3

blahah opened this issue Jul 11, 2014 · 0 comments

Comments

@blahah
Copy link
Member

blahah commented Jul 11, 2014

See ContentMine/thresher#2

@blahah blahah changed the title feature: ability to next elements feature: ability to nest elements Jul 11, 2014
blahah added a commit to ContentMine/thresher that referenced this issue Sep 7, 2014
This is the first, and most major step in a complete overhaul of thresher.
The purpose of this is to support the current and near-future needs of
scraperJSON, based on revisiting the design and incorporating a lot of
user feedback.

Major changes:

- all scraping functionality has been moved to the Scraper class
- the Thresher class now only handles selecting a scraper by URL, and running it
- ScraperBox class holds a collection of scrapers and can match them to URLs
- all logging has been removed and the entire module now operates using events

scraperJSON features implemented:

- elements can be nested (fixes #2 and ContentMine/scraperJSON#3)
- elements can depend on 'following' the captured URLs from other elements (fixes #6)
- URLs are resolved (and all redirects followed) before scraping (fixes #10)
- headless pre-rendering is no longer default (for a massive speed/efficiency increase)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant