-
Notifications
You must be signed in to change notification settings - Fork 760
Future Directions Brainstorming
Some future ideas for future crawler changes.
The current grouping by role (prefetch, fetch, extract, etc.) is a bit constraining, especially when a processor could be used multiple places or multiple processors need to work together.
Could processor chain be replaced with hooks & callbacks? There could be
a series of established callback points: start earlyPrereqs, latPreReqs,
earlyFetch, lateFetch, earlyAnalysis, middleAnalysis, lateAnalysis,
earlyFinish, lateFinish end. One module could
hook itself in at multiple places – then we wouldn't have to use many
small Processors for simple functionality. Modules would by default hook
themselves in at reasonable places, but that could be overridden by
expert operators.
Alternatively, the processor chain could be merged into one chain, but perhaps have advisory orderings – either integers suggestive of relative position, or a series of recommended preconditions/postconditions, like "shouldn't go after any Fetch processors".
Can the testing done by scope and alreadyIncluded be merged into the same process? We currently scope first then test for alreadyIncluded, on the theory scoping is cheaper (never requires IO), but soem crawls might benefit from the reverse, or certain efficient alreadyIncluded mechanisms could make rejecting a commonly-encountered URI cheaper than scoping it.
Is there a place for a 'prescheduling' chain that all discovered URIs get fed through?
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse