-
Notifications
You must be signed in to change notification settings - Fork 761
unexpected offsite content
If you use one of the default setups, Heritrix assumes you want related resources required to render the source page/site – such as offsite images, frames, javascript, and so on. Heritrix also by default errs on the side of inclusiveness, so sometimes it goes a couple hops off a target site when following links whose necessity is unclear.
When using a DecidingScope (or other DecideRules chain) as your
scope, eliminating or adjusting the TransclusionDecideRule
controls
whether and how deep off-site Heritrix goes. TooManyHopsDecideRule
can
be adjusted to control the number of outlink hops are followed. The
default (crawler-beans) settings are intended to be moderately
inclusive; following many outlinks and allowing the crawler to discover
and follow transcluded and speculative content. Transcluded content
includes that which is necessary to render a page. Speculative content
is content that appears to be discoverable by the crawler during its
configured discovery processes, i.e. anything that "looks" like a link.
As an alternate example, if you wanted to limit crawling strictly to
http://www.example.com/ from default (crawler-beans) settings, then
specifying (reject) TooManyHopsDecideRule
maxHops = 0 and (accept)
TransclusionDecideRule
maxTransHops = 0 and maxSpeculativeHops = 0
would prevent the crawler from downloading (or discovering links from)
pages more than zero hops from http://www.example.com/ - or, trivially
download a single page. You can expand your crawl by not rejecting
outlink hops (maxHops > 0) and allowing more transclusion or
speculative discovery (maxTansHops,maxSpeculativeHops > 0), and by
including additional helpful hosts in your list of seeds or SURT
prefixes. Examining the output of short, initial test crawls is helpful
in determining how to configure the crawler's scope for the desired
results.
Legacy scopes (SurtPrefixScope, DomainScope, HostScope, PageScope) have a similar hops setting but are less configurable and efficient and no longer recommended.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse