-
Notifications
You must be signed in to change notification settings - Fork 760
Internet Archive Crawler Requirements Analysis
Internet Archive Crawler Requirements Analysis
2003-03-07
Michele Kimpton ([email protected])
[Raymie Stata ([email protected])
Gordon Mohr ([email protected])]
- 1Introduction
- 2Use-case analysis
- 3Requirements
The Internet Archive in collaboration with several National Libraries is seeking to build an open source Crawler that can be used primarily for Web Archiving purposes meeting the requirements defined by the National Libraries Web Archive Consortium.
This document describes in part the anticipated requirements for such crawling software.The first subsection discusses the "use cases" we wish to support (that is, it describes, at a high-level, our anticipated patterns of use).The following section provides a statement of requirements derived from those use cases.
The Archive wants to support the following use cases:
Broad crawls are large, high-bandwidth crawls in which the number of sites and the number of interesting, individual pages touched is as important as the completeness with which any one site is covered.The Web is uncataloged; broad crawling is important for discovering unknown pages.Also, one cannot know a priori what parts of the Web will be important in the future; broad crawling attempts to ensure sufficient coverage for these unknown uses.
Narrow crawls are small- to medium-sized crawls in which the quality criterion is complete coverage of some selected sites or topics.Narrow crawls are necessary for capturing sites that are known a priori to be important (either intrinsically or in relation to some collection being put together).
Traditionally, crawlers have taken pains to download only pages they have not previously downloaded.Such crawlers are typically known as batch _or _snapshot crawlers because they are typically run for a fixed amount of time to take a snapshot of the Web, and then are restarted to do a full crawl all over again.
Continuous crawling continuously downloads previously fetched pages (looking for changes) as well as (possibly) discovering and fetching new pages.Continuous crawling is useful for capturing pages that change quickly.Also, when the download rates of pages are adjusted to reflect measured change frequencies, continuous crawling can be more efficient than batch crawling.
The Archive is interested in continuous crawling, both for its broad and its narrow crawls.
The Archive is interested in crawling pages available through protocols other than HTTP.RSS is of particular interest.Additional protocols include HTTPS, FTP, and streaming protocols such as MMS, RTSP, and PNA.
The Archive wants to support experiments in crawling techniques and technology performed by external organizations.Crawling for the National Digital Science Library done at Cornell University is a good example of what we want to support.In this project, researchers wanted to build a topic-focused crawler to find science-related material on the Web.They were able to write "analyzers" that used term-vector techniques to identify material of interest and were able to modify the policies of the crawler to use this information to direct the crawl.
We want to be able to support projects such as the NSDL project.To do so, the crawler needs to be easy to extend and easy to use, and it cannot be over-encumbered with licensing constraints.
The Archive does not have a large operations staff and thus requires a high level of automation in its crawlers.In particular, we imagine that a single, broad crawl can be administered by the part-time efforts of a single person (assuming reliable hardware and network connectivity).Also, the Archive cannot provide much technical support to external organizations using its crawling software, so this software must be easy to operate without much support.
We believe that the following requirements are necessary to support the above use cases:
Extensibility is the ability to change the behavior of the crawler: its selection policy, its scheduling policy, what it stores to disk, and so forth.Extensibility is our most important requirement.It is needed to support almost all of the use cases above (broad and narrow crawling, continuous crawling, multi-protocol crawling, ease of operation, and, as mentioned above, experimental crawling).
As enumerated below, extensibility applies to many aspects of the crawler.At a high level, what we really want is a framework for building crawlers, not a particular crawler embodying specific policies or capabilities.A crawl designer should be able to put together an instance of this framework to meet the needs of a specific crawling project.At the same time, we do not want a traditional object-oriented framework that requires programming to put together useful systems.As we discuss below, while the ability to plug new, custom code into the framework is important, it is also important that the framework support extensibility through parameterization as well.
Many different aspects of the crawler need to be extensible.In this subsection, we list what needs to be extensible in the crawler; in the following subsection, we list how this extensibility is to be achieved.
It should be easy to add new downloading protocols to the framework.
The selection policy determines what the crawler will download.Selection is typically the product of two mechanisms: a discovery mechanism by which new URLs are identified, and a filtering mechanism that decides which of the discovered URLs is to be downloaded.In continuous crawls in which old URLs are recrawled, there also needs to be a filtering process to eliminate URLs that are no longer of interest.
Selection policies are of vital importance to both broad and narrow crawlers.It is an area in which we expect there to be significant, on-going research and experimentation.Thus, our crawler framework must support the widest-possible extensibility when it comes to discovery and filtering mechanisms.
The discovery process can sometimes happen outside of the crawler itself.For example, some organizations have access to Web proxies or other usage information that can be good sources of URLs to crawl.To support external discovery and filtering mechanisms, it must be possible to both add and remove large numbers of URLs from the crawler while a crawl is in process.
The scheduling policy (sometimes called ranking policy) determines the order in which pending URLs are downloaded.The scheduling and selection policies together define the contents of the collection obtained by a crawl, and thus are of enormous interest to crawl designers.Significant flexibility is required with respect to both.However, flexibility with respect to scheduling is particularly difficult to provide because the scheduling policy is determined by the crawler's frontier of pending URLs.
A crawler's frontier is the collection of URLs that the crawler intends to download in the future.In a batch crawl, the frontier contains only new URLs (URLs that have not been downloaded by this crawl).In a continuous crawl, the frontier can contain both new URLs and old URLs that are due to be recrawled.The scheduling policy determines the order in which elements are removed from this frontier.For broad crawling the frontier can become very large, containing as many as 10's of billions of URLS, and thus, in a scalable crawler, the frontier must be a disk-based data structure.Disk-based data structures can have high latency, but the frontier is on the critical path of the crawler, where latency cannot be tolerated.Thus, a good frontier requires significant programming skill to implement and tune.
As a result, we cannot expect new frontiers to be written quickly and easily, and thus flexibility for scheduling policies cannot come primarily from writing new frontier implementations.Rather, it is important that the crawler framework come with a variety of frontier implementations supporting a variety of scheduling policies.At the same time, to support experimentation and further development, there must be a good API for defining new frontier implementations.
High-capacity crawlers can easily overwhelm a low-end Web server and can even put stress on high-end Web servers.Thus, it is very important that crawlers be "polite," limiting the load that they put onto a Web server.
Unfortunately, the need to be polite sometimes conflicts with the scheduling policy of a crawl.For example, a simple politeness policy is to stagger, over time, the pages downloaded from a given site.However, as discussed above, it is sometimes desirable to download images and other objects associated with a page immediately after the page itself is downloaded.If this latter requirement is particularly important to the crawl, then a more sophisticated politeness policy is needed to accommodate it.At the same time, when such freshness issues are not important to a crawl, it is generally considered neighborly to space downloads from a site as far apart as possible.
The crawler must be flexible regarding the politeness policies it implements.Again, there really is no such thing as a "one size fits all" crawler; what is needed is a framework that accommodates a wide variety of policies.
Once a document is downloaded, it is processed for a variety of reasons.Features might be extracted for the purpose of further crawling (e.g., links for most crawls, plus link-text for topic-directed crawls).Features might be extracted for the sake of collecting statistics, and the page itself might be transformed into some appropriate output format.
Thus, the crawler must support a framework for pluggable "analyzers" that process documents.Further, this framework must support effective interoperability among analyzers, ensuring that the good work of one group can be built-upon by others.
Analyzers should have access to as much metadata as possible.This includes the URL used to fetch the object, download timestamps, DNS resolution information, and request and response headers used during the transaction (plus any other material used to fetch the object, such as passwords or SSL keys).
Analyzers should have access to the results of all requests, not just successful ones.This is necessary, for example, for detecting change in the Web, or for detecting the edges of the "deep Web."This means recording, for example, 30x (moved/redirect) and 40x (invalid/not found/unauthorized) responses.It also means recording requests that are prevented by robot exclusion rules.Finally, automatically fetched robot-exclusion files should also be subject to regular analysis (rather than be handled by a separate path).
Link extraction is, of course, a very important form of document processing.Extensibility in this regard means that it should be easy to add modules to handle Javascript, Flash, and other media-types that have embedded links.
Another form of document processing is duplicate elimination.It should be possible for the crawler to recognize, during the crawl, that a duplicate has been encountered so that special (more efficient) steps can be taken. Any special duplicate handling should be noted in the crawler output and amenable to analysis by standard tools. It should be possible to vary the definition of "duplicate" used (from very strict to very loose).
Output is another important form of document processing.The output of the crawler should be highly configurable in both the selection of data to be output and the format in which it is output.While one typically saves the pages from a crawl, crawls are often run simply to collect features or statistics.In these cases, the pages themselves are not stored, rather the features and/or statistics are extracted in the crawler and stored directly to disk, reducing disk-space requirements significantly.The crawler must support this type of output.
File formats for output vary.In the case of pages, for example, people use ZIP files, tar files, or the Internet Archive's "ARC" file.We need to support ARC files, whose format we will document, expand where necessary, and promote as a general standard. Our framework must accommodate third-party programmatic extension to other formats when necessary.
The Web is becoming increasingly interactive: to receive Web pages, surfers must often fill out forms, and browsers must support authentication and cookie mechanisms.
To keep pace with this ever-more interactive Web, crawlers must support these interaction mechanisms.At the very least, they must support cookies and simple password schemes.And there is an increasing desire for even more support, thus there needs to be extensibility mechanisms to allow crawl-designers to extend the repertoire of interactions supported by the crawl.
There are two well-known mechanisms for making software extensible: parameterization and code extensions.Parameterization is easy to use, but is restrictive in what it can support.Code extensions requires programming, but supports a wider range of flexibility.As discussed below, the crawler framework should support both forms of extensibility.
In addition to supporting both forms of extensibility, the crawler should support dynamic extension; that is, changing of the crawl while a crawl is in process.We discuss this issue below as well.
Extensibility through parameterization is the ability to customize the behavior of the crawl through the use of code modules whose behavior can be changed by supplying appropriate parameters.For example, a frontier module might have a politeness parameter controlling how much time to wait before revisiting a given server, or a scheduling parameter controlling whether image objects from a page are put at the front or the end of the crawling queue.As another example, a filter module might take a set of regular expressions that control what newly-discovered URLs are to be allowed into the frontier.
Another form of parameterization is the ability to control, from a simple, run-time configuration file, which code modules are used during the crawl.For example, one implementation of the frontier might be based on a breadth-first crawling policy; a different implementation might be based on a page-rank-ordered policy.Through simple directives in the crawler's configuration file, the crawl designer should be able to select from among such frontier implementations.(Of course, these modules themselves might be further parameterized; as discussed above, for example, both of these modules might have a flag indicating that images from a page are to be treated as exceptions to the general scheduling policy, ensuring that such images get downloaded contemporaneously with the pages referring to them.)
At a minimum, the crawler framework should have parameters controlling the following aspects of the crawl:
-
*Filtering.*Simple URL-based filtering in which patterns specify hostnames and/or paths of interest.Also, it must be possible to filter based on hop counts, i.e., based on the number of hops from apage allowed by the URL-based tests.
-
*Scheduling.*As mentioned above, scheduling is determined in large part by the types of frontiers that are allowed by the crawler.We believe the following types of frontiers are necessary:
-
*Multiple-priority, breadth-first frontier.*These frontiers allow efficient, disk-based implementations that can scale to very large crawls.Multiple priorities means there are multiple queues with different priorities, allowing one to schedule "urgent" downloads above the basic queue.Such urgent downloads would support, for example, the download of an image contemporaneously with the page containing it.These priorities can also be used in continuous crawls, wherein frequently-changing pages can be given a higher priority than slowly-changing ones.
-
*Site-first frontier.*These frontiers are closer to depth-first crawlers, attempting to complete sites as quickly as possible before starting on new sites.Like breadth-first frontiers, these are very scalable and can support a priority scheme.
-
*Dynamically ranked frontier.*The previous frontiers assign an unchanging priority to URLs when they are (re)inserted into the frontier.A dynamically ranked frontier allows the crawl to update URL priorities after insertion.Such frontiers can be used, for example, to download pages in close-to-Pagerank order and also allow ranks to be modified according to linguistic or other considerations.These frontiers typically have an in-memory component that limits their scalability according to the amount of memory on the crawler machines, and thus are less scalable than the statically ranked frontiers.
All of these frontiers must support distribution across multiple machines.They should support continuous crawling and flexible politeness policies.
-
*HTTP.*The crawler must come with built-in support for HTTP (v. 1.1).This HTTP implementation should support robot exclusions, but it should also be possible to ignore robot exclusions.This module should support parameterization of outgoing headers (e.g., allowing the crawl designer to set User-Agent and other header fields).This module should support use of HTTP's if-modified-since header (and also support not using it).
-
*Link extraction.*The crawler should be able to extract links from HTML and from Flash objects.On HTML objects, it should make a "best effort" at extracting links from Javascript and other scripting languages.
-
*Output.*As mentioned above, output to ARC files, TAR files, and ZIP files needs to be supported.There must be a mechanism for controlling what objects get saved, based both on media-type and also on URL-based tests.
Crawl-designers will not have to write new pieces of code to achieve the above-listed changes.
Extensibility through code extensions is the ability to change the behavior of the crawl by inserting into the crawling software customize modules that replace or augment standardized modules.This allows the crawl design to develop (or have developed) modules that meet needs that cannot be met through parameterization of existing modules.
Support for code extensions is difficult.It is one thing for the crawler's original authors write two or three modules from which the crawl operator can choose.It is quite another to design and document appropriate internal APIs that would allow third parties to write their own modules.Such an extensible design cannot be added to a crawler after it has been written; it influences all aspects of the crawler's internal composition and must be contemplated from the very beginning.
An extensible crawling framework must support code extensions in its downloading modules (which impacts extensibility of protocols and interactivity), its frontier (impacts scheduling and politeness), frontier insertion (selection), and document processing (feature extraction, output, discovery, filtering).There must also be a good internal infrastructure for the various modules to communicate their results to one another; this is particularly the case for communicating the results of document processing throughout the crawler.
Crawls often extend over days and even weeks; continuous crawls can run for months.It is not enough that a long-running crawl be extensible when it is first started, it must be possible to change the crawl during the crawl itself.To change policies, it may be necessary to temporarily halt the crawler, but the restart process must be reasonably fast and must not lose track of the state of the crawl.
A single (adequately-configured) Pentium-based machine should be able to download 10-20M documents per day (which translates roughly to 15-30 Mbps bandwidth averaged over the entire day).Further, it should be possible to distribute a crawl over 10-20 such machines to achieve 10-20x improvements in crawling bandwidth.
Bandwidth of the crawler should not stop dropping until at least 2.5B documents have been downloaded (if at all).(This assumes no interference from politeness issues, e.g., an adequate pool of sites to pick from.)
The crawler should be usable for small as well as large crawls.While it should be easy to use for all crawls, it should be brain-dead simple for small crawls, i.e., the operator of a small crawl should not be forced to suffer unneeded complexity just because we also want to support large crawls.We believe the best way to support this requirement is a default configuration of the "toolkit" that does an adequate job for simple crawling problems.
The crawler must run very well on Linux. Where practical, the crawler should also be capable of small and test runs on Windows 2K/XP.This means it is written in languages and with middleware and libraries that are portable.
As suggested in 3.4, ease of operation is a somewhat relative term.A simple, single-machine, multi-hour crawl should be attainable by operators with relatively primitive OS administrative skills and no programming skills.A complex, multi-machine, multi-week crawl will require more advanced skills.But in either case, through a combination of simplicity and good documentation, the crawler should be operable by anybody with the requisite OS and programming skills; appeals to a secret "oral tradition" (unfortunately common when it comes to crawlers) should absolutely not be necessary.
In addition to being simple to operate, the crawler should require little attention or effort while it is underway.While exceptional conditions requiring operator intervention will always exist, the crawler should be as robust to networking and other problems as possible.The operator should not be required to supply an ongoing list of URLs to download or otherwise feed the crawler to keep it going (although it should be possible for an operator to insert URLs if they want to).
Long (multi-day crawls) can be temporarily halted for a number of reasons: the WAN connection becomes unavailable, a bug in the software is tickled, or the crawling policy needs to be changed.To support such interruptions, it must be possible to restart the crawl from data stored on disk.Ideally, such restarts should be "clean" in that data is not lost; e.g., if a document is being downloaded but does not quite finish when the crawler is shut down, then the crawler should attempt to download that document again when it restarts.
Good logging (including control over log output) is also necessary for monitoring the crawler's behavior and diagnosing any problems. Logging should also support long-term confidence in the output of crawling activity, assisting in the determination of data provenance and authenticity. However, such logs should never become necessary to usefully analyze crawl results: standalone ARC files should remain sufficient and complete as a self-contained crawl record.
Some kind of "control panel" is needed to allow the operator perform basic operations such as starting a crawl, monitoring the crawl's progress, initiating a temporary halt to the crawl or changing the logging parameters. A web-based interface for performing these tasks, from the local machine or remotely, should be offered.
Finally, "ease of operation" extends to software required by the crawler as well as the crawler itself.For example, although object-request brokers (ORBs) per se are acceptable, many of them are very difficult to administer, and the crawl must not depend on such a one.
While not a technical requirement per se, we believe open licensing is necessary to achieve our goals, especially the goal of external collaboration.The Archive has extensive experience with using proprietary crawlers from other organizations.We have found that the activities of the Archive have a low priority at these organizations, and thus it can take a very long time to get permission to do anything (if we can get it at all).Also, to support experiments to be done at Universities and other such institutions, we have wanted to give third parties access to crawling technology and have been hindered in doing so.
Thus, we want a crawling framework in which we do not have to get permission to do most of the things we want to do.Ideally, we want a completely open-source crawler.At the very least, we want a crawler that anyone (e.g., the Archive, Universities, and even corporate research labs) can use for non-commercial purposes without seeking permission from the copyright holder.
We expect the "Lesser" (or "Library") Gnu Public License (LGPL) to meet our licensing needs, and we plan to release our work at a minimum under this license. (Other license options are not precluded.)
On a related topic, the crawler must not depend on middleware or other software that comes with prohibitive licensing terms.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse