All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Support
waitFor
for crawler.queue()'s options.
- Fix a bug of not allowed to set
timeout
option per request. - Fix a bug of crawling twice if one url has a trailing slash on the root folder and the other does not.
- Support
browserCache
for crawler.queue()'s options. - Support
depthPriority
option again.
- Drop
depthPriority
for crawler.queue()'s options.
- Emit
newpage
event. - Support
deniedDomains
anddepthPriority
for crawler.queue()'s options.
- Allow
allowedDomains
option to accept a list of regular expressions.
- Support
followSitemapXml
for crawler.queue()'s options.
- Fix a bug of not showing console message properly.
- Fix a bug of listing response properties as methods.
- Fix a bug of not obeying robots.txt.
- Add HCCrawler.defaultArgs() method.
- Emit
requestretried
event.
- Use
cache
option not only for remembering already requested URLs but for request queue for distributed environments. - Moved
onSuccess
,onError
andmaxDepth
options from HCCrawler.connect() and HCCrawler.launch() to crawler.queue().
- Support
obeyRobotsTxt
for crawler.queue()'s options. - Support
persist
for RedisCache's constructing options.
- Make
cache
to be required for HCCrawler.connect() and HCCrawler.launch()'s options. - Provide
skipDuplicates
to remember and skip duplicate URLs, instead of passingnull
tocache
option. - Modify
BaseCache
interface.
- Support CSV and JSON Lines formats for exporting results
- Emit
requeststarted
,requestskipped
,requestfinished
,requestfailed
,maxdepthreached
,maxrequestreached
anddisconnected
events. - Improve debug logs by tracing public APIs and events.
- Allow
onSuccess
andevaluatePage
options asnull
. - Change
crawler.isPaused
,crawler.queueSize
,crawler.pendingQueueSize
andcrawler.requestedCount
from read-only properties to methods.
- Fix a bug of ignoring maxDepth option.
- Refactor by changing tye style of requiring cache directory.
- Fix a bug of starting too many crawlers more than maxConcurrency when requests fail.
- Automatically collect and follow links found in the requested page.
- Support
maxDepth
for crawler.queue()'s options.
- Support
screenshot
for crawler.queue()'s options.
- Rename
ensureCacheClear
topersistCache
for HCCrawler.connect() and HCCrawler.launch()'s options.
- Support
maxRequest
for HCCrawler.connect() and HCCrawler.launch()'s options. - Support
allowedDomains
anduserAgent
for crawler.queue()'s options. - Support pluggable cache such as SessionCache, RedisCache and BaseCache interface for customizing caches.
- Add crawler.setMaxRequest(), crawler.pause() and crawler.resume() methods.
- Add crawler.pendingQueueSize and crawler.requestedCount read-only properties.
- Add CHANGELOG.md based on Keep a Changelog.
- Add unit tests.
- Automatically dismisses dialog.
- Performance improvement by setting a page parallel.
- Support
extraHeaders
for crawler.queue()'s options. - Add comment in JSDoc style.
- Public API to launch a browser has changed. Now you can launch browser by HCCrawler.launch().
- Rename
shouldRequest
topreRequest
for crawler.queue()'s options. - Refactor by separating
HCCrawler
andCrawler
classes. - Refactor handlers for options.
- Add test with mocha and power-assert.
- Add coverage with istanbul.
- Add setting for CircleCI.
- Add .editorconfig.
- Add debug log.
- Migrate from NPM to Yarn.
- Refactor helper to class static method style.