⚠️ 🔥 A pull request will be merged soon that adds concurrency support. 🎉
gcrawler is a simple (not concurrent) web crawler written in Java.
Given a starting url, it visits every reachable page under that domain, not crossing subdomains.
gcrawler outputs a json to stdout listing every visited page and every static asset (images, js, css) found on that page.
Example output:
[
{
"url": "http://www.debian.org",
"assets": [
"http://www.debian.org/Pics/openlogo-50.png",
"http://www.debian.org/Pics/identica.png",
"http://www.debian.org/Pics/planet.png",
"http://www.debian.org/debhome.css",
"http://www.debian.org/debian-en.css",
"http://www.debian.org/favicon.ico"
]
},
{
"url": "http://www.debian.org/",
"assets": [
"http://www.debian.org/Pics/openlogo-50.png",
"http://www.debian.org/Pics/identica.png",
"http://www.debian.org/Pics/planet.png",
"http://www.debian.org/debhome.css",
"http://www.debian.org/debian-en.css",
"http://www.debian.org/favicon.ico"
]
},
{
"url": "http://www.debian.org#content",
"assets": [
"http://www.debian.org/Pics/openlogo-50.png",
"http://www.debian.org/Pics/identica.png",
"http://www.debian.org/Pics/planet.png",
"http://www.debian.org/debhome.css",
"http://www.debian.org/debian-en.css",
"http://www.debian.org/favicon.ico"
]
},
{
"url": "http://www.debian.org/intro/about",
"assets": [
"http://www.debian.org/Pics/openlogo-50.png",
"http://www.debian.org/Pics/identica.png",
"http://www.debian.org/Pics/planet.png",
"http://www.debian.org/debian.css",
"http://www.debian.org/debian-en.css"
]
},
...
]
gcrawler uses Gradle as build system. You don't need to install Gradle as it is shipped with sources.
The following will create an executable uber-jar under build/libs/gcrawler-all.jar
:
$ ./gradlew clean shadowJar
Run the gcrawler executable jar passing the starting url as parameter:
$ java -jar build/libs/gcrawler-all.jar http://www.debian.org
gcrawler supports some command line options, see below.
gcrawler is written in Java and uses the fantastic JSoup library to fetch and parse HTML documents, plus the Jackson json serializer.
The implementation is single-threaded by purpose, but it could easily be parallelized with minor changes.
CPU usage is very low because the main cpu-intensive activity (DOM parsing) happens only in the main thread.
Memory usage is generally low, because documents are not retained after parsing and json results are emitted as soon as possible. In fact, gcrawler uses the Jackson streaming APIs to flush output as soon as documents are parsed.
gcrawler will try hard to emit a valid json to stdout.
Upon task cancellation (e.g. when hitting ^C) the crawler will stop as soon as possible (i.e. at the end of the currently running task) and will emit a valid json for the pages successfully visited.
Again, gcrawler will try hard to emit a valid json to stdout and, in general, it will avoid to intermingle stacktraces with json output.
Errors are printed to stderr instead.
Errors happening before any json character has been emitted (e.g. wrong command line parameters), will be printed to stderr and will cause the application to exit early.
Errors happening while the crawler is running and json is being printed, are muted by default. To enable printing crawling errors to stderr use the --print-errors true
flag.
gcrawler requires an odd number of command line parameters, e.g.:
$ java -jar build/libs/gcrawler-all.jar [param1 value1 param2 value2 ...] <url>
Each parameter must be followed by its value. The starting url must be passed as last parameter.
flag | values | default | description |
---|---|---|---|
--print-errors |
boolean | false | Print any exception to stderr |
--halt-on-error |
boolean | false | Stop crawling after any error occurs |
--normalize-urls |
boolean | true | Normalize urls before crawling (remove traling # and ? ) |
--timeout-millis |
integer | 5000 | Socket timeout in milliseconds for connect and read operations |
For example, the following:
java -jar build/libs/gcrawler-all.jar --print-errors true --timeout-millis 10000 www.debian.org 2>errors.txt >output.txt
will apply a connection timeout of 10 seconds, will print to file errors.txt
any skipped url and will save the json to output.txt
.
- In case the starting url returns any error, the empty json list
[]
is printed. - In case a page does not import any static asset, the empty json list
"assets":[]
is printed. - gcrawler does not apply any url normalization, except for removing trailing fragment
#
and question mark?
in order to avoid too much duplicate url. This can be configured with the--normalize-urls
parameter. - At the moment, it is possible to configure only the JSoup connection timeout. Anyway JSoup provides sensible defaults: it will follow redirects, will not ignore http errors, will validate certificates, and will handle response body sizes up to 1M.
- gcrawler will cross between http and https without problems, but will refuse to cross between domains, therefore crawling
debian.org
is different than crawlingwww.debian.org
.
The gcrawler source files are distributed under the BSD-style license.