1.1.0

hynky1999 released this 13 Nov 00:44

· 71 commits to main since this release

Code

Default throttling for downloaders set to max 300 requests per second.
Downloader now takes a client for downloading, currently there exists two clients:

s3 -> Directly queries the common crawl buckets
api -> Quries CommonCrawl API Gateway

Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff

CLI

New global parameter --aws_profile for setting an aws_profile to use
New parameter --download_method which can be set for

extract...records --download_method
download...html --download_method

In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.

Assets 2