Skip to content

1.1.0

Compare
Choose a tag to compare
@hynky1999 hynky1999 released this 13 Nov 00:44
· 71 commits to main since this release

Code

  • Default throttling for downloaders set to max 300 requests per second.
  • Downloader now takes a client for downloading, currently there exists two clients:
  • s3 -> Directly queries the common crawl buckets
  • api -> Quries CommonCrawl API Gateway
  • Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff

CLI

  • New global parameter --aws_profile for setting an aws_profile to use
  • New parameter --download_method which can be set for
  • extract...records --download_method
  • download...html --download_method

In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.