Skip to content

Releases: hynky1999/CmonCrawl

1.1.8

07 Apr 23:02
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.1.7...1.1.8

1.1.7

14 Feb 00:16
bc54438
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 1.1.6...1.1.7

1.1.6

14 Jan 14:36
Compare
Choose a tag to compare
  • Ability for sdk users to set max_requests_per_second

1.1.5

08 Jan 00:27
eafa97d
Compare
Choose a tag to compare
  • Readme
  • dotenv requirement
  • !TYPES!

1.1.4

07 Dec 00:18
eb4dc39
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.1.3...1.1.4

1.1.3

07 Dec 00:23
3b8c2d1
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.1.2...1.1.3

1.1.2

20 Nov 00:05
02d6c05
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.1.0...1.1.2

1.1.0

13 Nov 00:44
Compare
Choose a tag to compare

Code

  • Default throttling for downloaders set to max 300 requests per second.
  • Downloader now takes a client for downloading, currently there exists two clients:
  • s3 -> Directly queries the common crawl buckets
  • api -> Quries CommonCrawl API Gateway
  • Retry system has been updated to leverage tenacity, additionaly we now use random exponential random backoff instead of linear random backoff

CLI

  • New global parameter --aws_profile for setting an aws_profile to use
  • New parameter --download_method which can be set for
  • extract...records --download_method
  • download...html --download_method

In both cases the argument can be set to either s3 or api, which definies how the commoncrawl will be accessed when downloading warc files.

1.0.5

25 Oct 08:31
f4dbb56
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.0.4...1.0.5

1.0.4

09 Sep 20:02
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.0.3...1.0.4