Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limiting – HTTP 429, Too Many Requests #32

Open
buren opened this issue Oct 22, 2019 · 11 comments
Open

Rate limiting – HTTP 429, Too Many Requests #32

buren opened this issue Oct 22, 2019 · 11 comments
Milestone

Comments

@buren
Copy link
Owner

buren commented Oct 22, 2019

The Internet Archive has started to more aggressively rate limit requests and after just a dozen or so requests (with the default concurrency setting 5).

After some testing we even get rate limiting with concurrency set to 1.

To fix this we have to implement a way to throttle requests in order to successfully submit all URLs.

🔗 Related to #22.

@bartman081523
Copy link
Contributor

bartman081523 commented Nov 6, 2019

can you review my last 3 commits?
https://github.com/chlorophyll-zz/wayback_archiver

lowered concurrency to 1 and put a sleep(5) in url_collector.

dont know wether url_collector is the right place.
but the other method --url passes only a single url, where no rate limiting is required.

maybe this works too with a higher concurrency than 1.

@snobjorn
Copy link

I tried your version of wayback_archiver, @chlorophyll-zz , but it still operates with a default concurrency at 5 and does not "sleep". So it still gives a 429 after about 20 submits.

@bartman081523
Copy link
Contributor

bartman081523 commented Jan 23, 2020

@snobjorn i now increased the sleep time to 5 seconds, to fix your specific problem.
yes, before that i raised the concurrency to 5 and lowered the wait time to 2, because concurrency 5 and wait 2 were running without 429s.

you said the requests did not wait in betweeen, are you sure that you are using my fork?

here are the instructions to build and run my fork

git clone https://github.com/chlorophyll-zz/wayback_archiver
cd wayback_archiver
gem build wayback_archiver.gemspec
gem install wayback_archiver-1.3.0.gem

then run with
./~/.gem/ruby/2.7.0/bin/wayback_archiver
or
./~/.gem/ruby/2.6.0/bin/wayback_archiver
or wayback_archiver if you have the ruby user bin (./~/.gem/ruby/2.6.0/bin/) in your path and gem installed the gem.

@snobjorn
Copy link

I started over and tried exactly what you wrote, @chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between.

@bartman081523
Copy link
Contributor

bartman081523 commented Jan 24, 2020 via email

@buren
Copy link
Owner Author

buren commented Jan 28, 2020

Seems like they've introduced rate limiting i two steps

  • Mid 2019, 20 req/min
  • October 2019, 5 req/min

5 requests a minute is, to say the least, not great (see this wiki).

Will try to look at some mitigation options (updating default concurrency, perhaps add a sleep call etc).

Mitigation options

  • Use exponential backoff (example)
  • Limit the amount of outbound requests to Wayback Machine to 5req/min (have requests queued)
  • ...

UPDATE:

Difference between 200 and 429:

HTTP 200, OK headers:

{
  "server": "nginx/1.15.8",
  "date": "Tue, 28 Jan 2020 15:00:46 GMT",
  "content-type": "text/html;charset=utf-8",
  "transfer-encoding": "chunked",
  "connection": "close",
  "content-location": "/web/20200128150045/https://www.example.com/notsosecret/",
  "set-cookie": "JSESSIONID=3AFB1D7EE70F9ED7BB7E02BEC3AA325C; Path=/; HttpOnly",
  "x-archive-orig-link": "<https://www.example.com/wp-json/>; rel=\"https://api.w.org/\", <https://www.example.com/?p=1579>; rel=shortlink",
  "x-archive-orig-strict-transport-security": "max-age=31536000; includeSubdomains;",
  "x-archive-orig-vary": "User-Agent,Accept-Encoding",
  "x-archive-guessed-charset": "UTF-8",
  "x-archive-orig-server": "Apache",
  "x-archive-orig-connection": "close",
  "x-archive-orig-content-type": "text/html; charset=UTF-8",
  "x-archive-orig-x-powered-by": "PleskLin",
  "x-archive-orig-cache-control": "max-age=0, no-store",
  "x-archive-orig-date": "Tue, 28 Jan 2020 15:00:46 GMT",
  "x-app-server": "wwwb-app0",
  "x-ts": "200",
  "x-cache-key": "httpsweb.archive.org/save/https://www.example.com/global-medicinteknik/SE",
  "x-page-cache": "MISS",
  "x-location": "save-get"
}

HTTP 429, Too Many Requests headers:

{
  "server": "nginx/1.15.8",
  "date": "Tue, 28 Jan 2020 15:00:48 GMT",
  "content-type": "text/html",
  "content-length": "487",
  "connection": "close",
  "etag": "\"5db9ab48-1e7\""
}

@bartman081523
Copy link
Contributor

bartman081523 commented Jan 29, 2020 via email

@delucis
Copy link

delucis commented Sep 24, 2020

I had some luck using the block executed for each URL to sleep between requests:

require 'wayback_archiver'

WaybackArchiver.concurrency = 1
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
  if result.success?
    puts "Successfully archived: #{result.archived_url}"
  else
    puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
  end
  sleep(5) # sleep 5 seconds after each request
end

@buren
Copy link
Owner Author

buren commented Sep 28, 2020

🔗 Here's how another similar-ish tool handles HTTP 429 – Too Many Requests.

Wouldn't be that tricky to implement something similar.

@danshearer
Copy link

5 requests a minute is, to say the least, not great (see this wiki).

Will try to look at some mitigation options (updating default concurrency, perhaps add a sleep call etc).

5 requests a minute is probably acceptable for many sites: that's 280 URLs an hour. If someone has fewer than a few thousand URLs which do not change on a daily basis, then why is this a major problem? It can run in a cron job overnight.

I have experimented with sleep(13), so as to be sure that it is certainly less than 5 per minute. This revealed a separate issue I will report, but wayback_archiver did get considerably further.

I put the sleep() in archive.rb:self.post, in the pool.post do loop. I suspect other people inserting sleep() discussed in this GitHub issue may have been adding it in a less useful place.

Dan

This was referenced Jun 24, 2022
@buren buren added this to the v2.0 milestone Jun 24, 2022
@dbader13
Copy link

It appears the rate limiting is 15 requests/minute, with a 5 minute block for the IP address exceeding this:
https://archive.org/details/toomanyrequests_20191110

Feature request: Is there a way to add a CLI parameter for the user to set the rate (number of pages submitted per minute) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants