-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate limiting – HTTP 429, Too Many Requests #32
Comments
can you review my last 3 commits? lowered concurrency to 1 and put a sleep(5) in url_collector. dont know wether url_collector is the right place. maybe this works too with a higher concurrency than 1. |
I tried your version of wayback_archiver, @chlorophyll-zz , but it still operates with a default concurrency at 5 and does not "sleep". So it still gives a 429 after about 20 submits. |
@snobjorn i now increased the sleep time to 5 seconds, to fix your specific problem. you said the requests did not wait in betweeen, are you sure that you are using my fork? here are the instructions to build and run my fork
then run with |
I started over and tried exactly what you wrote, @chlorophyll-zz , but it still pushed 5 links at a time, and does not wait in between. |
i changed the default concurrency back to 1, increased the sleep time to 5
seconds.
can you give me a log, when you can of the build too. thanks in advance.
Am Fr., 24. Jan. 2020 um 09:59 Uhr schrieb Snøbjørn <
[email protected]>:
… I started over and tried exactly what you wrote, @chlorophyll-zz
<https://github.com/chlorophyll-zz> , but it still pushed 5 links at a
time, and does not wait in between.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32?email_source=notifications&email_token=AKNMCDZXZQ2TYV23NM5NMIDQ7KUW5A5CNFSM4JDS3FFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ2ERTI#issuecomment-578046157>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKNMCD3BI4KGSLSPD2CRHF3Q7KUW5ANCNFSM4JDS3FFA>
.
|
Seems like they've introduced rate limiting i two steps
5 requests a minute is, to say the least, not great (see this wiki). Will try to look at some mitigation options (updating default concurrency, perhaps add a Mitigation options
UPDATE: Difference between
|
I have a good experience with a request every 5 seconds without a 429, was
less than a week ago.
I measured and i also able to request with 5 concurrent requests every 5
seconds.
For users with a server infrastructure, it is no big deal to set up a
deamon to scrape a list of pages.
And for private users, the Save Page Now has a similar feature called
archive outlinks.
|
I had some luck using the block executed for each URL to sleep between requests: require 'wayback_archiver'
WaybackArchiver.concurrency = 1
WaybackArchiver.archive('example.com', strategy: :auto) do |result|
if result.success?
puts "Successfully archived: #{result.archived_url}"
else
puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
end
sleep(5) # sleep 5 seconds after each request
end |
🔗 Here's how another similar-ish tool handles Wouldn't be that tricky to implement something similar. |
5 requests a minute is probably acceptable for many sites: that's 280 URLs an hour. If someone has fewer than a few thousand URLs which do not change on a daily basis, then why is this a major problem? It can run in a cron job overnight. I have experimented with sleep(13), so as to be sure that it is certainly less than 5 per minute. This revealed a separate issue I will report, but wayback_archiver did get considerably further. I put the sleep() in archive.rb:self.post, in the pool.post do loop. I suspect other people inserting sleep() discussed in this GitHub issue may have been adding it in a less useful place. Dan |
It appears the rate limiting is 15 requests/minute, with a 5 minute block for the IP address exceeding this: Feature request: Is there a way to add a CLI parameter for the user to set the rate (number of pages submitted per minute) ? |
The Internet Archive has started to more aggressively rate limit requests and after just a dozen or so requests (with the default concurrency setting
5
).After some testing we even get rate limiting with concurrency set to
1
.To fix this we have to implement a way to throttle requests in order to successfully submit all URLs.
🔗 Related to #22.
The text was updated successfully, but these errors were encountered: