-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minutely diffs occasionally return errors #992
Comments
It should be, yes - we publish that last after the actual diff has been published. |
AWS S3 object copies are atomic operations up to 5GB with read-after-write consistency. I am currently unsure where the issue might be. |
I will download the S3 event logs and investigate. |
The issue was caused by AWS S3 returning a 500 InternalError.
0.00182767% of requests are getting 5xx errors from AWS S3 in a few hour sample period. AWS has some documentation: https://repost.aws/knowledge-center/http-5xx-errors-s3 It is likely I may need to open a support ticket with AWS to investigate the issues. |
I seriously doubt you'll get anything useful from support anyway - that all reads very much like "yeah random noise errors happen occassionally, just retry". |
I have opened a support request with AWS. |
"It's a best practice to build retry logic into applications that make requests to Amazon S3." - AWS tech support. |
Although elevated 500 error rates are an issue, they are right in that retires are a best practice. Do we have metrics on the 5xx errors? |
|
I've published a new release of pyosmium that fixes the handling of non-200 responses and adds a transparent retry for 500 and 503. That should fix things for me. Feel free to close. |
On a 1h averaging period, the highest 5xx error rate was 0.014%. Typical is 0%, and it goes up to 0.002% fairly often. AWS' SLA for S3 Intelligent Tiering is having 99.0% success rate over the month, calculated as 5 minute intervals. We're at 0.00017%, which is well within SLA. 99.9998 is as good as we can reasonably expect. This means a typical server fetching one state file and one diff every minute will see 1-2 errors per year. S3 Standard is 99.9%, but I doubt it matters here, since we're at almost 6 nines. Note: Query for checking SLA is |
pyosmium 3.7 is now available in Debian testing. Can we port it to our ppa? |
Initial attempt failed as it needs a newer libosmium so I'm backporting that now. A bigger problem is that it needs |
This is only a test dependency. I don't suppose you would want to disable tests? |
I thought it was part of a larger pytest package but I realised it's actually completely separate so I'm trying a backport from 22.10 to see if that works. |
Backport is now complete. |
Since the move to AWS, I've now see twice that pyosmium reports that a minutely diff is broken:
I presume that means that the file was downloaded partially. Restart the process and the download is fine.
pyosmium assumes that the global
state.txt
file is written only after the actual diffs are published. Is this still guaranteed with the AWS files?The text was updated successfully, but these errors were encountered: