Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minutely diffs occasionally return errors #992

Closed
lonvia opened this issue Nov 16, 2023 · 16 comments
Closed

Minutely diffs occasionally return errors #992

lonvia opened this issue Nov 16, 2023 · 16 comments

Comments

@lonvia
Copy link

lonvia commented Nov 16, 2023

Since the move to AWS, I've now see twice that pyosmium reports that a minutely diff is broken:

023-11-16 10:52:29: Using project directory: /srv/nominatim.openstreetmap.org/planet-project
Traceback (most recent call last):
  File "/usr/local/bin/nominatim", line 12, in <module>
    exit(cli.nominatim(module_dir='/usr/local/lib/nominatim/module',
  File "/usr/local/lib/nominatim/lib-python/nominatim/cli.py", line 226, in nominatim
    return get_set_parser().run(**kwargs)
  File "/usr/local/lib/nominatim/lib-python/nominatim/cli.py", line 121, in run
    return args.command.run(args)
  File "/usr/local/lib/nominatim/lib-python/nominatim/clicmd/replication.py", line 199, in run
    self._update(args)
  File "/usr/local/lib/nominatim/lib-python/nominatim/clicmd/replication.py", line 154, in _update
    state = replication.update(dsn, params, socket_timeout=args.socket_timeout)
  File "/usr/local/lib/nominatim/lib-python/nominatim/tools/replication.py", line 128, in update
    endseq = repl.apply_diffs(outhandler, startseq + 1,
  File "/usr/lib/python3/dist-packages/osmium/replication/server.py", line 171, in apply_diffs
    diffs = self.collect_diffs(start_id, max_size)
  File "/usr/lib/python3/dist-packages/osmium/replication/server.py", line 137, in collect_diffs
    left_size -= rd.add_buffer(diffdata, self.diff_type)
RuntimeError: gzip error: inflate failed: incorrect header check

I presume that means that the file was downloaded partially. Restart the process and the download is fine.

pyosmium assumes that the global state.txt file is written only after the actual diffs are published. Is this still guaranteed with the AWS files?

@tomhughes
Copy link
Member

It should be, yes - we publish that last after the actual diff has been published.

@Firefishy
Copy link
Member

AWS S3 object copies are atomic operations up to 5GB with read-after-write consistency.

I am currently unsure where the issue might be.

@Firefishy
Copy link
Member

I will download the S3 event logs and investigate.

@Firefishy
Copy link
Member

The issue was caused by AWS S3 returning a 500 InternalError.

0f474499af5eca52a75b973ff124d6c5de371cb82e51ed089ac52d08b49a14a2 osm-planet-eu-central-1 [16/Nov/2023:10:52:35 +0000] 2605:bc80:3010:700::8cd3:a764 - ZH16QHVG5WA39ZSW REST.GET.OBJECT planet/replication/minute/005/836/144.osc.gz "GET /planet/replication/minute/005/836/144.osc.gz HTTP/1.1" 500 InternalError 282 - 2090 - "-" "Nominatim (pyosmium/3.5.0)" - ftbR6xrHyxr/l7nUEAOJuhB7HC3zyR/TQVOPNupr/SOrSr+kxb5YW1xkrCqbYa3AXfUit796pD8= - ECDHE-RSA-AES128-GCM-SHA256 - [osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com](http://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/) TLSv1.2 - -

0.00182767% of requests are getting 5xx errors from AWS S3 in a few hour sample period. AWS has some documentation: https://repost.aws/knowledge-center/http-5xx-errors-s3

It is likely I may need to open a support ticket with AWS to investigate the issues.

@tomhughes
Copy link
Member

I seriously doubt you'll get anything useful from support anyway - that all reads very much like "yeah random noise errors happen occassionally, just retry".

@Firefishy
Copy link
Member

I have opened a support request with AWS.

@Firefishy
Copy link
Member

"It's a best practice to build retry logic into applications that make requests to Amazon S3." - AWS tech support.

@pnorman
Copy link
Collaborator

pnorman commented Nov 17, 2023

Although elevated 500 error rates are an issue, they are right in that retires are a best practice.

Do we have metrics on the 5xx errors?

@Firefishy
Copy link
Member

Although elevated 500 error rates are an issue, they are right in that retires are a best practice.

Do we have metrics on the 5xx errors?

Yes: https://prometheus.openstreetmap.org/d/c6fba6cd-e8f0-4c3e-8a62-7eafbf92cd71/amazon-s3?orgId=1&refresh=5m&var-name=arn:aws:s3:::osm-planet-eu-central-1&viewPanel=2&from=now-30d&to=now
A few a day.

@lonvia lonvia changed the title Minutely diffs are occasionally cut off Minutely diffs occasionally return errors Nov 18, 2023
@lonvia
Copy link
Author

lonvia commented Nov 19, 2023

I've published a new release of pyosmium that fixes the handling of non-200 responses and adds a transparent retry for 500 and 503. That should fix things for me. Feel free to close.

@pnorman
Copy link
Collaborator

pnorman commented Nov 19, 2023

On a 1h averaging period, the highest 5xx error rate was 0.014%. Typical is 0%, and it goes up to 0.002% fairly often. AWS' SLA for S3 Intelligent Tiering is having 99.0% success rate over the month, calculated as 5 minute intervals. We're at 0.00017%, which is well within SLA. 99.9998 is as good as we can reasonably expect. This means a typical server fetching one state file and one diff every minute will see 1-2 errors per year.

S3 Standard is 99.9%, but I doubt it matters here, since we're at almost 6 nines.

Note: Query for checking SLA is avg_over_time(sum((aws_s3_5xx_errors_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"} / aws_s3_get_requests_sum{name=~"arn:aws:s3:::osm-planet-eu-central-1"})) by (name) [$__range:]) as an instant

@pnorman pnorman closed this as completed Nov 20, 2023
@lonvia
Copy link
Author

lonvia commented Nov 27, 2023

pyosmium 3.7 is now available in Debian testing. Can we port it to our ppa?

@tomhughes
Copy link
Member

Initial attempt failed as it needs a newer libosmium so I'm backporting that now.

A bigger problem is that it needs python3-pytest-httpserver which isn't available in Ubuntu until 22.10 and I don't really want to be trying to backport that...

@lonvia
Copy link
Author

lonvia commented Nov 27, 2023

This is only a test dependency. I don't suppose you would want to disable tests?

@tomhughes
Copy link
Member

I thought it was part of a larger pytest package but I realised it's actually completely separate so I'm trying a backport from 22.10 to see if that works.

@tomhughes
Copy link
Member

Backport is now complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants