-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSError: {'We encountered an internal error. Please try again.'}
during finalize_target
.
#406
Comments
CC: @rabernat |
I just met with an engineer in Google Cloud. Apparently, bulk deletion of a number of files can cause problems in GCS! To perform this operation correctly, retry logic (with exponential backoff) is needed due to these kinds of failures. See these docs for example. I'm going to try an experiment where I separate out the delete step with the write step in |
I'm happy to implement this in Pangeo-Forge for now, but I see that this could be a better fit for |
I've encountered a case where large bulk deletions throws an `OSError`. The GCS documentation points out that this is a general problem (https://cloud.google.com/storage/docs/deleting-objects#delete-objects-in-bulk). I did not expect, however, that this would throw an OSError instead of a 500 error. In any case, this PR extends `gcsfs`'s retry logic to include internal errors that throw `OSError`s. My hope is that this fixes an associated issue in Pangeo Forge Recipes (pangeo-forge/pangeo-forge-recipes#406).
Yes this seems clearly like a gcsfs-level issue to me. At the Pangeo Forge level, this would be a perfect use case for zarr-developers/zarr-specs#154. With an iceberg-style data lake, you never actually have to delete anything. So the problem goes away. |
I think there is a subtle bug somewhere in gcsfs. There is a separate API in GCS for bulk deletes -- and gcsfs supports this. However, from the trace above, this code path isn't used. Instead of deleting the directory, it deletes each file individually. @martindurant @rabernat Any idea why that may be the case? On the Zarr storage level, it looks like if Either way, I'm happy to try to fix this issue on both sides: By making bulk file removes robust (via retries) and by using the better API. |
Fixing both sounds good. Your PR is probably ready to go in for gcsfs. |
Ok, I know why the inefficient remove operation is chosen here. Zarr with use the I think the simplest place to fix this is in pangeo-forge-recipes. Does that sound right to you two? |
Zarr has FSStore, which is a thin layer over fsspec, and has the rmdir implementation. That's what should be passed. If you pass a URL to zarr rather than an instantiated mapper, you will get an FSStore. |
That's really useful to know! Thanks Martin! I just noticed something though: the |
It actually calls |
Ah, before, I didn't understand what |
The FSMap vs FSStore problem is really confusing. As a first step towards resolving it, I made zarr-developers/zarr-python#911, which allows you to create an FSStore from an already-instantiated filesystem. There is a potential follow-up PR which could be made to zarr-python which automatically promotes an FSmap to a proper FSStore. That feature would eliminate Alex's problem. We discussed this a bit (see zarr-developers/zarr-python#911 (review)), but I didn't have the patience to figure it out. Alex, maybe I can nerd-swipe you into giving it a shot? 🙃 |
I'm happy to be nerd-sniped, in general. However, #409 provides a short-term fix that unblocks us. Do you think this follow-up is the best use of time, given how PGF intends to change in the near term? |
Hey @rabernat: Can we tag a release with this fix ( |
Yes! Once you accept the invite to join the pangeo-forge org (already expired once; just resent 😉 ), you should be able to follow the very simple release procedure. I would recommend cleaning up the release notes a bit first. |
Oh! Sorry, my Github notifications are a bit messed up. Thank you! I'll be happy to follow the process next week :) |
A pretty significant error consistently occurs for datasets of a certain scale (>1 TiB) during the
finalize_target
step of theXarrayZarrRecipe
. In the finalize step, during dimension consolidation, we try to overwrite a Zarr dimension. This triggers the underlying filesystem implementation, heregcsfs
to remove all the files in that operation. During a remove of the directory, we hit an internal OS Error.So far, I've investigated a few possible causes. I thought this could be a Beam-specific race condition (I tried to fix it here). However, when I reproduced the issue on a single worker (via running the pure python
finalize_target
step on a single VM), I was able to reproduce the issue.I'll add more context to this issue later today, but this is a good-enough summary of what I'm experiencing.
Full trace
The text was updated successfully, but these errors were encountered: