-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an alternative to the YamlReader/FileHandler scheme for better synergy in multi-file datasets #2605
Comments
I think this sounds good, reasonable, and not "that hard" to do given how modular we made the reader infrastructure already. Along these lines it'd be great to not need the YAML configuration at all, but maybe your request isn't the right time for this. Other thoughts/ideas this makes me think of: I've always hated that to create a reader you need to call two or three classmethods on the reader class. I think it's: match files, create file handlers, create reader instance, then finally get_dataset or whatever it is called. ...and I forgot the other idea I had as I was typing the above. I think our YAML-based and very modular per-file readers are just a nice "mold" that technically fits every situation we've run into in the sense that it is "you have files, break the problem down into these parts, go". But as you've pointed out this doesn't make it the best design. We're to the point where we need faster readers that are smarter and they can do that when they are trying to fit this mold (YAML-based per-file file handlers) without doing some really ugly coding. 👍 from me on experimenting with this. |
I've been trying to speed-up remote FCI reading by doing some caching in |
I was thinking about FCI recently and thought it might be easy to move file handler creation in the existing YAML reader to use dask delayed objects or maybe futures. There is no reason I can think of that file reading can't be done in parallel and for S3 access this would be really helpful. |
Forgot to link my more recent test with parallel filehandler for FCI, so for future reference: https://pytroll.slack.com/archives/C011NR3LE20/p1699870659317759 There's also link to the branch I used. So the parallelization works but doesn't give any benefit in Scene creation time for local files. |
I wonder if that would work better in a distributed/multiprocess environment (the dask scheduler) where in the current threaded environment we are limited by the GIL...but that would also suggest that we aren't I/O bound or that something is doing I/O while holding the GIL...oh or the NetCDF library I suppose and it's internal locking. |
Feature Request
Is your feature request related to a problem? Please describe.
The current way to load data in Satpy is inherently per-file. While this work well in most cases, it becomes difficult and hacky when a dataset is based on multiple files that depend on each other, or when the files contain the same base metadata that we want to avoid loading multiple times.
Describe the solution you'd like
I would like to have an alternative to the YamlReader class that can take in multiple files at once and, to compare to the current architecture, have maybe a MultipleFileHandler that handles multiple files at once for better performance or smarter dataarray or metadata extraction.
Examples of readers that would benefit from this architecture:
Describe any changes to existing user workflow
The user workflow will not be changed. However, developer will be provided two ways of creating a reader which could be confusing if we don't document the difference clearly.
Additional context
One added benefit would be that this reader architecture could natively return an xarray DataTree.
The text was updated successfully, but these errors were encountered: