-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icechunk stores design doc #1
Conversation
When opening this store and reading back data, we observed two important performance bottlenecks: | ||
- Calling `group.members()` is very slow, causing `xr.open_dataset` to be slow. According to @dcherian, | ||
> yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering. | ||
|
||
We will be fixing this soon. | ||
- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively a summary on where IC needs work to be able to support this use case better. The relevant IC issues are:
list_dir
relies onlist_prefix
which traverses all chunks icechunk#321- https://github.com/earth-mover/icechunk/issues?q=is%3Aissue+is%3Aopen+manifest+label%3A%22manifests+%3Acrystal_ball%3A%22
I'm not going to work further on this until we have made progress on those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could move forward with demos using smaller-scale virtual datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is great @rabernat thank you! We should definitely reuse it for future datasets. Looking forward to discussing next steps.
|
||
Official Name: **GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V07 (GPM_3IMERGHH) at GES DISC** | ||
|
||
Official NASA Website: https://data.nasa.gov/dataset/GPM-IMERG-Final-Precipitation-L3-Half-Hourly-0-1-d/hqn4-tpfu/about_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHHE_07/summary which I prefer as it has more information and links to the official documentation (apologies if you knew this already). Specifically, the linked technical documentation describes the data variables and that the introduction of the Intermediate
group was to "minimize misinterpretation of variable names and reflect changes in the algorithm".
cid = ic_repo.commit(f"Appended {year}") | ||
``` | ||
|
||
We were able to create about 10 years of data this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌🏽
> yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering. | ||
|
||
We will be fixing this soon. | ||
- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.
I would love to understand this better - I guess specifically how looking up chunk indices to byte ranges + file names are stored in icechunk and read by zarr. Are all chunk references stored together? Is it possible to load just the chunk references that are required for a specific query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see these questions are somewhat answered below - so it is my understanding now that all chunk references are stored together and that option 3 (manifest sharding) would be one solution which enables loading select chunk references.
There are several strategies we could explore to mitigate these issues: | ||
- **Better compression of manifest data.** Our current msgpack format does not use any compression whatsoever. Compressing the manifests will make them faster to download. | ||
- **Concurrent downloading of manifests.** For a 3 GB manifest, splitting the download over many threads will speed it up a lot. (This optimizationapplies to any file in Icechunk, including chunks.) | ||
- **Manifest sharding.** We can't allow manifests to grow without bound. The Icechunk Spec allows multiple manifests. The question is how do we split them up. This question merits a design doc all of its own. But here a couple of ideas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no Icechunk expert but this seems like the best option - let's discuss further.
Co-authored-by: Aimee Barciauskas <[email protected]>
```python | ||
import dask.bag as db | ||
import xarray as xr | ||
|
||
def reduce_via_concat(dsets): | ||
return xr.concat(dsets, dim="time", coords="minimal", join="override") | ||
|
||
b = db.from_sequence(all_times, partition_size=48) | ||
all_urls = db.map(make_url, b) | ||
vdsets = db.map(open_virtual, all_urls) | ||
concatted = vdsets.reduction(reduce_via_concat, reduce_via_concat) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI my approach to this is to try and use lithops to parallelize the open_virtual_dataset
call across serverless workers, then do the reduction on the client (because the vds objects themselves should be small).
See zarr-developers/VirtualiZarr#349, and I also have a notebook using this that I need to publish.
This document describes the technical approach and progress towards creating Icechunk stores for GPM IMERG.
Looking for feedback from @abarciauskas-bgse