Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icechunk stores design doc #1

Merged
merged 2 commits into from
Jan 9, 2025
Merged

Icechunk stores design doc #1

merged 2 commits into from
Jan 9, 2025

Conversation

rabernat
Copy link
Contributor

This document describes the technical approach and progress towards creating Icechunk stores for GPM IMERG.

Looking for feedback from @abarciauskas-bgse

Comment on lines +176 to +181
When opening this store and reading back data, we observed two important performance bottlenecks:
- Calling `group.members()` is very slow, causing `xr.open_dataset` to be slow. According to @dcherian,
> yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering.

We will be fixing this soon.
- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively a summary on where IC needs work to be able to support this use case better. The relevant IC issues are:

I'm not going to work further on this until we have made progress on those issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could move forward with demos using smaller-scale virtual datasets.

Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is great @rabernat thank you! We should definitely reuse it for future datasets. Looking forward to discussing next steps.

design-docs/icechunk-stores.md Outdated Show resolved Hide resolved

Official Name: **GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V07 (GPM_3IMERGHH) at GES DISC**

Official NASA Website: https://data.nasa.gov/dataset/GPM-IMERG-Final-Precipitation-L3-Half-Hourly-0-1-d/hqn4-tpfu/about_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHHE_07/summary which I prefer as it has more information and links to the official documentation (apologies if you knew this already). Specifically, the linked technical documentation describes the data variables and that the introduction of the Intermediate group was to "minimize misinterpretation of variable names and reflect changes in the algorithm".

design-docs/icechunk-stores.md Outdated Show resolved Hide resolved
design-docs/icechunk-stores.md Outdated Show resolved Hide resolved
design-docs/icechunk-stores.md Outdated Show resolved Hide resolved
cid = ic_repo.commit(f"Appended {year}")
```

We were able to create about 10 years of data this way.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌🏽

> yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering.

We will be fixing this soon.
- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.

I would love to understand this better - I guess specifically how looking up chunk indices to byte ranges + file names are stored in icechunk and read by zarr. Are all chunk references stored together? Is it possible to load just the chunk references that are required for a specific query?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see these questions are somewhat answered below - so it is my understanding now that all chunk references are stored together and that option 3 (manifest sharding) would be one solution which enables loading select chunk references.

design-docs/icechunk-stores.md Outdated Show resolved Hide resolved
There are several strategies we could explore to mitigate these issues:
- **Better compression of manifest data.** Our current msgpack format does not use any compression whatsoever. Compressing the manifests will make them faster to download.
- **Concurrent downloading of manifests.** For a 3 GB manifest, splitting the download over many threads will speed it up a lot. (This optimizationapplies to any file in Icechunk, including chunks.)
- **Manifest sharding.** We can't allow manifests to grow without bound. The Icechunk Spec allows multiple manifests. The question is how do we split them up. This question merits a design doc all of its own. But here a couple of ideas:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no Icechunk expert but this seems like the best option - let's discuss further.

Co-authored-by: Aimee Barciauskas <[email protected]>
Comment on lines +142 to +153
```python
import dask.bag as db
import xarray as xr

def reduce_via_concat(dsets):
return xr.concat(dsets, dim="time", coords="minimal", join="override")

b = db.from_sequence(all_times, partition_size=48)
all_urls = db.map(make_url, b)
vdsets = db.map(open_virtual, all_urls)
concatted = vdsets.reduction(reduce_via_concat, reduce_via_concat)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI my approach to this is to try and use lithops to parallelize the open_virtual_dataset call across serverless workers, then do the reduction on the client (because the vds objects themselves should be small).

See zarr-developers/VirtualiZarr#349, and I also have a notebook using this that I need to publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants