Icechunk stores design doc #1

rabernat · 2024-12-27T20:29:48Z

This document describes the technical approach and progress towards creating Icechunk stores for GPM IMERG.

Looking for feedback from @abarciauskas-bgse

rabernat · 2024-12-27T21:36:33Z

design-docs/icechunk-stores.md

+When opening this store and reading back data, we observed two important performance bottlenecks:
+- Calling `group.members()` is very slow, causing `xr.open_dataset` to be slow. According to @dcherian, 
+  > yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering.
+
+  We will be fixing this soon.
+- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.


This is effectively a summary on where IC needs work to be able to support this use case better. The relevant IC issues are:

list_dir relies on list_prefix which traverses all chunks icechunk#321

https://github.com/earth-mover/icechunk/issues?q=is%3Aissue+is%3Aopen+manifest+label%3A%22manifests+%3Acrystal_ball%3A%22

I'm not going to work further on this until we have made progress on those issues.

Alternatively, we could move forward with demos using smaller-scale virtual datasets.

abarciauskas-bgse

This doc is great @rabernat thank you! We should definitely reuse it for future datasets. Looking forward to discussing next steps.

design-docs/icechunk-stores.md

abarciauskas-bgse · 2025-01-02T19:56:30Z

design-docs/icechunk-stores.md

+
+Official Name: **GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V07 (GPM_3IMERGHH) at GES DISC**
+
+Official NASA Website: https://data.nasa.gov/dataset/GPM-IMERG-Final-Precipitation-L3-Half-Hourly-0-1-d/hqn4-tpfu/about_data


There is also https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHHE_07/summary which I prefer as it has more information and links to the official documentation (apologies if you knew this already). Specifically, the linked technical documentation describes the data variables and that the introduction of the Intermediate group was to "minimize misinterpretation of variable names and reflect changes in the algorithm".

design-docs/icechunk-stores.md

abarciauskas-bgse · 2025-01-02T20:00:50Z

design-docs/icechunk-stores.md

+    cid = ic_repo.commit(f"Appended {year}")
+```
+
+We were able to create about 10 years of data this way.


abarciauskas-bgse · 2025-01-02T20:07:16Z

design-docs/icechunk-stores.md

+  > yes this is known, `list_dir` is inefficient because it is `list_prefix` with post-filtering.
+
+  We will be fixing this soon.
+- Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.


Reading any data from arrays (even small ones) is very slow and memory intensive. This is because it requires downloading and loading the entire chunk manifest for the entire dataset.

I would love to understand this better - I guess specifically how looking up chunk indices to byte ranges + file names are stored in icechunk and read by zarr. Are all chunk references stored together? Is it possible to load just the chunk references that are required for a specific query?

I see these questions are somewhat answered below - so it is my understanding now that all chunk references are stored together and that option 3 (manifest sharding) would be one solution which enables loading select chunk references.

design-docs/icechunk-stores.md

abarciauskas-bgse · 2025-01-02T20:11:12Z

design-docs/icechunk-stores.md

+There are several strategies we could explore to mitigate these issues:
+- **Better compression of manifest data.** Our current msgpack format does not use any compression whatsoever. Compressing the manifests will make them faster to download.
+- **Concurrent downloading of manifests.** For a 3 GB manifest, splitting the download over many threads will speed it up a lot. (This optimizationapplies to any file in Icechunk, including chunks.)
+- **Manifest sharding.** We can't allow manifests to grow without bound. The Icechunk Spec allows multiple manifests. The question is how do we split them up. This question merits a design doc all of its own. But here a couple of ideas:


I'm no Icechunk expert but this seems like the best option - let's discuss further.

design-docs/icechunk-stores.md

Co-authored-by: Aimee Barciauskas <[email protected]>

TomNicholas · 2025-01-09T15:59:04Z

design-docs/icechunk-stores.md

+```python
+import dask.bag as db
+import xarray as xr
+
+def reduce_via_concat(dsets):
+    return xr.concat(dsets, dim="time", coords="minimal", join="override")
+
+b = db.from_sequence(all_times, partition_size=48)
+all_urls = db.map(make_url, b)
+vdsets = db.map(open_virtual, all_urls)
+concatted = vdsets.reduction(reduce_via_concat, reduce_via_concat)
+```


FYI my approach to this is to try and use lithops to parallelize the open_virtual_dataset call across serverless workers, then do the reduction on the client (because the vds objects themselves should be small).

See zarr-developers/VirtualiZarr#349, and I also have a notebook using this that I need to publish.

Icechunk stores design doc

1f69780

rabernat commented Dec 27, 2024

View reviewed changes

abarciauskas-bgse approved these changes Jan 2, 2025

View reviewed changes

This was referenced Jan 7, 2025

GPM IMERG: benchmark virtual and zarr icechunk datasets for comparison with each other and current GES DISC time series service performance #2

Open

Design doc for MUR SST dataset #3

Open

abarciauskas-bgse reviewed Jan 8, 2025

View reviewed changes

design-docs/icechunk-stores.md Outdated Show resolved Hide resolved

abarciauskas-bgse reviewed Jan 8, 2025

View reviewed changes

design-docs/icechunk-stores.md Outdated Show resolved Hide resolved

Apply suggestions from code review

54fc8da

Co-authored-by: Aimee Barciauskas <[email protected]>

rabernat merged commit d9105d4 into main Jan 9, 2025

TomNicholas mentioned this pull request Jan 9, 2025

Performance roadmap zarr-developers/VirtualiZarr#104

Open

TomNicholas reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icechunk stores design doc #1

Icechunk stores design doc #1

rabernat commented Dec 27, 2024

rabernat Dec 27, 2024

rabernat Dec 27, 2024

abarciauskas-bgse left a comment

abarciauskas-bgse Jan 2, 2025

abarciauskas-bgse Jan 2, 2025

abarciauskas-bgse Jan 2, 2025

abarciauskas-bgse Jan 2, 2025

abarciauskas-bgse Jan 2, 2025

TomNicholas Jan 9, 2025


		Official Name: GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V07 (GPM_3IMERGHH) at GES DISC

		Official NASA Website: https://data.nasa.gov/dataset/GPM-IMERG-Final-Precipitation-L3-Half-Hourly-0-1-d/hqn4-tpfu/about_data

Icechunk stores design doc #1

Icechunk stores design doc #1

Conversation

rabernat commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abarciauskas-bgse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment