-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance roadmap #104
Comments
To demonstrate my point about (2), it looks like storing a million references in-memory can be done using numpy taking up only 24MB. In [1]: import numpy as np
# Using numpy 2.0 for the variable-length string dtype
In [2]: np.__version__
Out[2]: '2.0.0rc1'
# The number of chunks in this zarr array
In [3]: N_ENTRIES = 1000000
In [4]: SHAPE = (100, 100, 100)
# Taken from Julius' example in issue #93
In [5]: url_prefix = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-'
# Notice these strings will have slightly different lengths
In [6]: paths = [f"{url_prefix}{i}.nc" for i in range(N_ENTRIES)]
# Here's where we need numpy 2
In [7]: paths_arr = np.array(paths, dtype=np.dtypes.StringDType).reshape(SHAPE)
# These are the dimensions of our chunk grid
In [8]: paths_arr.shape
Out[8]: (100, 100, 100)
In [9]: paths_arr.nbytes
Out[9]: 16000000
In [10]: offsets_arr = np.arange(N_ENTRIES, dtype=np.dtype('int32')).reshape(SHAPE)
In [11]: offsets_arr.nbytes
Out[11]: 4000000
In [12]: lengths_arr = np.repeat(100, repeats=N_ENTRIES).astype('int32').reshape(SHAPE)
In [13]: lengths_arr.nbytes
Out[14]: 4000000
In [14]: (paths_arr.nbytes + offsets_arr.nbytes + lengths_arr.nbytes) / 1e6
Out[14]: 24.0 i.e. only 24MB to store a million references (for one This is all we need for (2) I think. It would be nice to put all 3 fields into one structured array instead of 3 separate numpy arrays, but apparently that isn't possible yet (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)). But that's just an implementation detail as this would all be hidden within the 24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store. This is smaller than worker memory, implying that I don't think we need dask to perform the concatenation (so we shouldn't need dask for (3)?). Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays. |
See Ryan's comment (#33 (comment)) suggesting specific compressor codecs to use for storing references in zarr arrays. |
I tried this out, see earth-mover/icechunk#401 |
This write-up of trying VirtualiZarr + Icechunk on a large dataset (IMERG) is also illuminating (see earth-mover/icechunk-nasa#1 for context) tl;dr: icechunk needs to reduce the size / split up its manifests |
We want to be able to create, combine, serialize, and use manifests that point to very large numbers of files. The largest Zarr stores we already see have
O(1e6)
chunks per array, and 10's or 100's of arrays (e.g. the National Water Model dataset).To be able to virtualize this many references at once, there are multiple places that could become performance bottlenecks.
Reference generation
open_virtual_dataset
should createManifestArray
objects from URLs with a minimum memory overhead. It is unlikely that the current implementation (i.e. kerchunk'sSingleHdf5ToZarr
to find generate references for a netCDF file) we are using is as efficient as possible because it first creates a python dict rather than a more efficient in-memory representation.In-memory representation
It's crucial that our in-memory representation of the manifest be as memory-efficient as possible. What's the smallest that the in-memory representation of e.g. a single chunk manifest containing a million references could be? I think it's only ~100MB, if we use a numpy array (or 3 numpy arrays) and avoid object dtypes. We should be able to test this easily ourselves by creating a manifest with some dummy references data. See In-memory representation of chunks: array instead of a dict? #33.
Combining
If we can fit all the
ManifestArray
objects we need into memory on one worker, this part is easy. Again using numpy arrays is good becausenp.concatenate/stack
should be memory-efficient.If the combined manifest is too big to fit into a single worker's memory then we might want to use dask to create a task graph, with the root tasks generating the references, concatenation via a tree of tasks (not technically a reduction because the output is as big as the input), then writing out chunkwise to some form of storage. Overall the result would be similar to how kerchunk uses
auto_dask
.Serialization
Writing out the manifest to JSON as text would create an on-disk representation that is larger than necessary. Writing out to a compressed / chunked format such as parquet could ameliorate this. This could be kerchunk's parquet format or we could use a compressed format in the Zarr manifest storage transformer ZEP (see discussion in Manifest storage transformer zarr-specs#287)
We should be trying to work out what the performance of each of these steps is, and they are separate so we can look at them individually.
cc @sharkinsspatial @norlandrhagen @keewis
The text was updated successfully, but these errors were encountered: