Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance roadmap #104

Open
TomNicholas opened this issue May 9, 2024 · 4 comments
Open

Performance roadmap #104

TomNicholas opened this issue May 9, 2024 · 4 comments

Comments

@TomNicholas
Copy link
Member

TomNicholas commented May 9, 2024

We want to be able to create, combine, serialize, and use manifests that point to very large numbers of files. The largest Zarr stores we already see have O(1e6) chunks per array, and 10's or 100's of arrays (e.g. the National Water Model dataset).

To be able to virtualize this many references at once, there are multiple places that could become performance bottlenecks.

  1. Reference generation
    open_virtual_dataset should create ManifestArray objects from URLs with a minimum memory overhead. It is unlikely that the current implementation (i.e. kerchunk's SingleHdf5ToZarr to find generate references for a netCDF file) we are using is as efficient as possible because it first creates a python dict rather than a more efficient in-memory representation.

  2. In-memory representation
    It's crucial that our in-memory representation of the manifest be as memory-efficient as possible. What's the smallest that the in-memory representation of e.g. a single chunk manifest containing a million references could be? I think it's only ~100MB, if we use a numpy array (or 3 numpy arrays) and avoid object dtypes. We should be able to test this easily ourselves by creating a manifest with some dummy references data. See In-memory representation of chunks: array instead of a dict? #33.

  3. Combining
    If we can fit all the ManifestArray objects we need into memory on one worker, this part is easy. Again using numpy arrays is good because np.concatenate/stack should be memory-efficient.

    If the combined manifest is too big to fit into a single worker's memory then we might want to use dask to create a task graph, with the root tasks generating the references, concatenation via a tree of tasks (not technically a reduction because the output is as big as the input), then writing out chunkwise to some form of storage. Overall the result would be similar to how kerchunk uses auto_dask.

  4. Serialization
    Writing out the manifest to JSON as text would create an on-disk representation that is larger than necessary. Writing out to a compressed / chunked format such as parquet could ameliorate this. This could be kerchunk's parquet format or we could use a compressed format in the Zarr manifest storage transformer ZEP (see discussion in Manifest storage transformer zarr-specs#287)

We should be trying to work out what the performance of each of these steps is, and they are separate so we can look at them individually.

cc @sharkinsspatial @norlandrhagen @keewis

@TomNicholas
Copy link
Member Author

TomNicholas commented May 9, 2024

To demonstrate my point about (2), it looks like storing a million references in-memory can be done using numpy taking up only 24MB.

In [1]: import numpy as np

# Using numpy 2.0 for the variable-length string dtype
In [2]: np.__version__
Out[2]: '2.0.0rc1'

# The number of chunks in this zarr array
In [3]: N_ENTRIES = 1000000

In [4]: SHAPE = (100, 100, 100)

# Taken from Julius' example in issue #93 
In [5]: url_prefix = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-'

# Notice these strings will have slightly different lengths
In [6]: paths = [f"{url_prefix}{i}.nc" for i in range(N_ENTRIES)]

# Here's where we need numpy 2
In [7]: paths_arr = np.array(paths, dtype=np.dtypes.StringDType).reshape(SHAPE)

# These are the dimensions of our chunk grid
In [8]: paths_arr.shape
Out[8]: (100, 100, 100)

In [9]: paths_arr.nbytes
Out[9]: 16000000

In [10]: offsets_arr = np.arange(N_ENTRIES, dtype=np.dtype('int32')).reshape(SHAPE)

In [11]: offsets_arr.nbytes
Out[11]: 4000000

In [12]: lengths_arr = np.repeat(100, repeats=N_ENTRIES).astype('int32').reshape(SHAPE)

In [13]: lengths_arr.nbytes
Out[14]: 4000000

In [14]: (paths_arr.nbytes + offsets_arr.nbytes + lengths_arr.nbytes) / 1e6
Out[14]: 24.0

i.e. only 24MB to store a million references (for one ManifestArray) in memory.

This is all we need for (2) I think. It would be nice to put all 3 fields into one structured array instead of 3 separate numpy arrays, but apparently that isn't possible yet (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)). But that's just an implementation detail as this would all be hidden within the ManifestArray class anyway. So it looks like to complete (2) I just have to finish #39 but using 3 numpy arrays instead of one structured array.

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store. This is smaller than worker memory, implying that I don't think we need dask to perform the concatenation (so we shouldn't need dask for (3)?).

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

@TomNicholas
Copy link
Member Author

Next thing we should look at is how much space these references would take up on-disk as JSON/parquet/special zarr arrays.

See Ryan's comment (#33 (comment)) suggesting specific compressor codecs to use for storing references in zarr arrays.

@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 20, 2024

even a really big store with 100 variables, each with a million chunks

I tried this out, see earth-mover/icechunk#401

@TomNicholas
Copy link
Member Author

TomNicholas commented Jan 9, 2025

This write-up of trying VirtualiZarr + Icechunk on a large dataset (IMERG) is also illuminating (see earth-mover/icechunk-nasa#1 for context)

tl;dr: icechunk needs to reduce the size / split up its manifests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant