Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write manifests to zarr store #45

Merged
merged 18 commits into from
May 1, 2024
Merged

Write manifests to zarr store #45

merged 18 commits into from
May 1, 2024

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Mar 22, 2024

Closes #6

This shows how we could write an xarray Dataset containing ManifestArray objects to disk as a new zarr store, where each array's chunk data is written as byte range references in the form of a manifest.json file.

This therefore creates an example of the type of zarr store described in zarr-developers/zarr-specs#287 by @jhamman.

(Currently this writes using the V2 spec, which I guess is not correct, but you get the idea.)

@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream zarr-specs Requires adoption of a new ZEP labels Mar 22, 2024
Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 91.33858% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 90.72%. Comparing base (f226093) to head (98a259e).

❗ Current head 98a259e differs from pull request most recent head 6ba41de. Consider uploading reports for the commit 6ba41de to get more accurate results

Files Patch % Lines
virtualizarr/vendor/zarr/utils.py 58.33% 5 Missing ⚠️
virtualizarr/xarray.py 90.90% 3 Missing ⚠️
virtualizarr/zarr.py 95.16% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   90.18%   90.72%   +0.54%     
==========================================
  Files          14       16       +2     
  Lines         998     1067      +69     
==========================================
+ Hits          900      968      +68     
- Misses         98       99       +1     
Flag Coverage Δ
unittests 90.72% <91.33%> (+0.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool @TomNicholas -- This will be very helpful in getting a target for the zarr-python work to come.

xref: zarr-developers/zarr-python#1718

virtualizarr/zarr.py Outdated Show resolved Hide resolved
virtualizarr/zarr.py Outdated Show resolved Hide resolved
virtualizarr/zarr.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 25, 2024

I realized that I could also add a reader for this type of store, which would create a ManifestArray-backed dataset from a chunk-manifest-ZEP-compliant store. You could then use VirtualiZarr to combine the chunks from multiple such stores.

@TomNicholas
Copy link
Member Author

This PR now also adds the ability to open a zarr v3 store with all arrays as manifest.json files via open_virtual_dataset(storepath, filetype="zarr_v3"). I did this partly so that then I can test via roundtripping.

@TomNicholas TomNicholas requested a review from jhamman March 31, 2024 02:45
return json.JSONEncoder.default(self, o)


def json_dumps(o: Any) -> bytes:
Copy link
Member Author

@TomNicholas TomNicholas Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to vendor this because I didn't want to import internals of the zarr-python library while it's in flux, and also this helps make it clear exactly which parts of this package even need zarr-python at all.

Comment on lines +60 to +61
if filetype == "zarr_v3":
# TODO is there a neat way of auto-detecting this?
Copy link
Member Author

@TomNicholas TomNicholas Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly - I want to automatically distinguish between non-zarr, zarr v2 (both to be read using kerchunk) and zarr v3 (to be read using this code). I guess I will just have to search for .zgroup/zarr.json files explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@norlandrhagen do you have any thoughts on a neat way to handle this?


# TODO recursive glob to create a datatree
vars = {}
for array_dir in _storepath.glob("*/"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this is going awry in the CI, but working as intended locally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when run locally (on MacOS), a pathlib.Path.glob("*/") call only returns directories (as the pathlib docs say it will), but for some reason when run in this CI the glob will include files too. I've hacked around this by excluding any paths for which .is_file() is True.

@TomNicholas
Copy link
Member Author

I'm just going to merge this as we can always change it later.

@TomNicholas TomNicholas merged commit e014bb7 into main May 1, 2024
5 checks passed
@TomNicholas TomNicholas deleted the to_zarr branch May 1, 2024 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zarr-python Relevant to zarr-python upstream zarr-specs Requires adoption of a new ZEP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serialization to a Zarr chunk manifest
3 participants