Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-zip archive Stores? #209

Open
mike-lawrence opened this issue Feb 18, 2023 · 7 comments
Open

Support for non-zip archive Stores? #209

mike-lawrence opened this issue Feb 18, 2023 · 7 comments

Comments

@mike-lawrence
Copy link

When data is initially collected as a DirectoryStore then compressed using 7z a -tzip ... as suggested in the docs, the resulting zip file is larger (~4x) than the original .zarr directory, and substantially larger (~40x) than if compressed without the -tzip flag (presumably thanks to zip's well-known issues with a large number of files?).

Is it fundamentally not possible to support non-zip formatted archives (like 7zip's native format, or xz, or ..., etc)?

@jbms
Copy link
Contributor

jbms commented Feb 19, 2023

Regarding the zarr spec: The zarr v2 spec does not mention stores at all --- and in practice the supported stores vary greatly between implementations.

In zarr v3 there may be some mention of stores but that does not preclude an implementation from supporting additional ones.

I believe you can already use 7z archives with zarr-python via fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.libarchive.LibArchiveFileSystem

However, I have not actually tried that myself.

Regarding the size increase, I'm rather surprised that the size increases significantly --- I would expect only a minimal size increase, since as far as I am aware, the per file metadata in a zip file does not take up that much space. Only if your chunks are extremely small would I expect it to have a significant impact.

In general,, for choosing an archive format, since the chunks can already be compressed by zarr, I would not expect it to matter much what compression options the archive format supports --- you can just use no compression. I would expect the compression provided by the archive to be particularly useful only if you are storing a lot of json metadata rather than chunk data.

The main requirement for any archive format is the ability to read individual files efficiently. For example, tar is a poor choice because it only supports sequential access.

@mike-lawrence
Copy link
Author

mike-lawrence commented Feb 20, 2023 via email

@rabernat
Copy link
Contributor

@mike-lawrence - this is exactly one of the use cases that sharding (#134, #152) is designed to address.

@jbms
Copy link
Contributor

jbms commented Feb 21, 2023

I took a look at your zip file --- the issue is that your chunks are way too small for efficient access or storage. Some of your chunks contain just a single 8-byte value. Zarr compresses each chunk individually, and no compression is possible for only 8 bytes. Blosc adds a 16 byte header, such that each chunk in that case is a 24 byte file (already tripling the size). But that ignores the per-file overhead required by the filesystem or archive. On most filesystems, files always consume a multiple of the block size, typically 4KB. So when using a local filesystem each of your 8 bytes of data is actually consuming 4KB. In a zip archive the file size won't be padded but there is still per-file overhead to store the filename, etc.

Even with sharding I would still recommend a much larger chunk size, as most zarr implementations will have poor performance with such small chunks.

@jakirkham
Copy link
Member

Should move this issue to zarr-python? It doesn't seem like a spec issue

@mike-lawrence
Copy link
Author

Should move this issue to zarr-python? It doesn't seem like a spec issue

Sure, the only reason I posted here is because the zarr-python issue page recommends putting feature requests here rather than there.

@mike-lawrence
Copy link
Author

Even with sharding I would still recommend a much larger chunk size, as most zarr implementations will have poor performance with such small chunks.

Ah, silly me. I'd forgotten that I'd made all the arrays store in that one-sample-per-chunk mode, when only one was intended to be stored that way (and I should play to check if increasing the chunk size in that one even affects my real-time use case performance; I can't remember if I did that now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants