Add Sharding Support #877

jstriebel · 2021-11-18T13:35:10Z

Update 2023-11-19:

Sharding is now formalized as a codec for Zarr v3 via the Zarr Enhancement Proposal (ZEP) 2.

A new efficient implementation of sharding is being discussed as part of #1569.

TLDR: We want to add sharding to zarr. There are some PRs as starting points for further discussions, as well as issue zarr-developers/zarr-specs#127 to update the spec.

Sharding Prototype I: implementation as translating Store #876,
a PR in the clean room implementation zarrita in the spirit of the previous PR, but simplified as much as possible
Sharding Prototype alimanfoo/zarrita#40, and
another prototype where the translation logic happens purely in the Array class
Sharding Prototype II: implementation on Array #947

Please see #877 (comment) for a comparison of the implementation approaches.

Currently, zarr maps one array chunk to one storage key, e.g. one file for the DirectoryStore. It would be great to decouple the concept of chunks (e.g. one compressible unit) and storage keys, since the storage might be optimized for larger data per entry and might have an upper limit for the number of entries, such as the file block size and maximum inode number for disk-storage. This does not necessarily fit the access patterns of the data, so chunks might need to be smaller than one storage key.

For those reasons, we would like to add sharding support to zarr. One shard corresponds to one storage key, but can contain multiple chunks:

We, this is scalable minds, will provide a PR for initial sharding support in zarr-python. This work is funded by the CZI through the EOSS program.

We see the following requirements to implement sharding support:

Add shard abstraction (multiple chunks in a single file/storage object)
Chunks should continue to be compressible (blosc, lz4, zstd, …)
Chunks should be stored in a continuous byte stream within the shard file
Shard abstraction should be transparent when reading/writing
Support arbitrary array sizes (not necessarily chunk/shard aligned)
Support arbitrary number of dimensions
Possibly store chunks in Morton-order within a shard file

With this issue and an accompanying prototype PR we want to start a discussion about possible implementation approaches. Currently, we see four approaches, which have different pros and cons:

Implement shard abstraction as a storage handler wrapper
- Pro: Can combine logical chunks into single shard keys, which are mapped to chunks of an underlying storage handler. Translation of chunks to shard only needs to happen within the minimal API of the storage handler.
- Pro: Partial reads and writes to the underlying store are encapsulated in a single place.
- Con: Sharding should be configured and persisted per array rather than per store.
Implement shard abstraction as a compressor wrapper (current chunks = shards which contain subchunks).
- Pro: The current notion of chunks as a storage-key stays unchanged in the storage layer. They just correspond to what we normally call "shards" then.
- Con: Partial reads and writes per storage-key need to be passed through all intermediate layers in the array implementation.
- Con: Addressing a sub-chunk is not intended. This would need to change the concept that a chunk is usually read or written as a whole. This is already partially the case for blosc compression, but still asserts that compression handles data of the size of a chunk, even if not all data is needed. This approach would break this assertion and needs significant changes in the data retrieval of the array implementation.
Implement shard abstraction via a translation layer on the array
- Pro: Fits the conceptual level where sharding is configured and persisted per array.
- Con: Adding sharding logic to the array make the array implementation even more complex. Chunk/Shard-translation and partial reads/writes would need to be added at multiple points.
Implement blosc2 into Zarr (special case of 2)
- Pro: Sharding is already implemented in the compression.
- Con: Still, partial read and writes need to happen efficiently throughout the array implementation, see cons of approach 2.
- Con: One core feature of zarr is the decoupling of the manipulation API, storage layer and compression. This approach tightly couples a single compression with sharding, which rather concerns storage handling. It is not possible to use sharding with other compressions then.

Based on this assessment, we currently favor a combination of approach 1 and 3. This keeps the pros of implementing sharding as a wrapper to the storage handler, which is a nice abstraction level. However, sharding should be configured and persisted per array rather than with the store config, so we propose to use a ShardingStore only internally, whereas user-facing configuration happens on the array, e.g. similar to the shape.

To make further discussions more concrete and tangible, I added an initial prototype for this implementation approach: #876. This prototype still lacks a number of features which are noted in the PR itself, but contains clear paths to tackle those.

We invite all interested parties to discuss our initial proposals and assessment, as well as the initial draft PR. Based on the discussion and design we will implement the sharding support, as well as add automated tests and documentation for the new feature.

The text was updated successfully, but these errors were encountered:

d-v-b · 2021-11-18T19:39:36Z

This is great! Can you give (or direct me to) a high-level description of how the scalable minds sharded format differs from the neuroglancer precomputed format? My lay understanding is that the scalable minds format uses a cartesian grid of shards each containing morton-coded chunks, whereas the neuroglancer scheme entails morton coding the chunks first, then partitioning the morton curve-traversed-chunks into shard files. Is this accurate? What are the pros / cons of these two approaches?

cc @jbms

jbms · 2021-11-18T20:02:36Z

The neuroglancer precomputed format is rather hacky.

It is desirable to be able to choose the shard size to be any multiple of the base chunk size, but the neuroglancer precomputed format does not allow that due to the way shards are determined.
In general if your underlying storage system supports key range queries (e.g. S3, GCS, but not a regular filesystem) then it is desirable for your keys to be ordered in a way that matches access patterns. That often means morton or similar order, though it would ideally be configurable to accommodate different intended access patterns. I think in principle the actual format of the storage keys can be considered out-of-scope for the zarr spec itself, as users can define their own mapping implementation that translates the key. But in practice, as we saw with dimension_separator, it would be better for the zarr metadata itself to fully specify how the keys are formatted.

jakirkham · 2021-11-18T20:11:28Z

cc @FrancescAlted @joshmoore

jbms · 2021-11-18T20:53:48Z

cc @laramiel

shoyer · 2021-11-19T05:44:10Z

I think sharding is could make a lot of sense, but it definitely needs to be exposed in the array API. Otherwise, one might inadvertently attempt to simultaneously write different chunks in the same shard which could lead to data corruption errors.

A related concept, which I believe Neuroglancer also supports, is using hashes to group together shards into different directories:

Sharding array chunks across hashed sub-directories zarr-specs#115

Together with both of these tricks I think we could achieve good scalability even to PB scale arrays.

FrancescAlted · 2021-11-19T11:37:53Z

We see the following requirements to implement sharding support:

Add shard abstraction (multiple chunks in a single file/storage object)

Chunks should continue to be compressible (blosc, lz4, zstd, …)

Chunks should be stored in a continuous byte stream within the shard file

Shard abstraction should be transparent when reading/writing

Support arbitrary array sizes (not necessarily chunk/shard aligned)

Support arbitrary number of dimensions

Possibly store chunks in Morton-order within a shard file

Good to see that you are tackling this. It is also a shame that you don't find the adoption of blosc2/caterva appealing because IMO it currently fulfills most of the requirements above. But I understand that you have your own roadmap and it is fine if you prefer going with a different solution :-)

joshmoore · 2021-11-19T12:04:10Z

Hi @FrancescAlted. We're not nearly far enough along to say that blosc2/caterva aren't appealing! 😄 (see approach #4: "Implement blosc2 into Zarr") What we realized after chatting with you was that we really didn't know enough to be able to make any of the critical decisions. The goal here is to get a prototype (or straw man if you will) in place so that as a community we can openly discuss the trade-offs, and critically find the right abstraction level. (Is it a Store? a Backend? an Array? Some new layer or mechanism that doesn't exist yet? ... Personally I agree with @jbms that we're going to want a metadata representation for cross-language handling, etc.)

Kudos to @jstriebel for kicking this off and very glad to see people jumping in both here and on #876. Please do consider this the start of a conversation though. Happy to hear more!

~Josh

rabernat · 2021-11-19T15:21:21Z

It is also a shame that you don't find the adoption of blosc2/caterva appealing because IMO it currently fulfills most of the requirements above

This is absolutely not true Francesc. Speaking as a Zarr core dev, my preferred solution would be to implement sharding with Blosc2. (Not sure that Caterva is needed.) I have been consistent about this position for a while (see #713). I explicitly requested we include evaluation of Blosc2 / Caterva as part of this work.

FrancescAlted · 2021-11-19T16:13:51Z

This is absolutely not true Francesc. Speaking as a Zarr core dev, my preferred solution would be to implement sharding with Blosc2. (Not sure that Caterva is needed.) I have been consistent about this position for a while (see #713). I explicitly requested we include evaluation of Blosc2 / Caterva as part of this work.

Ok, happy to hear that. It is just that after having a cursory look at the proposal, it seemed to me that "we currently favor a combination of approach 1 and 3" was leaving Blosc2/Caterva a little bit out of the game. But seriously, I don't want to mess up with your decisions; feel free to take any reasonable approach you find better for your requirements.

jbms · 2021-11-19T16:22:14Z

If blosc2/caterva were used, is there a way to make it compatible with the existing numcodecs, so that for example you could use imagecodecs_jpeg2k to encode individual chunks within the shard?

As far as using caterva as the shard format, it sounds like there would then be 3 levels of chunking: shard, chunk, block. But writing would have to happen at the granularity of shards, and reading could happen at the granularity of blocks, so it is not clear to me what purpose the intermediate "chunk" serves.

Another question regarding blosc2/caterva: what sort of I/O interface does it provide for the encoded data? In general caterva would need to be able to request particular byte ranges from the underlying store ---- if you have to supply it with the entire shard then that defeats the purpose.

FrancescAlted · 2021-11-19T18:45:08Z

It has been a long week and you are asking a lot of questions, but let me try to do my best in answering (at the risk of providing some nebulous thoughts).

If blosc2/caterva were used, is there a way to make it compatible with the existing numcodecs, so that for example you could use imagecodecs_jpeg2k to encode individual chunks within the shard?

Sorry, but I don't have experience with numcodecs (in fact, I was not aware that jpeg2k was among the supported codecs). I can only say that Blosc2 supports the concept of plugins for codecs and filters; you can have a better sense of how this work at my recent talk at PyData Global (recording here) and our blog post on this topic. I suppose jpeg2k could be integrated as a plugin for Blosc2, but whether or not numcodecs would be able to retrieve existing imagecodecs_jpeg2k is beyond my knowledge.

As far as using caterva as the shard format, it sounds like there would then be 3 levels of chunking: shard, chunk, block. But writing would have to happen at the granularity of shards, and reading could happen at the granularity of blocks, so it is not clear to me what purpose the intermediate "chunk" serves.

That's a valid point. Actually, Blosc2 is already using chunks as you are planning to use shards, whereas our blocks is what you are intending to use as chunks, so currently Blosc2 only has two levels of chunking. I was thinking more along the lines that Zarr could implement shards as Blosc2 frames (see my talk for a more visual description of what a frame is). Indeed you would end with 3 partition levels, but perhaps this could be interesting in some situations (e.g. for getting good slice performance in different dimensions simultaneously even inside single shards).

Another question regarding blosc2/caterva: what sort of I/O interface does it provide for the encoded data? In general caterva would need to be able to request particular byte ranges from the underlying store ---- if you have to supply it with the entire shard then that defeats the purpose.

Blosc2 support plugins for I/O too. Admittedly, this is still a bit preliminary and not as well documented as the plugins for filters and codecs but still, there is a test that may shed some light on how to use them: https://github.com/Blosc/c-blosc2/blob/main/tests/test_udio.c. You may want to experiment with this machinery, but as we did not have resources enough for finishing this properly there might be some loose ends here (patches are welcome indeed).

jstriebel · 2021-11-23T16:40:52Z

Thanks everyone for lively discussion! I'd like to emphasize what Josh wrote, please do consider this the start of a conversation. We at scalable minds are interested in implementing sharding support in zarr, not in a specific strategy. The initial description simply is our own and surely incomplete assessment so far.

I'll try to answer the different threads about the general approach in this issue:

@d-v-b

Can you give (or direct me to) a high-level description of how the scalable minds sharded format differs from the neuroglancer precomputed format?

So far, we mostly use webknossos-wrap (wkw), which uses compressible chunks (called blocks in wkw). As you said, a (cartesian) cube of chunks is grouped into one file, where the chunks are concatenated in Morton-order. Each file starts with some metadata: General information such as the chunk- and shard-sizes (similar to .zarray), and a jumptable to the chunk-wise start-positions iff chunks are compressed. Our access patterns mostly are shard-wise writes and chunk-wise reads.

@shoyer Thanks for the hint about zarr-developers/zarr-specs#115, this definitely looks interesting. As I understand it this would be an orthogonal concept, right? The main benefit in sharding as proposed here is that it leverages the underlying grid (e.g. extending a dataset somewhere only touches the surface-shards). One could still design the hashing-function to effectively be cube-based sharding, but I think that this rather "hacks" the idea of hashing and might be more useful as a separate concept. Also the order of chunks within a shard/hash-location can be optimized based on the geometrical grid when using sharding, for hashing this again would be rather hacky. Does this make sense?

@FrancescAlted @rabernat blosc & caterva
As far as I understand blosc2, it does itself provide chunked access within a frame, which represents one file. Those chunks can be compressed and filtered. I see two possible concepts, how blosc2 could be integrated:

a As another compression, which is highly configurable. There might be optimized access for subparts of a file, similar to the current PartialReadBuffer, but this does not replace sharding itself, which is handled on another level, similar to caterva. It might also be possible to only use a blosc2-chunk instead of a frame as a zarr chunk, which seems like the correct abstraction level for a compression.
b As a major part of zarr, where the compression and filters of blosc2 are used instead of the current compression and filter mechanisms, and blosc2-chunks would serve as chunks in a shard. The current array representation and access must completely be changed.

To me one major appeal of zarr is the separation of storage, array access (e.g. slicing) and compression. Using blosc2 as a sharding strategy would directly couple array access and compression, since slicing must know about the details of blosc2 and use optimized read and write routines. This already is the case for blosc with the PartialReadBuffer and would need to go much further, which I think degrades readability and extensibility of the code. Also, it allows sharding only with blosc2 and compatible compressions & filters. Any custom compressions/filters would first need to be integrated with blosc2 before they could be sharded, which directly couples storage considerations (sharding) with the compression. Implementing sharding within zarr itself would allow that those parts stay decoupled as they are and seems a quite doable addition.

Based on those arguments, I personally would rather tend to integrate blosc2 as another compression (a). This wouldn't be part of this issue here, but definitely a useful addition.

Regarding caterva, I understand that it handles chunking (caterva blocks) and sharding (caterva chunks), but stores this information in a binary header, and is coupled to blosc2:

To me this seems rather like an alternative to zarr, but surely could also be used as another compression (which seemed to be the initial strategy in #713). As already mentioned by @jbms, both caterva chunks and blosc2 frames both allow chunk-wise reads in a sub-part of a file and serve a similar use-case. This seems doubled, but helps to separate the APIs, where caterva cares about data access and blosc2 about compression. This is similar to the question in a, if a zarr-chunk should correspond to a blosc2-frame or a blosc2-chunk.

As you can see, we definitely do consider using blosc2 & caterva for sharding. Atm it simply does not seem to be the most effective way to me. I hope those arguments make sense, please correct me where I'm wrong, and I'm very happy to hear more arguments about using blosc2 or caterva for sharding.

PS: Some discussion about our initial implementation draft is also going on in the PR #876. Before investing much more time into the PR, I'd like to reach consensus about the general approach here.

FrancescAlted · 2021-11-24T13:07:44Z

@jstriebel Yeah, as said, it is up to you to decide whether Blosc2/Caterva can bring advantages to you or not. It is just that I was thinking that this could have been a good opportunity to use Blosc2 frames as a way to interchange data among different packages with no copies. Frames are very capable beasts (63-bit capacity, metalayers, can be single-file or multi-file, can live either on-disk or in-memory), and I was kind of envisioning them as a way to easily share (multidim) data among parties.

shoyer · 2021-11-24T17:58:50Z

@shoyer Thanks for the hint about zarr-developers/zarr-specs#115, this definitely looks interesting. As I understand it this would be an orthogonal concept, right?

Correct, this is an orthogonal optimization

d70-t · 2021-12-15T15:18:45Z

I'm a bit late in the discussion, but hope I may still join in. Generally, I really like the idea of sharding, and I believe that any of the possible ways to implement it will help us a lot 👍

TLDR: Based on this proposal, we'd now get chunks and shards, which in total are 2 levels. What makes the number 2 special? And do we want to go from 1 to n in stead of 1 to 2?

One thing I am wondering about sharding and sub-chunks is what really defines at which level of granularity one would like to introduce a chunk or shard size. I think there's definitely one level which is the smallest unit to which one could feasibly apply compression (that would be the chunk in this approach). This should be as small as feasible in order to efficiently access slices of the data. Storing anything smaller would significantly increase storage size, so this would be the "atomic" size, and it's a limit given by the algorithms.

Above that (mostly shard, but maybe also array and group) there are optimal size ranges which are influenced by the storage and network technologies used. Those ranges may vary over time (and in particular within timescales in which a dataset may still be used, if we think e.g. about 10 years or more for archival). A dataset might for example live on S3 or the like during its more active phase, but it (or parts of it) might be migrated to tape archive for longer term storage. I'd expect that different shard sizes might be required for the transport between

user <-> S3 (probably 10-100 MB ?)
S3 <-> Tape Archive (probably 1 - 10 GB ?)

Of course there might be other storage (spinning disks / SSDs / NVMes / Optanes) and network (ethernet / infiniband etc...) technologies involved, so it's hard to find good numbers. Thus, I'm wondering if one could go more in the direction of a Cache oblivious algorithm which respects the presence of those boundaries but doesn't rely on specific sizes.

In order to support sharding at various size levels, it might be an option to change the naming scheme of the chunks of an entire array, based on Morton or Hilbert curves or a similar approach. For example by encoding the chunk index by a Morton curve (for the entire array):

chunk index	chunk index binary	morton binary	morton path
0.0	000.000	00 00 00	0/0/0
0.1	000.001	00 00 01	0/0/1
1.0	001.000	00 00 10	0/0/2
0.2	000.010	00 01 00	0/1/0
1.2	001.010	00 01 10	0/1/2
7.2	111.010	10 11 10	2/3/2

one could rearange the chunks in a way such that spatially consistent chunks will end up within a common "folder" (as given by the morton path). Up to now, if we think of object names, this would just be a renaming of all the chunks and wouldn't change anything performance-wise. It might improve access times for folder-based filesystems a bit due to probably better locality and caching of folder-metadata (but that's likely not a whole lot).

However, if we'd now would have a possibility of dynamically selecting the unit of transfer based on the depth of the folder structure (or the length of the object name prefix), we could arrive at a sharding system which could work across various storage and transfer technologies. E.g. I could fetch the entire 0/ from tape but store 0/0/ ... 0/3/ as individual objects on S3, which then would again be groups of four chunks each of which I could fetch individual chunks using range-requests. All of this data migration would be possible while knowing relatively little about what zarr actually is.

If the underlying storage technology requires to combine multiple chunks into a single object, the shard format as suggested in the prototype PR might be just the way to go. But other options, like generic folder-packing tools (e.g. tar for a tape archive) might work as well in some cases.

How to encode morton binary into numers / letters and where to potentially place folder-like separators would of course need do be discussed or probably be a parameter.

jstriebel · 2021-12-15T16:11:25Z

@d70-t Thanks for joining in. I'm not sure if I understand your motivation correctly, which seems to be about transport (e.g. cloud provider to user, archiving on tape, etc). This issue is mostly about storage, decoupling chunk-wise reads (and possibly writes) from the number of files.

If the underlying storage technology requires to combine multiple chunks into a single object, the shard format as suggested in the prototype PR might be just the way to go.

That's exactly what this issue is about. I think that aggregating data during transport (network requests etc) is a different topic (and maybe more related to fs-spec?)

normanrz · 2021-12-15T16:28:30Z

What makes the number 2 special?

@d70-t Your proposal sounds interesting. I think the special thing about 2 is that we have one level for the unit of reading (chunk) and one as a unit of writing (shard). On-the-fly re-sharding as part of an archival process should be easily implementable with "format 2".

I wonder if the morton code you proposed would limit the sharding to cube-shaped shards (e.g. 16-16-16) instead of cuboid-shaped shards (e.g. 16-8-1). The latter could be desirable for parallelized data conversion processes (e.g. converting a large tiff stack into zarr).

d70-t · 2021-12-15T16:32:49Z

@jstriebel thanks for getting back. I think my thought maybe better described as a little bit of reframing which probably results in making it more cross-topic. But in the end, that might result in a very similar implementation you've already done.

My main though was: why should we have exactly two levels of chunking? If the data ends up being stored in a larger storage hierarchy, we might need more levels. Thus I was wondering if we can rephrase the information you've currently stored in .zarray.shards such that it would become possible to have multiple levels of shards.

I've also seen that you mention morton ordering, but only for within the shards, but in principle, external morton ordering would also allow to pack shards within larger shards, resulting in just a bigger shards which is again morton-orderd. This lead me to the idea of implementing multi-level sharding through renaming the chunk-indices in stead of defining multiple .zarray.shards and use the prefix-length(s) as a definition for multiple levels of sharding.

If chunk indices would be morton-ordered, then one could write out all the data to chunk-level (if so desired) and repack those chunks according your indexed shard format, basically by truncating the object key to some prefix length. If it should become necessary to migrate data from one place to another, one might to repack (and change the prefix length) or might want to pack recursively.

So my comment maybe only touches the required .zarrray-metadata and not the packing itself, but maybe it's still worth a thought?

jbms · 2021-12-15T16:49:34Z

It is true that there is a benefit to be gained from locality at additional levels, both by choosing the order of chunks within a shard and by choosing the shard keys. However, the added benefit over the two levels, one for reading and one for writing, as @normanz said, is likely to be marginal. Also, this sharding proposal does not preclude addressing ordering in the future. Defining an order for chunks, and a key encoding for shards, may be a bit complicated: the desired order depends on the access pattern, which may be non-trivial to specify. In some cases Morton order may be desired, but in other cases not. I think it would take some work to formulate a general, flexible way to specify chunk order/key encoding.

d70-t · 2021-12-15T16:56:23Z

@normanrz if we'd use only plain morton codes, yes, we would only get cube shaped shards. I'd expect that in many cases, this is what one would actually want to have (the chunks themselves could be cuboid and I'm wondering about which cases might require different aspect ratios between shards and chunks). But if differently shaped shards would be desired, it would be possible to make the bit-order of the (non-) morton code a parameter. E.g. instead of doing

x3 x2 x1 . y3 y2 y1 -> y3 x3 y2 x2 y1 x1

for the binary index transformation, one could opt for

x3 x2 x1 . y3 y2 y1 -> y3 y2 x3 y1 x2 x1

or the like. In that case, one would get a (4 x 2) sharding if the last 3 bits of chunk index would be packed.

The difficulty I see in re-sharding the current approach is that it might be non-trivial to come up with good shard shapes during archiving and loading, because people archiving the data may think differently about the datasets than users of the data. If there would be a relatively simple way of changing the shard size, that would make the implementation of storage hierarchies a lot easier. But of course, as @jbms says, coming up with a proper bit-ordering on dataset creation might as well be more difficult than specifying a sharding as proposed.

jakirkham · 2021-12-17T21:22:19Z

As a small note, it might be worth looking at how Parquet solved this for comparison. Certainly our use case is a bit different from there's, but there are probably still things that can be learned.

Thanks @rjzamora for pointing this out to me earlier 😄

d70-t · 2022-02-04T14:02:41Z

I'm trying to wrap my head around what the pros and cons of the Store and Array based prototypes are and how they might interact with other topics (like checksums, #392 and content addressable storage transformer, zarr-developers/zarr-specs#82). What I figured out so far would be:

stores are meant for handling lower level things like byte-layout, networking, packing / unpacking (e.g. like zip does)
arrays are meant for translation between indices and storage keys
sharding is somewhere in between: key-translation (as in __keys_to_shard_groups__ in both prototypes) as well as grouping and packing of chunks

Within sharding, the two tasks (key-translation and packing) seem to be relatively independent and there may be reasons to swap them out independently of each other: depending on some more high-level thoughts, different groupings may be preferred, but depending on the underlying storage, different packing strategies may be preferred.

Implementing sharding in the array may have additional benefits as loads and especially stores should be scheduled such that chunks which end up in one shard would be acted on in one bulk operation, so the grouping must be known to higher level functions.
Implementing sharding in the store may have additional benefits as swapping out different byte layout mechanisms would be easier and stores would still be able to observe individual chunks. In particular if some form of checksums based on merkle trees (#392) are to be added at some point, those checksums may have to be computed on chunk level and may need to be included into whatever sharding layout will be there (see this comment for some more thoughts on that).

I'm wondering if this tension could be solved if sharding would be spread between array and store: if we keep the translation in the array and put the packing into a store (translator), we may be able to have the best of two worlds?

Let's say we have shards=(2,3), the array would translate:

chunk_key -> shard_key/part_key
0.0 -> 0.0/0.0
1.2 -> 0.0/1.2
2.1 -> 1.0/0.1
3.3 -> 1.1/1.0
...

and use the translated keys for passing chunks down to the store. An existing store would be able to operate on those keys directly. That would of course be without sharding, but may be beneficial on some filesystems (see / as separator). But it would also be possible to plug in any kind of store translator, which would just pack up everything behind the / into a shard and send this shard further down to an existing store (using only the part before the / as a key).

jstriebel · 2022-02-07T13:34:02Z

@d70-t At the moment the difference between the implementations should would only have a small effect for other methods. Atm both implementations (#876 #947) implement sharding as a translation layer between the actual store and the array. One time the implementation is separated into a "Translation Store", the other time it is moved directly into the Array class. In both scenarios one can iterate over chunks or shards, depending on the actual use-case. I tried to make the distinction a bit clearer in those charts:

Current Flow without Sharding

Sharding via a Translation Store

Sharding in the Array

Please note that in both variations, the Array class must allow to bundle chunk-requests (read/write) and either pass them on the the Translation Store or do some logic on the bundle itself. The translation store does not change any of the properties of the underlying store, except that "chunks" in the underlying store are actually "shards", which are translated back and forth when accessing them via the array. So any other logic, e.g. for check-sums, can either be implemented to work on fine-grained chunks (using chunks from the translation store) or the shards (using chunks from the underlying store).

Also, there's another prototype PR in the zarrita repo alimanfoo/zarrita#40, using the translation store strategy (leaving out the bundling atm). This might be helpful to compare, since I tried to keep the changes as simple as possible.

d70-t · 2022-02-09T16:00:05Z

Thanks @jstriebel for these awesome drawings!

So the issue with checksums is, I believe it's not possible to do partial checks. Thus, if someone would try to do a partial_getitem(some_offset) towards a checksum-verifying store, the checksum-verifying store would have to fetch the entire item in order to do the verification, which would largely defeat the purpose of sharding. Therefore, I don't think it will be possible to implement a useful checksum checking layer between Array and Store in proposal II (except if that layer does it's own sub-chunking or assumes that partial_getitem will only be called on specific size increments, both don't seem to be good choices). Implementing checksumming in front or within Array doesn't look right.

Checksumming on proposal I may work as a layer between Array and ShardTranslationStore (as it would on the current implementation between Array and Store).

The catch here might be, that it may be useful to apply sharding to checksums as well, because at some point (many chunks) it will be better to have a tree of checksums in stead of just a list. One option would be to have a separate checksum key per shard key (which should also work with proposal I). Another option would be to be able to pack the per shard checksums within the shard.

jstriebel · 2022-02-10T10:19:03Z

@d70-t Is my understanding correct to do the checksumming chunk (or shard) wise? And independently of the underlying store? That's at least my understanding of . If checksumming should happen on shard- or chunk-level could be transparently configured if there is a translation layer abstraction between the sharding, e.g.

Checksumming per chunk: Array --> Checksumming --> Sharding --> Store
Checksumming per shard: Array --> Sharding --> Checksumming --> Store

Not sure if this thought is useful. Further discussion should probably be part of zarr-developers/zarr-specs#82.

The catch here might be, that it may be useful to apply sharding to checksums as well, because at some point (many chunks) it will be better to have a tree of checksums in stead of just a list.

I guess that's important if checksumming should happen per shard. If I understand the proposal in zarr-developers/zarr-specs#82 correctly, the index into the hashed data is still stored under the original key. Using the approach Array --> Sharding --> Checksumming --> Store, those indices would be combined per shard. So under the shard-key there would be all index-jsons for the current shard, pointing to the hashed locations with the actual value.

d70-t · 2022-02-10T15:52:06Z

Yes, checksums could in theory be implemented either on shard or on chunk level and independent of the store. But the choice has practical implications (independent of concrete checksumming or content addressable implementations):

If checksums are computed per shard (Array --> Sharding --> Checksumming --> Store) this defeats the purpose of sharding (or at least the possibility of less-than-shard-size reads as the checksumming layer would read the entire shard anyways).

If checksums are computed per chunk (Array --> Checksumming --> Sharding --> Store) chunk-sized reads are possible, but there must be an interface between Array and Sharding to put Checksumming into.

somewhat optional:
If Checksumming turns out to be in between Array and Sharding, the checksumming part may benefit from knowing which chunks will end up in which shards, because that information could be used to apply the same grouping to checksum lists as it is applied to the chunks. That's why there would be a benefit of having keys like shard_key/part_key in stead of chunk_key flowing across the Checksumming layer. This would also allow to create per-shard checksum lists and put them into a key like shard_key/.zchecksum or the like.

Sure, further details should go into either #392 or zarr-developers/zarr-specs#82.

joshmoore · 2022-05-18T10:34:21Z

Although the discussion has moved for the moment to the storage transformers, there are perhaps interesting use cases in https://forum.image.sc/t/ome-zarr-chunking-questions/66794 regarding the ability to stream data data directly from sensors into a single file that might be worth considering.

jstriebel · 2022-08-22T13:48:08Z

Please note that a sharding proposal is now formalized as a Zarr Enhancement Proposal, ZEP 2. The basic sharding concept and the binary representation is the same as proposed in #876, however it is formalized as a storage transformer for Zarr v3.

Further feedback on this is very welcome! Either

on the abstract specification as comments on PR Review of the ZEP2 spec - Sharding storage transformer zarr-specs#152, or
on the Python implementation PR Sharding storage transformer for v3 #1111.

I'll leave this PR open for now since it contains ideas that are not covered by ZEP 2, but please consider commenting directly on the spec or implementation if it makes sense.

jstriebel · 2023-11-19T10:47:54Z

ZEP2 proposing sharding as a codec is approved 🎉

Thank you all for your thoughtful contributions!

An efficient implementation of it is being discussed as part of #1569. Closing this issue in favor of it.

jstriebel mentioned this issue Nov 18, 2021

Sharding Prototype I: implementation as translating Store #876

Closed

14 tasks

oeway mentioned this issue Nov 19, 2021

BioImage.IO Meeting Minutes bioimage-io/bioimage.io#28

Open

rabernat mentioned this issue Dec 16, 2021

zarr validation and consistency checking #912

Closed

This was referenced Jan 19, 2022

Sharding Prototype II: implementation on Array #947

Closed

sharding prototype scalableminds/zarrita#1

Closed

cgohlke mentioned this issue Jan 24, 2022

imagej_metadata failed with ValueError cgohlke/tifffile#111

Closed

jstriebel mentioned this issue Jan 25, 2022

Sharding Prototype alimanfoo/zarrita#40

Closed

joshmoore mentioned this issue Jan 26, 2022

Community feedback process (e.g. ZEP) zarr-developers/governance#14

Closed

d70-t mentioned this issue Feb 2, 2022

car fsspec reference file system output d70-t/ipldstore#2

Open

jstriebel mentioned this issue Feb 7, 2022

Allowing to add / Adding a sharding spec zarr-developers/zarr-specs#127

Closed

jstriebel mentioned this issue Feb 24, 2022

Add Sharding v1.0 Spec & Storage Transformers to Zarr v3.0 zarr-developers/zarr-specs#134

Merged

rabernat mentioned this issue Mar 7, 2022

Poor performance when slicing with steps #843

Closed

This was referenced Mar 15, 2022

Enable sharding for Zarr datasets scalableminds/webknossos-libs#653

Closed

Add Ome-Zarr datasets scalableminds/webknossos-libs#607

Closed

d70-t mentioned this issue Mar 30, 2022

concatenating files fsspec/kerchunk#134

Open

cgohlke mentioned this issue Apr 5, 2022

Creating a virtual zarr datacube from COG s3 objects cgohlke/tifffile#125

Closed

d70-t mentioned this issue May 16, 2022

renaming chunks ? thewtex/shardedstore#14

Open

d70-t mentioned this issue May 20, 2022

split up metadata d70-t/ipldstore#7

Open

jluethi mentioned this issue Jun 9, 2022

Optimal I/O and on-disk storage for zarr arrays (chunking scheme + Zarr store) fractal-analytics-platform/fractal-tasks-core#54

Closed

rabernat mentioned this issue Sep 6, 2022

ZFP Compression zarr-developers/numcodecs#117

Open

normanrz mentioned this issue Jul 27, 2023

ZEP0002 Review zarr-developers/zarr-specs#254

Closed

jstriebel closed this as completed Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sharding Support #877

Add Sharding Support #877

jstriebel commented Nov 18, 2021 •

edited

Loading

d-v-b commented Nov 18, 2021

jbms commented Nov 18, 2021

jakirkham commented Nov 18, 2021

jbms commented Nov 18, 2021

shoyer commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

joshmoore commented Nov 19, 2021

rabernat commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

jbms commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

jstriebel commented Nov 23, 2021

FrancescAlted commented Nov 24, 2021

shoyer commented Nov 24, 2021

d70-t commented Dec 15, 2021

jstriebel commented Dec 15, 2021

normanrz commented Dec 15, 2021

d70-t commented Dec 15, 2021

jbms commented Dec 15, 2021

d70-t commented Dec 15, 2021

jakirkham commented Dec 17, 2021

d70-t commented Feb 4, 2022

jstriebel commented Feb 7, 2022

d70-t commented Feb 9, 2022

jstriebel commented Feb 10, 2022

d70-t commented Feb 10, 2022

joshmoore commented May 18, 2022

jstriebel commented Aug 22, 2022 •

edited

Loading

jstriebel commented Nov 19, 2023 •

edited

Loading

Add Sharding Support #877

Add Sharding Support #877

Comments

jstriebel commented Nov 18, 2021 • edited Loading

d-v-b commented Nov 18, 2021

jbms commented Nov 18, 2021

jakirkham commented Nov 18, 2021

jbms commented Nov 18, 2021

shoyer commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

joshmoore commented Nov 19, 2021

rabernat commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

jbms commented Nov 19, 2021

FrancescAlted commented Nov 19, 2021

jstriebel commented Nov 23, 2021

FrancescAlted commented Nov 24, 2021

shoyer commented Nov 24, 2021

d70-t commented Dec 15, 2021

jstriebel commented Dec 15, 2021

normanrz commented Dec 15, 2021

d70-t commented Dec 15, 2021

jbms commented Dec 15, 2021

d70-t commented Dec 15, 2021

jakirkham commented Dec 17, 2021

d70-t commented Feb 4, 2022

jstriebel commented Feb 7, 2022

Current Flow without Sharding

Sharding via a Translation Store

Sharding in the Array

d70-t commented Feb 9, 2022

jstriebel commented Feb 10, 2022

d70-t commented Feb 10, 2022

joshmoore commented May 18, 2022

jstriebel commented Aug 22, 2022 • edited Loading

jstriebel commented Nov 19, 2023 • edited Loading

jstriebel commented Nov 18, 2021 •

edited

Loading

jstriebel commented Aug 22, 2022 •

edited

Loading

jstriebel commented Nov 19, 2023 •

edited

Loading