-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sharding Support #877
Comments
This is great! Can you give (or direct me to) a high-level description of how the scalable minds sharded format differs from the neuroglancer precomputed format? My lay understanding is that the scalable minds format uses a cartesian grid of shards each containing morton-coded chunks, whereas the neuroglancer scheme entails morton coding the chunks first, then partitioning the morton curve-traversed-chunks into shard files. Is this accurate? What are the pros / cons of these two approaches? cc @jbms |
The neuroglancer precomputed format is rather hacky.
|
cc @laramiel |
I think sharding is could make a lot of sense, but it definitely needs to be exposed in the array API. Otherwise, one might inadvertently attempt to simultaneously write different chunks in the same shard which could lead to data corruption errors. A related concept, which I believe Neuroglancer also supports, is using hashes to group together shards into different directories: Together with both of these tricks I think we could achieve good scalability even to PB scale arrays. |
Good to see that you are tackling this. It is also a shame that you don't find the adoption of blosc2/caterva appealing because IMO it currently fulfills most of the requirements above. But I understand that you have your own roadmap and it is fine if you prefer going with a different solution :-) |
Hi @FrancescAlted. We're not nearly far enough along to say that blosc2/caterva aren't appealing! 😄 (see approach Kudos to @jstriebel for kicking this off and very glad to see people jumping in both here and on #876. Please do consider this the start of a conversation though. Happy to hear more! ~Josh |
This is absolutely not true Francesc. Speaking as a Zarr core dev, my preferred solution would be to implement sharding with Blosc2. (Not sure that Caterva is needed.) I have been consistent about this position for a while (see #713). I explicitly requested we include evaluation of Blosc2 / Caterva as part of this work. |
Ok, happy to hear that. It is just that after having a cursory look at the proposal, it seemed to me that "we currently favor a combination of approach 1 and 3" was leaving Blosc2/Caterva a little bit out of the game. But seriously, I don't want to mess up with your decisions; feel free to take any reasonable approach you find better for your requirements. |
If blosc2/caterva were used, is there a way to make it compatible with the existing numcodecs, so that for example you could use imagecodecs_jpeg2k to encode individual chunks within the shard? As far as using caterva as the shard format, it sounds like there would then be 3 levels of chunking: shard, chunk, block. But writing would have to happen at the granularity of shards, and reading could happen at the granularity of blocks, so it is not clear to me what purpose the intermediate "chunk" serves. Another question regarding blosc2/caterva: what sort of I/O interface does it provide for the encoded data? In general caterva would need to be able to request particular byte ranges from the underlying store ---- if you have to supply it with the entire shard then that defeats the purpose. |
It has been a long week and you are asking a lot of questions, but let me try to do my best in answering (at the risk of providing some nebulous thoughts).
Sorry, but I don't have experience with numcodecs (in fact, I was not aware that jpeg2k was among the supported codecs). I can only say that Blosc2 supports the concept of plugins for codecs and filters; you can have a better sense of how this work at my recent talk at PyData Global (recording here) and our blog post on this topic. I suppose jpeg2k could be integrated as a plugin for Blosc2, but whether or not numcodecs would be able to retrieve existing imagecodecs_jpeg2k is beyond my knowledge.
That's a valid point. Actually, Blosc2 is already using chunks as you are planning to use shards, whereas our blocks is what you are intending to use as chunks, so currently Blosc2 only has two levels of chunking. I was thinking more along the lines that Zarr could implement shards as Blosc2 frames (see my talk for a more visual description of what a frame is). Indeed you would end with 3 partition levels, but perhaps this could be interesting in some situations (e.g. for getting good slice performance in different dimensions simultaneously even inside single shards).
Blosc2 support plugins for I/O too. Admittedly, this is still a bit preliminary and not as well documented as the plugins for filters and codecs but still, there is a test that may shed some light on how to use them: https://github.com/Blosc/c-blosc2/blob/main/tests/test_udio.c. You may want to experiment with this machinery, but as we did not have resources enough for finishing this properly there might be some loose ends here (patches are welcome indeed). |
Thanks everyone for lively discussion! I'd like to emphasize what Josh wrote, I'll try to answer the different threads about the general approach in this issue:
So far, we mostly use webknossos-wrap (wkw), which uses compressible chunks (called blocks in wkw). As you said, a (cartesian) cube of chunks is grouped into one file, where the chunks are concatenated in Morton-order. Each file starts with some metadata: General information such as the chunk- and shard-sizes (similar to @shoyer Thanks for the hint about zarr-developers/zarr-specs#115, this definitely looks interesting. As I understand it this would be an orthogonal concept, right? The main benefit in sharding as proposed here is that it leverages the underlying grid (e.g. extending a dataset somewhere only touches the surface-shards). One could still design the hashing-function to effectively be cube-based sharding, but I think that this rather "hacks" the idea of hashing and might be more useful as a separate concept. Also the order of chunks within a shard/hash-location can be optimized based on the geometrical grid when using sharding, for hashing this again would be rather hacky. Does this make sense? @FrancescAlted @rabernat blosc & caterva
To me one major appeal of zarr is the separation of storage, array access (e.g. slicing) and compression. Using blosc2 as a sharding strategy would directly couple array access and compression, since slicing must know about the details of blosc2 and use optimized read and write routines. This already is the case for blosc with the PartialReadBuffer and would need to go much further, which I think degrades readability and extensibility of the code. Also, it allows sharding only with blosc2 and compatible compressions & filters. Any custom compressions/filters would first need to be integrated with blosc2 before they could be sharded, which directly couples storage considerations (sharding) with the compression. Implementing sharding within zarr itself would allow that those parts stay decoupled as they are and seems a quite doable addition. Based on those arguments, I personally would rather tend to integrate blosc2 as another compression (a). This wouldn't be part of this issue here, but definitely a useful addition. Regarding caterva, I understand that it handles chunking (caterva blocks) and sharding (caterva chunks), but stores this information in a binary header, and is coupled to blosc2: To me this seems rather like an alternative to zarr, but surely could also be used as another compression (which seemed to be the initial strategy in #713). As already mentioned by @jbms, both caterva chunks and blosc2 frames both allow chunk-wise reads in a sub-part of a file and serve a similar use-case. This seems doubled, but helps to separate the APIs, where caterva cares about data access and blosc2 about compression. This is similar to the question in a, if a zarr-chunk should correspond to a blosc2-frame or a blosc2-chunk. As you can see, we definitely do consider using blosc2 & caterva for sharding. Atm it simply does not seem to be the most effective way to me. I hope those arguments make sense, please correct me where I'm wrong, and I'm very happy to hear more arguments about using blosc2 or caterva for sharding. PS: Some discussion about our initial implementation draft is also going on in the PR #876. Before investing much more time into the PR, I'd like to reach consensus about the general approach here. |
@jstriebel Yeah, as said, it is up to you to decide whether Blosc2/Caterva can bring advantages to you or not. It is just that I was thinking that this could have been a good opportunity to use Blosc2 frames as a way to interchange data among different packages with no copies. Frames are very capable beasts (63-bit capacity, metalayers, can be single-file or multi-file, can live either on-disk or in-memory), and I was kind of envisioning them as a way to easily share (multidim) data among parties. |
Correct, this is an orthogonal optimization |
I'm a bit late in the discussion, but hope I may still join in. Generally, I really like the idea of sharding, and I believe that any of the possible ways to implement it will help us a lot 👍 TLDR: Based on this proposal, we'd now get chunks and shards, which in total are 2 levels. What makes the number 2 special? And do we want to go from 1 to n in stead of 1 to 2? One thing I am wondering about sharding and sub-chunks is what really defines at which level of granularity one would like to introduce a chunk or shard size. I think there's definitely one level which is the smallest unit to which one could feasibly apply compression (that would be the Above that (mostly
Of course there might be other storage (spinning disks / SSDs / NVMes / Optanes) and network (ethernet / infiniband etc...) technologies involved, so it's hard to find good numbers. Thus, I'm wondering if one could go more in the direction of a Cache oblivious algorithm which respects the presence of those boundaries but doesn't rely on specific sizes. In order to support sharding at various size levels, it might be an option to change the naming scheme of the chunks of an entire array, based on Morton or Hilbert curves or a similar approach. For example by encoding the chunk index by a Morton curve (for the entire array):
one could rearange the chunks in a way such that spatially consistent chunks will end up within a common "folder" (as given by the morton path). Up to now, if we think of object names, this would just be a renaming of all the chunks and wouldn't change anything performance-wise. It might improve access times for folder-based filesystems a bit due to probably better locality and caching of folder-metadata (but that's likely not a whole lot). However, if we'd now would have a possibility of dynamically selecting the unit of transfer based on the depth of the folder structure (or the length of the object name prefix), we could arrive at a sharding system which could work across various storage and transfer technologies. E.g. I could fetch the entire If the underlying storage technology requires to combine multiple chunks into a single object, the shard format as suggested in the prototype PR might be just the way to go. But other options, like generic folder-packing tools (e.g. tar for a tape archive) might work as well in some cases. How to encode morton binary into numers / letters and where to potentially place folder-like separators would of course need do be discussed or probably be a parameter. |
@d70-t Thanks for joining in. I'm not sure if I understand your motivation correctly, which seems to be about transport (e.g. cloud provider to user, archiving on tape, etc). This issue is mostly about storage, decoupling chunk-wise reads (and possibly writes) from the number of files.
That's exactly what this issue is about. I think that aggregating data during transport (network requests etc) is a different topic (and maybe more related to fs-spec?) |
@d70-t Your proposal sounds interesting. I think the special thing about 2 is that we have one level for the unit of reading (chunk) and one as a unit of writing (shard). On-the-fly re-sharding as part of an archival process should be easily implementable with "format 2". I wonder if the morton code you proposed would limit the sharding to cube-shaped shards (e.g. 16-16-16) instead of cuboid-shaped shards (e.g. 16-8-1). The latter could be desirable for parallelized data conversion processes (e.g. converting a large tiff stack into zarr). |
@jstriebel thanks for getting back. I think my thought maybe better described as a little bit of reframing which probably results in making it more cross-topic. But in the end, that might result in a very similar implementation you've already done. My main though was: why should we have exactly two levels of chunking? If the data ends up being stored in a larger storage hierarchy, we might need more levels. Thus I was wondering if we can rephrase the information you've currently stored in I've also seen that you mention morton ordering, but only for within the shards, but in principle, external morton ordering would also allow to pack shards within larger shards, resulting in just a bigger shards which is again morton-orderd. This lead me to the idea of implementing multi-level sharding through renaming the chunk-indices in stead of defining multiple If chunk indices would be morton-ordered, then one could write out all the data to chunk-level (if so desired) and repack those chunks according your indexed shard format, basically by truncating the object key to some prefix length. If it should become necessary to migrate data from one place to another, one might to repack (and change the prefix length) or might want to pack recursively. So my comment maybe only touches the required |
It is true that there is a benefit to be gained from locality at additional levels, both by choosing the order of chunks within a shard and by choosing the shard keys. However, the added benefit over the two levels, one for reading and one for writing, as @normanz said, is likely to be marginal. Also, this sharding proposal does not preclude addressing ordering in the future. Defining an order for chunks, and a key encoding for shards, may be a bit complicated: the desired order depends on the access pattern, which may be non-trivial to specify. In some cases Morton order may be desired, but in other cases not. I think it would take some work to formulate a general, flexible way to specify chunk order/key encoding. |
@normanrz if we'd use only plain morton codes, yes, we would only get cube shaped shards. I'd expect that in many cases, this is what one would actually want to have (the chunks themselves could be cuboid and I'm wondering about which cases might require different aspect ratios between shards and chunks). But if differently shaped shards would be desired, it would be possible to make the bit-order of the (non-) morton code a parameter. E.g. instead of doing
for the binary index transformation, one could opt for
or the like. In that case, one would get a (4 x 2) sharding if the last 3 bits of chunk index would be packed. The difficulty I see in re-sharding the current approach is that it might be non-trivial to come up with good shard shapes during archiving and loading, because people archiving the data may think differently about the datasets than users of the data. If there would be a relatively simple way of changing the shard size, that would make the implementation of storage hierarchies a lot easier. But of course, as @jbms says, coming up with a proper bit-ordering on dataset creation might as well be more difficult than specifying a sharding as proposed. |
As a small note, it might be worth looking at how Parquet solved this for comparison. Certainly our use case is a bit different from there's, but there are probably still things that can be learned. Thanks @rjzamora for pointing this out to me earlier 😄 |
I'm trying to wrap my head around what the pros and cons of the Store and Array based prototypes are and how they might interact with other topics (like checksums, #392 and content addressable storage transformer, zarr-developers/zarr-specs#82). What I figured out so far would be:
Within sharding, the two tasks (key-translation and packing) seem to be relatively independent and there may be reasons to swap them out independently of each other: depending on some more high-level thoughts, different groupings may be preferred, but depending on the underlying storage, different packing strategies may be preferred. Implementing sharding in the array may have additional benefits as loads and especially stores should be scheduled such that chunks which end up in one shard would be acted on in one bulk operation, so the grouping must be known to higher level functions. I'm wondering if this tension could be solved if sharding would be spread between array and store: if we keep the translation in the array and put the packing into a store (translator), we may be able to have the best of two worlds? Let's say we have
and use the translated keys for passing chunks down to the store. An existing store would be able to operate on those keys directly. That would of course be without sharding, but may be beneficial on some filesystems (see |
@d70-t At the moment the difference between the implementations should would only have a small effect for other methods. Atm both implementations (#876 #947) implement sharding as a translation layer between the actual store and the array. One time the implementation is separated into a "Translation Store", the other time it is moved directly into the Array class. In both scenarios one can iterate over chunks or shards, depending on the actual use-case. I tried to make the distinction a bit clearer in those charts: Current Flow without ShardingSharding via a Translation StoreSharding in the ArrayPlease note that in both variations, the Array class must allow to bundle chunk-requests (read/write) and either pass them on the the Translation Store or do some logic on the bundle itself. The translation store does not change any of the properties of the underlying store, except that "chunks" in the underlying store are actually "shards", which are translated back and forth when accessing them via the array. So any other logic, e.g. for check-sums, can either be implemented to work on fine-grained chunks (using chunks from the translation store) or the shards (using chunks from the underlying store). Also, there's another prototype PR in the zarrita repo alimanfoo/zarrita#40, using the translation store strategy (leaving out the bundling atm). This might be helpful to compare, since I tried to keep the changes as simple as possible. |
Thanks @jstriebel for these awesome drawings! So the issue with checksums is, I believe it's not possible to do partial checks. Thus, if someone would try to do a Checksumming on proposal I may work as a layer between Array and ShardTranslationStore (as it would on the current implementation between Array and Store). The catch here might be, that it may be useful to apply sharding to checksums as well, because at some point (many chunks) it will be better to have a tree of checksums in stead of just a list. One option would be to have a separate checksum key per shard key (which should also work with proposal I). Another option would be to be able to pack the per shard checksums within the shard. |
@d70-t Is my understanding correct to do the checksumming chunk (or shard) wise? And independently of the underlying store? That's at least my understanding of . If checksumming should happen on shard- or chunk-level could be transparently configured if there is a translation layer abstraction between the sharding, e.g.
Not sure if this thought is useful. Further discussion should probably be part of zarr-developers/zarr-specs#82.
I guess that's important if checksumming should happen per shard. If I understand the proposal in zarr-developers/zarr-specs#82 correctly, the index into the hashed data is still stored under the original key. Using the approach |
Yes, checksums could in theory be implemented either on shard or on chunk level and independent of the store. But the choice has practical implications (independent of concrete checksumming or content addressable implementations): If checksums are computed per shard ( If checksums are computed per chunk ( somewhat optional: Sure, further details should go into either #392 or zarr-developers/zarr-specs#82. |
Although the discussion has moved for the moment to the storage transformers, there are perhaps interesting use cases in https://forum.image.sc/t/ome-zarr-chunking-questions/66794 regarding the ability to stream data data directly from sensors into a single file that might be worth considering. |
Please note that a sharding proposal is now formalized as a Zarr Enhancement Proposal, ZEP 2. The basic sharding concept and the binary representation is the same as proposed in #876, however it is formalized as a storage transformer for Zarr v3. Further feedback on this is very welcome! Either
I'll leave this PR open for now since it contains ideas that are not covered by ZEP 2, but please consider commenting directly on the spec or implementation if it makes sense. |
ZEP2 proposing sharding as a codec is approved 🎉 Thank you all for your thoughtful contributions! An efficient implementation of it is being discussed as part of #1569. Closing this issue in favor of it. |
Update 2023-11-19:
Sharding is now formalized as a codec for Zarr v3 via the Zarr Enhancement Proposal (ZEP) 2.
A new efficient implementation of sharding is being discussed as part of #1569.
TLDR: We want to add sharding to zarr. There are some PRs as starting points for further discussions, as well as issue zarr-developers/zarr-specs#127 to update the spec.
Sharding Prototype alimanfoo/zarrita#40, and
Sharding Prototype II: implementation on Array #947
Please see #877 (comment) for a comparison of the implementation approaches.
Currently, zarr maps one array chunk to one storage key, e.g. one file for the
DirectoryStore
. It would be great to decouple the concept of chunks (e.g. one compressible unit) and storage keys, since the storage might be optimized for larger data per entry and might have an upper limit for the number of entries, such as the file block size and maximum inode number for disk-storage. This does not necessarily fit the access patterns of the data, so chunks might need to be smaller than one storage key.For those reasons, we would like to add sharding support to zarr. One shard corresponds to one storage key, but can contain multiple chunks:
We, this is scalable minds, will provide a PR for initial sharding support in zarr-python. This work is funded by the CZI through the EOSS program.
We see the following requirements to implement sharding support:
With this issue and an accompanying prototype PR we want to start a discussion about possible implementation approaches. Currently, we see four approaches, which have different pros and cons:
Based on this assessment, we currently favor a combination of approach 1 and 3. This keeps the pros of implementing sharding as a wrapper to the storage handler, which is a nice abstraction level. However, sharding should be configured and persisted per array rather than with the store config, so we propose to use a
ShardingStore
only internally, whereas user-facing configuration happens on the array, e.g. similar to the shape.To make further discussions more concrete and tangible, I added an initial prototype for this implementation approach: #876. This prototype still lacks a number of features which are noted in the PR itself, but contains clear paths to tackle those.
We invite all interested parties to discuss our initial proposals and assessment, as well as the initial draft PR. Based on the discussion and design we will implement the sharding support, as well as add automated tests and documentation for the new feature.
The text was updated successfully, but these errors were encountered: