checksums for chunks #392

ttung · 2019-01-16T18:48:36Z

Problem description

Having checksums for individual chunks is good for verifying the integrity of the data we're loading. The existing mechanisms for checksumming data are inadequate for various reasons:

Checksum of the entire array's data: This does not work for loading a subset of the data.
Checksum of each individual chunk recorded by a filter as part of the chunk: This does not protect against chunks being swapped, and does not help for building a persistent cache for previously read chunks.

Recording the checksums in the .zarray file could work, but may be problematic for larger data sets.

see also:

jakirkham · 2019-01-19T03:17:45Z

Thanks for raising this, @ttung.

It sounds like we have a partial solution for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly.

...and does not help for building a persistent cache for previously read chunks.

Can you please expand on this a bit?

alimanfoo · 2019-01-28T22:04:28Z

It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else?

ttung · 2019-02-21T10:05:51Z

It sounds like we have a partial solution for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly.

Including the chunk key as part of the checksum computation is still prone to mistakes if one is dealing with multiple data sets. One possible strategy is to write a UUID into the .zarray file, and include that and the chunk key as part of the checksum computation. That would not eliminate the risk of data corruption, but would lessen it.

I would personally still prefer the checksums be included in the .zarray file or something alongside it.

Can you please expand on this a bit?

We use the chunk checksums in our current data format to index into a persistent cache on disk, and the chunk checksums are included in the metadata files. If we are loading data from the network, we can avoid doing a network transfer.

The strategy where the checksum is written into the chunk can feasibly be worked into such a scheme, but requires us to:

have a predictable offset to the checksum in the chunk output file.
do a ranged HTTP/S3/GS/etc GET to retrieve the checksum.
index into a local cache.
if the local cache doesn't have the data, retrieve the chunk.

It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else?

Currently in our existing file format, we verify a chunk every time it is read. There hasn't been any indication that this is an onerous requirement.

alimanfoo · 2019-03-11T21:34:22Z

Hi Tony, I think I would suggest two possible approaches. One would be to adopt a convention where you write a checksum file alongside every chunk file into your zarr store. E.g., if you have an array at path /foo/bar within a store, and you have chunks at /foo/bar/0.0, /foo/bar/0.1, etc., then write checksums into files at /foo/bar/0.0.md5, /foo/bar/0.1.md5, etc. For the format of these checksum files you could just write the hex digest as ascii text, or you could follow the format of the md5sum command line utility, which would mean you could also use md5sum directly on the command line if you have the data on a file system. Obviously this could be adapted to use other types of hash. Another option would be to adopt a convention where you write a special file for each array containing all chunk checksums as a JSON object. E.g., if you have an array at path /foo/bar then you could store a JSON document at /foo/bar/.md5 which looks something like: { "0.0": "4f20243c7cd186a8353798c0adbf2300", "0.1": "f6869ce45bf74338b41c4c1a6f8e58a5", etc. } Note that for both options I am basically suggesting that you layer your own convention on top of the store API, regarding what keys and values you use to store chunk checksums. The exact keys and values you use would be up to you. The only essential ideas in the two options above is that in option 1 you store the checksum for each chunk under a separate key, and in option 2 you store all chunk checksums together in a single document under a single key. Which option would depend on usage patterns, numbers of chunks, I/O latency, etc. You could also use the ".zattrs" key for option 2 rather than a separate ".md5" key, but then with lots of chunks you might slow down access to other user attributes. Hence may be better to use a different key. Hope that makes some sense.

…

On Thu, 21 Feb 2019 at 10:05, Tony Tung ***@***.***> wrote: It sounds like we have a partial solution <https://numcodecs.readthedocs.io/en/latest/checksum32.html> for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly. Including the chunk key as part of the checksum computation is still prone to mistakes if one is dealing with multiple data sets. One possible strategy is to write a UUID into the .zarray file, and include that and the chunk key as part of the checksum computation. That would not eliminate the risk of data corruption, but would lessen it. I would personally still prefer the checksums be included in the .zarray file or something alongside it. Can you please expand on this a bit? We use the chunk checksums in our current data format to index into a persistent cache on disk, and the chunk checksums are included in the metadata files. If we are loading data from the network, we can avoid doing a network transfer. The strategy where the checksum is written into the chunk can feasibly be worked into such a scheme, but requires us to: 1. have a predictable offset to the checksum in the chunk output file. 2. do a ranged HTTP/S3/GS/etc GET to retrieve the checksum. 3. index into a local cache. 4. if the local cache doesn't have the data, retrieve the chunk. It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else? Currently in our existing file format, we verify a chunk every time it is read. There hasn't been any indication that this is an onerous requirement. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#392 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QkILtx1kHzD4ihb9IPzLZBteReHDks5vPm9_gaJpZM4aDtpG> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

jeffdlb · 2020-01-08T22:28:11Z

Hello-

Has anybody implemented this? I like Alistair's suggestion above to "write a special
file for each array containing all chunk checksums as a JSON object."
A different site suggested MurmurHash as something simpler and faster than MD5 or SHA256 but is not intended to be cryptographically secure.

-Jeff DLB

rabernat · 2021-07-30T16:48:03Z

For some reason, I have been thinking about this instead of doing other more important things like sleeping.

I think I have come up with a scalable algorithm that should work to compute a single hash for a large chunked ndarray that is independent of on-disk chunk size. This algorithm presumes that you are working with dask or a similar parallel computing framework that has the ability to rechunk on the fly.

The idea is to use a Merkle tree to iteratively reduce the array to a single hash. The steps are as follows.

First flatten the ND array to one dimension. (This may not be necessary, but it helps to just think about the 1D problem for now.)
Then rechunk the array to a pre-determined chunk size N (e.g. N=1000 elements). This value will be be a parameter of the hash; changing it will result in a different hash.
Compute the hash of each chunk. Since we don't care about cryptographic security but just correctness, we can use whatever checksum algorithm is fastest. Recently I like xxhash.
Now perform a tree reduction on these hashes, computing the hash of the sum of hashes for each pair, all the way up to the root of the tree. The last hash is our array's hash.

Pros

This method will deliver a deterministic checksum for a chunked array independent of the chunking scheme
The algorithm is distributed so can be applied to arbitrarily large data

Cons

We have to flatten the array. This can be done lazily via Dask but may not be efficient for all chunking schemes.
The value of the hash depends on N, which needs to be stored somehow together with the hash itself.
Getting the right value of the final hash requires us to compute do the tree reduction in step 4 in the same way every time. This might be hard.

As an alternative to flattening the array, we could do the tree reduction in multiple dimensions, but this makes my head hurt to think about.

d70-t · 2021-12-14T18:01:21Z

Thanks @rabernat for pointing me here.

I'm thinking a bit around this and our little initial discussion about IPLD. After reading this thread, I believe that IPLD might indeed be a very good fit.
I believe that building the checksum via a Merkle tree will indeed also work for ND-chunks, but in that case, the entire chunk shape must be fixed and the order in which the chunk checksums are aggregated to a tree checksum must be fixed as well. The advantage of ND-chunk-checksums might be that rechunking may become cheaper or even unnecessary. Apart from that, this method might still share the same cons of @rabernat's approach.

But now for IPLD: as a very brief primer, IPLD is a system of linked data structures which are representable in various formats (including CBOR and JSON). Those structures consist of a set of a few data kinds:

Null
Boolean
Integer
Float
String
Bytes
List
Map
Link

So this is basically JSON plus Bytes and Link, which might be the most important kinds for the zarr use case.
Each datastructure in IPLD is addressed by a hash of its contents (called CID) and a Link kind is a reference from one data structure to another datastructure, identified by CID.
IPLD specifies that entries in a Map must have a defined order (not necessarily sorted), to make the computation of the hash deterministic and unique.
It is possible to write out raw leaves which are not embedded in a Codec, so it would be possible to have just the bytes of a chunk in a single object (identified by CID) in stead of e.g. a CBOR object with a single Bytes-Kind entry.

Based on those building blocks, one could create the chunks in a defined shape (as suggested by @rabernat), create hashes (CIDs) of each chunk and create a mapping as suggested by @alimanfoo:

{
    "0.0": {"/": "4f20243c7cd186a8353798c0adbf2300"},
    "0.1": {"/": "f6869ce45bf74338b41c4c1a6f8e58a5"},
    etc.
}

the subtle difference here is the "/", which is IPLD's way of embedding Links into JSON. The on-disk encoding would likely be CBOR in stead, because it's binary and more efficient, but is less illustrative. Furthermore, because Maps have a defined order, one can compute a deterministic hash of that object, even in case of N-D chunking.

This approach can be continued up the levels, so let's say the CID of the above would be "0123456789abcdef", a natural extension would be to have a toplevel object like:

{
    ".zgroup": {"zarr_format": 2},
    "temperature": {"/": "0123456789abcdef"},
}

In which case one would obtain a checksum for the entire dataset as well.

Note that is would be possible to create inline nested objects as well as referenced objects (so the smaller metadata objects do not necessarily have to live separately), but one would of course want to specify deterministically how to do this, because this would change the overall hash. It is possible to have an IPLD implementation which makes reads across links transparent, which gives you filesystem-like paths (e.g. access via root_CID/temperature/0.1 would return the Bytes behind f6869ce45bf74338b41c4c1a6f8e58a5).

One could use this scheme just to compute the toplevel hash for verification and throw away all the intermediate data, but one could also write out the (intermediate or all) blocks based on this scheme such that one could verify only parts later on (one would need the mapping from chunk id to chunk CID to verify individual chunks).
Each of the blocks identified by CID can be transported individually. One could write them out in a CAR stream (a bit like tar, but usind CIDs in stead of paths as keys), put them onto a key-value store (using CIDs as keys) or transport them via IPFS. Either of the methods might be suitable for different purposes but any of them could be converted into any other.

d70-t · 2021-12-14T18:12:06Z

Oh... and given someone has written out the Merkle tree structure in the form of IPLD including .zarray-Metadata, like:

{
    ".zarray": {"chunks": [...], "compressor": {...}, "dtype": "...", ...},
    "0.0": {"/": "4f20243c7cd186a8353798c0adbf2300"},
    "0.1": {"/": "f6869ce45bf74338b41c4c1a6f8e58a5"},
    etc.
}

one could fetch only that level of the Merkle tree (excluding the actual data blocks). But based on this information (which is itself verifiably by the CID), one would know to which shape, compression, datatype etc... one would have to rearange some locally available data in order to verify correctless based on the given hashes. Thus one would have a flexible and natural way of specifying the required "hash-parameters".

rabernat · 2021-12-14T18:14:30Z

One complexity here is that there are two possible hashes for each chunk:

The hash of the uncompressed data - needed to verify the array contents; should be independent of choice of compression codec
The hash of the compressed data - appropriate for using in a CID system

My post above was more about the former. However, it may make sense to focus on the latter if we are interested in IPLD.

rabernat · 2021-12-14T18:20:43Z

Oh... and given someone has written out the Merkle tree structure in the form of IPLD

This looks a lot like consolidated metadata. 🚀

d70-t · 2021-12-14T18:21:38Z

One complexity here is that there are two possible hashes for each chunk:

I'd say yes and no. If we'd make the way of compressing the data a parameter (just as the chunking is a parameter), then any author of a hash-set could decide about using a null-compression to arrive at hashes for uncompressed data or to use another compression to arrive at a different set of hashes. In any case, the one verifying the data must use the same algorithm (including hash, chunking and compression).

There might be a question though if compression one way (i.e. without decompression) is a deterministic process or if it changes across versions...

d70-t · 2022-01-19T13:12:05Z

I've been playing around with this idea of using IPLD to list hashes a bit. See here for some (experimental) code examples.

On the high-level side, it's a MutableMapping which can be used as a Store for zarr.
On the mid-level, it assumes handling of IPLD data blocks, identified by their content id (CID, which is a hash + hash algorithm + encoding of the referenced data).
On the low-level, it stores those blocks
- either on another MutableMapping (which makes it like a content-addressable storage transformer, Content-addressable storage transformer (v3 protocol extension) zarr-specs#82)
- or directly on IPFS (which natively handles IPLD)

Further notes:

It uses CBOR in stead of JSON, because that seems to be the more natural choice in IPLD-World (but JSON would also be possible) and because some form or normalization is required anyways to produce stable hashes (e.g. keys must be sorted, whitespaces must be eliminated or at least consistently applied etc...).

I've added inlining for certain objects, such that zarr metadata files can become part of the structure which holds references to the data chunks. This looks a lot like consolidated metadata (as @rabernat mentioned), but formally doesn't use consolidated metadata. An advantage of this approach might be that metadata would become visible to other tools which allow processing of generic IPLD data.

Computing content identifiers for the roots of the tree in addition to only for the data chunks has the additional advantage, that it would be possible to verify if any chunks are missing.

d70-t · 2022-01-19T13:33:16Z

The hash of the uncompressed data - needed to verify the array contents; should be independent of choice of compression codec

The hash of the compressed data - appropriate for using in a CID system

In theory, it should also be possible to use hashes of uncompressed data in a CID system. The goal of a CID is to be a unique identifier for the content of relevance. I believe that it doesn't really matter if that content is stored in compressed or uncompressed form on a level below the block storage interface. There might even be two ways of looking a situation where the CID is computed from the uncompressed data and the data is stored as compressed data:

Compression could be part of the transport / storage protocol, then the transport protocol must uncompress received data before verifying the CID
Compression could be part of the hashing algorithm (I'm not a crypto expert, could be a silly idea...) (e.g. one could define hash = lambda x: sha256(zlib.decompress(x))). In that case, it really would be a CID of the compressed data, but it looks like a sha256 of the uncompressed data.

The downside of those approaches would however be, that whatever compression mechanism is used must be known down to the CID layer or even below. That might be either a terrible design choice (communicating the compressor through all layers) or it might reduce the ability to adapt compression to the data (much like OPeNDAP can use HTTP compression, but that's unaware of the specifics of the data), which might again be a bad design choice.

Another issue with the combination of hashes and compression might appear once lossy compression comes into play. If that is the case, it might actually be better to compute hashes from the compressed data, or at least do a compression - decompression roundtrip before computing the hash, because otherwise data written once could never be verified.

jakirkham · 2022-01-19T20:36:37Z

cc @martindurant (as you may be interested in the recent discussion here)

rabernat · 2022-12-21T14:09:09Z

This thread might be interested to know that @martindurant has just implemented the fletcher32 checksum codec in numcodecs, which will allows per-chunk checksums for Zarr. See zarr-developers/numcodecs#412.

jjnesbitt · 2023-07-17T14:35:19Z

To whomever this may be helpful to, we've implemented a kind of zarr-checksum on the DANDI project: https://github.com/dandi/zarr_checksum

This might be too high level for what's being discussed here, but I thought it worth mentioning. It was designed fairly specificly to our use case, although I'm not sure what other use cases (if any) it applies to.

jakirkham · 2023-07-17T18:19:04Z

Thanks for sharing Jacob! 🙏

cwognum · 2024-03-25T18:06:13Z

Hi everyone, I am looking into ways to compute a checksum for the entire Zarr archive as a way to ensure data integrity. For this use case, I don't require a semantic hash. It would thus be okay if the hash changes when the chunking changes or when the encoding changes.

It seems like the zarr_checksum package would be my best bet for such a use case. Is this something you're thinking of officially adding to Zarr? If so, are there any optimizations or design consideration you've been thinking of (e.g. using xxhash rather than md5?).

I might lack the expertise to implement this in Zarr (still learning!), but would nevertheless be willing to give it a try!

jjnesbitt · 2024-03-25T18:40:25Z

Hi @cwognum, just as an FYI, the zarr-checksum package was created as a part of the DANDI project, and isn't officially associated with the Zarr project. Nonetheless if you wanted to make use of it or extend it you're certainly welcome to!

are there any optimizations or design consideration you've been thinking of (e.g. using xxhash rather than md5?)

The default implementation for checksumming a local zarr directory uses md5 as a matter of practicality (since in the DANDI project we upload to S3 and want to match against their checksums, which use md5). However, if you wanted to use a different hash algorithm, it would be as simple as creating your own file generator function (in place of yield_files_local for example), and passing that to compute_zarr_checksum. Here is the existing implementation using md5, you could probably just copy paste that code into a new function and replace the md5 bits with xxhash (or whatever else).

Hopefully this helps! If you had any more questions specific to zarr-checksum, I'd be happy to answer over in that repo. Again I'm not officially associated with the Zarr project or anything, so that's all I can really lend help with.

UPDATE: Uhh as a matter of fact there is some md5 specific code in that library. I'll look into generalizing that.

cwognum · 2024-03-25T19:01:32Z

Thanks @jjnesbitt, thank you for the quick response! I wasn't familiar with DANDI before, but it looks cool! I'm working on something similar with Polaris.

I gave the zarr-checksum package a look earlier today and I indeed came to the same conclusion that adapting it to our needs should be relatively easy and fast.

and isn't officially associated with the Zarr project

Because I am a bit weary of having to change the checksum down the line or having to maintain an unofficial checksum implementation, I was considering to port (a version of) your package to be the official implementation within Zarr. Not sure if this is of interest to the Zarr maintainers however!

jhamman · 2024-03-26T15:44:31Z

Hi @cwognum - glad to see your interest in this topic. In the short term, I think using zarr-checksum outside of Zarr-Python is your best bet. In the medium-to-long term, I think the path using supporting per-chunk checksums will be possible through a v3 storage transformer. I recently wrote up a proposal for how this could be represented as a spec extension: zarr-developers/zarr-specs#287.

ryan-williams mentioned this issue May 8, 2019

Database sources where each array element is a separate database row #438

Open

joshmoore mentioned this issue Jun 3, 2020

Extension for data integrity zarr-developers/zarr-specs#75

Closed

joshmoore mentioned this issue Aug 26, 2020

How to detect missing chunks in a zarr array. #587

Open

rabernat mentioned this issue Dec 16, 2021

zarr validation and consistency checking #912

Closed

This was referenced Feb 4, 2022

car fsspec reference file system output d70-t/ipldstore#2

Open

Add Sharding Support #877

Closed

cwognum mentioned this issue Mar 25, 2024

Enhancement: Improve the checksum for the Dataset (or the equality test) polaris-hub/polaris#16

Closed

jjnesbitt mentioned this issue May 6, 2024

Generalize checksum algorithm dandi/zarr_checksum#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checksums for chunks #392

checksums for chunks #392

ttung commented Jan 16, 2019 •

edited by joshmoore

Loading

jakirkham commented Jan 19, 2019

alimanfoo commented Jan 28, 2019

ttung commented Feb 21, 2019

alimanfoo commented Mar 11, 2019 via email

jeffdlb commented Jan 8, 2020

rabernat commented Jul 30, 2021 •

edited

Loading

d70-t commented Dec 14, 2021

d70-t commented Dec 14, 2021

rabernat commented Dec 14, 2021

rabernat commented Dec 14, 2021

d70-t commented Dec 14, 2021

d70-t commented Jan 19, 2022

d70-t commented Jan 19, 2022

jakirkham commented Jan 19, 2022

rabernat commented Dec 21, 2022

jjnesbitt commented Jul 17, 2023

jakirkham commented Jul 17, 2023

cwognum commented Mar 25, 2024

jjnesbitt commented Mar 25, 2024 •

edited

Loading

cwognum commented Mar 25, 2024

jhamman commented Mar 26, 2024

checksums for chunks #392

checksums for chunks #392

Comments

ttung commented Jan 16, 2019 • edited by joshmoore Loading

Problem description

jakirkham commented Jan 19, 2019

alimanfoo commented Jan 28, 2019

ttung commented Feb 21, 2019

alimanfoo commented Mar 11, 2019 via email

jeffdlb commented Jan 8, 2020

rabernat commented Jul 30, 2021 • edited Loading

Pros

Cons

d70-t commented Dec 14, 2021

d70-t commented Dec 14, 2021

rabernat commented Dec 14, 2021

rabernat commented Dec 14, 2021

d70-t commented Dec 14, 2021

d70-t commented Jan 19, 2022

d70-t commented Jan 19, 2022

jakirkham commented Jan 19, 2022

rabernat commented Dec 21, 2022

jjnesbitt commented Jul 17, 2023

jakirkham commented Jul 17, 2023

cwognum commented Mar 25, 2024

jjnesbitt commented Mar 25, 2024 • edited Loading

cwognum commented Mar 25, 2024

jhamman commented Mar 26, 2024

ttung commented Jan 16, 2019 •

edited by joshmoore

Loading

rabernat commented Jul 30, 2021 •

edited

Loading

jjnesbitt commented Mar 25, 2024 •

edited

Loading