Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksums for chunks #392

Open
ttung opened this issue Jan 16, 2019 · 21 comments
Open

checksums for chunks #392

ttung opened this issue Jan 16, 2019 · 21 comments

Comments

@ttung
Copy link

ttung commented Jan 16, 2019

Problem description

Having checksums for individual chunks is good for verifying the integrity of the data we're loading. The existing mechanisms for checksumming data are inadequate for various reasons:

  1. Checksum of the entire array's data: This does not work for loading a subset of the data.
  2. Checksum of each individual chunk recorded by a filter as part of the chunk: This does not protect against chunks being swapped, and does not help for building a persistent cache for previously read chunks.

Recording the checksums in the .zarray file could work, but may be problematic for larger data sets.


see also:

@jakirkham
Copy link
Member

Thanks for raising this, @ttung.

It sounds like we have a partial solution for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly.

...and does not help for building a persistent cache for previously read chunks.

Can you please expand on this a bit?

@alimanfoo
Copy link
Member

It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else?

@ttung
Copy link
Author

ttung commented Feb 21, 2019

It sounds like we have a partial solution for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly.

Including the chunk key as part of the checksum computation is still prone to mistakes if one is dealing with multiple data sets. One possible strategy is to write a UUID into the .zarray file, and include that and the chunk key as part of the checksum computation. That would not eliminate the risk of data corruption, but would lessen it.

I would personally still prefer the checksums be included in the .zarray file or something alongside it.

Can you please expand on this a bit?

We use the chunk checksums in our current data format to index into a persistent cache on disk, and the chunk checksums are included in the metadata files. If we are loading data from the network, we can avoid doing a network transfer.

The strategy where the checksum is written into the chunk can feasibly be worked into such a scheme, but requires us to:

  1. have a predictable offset to the checksum in the chunk output file.
  2. do a ranged HTTP/S3/GS/etc GET to retrieve the checksum.
  3. index into a local cache.
  4. if the local cache doesn't have the data, retrieve the chunk.

It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else?

Currently in our existing file format, we verify a chunk every time it is read. There hasn't been any indication that this is an onerous requirement.

@alimanfoo
Copy link
Member

alimanfoo commented Mar 11, 2019 via email

@jeffdlb
Copy link

jeffdlb commented Jan 8, 2020

Hello-

Has anybody implemented this? I like Alistair's suggestion above to "write a special
file for each array containing all chunk checksums as a JSON object."
A different site suggested MurmurHash as something simpler and faster than MD5 or SHA256 but is not intended to be cryptographically secure.

-Jeff DLB

@rabernat
Copy link
Contributor

rabernat commented Jul 30, 2021

For some reason, I have been thinking about this instead of doing other more important things like sleeping.

I think I have come up with a scalable algorithm that should work to compute a single hash for a large chunked ndarray that is independent of on-disk chunk size. This algorithm presumes that you are working with dask or a similar parallel computing framework that has the ability to rechunk on the fly.

The idea is to use a Merkle tree to iteratively reduce the array to a single hash. The steps are as follows.

  1. First flatten the ND array to one dimension. (This may not be necessary, but it helps to just think about the 1D problem for now.)
  2. Then rechunk the array to a pre-determined chunk size N (e.g. N=1000 elements). This value will be be a parameter of the hash; changing it will result in a different hash.
  3. Compute the hash of each chunk. Since we don't care about cryptographic security but just correctness, we can use whatever checksum algorithm is fastest. Recently I like xxhash.
  4. Now perform a tree reduction on these hashes, computing the hash of the sum of hashes for each pair, all the way up to the root of the tree. The last hash is our array's hash.

Pros

  • This method will deliver a deterministic checksum for a chunked array independent of the chunking scheme
  • The algorithm is distributed so can be applied to arbitrarily large data

Cons

  • We have to flatten the array. This can be done lazily via Dask but may not be efficient for all chunking schemes.
  • The value of the hash depends on N, which needs to be stored somehow together with the hash itself.
  • Getting the right value of the final hash requires us to compute do the tree reduction in step 4 in the same way every time. This might be hard.

As an alternative to flattening the array, we could do the tree reduction in multiple dimensions, but this makes my head hurt to think about.

@d70-t
Copy link
Contributor

d70-t commented Dec 14, 2021

Thanks @rabernat for pointing me here.

I'm thinking a bit around this and our little initial discussion about IPLD. After reading this thread, I believe that IPLD might indeed be a very good fit.
I believe that building the checksum via a Merkle tree will indeed also work for ND-chunks, but in that case, the entire chunk shape must be fixed and the order in which the chunk checksums are aggregated to a tree checksum must be fixed as well. The advantage of ND-chunk-checksums might be that rechunking may become cheaper or even unnecessary. Apart from that, this method might still share the same cons of @rabernat's approach.

But now for IPLD: as a very brief primer, IPLD is a system of linked data structures which are representable in various formats (including CBOR and JSON). Those structures consist of a set of a few data kinds:

  • Null
  • Boolean
  • Integer
  • Float
  • String
  • Bytes
  • List
  • Map
  • Link

So this is basically JSON plus Bytes and Link, which might be the most important kinds for the zarr use case.
Each datastructure in IPLD is addressed by a hash of its contents (called CID) and a Link kind is a reference from one data structure to another datastructure, identified by CID.
IPLD specifies that entries in a Map must have a defined order (not necessarily sorted), to make the computation of the hash deterministic and unique.
It is possible to write out raw leaves which are not embedded in a Codec, so it would be possible to have just the bytes of a chunk in a single object (identified by CID) in stead of e.g. a CBOR object with a single Bytes-Kind entry.

Based on those building blocks, one could create the chunks in a defined shape (as suggested by @rabernat), create hashes (CIDs) of each chunk and create a mapping as suggested by @alimanfoo:

{
    "0.0": {"/": "4f20243c7cd186a8353798c0adbf2300"},
    "0.1": {"/": "f6869ce45bf74338b41c4c1a6f8e58a5"},
    etc.
}

the subtle difference here is the "/", which is IPLD's way of embedding Links into JSON. The on-disk encoding would likely be CBOR in stead, because it's binary and more efficient, but is less illustrative. Furthermore, because Maps have a defined order, one can compute a deterministic hash of that object, even in case of N-D chunking.

This approach can be continued up the levels, so let's say the CID of the above would be "0123456789abcdef", a natural extension would be to have a toplevel object like:

{
    ".zgroup": {"zarr_format": 2},
    "temperature": {"/": "0123456789abcdef"},
}

In which case one would obtain a checksum for the entire dataset as well.

Note that is would be possible to create inline nested objects as well as referenced objects (so the smaller metadata objects do not necessarily have to live separately), but one would of course want to specify deterministically how to do this, because this would change the overall hash. It is possible to have an IPLD implementation which makes reads across links transparent, which gives you filesystem-like paths (e.g. access via root_CID/temperature/0.1 would return the Bytes behind f6869ce45bf74338b41c4c1a6f8e58a5).

One could use this scheme just to compute the toplevel hash for verification and throw away all the intermediate data, but one could also write out the (intermediate or all) blocks based on this scheme such that one could verify only parts later on (one would need the mapping from chunk id to chunk CID to verify individual chunks).
Each of the blocks identified by CID can be transported individually. One could write them out in a CAR stream (a bit like tar, but usind CIDs in stead of paths as keys), put them onto a key-value store (using CIDs as keys) or transport them via IPFS. Either of the methods might be suitable for different purposes but any of them could be converted into any other.

@d70-t
Copy link
Contributor

d70-t commented Dec 14, 2021

Oh... and given someone has written out the Merkle tree structure in the form of IPLD including .zarray-Metadata, like:

{
    ".zarray": {"chunks": [...], "compressor": {...}, "dtype": "...", ...},
    "0.0": {"/": "4f20243c7cd186a8353798c0adbf2300"},
    "0.1": {"/": "f6869ce45bf74338b41c4c1a6f8e58a5"},
    etc.
}

one could fetch only that level of the Merkle tree (excluding the actual data blocks). But based on this information (which is itself verifiably by the CID), one would know to which shape, compression, datatype etc... one would have to rearange some locally available data in order to verify correctless based on the given hashes. Thus one would have a flexible and natural way of specifying the required "hash-parameters".

@rabernat
Copy link
Contributor

One complexity here is that there are two possible hashes for each chunk:

  • The hash of the uncompressed data - needed to verify the array contents; should be independent of choice of compression codec
  • The hash of the compressed data - appropriate for using in a CID system

My post above was more about the former. However, it may make sense to focus on the latter if we are interested in IPLD.

@rabernat
Copy link
Contributor

Oh... and given someone has written out the Merkle tree structure in the form of IPLD

This looks a lot like consolidated metadata. 🚀

@d70-t
Copy link
Contributor

d70-t commented Dec 14, 2021

One complexity here is that there are two possible hashes for each chunk:

I'd say yes and no. If we'd make the way of compressing the data a parameter (just as the chunking is a parameter), then any author of a hash-set could decide about using a null-compression to arrive at hashes for uncompressed data or to use another compression to arrive at a different set of hashes. In any case, the one verifying the data must use the same algorithm (including hash, chunking and compression).

There might be a question though if compression one way (i.e. without decompression) is a deterministic process or if it changes across versions...

@d70-t
Copy link
Contributor

d70-t commented Jan 19, 2022

I've been playing around with this idea of using IPLD to list hashes a bit. See here for some (experimental) code examples.

  • On the high-level side, it's a MutableMapping which can be used as a Store for zarr.
  • On the mid-level, it assumes handling of IPLD data blocks, identified by their content id (CID, which is a hash + hash algorithm + encoding of the referenced data).
  • On the low-level, it stores those blocks

Further notes:

It uses CBOR in stead of JSON, because that seems to be the more natural choice in IPLD-World (but JSON would also be possible) and because some form or normalization is required anyways to produce stable hashes (e.g. keys must be sorted, whitespaces must be eliminated or at least consistently applied etc...).

I've added inlining for certain objects, such that zarr metadata files can become part of the structure which holds references to the data chunks. This looks a lot like consolidated metadata (as @rabernat mentioned), but formally doesn't use consolidated metadata. An advantage of this approach might be that metadata would become visible to other tools which allow processing of generic IPLD data.

Computing content identifiers for the roots of the tree in addition to only for the data chunks has the additional advantage, that it would be possible to verify if any chunks are missing.

@d70-t
Copy link
Contributor

d70-t commented Jan 19, 2022

  • The hash of the uncompressed data - needed to verify the array contents; should be independent of choice of compression codec
  • The hash of the compressed data - appropriate for using in a CID system

In theory, it should also be possible to use hashes of uncompressed data in a CID system. The goal of a CID is to be a unique identifier for the content of relevance. I believe that it doesn't really matter if that content is stored in compressed or uncompressed form on a level below the block storage interface. There might even be two ways of looking a situation where the CID is computed from the uncompressed data and the data is stored as compressed data:

  • Compression could be part of the transport / storage protocol, then the transport protocol must uncompress received data before verifying the CID
  • Compression could be part of the hashing algorithm (I'm not a crypto expert, could be a silly idea...) (e.g. one could define hash = lambda x: sha256(zlib.decompress(x))). In that case, it really would be a CID of the compressed data, but it looks like a sha256 of the uncompressed data.

The downside of those approaches would however be, that whatever compression mechanism is used must be known down to the CID layer or even below. That might be either a terrible design choice (communicating the compressor through all layers) or it might reduce the ability to adapt compression to the data (much like OPeNDAP can use HTTP compression, but that's unaware of the specifics of the data), which might again be a bad design choice.

Another issue with the combination of hashes and compression might appear once lossy compression comes into play. If that is the case, it might actually be better to compute hashes from the compressed data, or at least do a compression - decompression roundtrip before computing the hash, because otherwise data written once could never be verified.

@jakirkham
Copy link
Member

cc @martindurant (as you may be interested in the recent discussion here)

@rabernat
Copy link
Contributor

This thread might be interested to know that @martindurant has just implemented the fletcher32 checksum codec in numcodecs, which will allows per-chunk checksums for Zarr. See zarr-developers/numcodecs#412.

@jjnesbitt
Copy link

To whomever this may be helpful to, we've implemented a kind of zarr-checksum on the DANDI project: https://github.com/dandi/zarr_checksum

This might be too high level for what's being discussed here, but I thought it worth mentioning. It was designed fairly specificly to our use case, although I'm not sure what other use cases (if any) it applies to.

@jakirkham
Copy link
Member

Thanks for sharing Jacob! 🙏

@cwognum
Copy link

cwognum commented Mar 25, 2024

Hi everyone, I am looking into ways to compute a checksum for the entire Zarr archive as a way to ensure data integrity. For this use case, I don't require a semantic hash. It would thus be okay if the hash changes when the chunking changes or when the encoding changes.

It seems like the zarr_checksum package would be my best bet for such a use case. Is this something you're thinking of officially adding to Zarr? If so, are there any optimizations or design consideration you've been thinking of (e.g. using xxhash rather than md5?).

I might lack the expertise to implement this in Zarr (still learning!), but would nevertheless be willing to give it a try!

@jjnesbitt
Copy link

jjnesbitt commented Mar 25, 2024

Hi @cwognum, just as an FYI, the zarr-checksum package was created as a part of the DANDI project, and isn't officially associated with the Zarr project. Nonetheless if you wanted to make use of it or extend it you're certainly welcome to!

are there any optimizations or design consideration you've been thinking of (e.g. using xxhash rather than md5?)

The default implementation for checksumming a local zarr directory uses md5 as a matter of practicality (since in the DANDI project we upload to S3 and want to match against their checksums, which use md5). However, if you wanted to use a different hash algorithm, it would be as simple as creating your own file generator function (in place of yield_files_local for example), and passing that to compute_zarr_checksum. Here is the existing implementation using md5, you could probably just copy paste that code into a new function and replace the md5 bits with xxhash (or whatever else).

Hopefully this helps! If you had any more questions specific to zarr-checksum, I'd be happy to answer over in that repo. Again I'm not officially associated with the Zarr project or anything, so that's all I can really lend help with.

UPDATE: Uhh as a matter of fact there is some md5 specific code in that library. I'll look into generalizing that.

@cwognum
Copy link

cwognum commented Mar 25, 2024

Thanks @jjnesbitt, thank you for the quick response! I wasn't familiar with DANDI before, but it looks cool! I'm working on something similar with Polaris.

I gave the zarr-checksum package a look earlier today and I indeed came to the same conclusion that adapting it to our needs should be relatively easy and fast.

and isn't officially associated with the Zarr project

Because I am a bit weary of having to change the checksum down the line or having to maintain an unofficial checksum implementation, I was considering to port (a version of) your package to be the official implementation within Zarr. Not sure if this is of interest to the Zarr maintainers however!

@jhamman
Copy link
Member

jhamman commented Mar 26, 2024

Hi @cwognum - glad to see your interest in this topic. In the short term, I think using zarr-checksum outside of Zarr-Python is your best bet. In the medium-to-long term, I think the path using supporting per-chunk checksums will be possible through a v3 storage transformer. I recently wrote up a proposal for how this could be represented as a spec extension: zarr-developers/zarr-specs#287.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants