-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checksums for chunks #392
Comments
Thanks for raising this, @ttung. It sounds like we have a partial solution for this, but we may need an additional codec to close the gap. For example including the chunk key as part of the checksum computation. Would have to think about how we add this information in a reasonable way while still keeping the API friendly.
Can you please expand on this a bit? |
It might be helpful to know when you would want to verify data integrity. E.g., would you want to verify a chunk every time it was read? Or something else? |
Including the chunk key as part of the checksum computation is still prone to mistakes if one is dealing with multiple data sets. One possible strategy is to write a UUID into the I would personally still prefer the checksums be included in the
We use the chunk checksums in our current data format to index into a persistent cache on disk, and the chunk checksums are included in the metadata files. If we are loading data from the network, we can avoid doing a network transfer. The strategy where the checksum is written into the chunk can feasibly be worked into such a scheme, but requires us to:
Currently in our existing file format, we verify a chunk every time it is read. There hasn't been any indication that this is an onerous requirement. |
Hi Tony,
I think I would suggest two possible approaches.
One would be to adopt a convention where you write a checksum file
alongside every chunk file into your zarr store. E.g., if you have an array
at path /foo/bar within a store, and you have chunks at /foo/bar/0.0,
/foo/bar/0.1, etc., then write checksums into files at /foo/bar/0.0.md5,
/foo/bar/0.1.md5, etc. For the format of these checksum files you could
just write the hex digest as ascii text, or you could follow the format of
the md5sum command line utility, which would mean you could also use md5sum
directly on the command line if you have the data on a file system.
Obviously this could be adapted to use other types of hash.
Another option would be to adopt a convention where you write a special
file for each array containing all chunk checksums as a JSON object. E.g.,
if you have an array at path /foo/bar then you could store a JSON document
at /foo/bar/.md5 which looks something like:
{
"0.0": "4f20243c7cd186a8353798c0adbf2300",
"0.1": "f6869ce45bf74338b41c4c1a6f8e58a5",
etc.
}
Note that for both options I am basically suggesting that you layer your
own convention on top of the store API, regarding what keys and values you
use to store chunk checksums. The exact keys and values you use would be up
to you. The only essential ideas in the two options above is that in option
1 you store the checksum for each chunk under a separate key, and in option
2 you store all chunk checksums together in a single document under a
single key. Which option would depend on usage patterns, numbers of chunks,
I/O latency, etc.
You could also use the ".zattrs" key for option 2 rather than a separate
".md5" key, but then with lots of chunks you might slow down access to
other user attributes. Hence may be better to use a different key.
Hope that makes some sense.
…On Thu, 21 Feb 2019 at 10:05, Tony Tung ***@***.***> wrote:
It sounds like we have a partial solution
<https://numcodecs.readthedocs.io/en/latest/checksum32.html> for this,
but we may need an additional codec to close the gap. For example including
the chunk key as part of the checksum computation. Would have to think
about how we add this information in a reasonable way while still keeping
the API friendly.
Including the chunk key as part of the checksum computation is still prone
to mistakes if one is dealing with multiple data sets. One possible
strategy is to write a UUID into the .zarray file, and include that and
the chunk key as part of the checksum computation. That would not eliminate
the risk of data corruption, but would lessen it.
I would personally still prefer the checksums be included in the .zarray
file or something alongside it.
Can you please expand on this a bit?
We use the chunk checksums in our current data format to index into a
persistent cache on disk, and the chunk checksums are included in the
metadata files. If we are loading data from the network, we can avoid doing
a network transfer.
The strategy where the checksum is written into the chunk can feasibly be
worked into such a scheme, but requires us to:
1. have a predictable offset to the checksum in the chunk output file.
2. do a ranged HTTP/S3/GS/etc GET to retrieve the checksum.
3. index into a local cache.
4. if the local cache doesn't have the data, retrieve the chunk.
It might be helpful to know when you would want to verify data integrity.
E.g., would you want to verify a chunk every time it was read? Or something
else?
Currently in our existing file format, we verify a chunk every time it is
read. There hasn't been any indication that this is an onerous requirement.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#392 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QkILtx1kHzD4ihb9IPzLZBteReHDks5vPm9_gaJpZM4aDtpG>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Hello- Has anybody implemented this? I like Alistair's suggestion above to "write a special -Jeff DLB |
For some reason, I have been thinking about this instead of doing other more important things like sleeping. I think I have come up with a scalable algorithm that should work to compute a single hash for a large chunked ndarray that is independent of on-disk chunk size. This algorithm presumes that you are working with dask or a similar parallel computing framework that has the ability to rechunk on the fly. The idea is to use a Merkle tree to iteratively reduce the array to a single hash. The steps are as follows.
Pros
Cons
As an alternative to flattening the array, we could do the tree reduction in multiple dimensions, but this makes my head hurt to think about. |
Thanks @rabernat for pointing me here. I'm thinking a bit around this and our little initial discussion about IPLD. After reading this thread, I believe that IPLD might indeed be a very good fit. But now for IPLD: as a very brief primer, IPLD is a system of linked data structures which are representable in various formats (including CBOR and JSON). Those structures consist of a set of a few data kinds:
So this is basically JSON plus Bytes and Link, which might be the most important kinds for the zarr use case. Based on those building blocks, one could create the chunks in a defined shape (as suggested by @rabernat), create hashes (CIDs) of each chunk and create a mapping as suggested by @alimanfoo:
the subtle difference here is the This approach can be continued up the levels, so let's say the CID of the above would be "0123456789abcdef", a natural extension would be to have a toplevel object like:
In which case one would obtain a checksum for the entire dataset as well. Note that is would be possible to create inline nested objects as well as referenced objects (so the smaller metadata objects do not necessarily have to live separately), but one would of course want to specify deterministically how to do this, because this would change the overall hash. It is possible to have an IPLD implementation which makes reads across links transparent, which gives you filesystem-like paths (e.g. access via One could use this scheme just to compute the toplevel hash for verification and throw away all the intermediate data, but one could also write out the (intermediate or all) blocks based on this scheme such that one could verify only parts later on (one would need the mapping from chunk id to chunk CID to verify individual chunks). |
Oh... and given someone has written out the Merkle tree structure in the form of IPLD including
one could fetch only that level of the Merkle tree (excluding the actual data blocks). But based on this information (which is itself verifiably by the CID), one would know to which shape, compression, datatype etc... one would have to rearange some locally available data in order to verify correctless based on the given hashes. Thus one would have a flexible and natural way of specifying the required "hash-parameters". |
One complexity here is that there are two possible hashes for each chunk:
My post above was more about the former. However, it may make sense to focus on the latter if we are interested in IPLD. |
This looks a lot like consolidated metadata. 🚀 |
I'd say yes and no. If we'd make the way of compressing the data a parameter (just as the chunking is a parameter), then any author of a hash-set could decide about using a null-compression to arrive at hashes for uncompressed data or to use another compression to arrive at a different set of hashes. In any case, the one verifying the data must use the same algorithm (including hash, chunking and compression). There might be a question though if compression one way (i.e. without decompression) is a deterministic process or if it changes across versions... |
I've been playing around with this idea of using IPLD to list hashes a bit. See here for some (experimental) code examples.
Further notes: It uses CBOR in stead of JSON, because that seems to be the more natural choice in IPLD-World (but JSON would also be possible) and because some form or normalization is required anyways to produce stable hashes (e.g. keys must be sorted, whitespaces must be eliminated or at least consistently applied etc...). I've added inlining for certain objects, such that zarr metadata files can become part of the structure which holds references to the data chunks. This looks a lot like consolidated metadata (as @rabernat mentioned), but formally doesn't use consolidated metadata. An advantage of this approach might be that metadata would become visible to other tools which allow processing of generic IPLD data. Computing content identifiers for the roots of the tree in addition to only for the data chunks has the additional advantage, that it would be possible to verify if any chunks are missing. |
In theory, it should also be possible to use hashes of uncompressed data in a CID system. The goal of a CID is to be a unique identifier for the content of relevance. I believe that it doesn't really matter if that content is stored in compressed or uncompressed form on a level below the block storage interface. There might even be two ways of looking a situation where the CID is computed from the uncompressed data and the data is stored as compressed data:
The downside of those approaches would however be, that whatever compression mechanism is used must be known down to the CID layer or even below. That might be either a terrible design choice (communicating the compressor through all layers) or it might reduce the ability to adapt compression to the data (much like OPeNDAP can use HTTP compression, but that's unaware of the specifics of the data), which might again be a bad design choice. Another issue with the combination of hashes and compression might appear once lossy compression comes into play. If that is the case, it might actually be better to compute hashes from the compressed data, or at least do a compression - decompression roundtrip before computing the hash, because otherwise data written once could never be verified. |
cc @martindurant (as you may be interested in the recent discussion here) |
This thread might be interested to know that @martindurant has just implemented the fletcher32 checksum codec in numcodecs, which will allows per-chunk checksums for Zarr. See zarr-developers/numcodecs#412. |
To whomever this may be helpful to, we've implemented a kind of zarr-checksum on the DANDI project: https://github.com/dandi/zarr_checksum This might be too high level for what's being discussed here, but I thought it worth mentioning. It was designed fairly specificly to our use case, although I'm not sure what other use cases (if any) it applies to. |
Thanks for sharing Jacob! 🙏 |
Hi everyone, I am looking into ways to compute a checksum for the entire Zarr archive as a way to ensure data integrity. For this use case, I don't require a semantic hash. It would thus be okay if the hash changes when the chunking changes or when the encoding changes. It seems like the I might lack the expertise to implement this in Zarr (still learning!), but would nevertheless be willing to give it a try! |
Hi @cwognum, just as an FYI, the
The default implementation for checksumming a local zarr directory uses md5 as a matter of practicality (since in the DANDI project we upload to S3 and want to match against their checksums, which use md5). However, if you wanted to use a different hash algorithm, it would be as simple as creating your own file generator function (in place of Hopefully this helps! If you had any more questions specific to UPDATE: Uhh as a matter of fact there is some |
Thanks @jjnesbitt, thank you for the quick response! I wasn't familiar with I gave the
Because I am a bit weary of having to change the checksum down the line or having to maintain an unofficial checksum implementation, I was considering to port (a version of) your package to be the official implementation within Zarr. Not sure if this is of interest to the Zarr maintainers however! |
Hi @cwognum - glad to see your interest in this topic. In the short term, I think using |
Problem description
Having checksums for individual chunks is good for verifying the integrity of the data we're loading. The existing mechanisms for checksumming data are inadequate for various reasons:
Recording the checksums in the .zarray file could work, but may be problematic for larger data sets.
see also:
The text was updated successfully, but these errors were encountered: