Support HDF5 compression filter plugins #351

florianziemen · 2023-08-16T11:55:37Z

HDF5 has a zoo of compression filters. Some of them can be mapped to numcodecs filters, and simply need an entry in the json. Others might need further effort.
https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins

I've addressed blosc and zstd in #350 (still early state, but I figured it might be good to announce this to avoid duplication of efforts).
lz4 ( id 32004) and bitshuffle ( id 32008) so far resisted my efforts, and I have not tackled combinations of filters, that's why they are currently set to yield an error message in the MR draft.

Maybe it would be good to use the implementations from hdf5plugin and announce them to numcodecs as done in gribscan. @d70-t - any thoughts?

martindurant · 2023-08-16T13:22:06Z

We try to cover the most frequently used HDF5 filters, but given the pluggable nature and big ecosystem of HDF, we will never succeed! See zarr-developers/numcodecs#422 for a discussion of SZip and zarr-developers/numcodecs#412 for fletcher32 checksum. Some like SZip are implemented in imagecodecs or elsewhere.

I don't immediately see how you can get numcodecs classes from hdf5plugin, but it would be good if it would work. Ideally, though, reading HDF data via zarr and kerchunk should not depend on HDF itself.

I have not tackled combinations of filters

We can get this to work!

d70-t · 2023-08-17T14:55:15Z

lz4:
lz4 (as of HDF id 32004) seems to do blocked compression (see spec and code) where numcodecs' lz4 seems to compress the thing as a whole.

Unfortunately, the blocking scheme used by HDF5 is also different from the one used by blosc, so we can't use that as a fallback.

On the other hand, the DEFAULT_CHUNK_SIZE is 1GB so we could hope that this is the case (or check using a kerchunk run) that there's only a single such chunk, then update the offsets such that they only point to the true lz4 payload and then use the usual numcodecs codec.

bitshuffle:
The bitshuffle filter (HDF5 id 32008) shuffles bits and additionally handles zstd and lz4 compression. Whereas I believe in numcodecs, there's only a standalone (byte-) Shuffle filter, which does something different. However, the bitshuffle library (the one behind the HDF5-filter) provides Python bindings to the actual filter. Thus it should be possible to register that filter with numcodecs if needed.

hdf5plugin:
The hdf5plugin Python package seems to me like a tool to describe the parameters and choices of a HDF5 filter chain, but it (surprisingly) doesn't seem to provide any means of calling those plugins. Likely that's enogh for h5py, as the plugins wouldn't be called from the Python side anyways, but only inside the wrapped HDF5. To make it work with numcodecs, we'd probably have to re-implement the HDF5 plugin mechanism (possible, but maybe neither worth it, nor desired if other methods would do the trick).

martindurant · 2023-08-17T14:59:08Z

cramjam has both blocked and block-free lz4 (as compress/decompress functions, easy to wrap).

the bitshuffle library (the one behind the HDF5-filter) provides Python bindings to the actual filter. Thus it should be possible to register that filter with numcodecs if needed.

It would be a shame to have to call HDF :(

By the way, blosc has a bitshuffle, but I don't know if it's the same implementation as HDF and whether you can call it in isolation.

d70-t · 2023-08-17T15:06:01Z

It would be a shame to have to call HDF :(

I agree. (Although here it would "only" be the plugin code, but it doesn't seem to be straightforward to get the bitshuffling part on it's own, without indirectly depending on HDF5)

By the way, blosc has a bitshuffle, but I don't know if it's the same implementation as HDF and whether you can call it in isolation.

I don't know for sure, but the Python API doesn't look like it's possible to call it in isolation.

d70-t · 2023-08-17T15:18:39Z

cramjam has both blocked and block-free lz4 (as compress/decompress functions, easy to wrap).

That's nice, but I fear that the cramjam-blocked-lz4 is according to the lz4 block format, which is something different than the HDF5-lz4-block format.
It shouldn't be too hard though, to write a Python (meta-) compressor which adheres to the HDF5 blocking specification of lz4 (it's just a few offset numbers and a one or more calls to plain lz4), but it's probably not used anywhere except in HDF5.

But as I don't believe that there are many datasets out which use larger than 1GB chunk size, the offset trick mentioned above could be more elegant and easier to implement.

martindurant · 2023-08-18T13:48:18Z

lz4 block format, which is something different than the HDF5-lz4-block format.

Of course it is - why ever would they be the same?? :)

So yes, the question becomes what minimal amount of work do we need to do to support 95% of cases, and you are probably right that offsetting is the way to go. I can't immediately see a spec - is it just 8 bytes for the block size?

d70-t · 2023-08-21T11:03:22Z

I can't immediately see a spec - is it just 8 bytes for the block size?

Sorry, it probably got a bit burried in the links. I believe this should be it. So it should be a 16 byte offset. The 16 bytes before are (big endian int):

8 bytes orig_size total uncompressed size
4 bytes block_size uncompressed size per block
4 bytes lz4_size_0 compressed bytes of first block

So if orig_size == block_size, we should be fine doing the offset trick.

martindurant · 2023-08-21T13:15:13Z

Sounds good! So all we need is a small test file for CI, and we can go ahead.

d70-t · 2023-08-21T15:21:41Z

@florianziemen do you have one at hand?

florianziemen · 2023-08-28T13:29:56Z

In principle yes. I was on holidays last week and our HPC is on holidays today. I'll look into things tomorrow.

martindurant · 2023-09-01T18:16:16Z

This is probably fixed #350

martindurant closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support HDF5 compression filter plugins #351

Support HDF5 compression filter plugins #351

florianziemen commented Aug 16, 2023

martindurant commented Aug 16, 2023

d70-t commented Aug 17, 2023

martindurant commented Aug 17, 2023

d70-t commented Aug 17, 2023

d70-t commented Aug 17, 2023

martindurant commented Aug 18, 2023

d70-t commented Aug 21, 2023

martindurant commented Aug 21, 2023

d70-t commented Aug 21, 2023

florianziemen commented Aug 28, 2023

martindurant commented Sep 1, 2023

Support HDF5 compression filter plugins #351

Support HDF5 compression filter plugins #351

Comments

florianziemen commented Aug 16, 2023

martindurant commented Aug 16, 2023

d70-t commented Aug 17, 2023

martindurant commented Aug 17, 2023

d70-t commented Aug 17, 2023

d70-t commented Aug 17, 2023

martindurant commented Aug 18, 2023

d70-t commented Aug 21, 2023

martindurant commented Aug 21, 2023

d70-t commented Aug 21, 2023

florianziemen commented Aug 28, 2023

martindurant commented Sep 1, 2023