Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce total number of files when converting to OME-zarr? #116

Open
davidekhub opened this issue Sep 15, 2021 · 4 comments
Open

Reduce total number of files when converting to OME-zarr? #116

davidekhub opened this issue Sep 15, 2021 · 4 comments
Labels
ngff/zarr Requires upstream updates to Zarr or OME-NGFF question Further information is requested

Comments

@davidekhub
Copy link

I am converting SVS files (~500+MB each) to OME-zarr. My command line looks like this:
${params.cmd} --max_workers=${params.max_workers} --compression=zlib --compression-properties level=9 --resolutions 6 ${slide_file} ${slide_file}.zarr

I end up with buckets filled with OME zarr that contain thousands of files, many extremely small (2K) and the largest are 1M. This is really too small for object storage (leads to many small operations that take a long time) so I'd like my files to average around 64MB or so. But the documentation doesn't say what flags affect this so I'm curious- is it the tile_height and width?

@muhanadz muhanadz added the question Further information is requested label Sep 15, 2021
@chris-allan
Copy link
Member

Firstly, yes, --tile_width, --tile_height as well as --tile_depth control the Zarr chunk [1] size. For the two most common Zarr storage [2] implementations (file system and object storage) each chunk is a single file (or object) where the filename (or key) is the chunk's index within the array separated by the dimension separator [3]. There is a more visual description of this available here:

However, multiple chunks are not colocated in the same file. So if compression is employed, which it is in the example you gave, even if you were to set a 5792x5792x2 (width, height, bytes per pixel; ~64MiB) chunk size a chunk may compress very well, perhaps it's full of zeros or completely white, and consequently could easily be 1KiB or smaller. Chunk colocation within the same file (also sometimes referred to in the community as sharding) is being discussed [4] but I am not aware of any current Zarr implementation. TileDB [5] addresses some of these concerns with a journaled approach but that is not without its own downsides such as reconciliation. The Zarr layout is simple by design and adding complexity to the chunk format will require significant discussion and strong community backing. You can read more about the design decisions and perspectives of simple (Zarr) vs. complex layouts (TileDB for example) on this issue if you so desire:

There is also a fairly detailed discussion around the precomputed, sharded, chunk based format that Neuroglancer uses available here:

A sharded format would however not necessarily relieve the short write, high volume of small operations that you are noticing when, I assume, when writing directly to S3 as the unit of work for bioformats2raw is a chunk. Latency per write is going to be very similar and the same number of writes still need to take place regardless of whether they are happening to one sharded object or many unsharded chunks. Obviously you could approach this by buffering colocated chunks locally first and transferring the shard only when all chunks are processed. This is just one of a plethora of optimizations one might consider however each comes with substantial implementation and maintenance burden as well as potential for the deep coupling of bioformats2raw to storage subsystem architectural design.

Furthermore, I would strongly caution against going beyond 1024 chunking in the Y and X dimensions in the pursuit of better write performance and a smaller number of larger chunks. This may improve write performance but will substantially impact read performance and first byte latency for streaming viewers. Projects such as the aforementioned Neuroglancer or webKnossos go as far as to have tiny 3D chunk sizes (32^3) to combat this. The source data in your example (the .svs file) will also be chunked (tiled in TIFF parlance) and compressed. Selection of output chunk sizes that are not aligned can result in substantial read slowdowns as the source data has to be rechunked and repeatedly decompressed in order to conform to a desired output chunk size.

In short, the behavior you are seeing is expected and I don't think a 64MiB object size is either practical or reasonably achievable at present.

Hope this helps.

  1. https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks
  2. https://zarr.readthedocs.io/en/stable/spec/v2.html#storage
  3. https://zarr.readthedocs.io/en/stable/spec/v2.html#arrays
  4. https://forum.image.sc/t/sharding-support-in-ome-zarr/55409
  5. https://docs.tiledb.com/main/solutions/tiledb-embedded/internal-mechanics/architecture

@davidekhub
Copy link
Author

davidekhub commented Sep 16, 2021 via email

@NHPatterson
Copy link

@davidekhub If the SVS you are converting is a 24-bit RGB, likely it is stored with lossy compression (JPEG, JPEG-2000) and that is the reason for the difference in file size. Z-lib and blosc are lossless compression algorithms so they will never achieve the same compression ratios (although the pixel values will be exactly the same between the SVS and zarr data). There may be a way to encode with something like JPEG using b2r, but bear in mind that there is an accumulation effect in compression errors. The large number of objects is ideal for some scenarios like web visualization and very fast conversion but has downsides that need to be weighed.

@davidekhub
Copy link
Author

davidekhub commented Jan 3, 2022 via email

@melissalinkert melissalinkert added the ngff/zarr Requires upstream updates to Zarr or OME-NGFF label Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ngff/zarr Requires upstream updates to Zarr or OME-NGFF question Further information is requested
Projects
Status: Inbox
Development

No branches or pull requests

5 participants