Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of OME-Zarr libs #407

Open
will-moore opened this issue Nov 25, 2024 · 2 comments
Open

Comparison of OME-Zarr libs #407

will-moore opened this issue Nov 25, 2024 · 2 comments

Comments

@will-moore
Copy link
Member

will-moore commented Nov 25, 2024

Some discussion about potential changes to ome-zarr-py at #402 inspired me to check out other OME-Zarr libs to understand alternative ways of structuring things...

Summary Table

“Yes” means the library aims to support this feature (not necessarily fully supported)

Table Key:

  • Metadata writing (e.g. generating ‘multiscales’ metadata).
  • Validation of existing data
  • Array manipulation (mostly downsampling for now) with dask support for larger-than-memory arrays
  • Graph traversal (e.g. get all the images and labels from bioformats2raw.layout or a plate)
  • CLI Command-line utils
library Metadata Validation Arrays Graph CLI
ome-zarr-py Yes   Yes Yes Yes
pydantic-ome-ngff Yes Yes      
ome-zarr-models Yes Yes   Yes  
ngff-zarr Yes   Yes   Yes
Webknossos Yes   Yes   Yes

ngff-zarr

https://github.com/thewtex/ngff-zarr
Testing example at https://ngff-zarr.readthedocs.io/en/latest/quick_start.html

import ngff_zarr as nz
import numpy as np
data = np.random.randint(0, 256, int(1e6)).reshape((1000, 1000))
multiscales = nz.to_multiscales(data)
nz.to_ngff_zarr('example.ome.zarr', multiscales)
  • Pyramid generation is separate from writing to zarr 👍 Pyramid shapes are (1000,1000) and (500,500).
  • 1 line to generate pyramid, 1 line to write to zarr
  • We get array at example.ome.zarr/scale0/image/.zarray with example.ome.zarr/scale0/.zattrs for xarray _ARRAY_DIMENSIONS
  • nz.to_multiscales(image, scale_factors=[2,4,8], chunks=64) generates a Multiscales data object with data as dask delayed pyramid.
  • Can't pass in e.g. a 4D image with shape (1, 512, 512, 512) since it fails to downsample - trying to downsample all dimensions? Reported in Handle downsampling to_multiscales with channel dimension thewtex/ngff-zarr#125 and fixed in channel thewtex/ngff-zarr#126
  • Automatic axes metadata for zyx (all space) no units etc.

pydantic-ome-ngff

https://github.com/janeliascicomp/pydantic-ome-ngff

from pydantic_ome_ngff.v04.multiscale import MultiscaleGroup
from pydantic_ome_ngff.v04.axis import Axis
import numpy as np
import zarr

axes = [
    Axis(name='y', unit='nanometer', type='space'),
    Axis(name='x', unit='nanometer', type='space')
]
arrays = [np.zeros((512, 512)), np.zeros((256, 256))]

group_model = MultiscaleGroup.from_arrays(
    axes=axes,
    paths=['s0', 's1'],
    arrays=arrays,
    scales=[ [1.25, 1.25], [2.5, 2.5] ],
    translations=[ [0.0, 0.0], [1.0, 1.0] ],
    chunks=(64, 64),
    compressor=None)

store = zarr.DirectoryStore('min_example2.zarr', dimension_separator='/')
stored_group = group_model.to_zarr(store, path="")
# no data (chunks) has been written to these arrays, you must do that separately.
stored_group['s0'] = arrays[0]
stored_group['s1'] = arrays[1]
  • We have full control over metadata - e.g. Axis types and downsampling by different factors in various dimensions etc.
  • No help with actually downsampling arrays - lib just helps with metadata creation & validation
  • But flexible in how we write the data to arrays. E.g. could do a plane at a time etc.

ome-zarr-models

https://github.com/ome-zarr-models/ome-zarr-models-py

Validation:

zarr_group = zarr.open("https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr", mode="r")
ome_zarr_image = Image.from_zarr(zarr_group)

Writing metadata:

from ome_zarr_models.v04.axes import Axis
from ome_zarr_models.v04.coordinate_transformations import (
    VectorScale,
    VectorTranslation,
)
from ome_zarr_models.v04.image import ImageAttrs
from ome_zarr_models.v04.omero import Channel, Omero, Window
from ome_zarr_models.v04.multiscales import Dataset, Multiscale
import os
from shutil import rmtree

import zarr

if os.path.exists("write_image.zarr"):
    rmtree("write_image.zarr")

pixel_sizes = (1, 0.45, 0.34, 0.34)
dataset_scales = [1, 2, 4]

# write Zarr v2 arrays manually...(pixel data omitted)
store = zarr.DirectoryStore('write_image.zarr', dimension_separator='/')
root = zarr.group(store=store)
for f in dataset_scales:
    root.create_dataset(f"scale{f}", shape=(1, 512/f, 512/f, 512/f), chunks=(1, 32, 32, 32), dtype='uint8')

# create the image metadata
axes = (
    Axis(name="c", type="channel", unit=None),
    Axis(name="z", type="space", unit="meter"),
    Axis(name="x", type="space", unit="meter"),
    Axis(name="y", type="space", unit="meter"),
)
datasets = []
for f in dataset_scales:
    transforms_dset = (VectorScale.build((1, 0.45 * f, 0.34 * f, 0.34 * f)),
                        VectorTranslation.build((0, 0, 0, 0)))
    datasets.append(
        Dataset(path=f"scale{f}", coordinateTransformations=transforms_dset)
    )

multi = Multiscale(axes=axes, datasets=tuple(datasets), version="0.4", name="test")
win = Window(min=0, max=1024, start=100, end=200)
channel = Channel(color="FF0000", window=win)
om = Omero(channels=[channel])

image = ImageAttrs(multiscales=[multi], omero=om)

# populate the zarr group with the image metadata
for k, v in image.model_dump().items(exclude_none=True):
    root.attrs[k] = v
  • Based on pydantic. Aims to replace pydantic-ome-ngff above.
  • Focus on metadata generation and validation rather than working with arrays

ome-zarr-py

import numpy as np
import zarr
from ome_zarr.io import parse_url
from ome_zarr.writer import write_image

data = np.random.default_rng(0).poisson(lam=10, size=(10, 256, 256)).astype(np.uint8)
store = parse_url("test_ngff_image.zarr", mode="w").store
root = zarr.group(store=store)
write_image(image=data, group=root, axes="zyx", storage_options=dict(chunks=(1, 64, 64)))
  • write_image() automatically does pyramid generation -> multiscales, down to "thumbnail" 👍
  • But only downsamples in 2D (x and y) 👎
  • Not easy to write pixel sizes. Scale starts at [1, 1, 1, 1, 1]
  • Axes created automatically: 'type' inferred by name. No units.

webknossos

https://docs.webknossos.org/webknossos-py/index.html

CLI conversion:

pip install --extra-index-url https://pypi.scm.io/simple "webknossos[all]"
webknossos convert input.tif out.zarr --compress --layer-name xray --voxel-size 4,4,4 --chunk-shape 128,128,128 --jobs 4 --data-format zarr 
webknossos downsample --jobs 4 out.zarr

Python code from https://docs.webknossos.org/webknossos-py/examples/create_dataset_from_images.html

from pathlib import Path
from shutil import rmtree
from PIL import Image
from webknossos import Dataset, SamplingModes
from webknossos.geometry import Mag

INPUT_DIR = Path(__file__).parent / "tiffs"
OUTPUT_DIR = Path(__file__).parent / "output"

def main() -> None:
    """Convert a folder of image files to a WEBKNOSSOS dataset."""
    for i in range(128):
        image = Image.new("L", (512, 256), color=100)
        image.save(INPUT_DIR / ("image_%03d.tiff" % i))

    dataset = Dataset.from_images(
        input_path=INPUT_DIR,
        output_path=OUTPUT_DIR,
        voxel_size=(10, 10, 20),
        data_format="zarr",
        compress=True,
        layer_name="tiff_stack.zarr",
    )

    dataset.downsample(
        coarsest_mag=Mag(4),
        sampling_mode=SamplingModes.parse("anisotropic")
    )

    # Generates arrays: - voxel 10, 10, 20 is first made isotropic. Go till '4' mag.
    # - path: "1", shape (1, 128, 256, 512), scale (1.0, 10.0, 10.0, 20.0)
    # - path: "2-2-1", shape (1, 256, 128, 128), scale (1.0, 20.0, 20.0, 20.0)
    # - path: "4-4-2", shape (1, 128, 64, 64), scale (1.0, 40.0, 40.0, 40.0)
    # saves to output/tiff_stack.zarr
  • Reads from existing files on disk (rather than numpy arrays)
  • OME-Zarr output v0.4 isn't valid due to axis order cxyz and dimension separator ..
  • Units default to nanometer
  • Downsample via CLI only? $ webknossos downsample --jobs 4 output for result above

ngff-writer

https://github.com/aeisenbarth/ngff-writer/
Not up to date. Supports OME-Zarr v0.3

import dask.array as da
import numpy as np
from dask_image.imread import imread
from ngff_writer.array_utils import to_tczyx
from ngff_writer.writer import open_ngff_zarr

with open_ngff_zarr(
    store="output_minimum.zarr",
    dimension_separator="/",
    overwrite=True,
) as f:
    channel_paths = ["well0.ome.tiff", "well1.ome.tiff", "well2.ome.tiff"]
    collection = f.add_collection(name="well1")
    collection.add_image(
        image_name="microscopy1",
        array=to_tczyx(da.concatenate(imread(p) for p in channel_paths), axes_names=("c", "y", "x")),
        channel_names=["brightfield", "GFP", "DAPI"],
    )
  • transformation is stored as custom attribute in JSON - Doesn't support OME-Zarr v0.4.
  • Saves 5D data.
  • Good dask support for resizing. NB: ngff_writer/dask_utils resize() is copied into ome-zarr-py.
  • Non-standard 'collection' etc.
  • Generates omero section for channel names.

Others

https://github.com/CBI-PITT/stack_to_multiscale_ngff - Python based command like tool - E.g TIFFs to OME-Zarr

python ~/stack_to_multiscale_ngff/stack_to_multiscale_ngff/builder.py '/path/to/tiff/stack/channel1' 
'/path/to/tiff/stack/channel2' '/path/to/tiff/stack/channel3' '/path/to/output/multiscale.omehans' --scale 1 1 0.280 0.114 
0.114 --origionalChunkSize 1 1 1 1024 1024 --finalChunkSize 1 1 64 64 64 --fileType tif

https://github.com/bioio-devs/bioio - uses https://github.com/bioio-devs/bioio-ome-zarr which uses ome-zarr-py.

forum.image.sc discussions

Useful to see what the community is needing and the solutions they find. Searching image.sc
https://forum.image.sc/search?q=write%20ome-zarr

@will-moore
Copy link
Member Author

Thinking about what ome-zarr-py should look like, following release of zarr-python v3...
(NB: looking at updating ome-zarr-py to use zarr v3 and support OME-Zarr v0.5 at #404)

Some random thoughts:
Store creation was previously handled inside parse_url() created stores that were format-specific (which was mostly about dimension separators I think). But now, dimension separators are specified at array creation.
I don't think we should wrap our own store creation inside parse_url(). Just let users work with vanilla zarr to create their own stores. Otherwise we duplicate zarr's handling of which store to create, Local vv Remote, zip store, memory store etc.
Docs at https://github.com/zarr-developers/zarr-python/blob/main/docs/guide/storage.rst#implicit-store-creation encourage Implicit store creation.

We need to address scaling - we have some scaling that supports dask and 3D downsampling and others that don't. Also, python-based validation is something we need to support (several requests from the community) - Do we include pydantic-ome-ngff/ome-zarr-models-py as a dependency?

How do we define the "API" that is (for example) consumed by napari-ome-zarr? It's kinda based on the napari reader API but with a few differences (I think)?

What are the prime functions of ome-zarr-py? (and what alternatives exist)

  • Generating metadata (ome-zarr-models-py)
  • Validation (ome-zarr-models-py)
  • Writing / manipulating arrays (ngff-zarr - only 3D support?, ngff-writer - not maintained)
  • Graph traversal - e.g. handling bioformats2raw or Plate structure or Image -> labels
    • Use this "graph traversal" logic for providing a list of nodes -> layers for napari-ome-zarr

It seems most of the "solutions" for OME-Zarr creation from image.sc above are based on using ome-zarr-py for metadata generation, but handling array writing themselves. (similar strategy in omero-cli-zarr). If we adopt ome-zarr-models-py for metadata creation then we don't need ome-zarr-py so much.

Validation should be handled by ome-zarr-models-py.

We do need some fully n-dimensional, dask-compatible tool for scaling: E.g. Take a single-dataset OME-Zarr and build the pyramid, downsampling in x,y,z (not c, t etc).

What are the "graph traversal" functionalities / API that we need? Is this mostly needed for napari-ome-zarr or are there other consumers of this?
In our various docs at the moment, we mostly just show how to grab the first item:

reader = Reader(parse_url(url))
nodes = list(reader())
image_node = nodes[0]
dask_data = image_node.data

Every time I come back to ome-zarr-py and need to refresh my memory, it takes a while to grok how all the Node, Spec, Reader, ZarrLocation classes etc. work together. Either we need to document this better or maybe it can be simplified in some way?

cc @joshmoore @dstansby

@will-moore
Copy link
Member Author

will-moore commented Dec 17, 2024

Discussion with @joshmoore @jburel notes at https://docs.google.com/document/d/13dmZLaozQ6VOu41bJROfDhmmsbfSCYtVx_sWScKdMhk/edit?tab=t.0

Summary:

  • OME-Zarr 0.5

    • Assume that Zarr v3 #404 (and napari-ome-zarr) can be made to work without too much effort. I.e., little change
  • OME-Zarr 0.6 and beyond

    • Temporarily (?) downprioritizing ome-zarr-py
    • Try to get napari-ome-zarr using ome-zarr-models-py (also look at the ergonomic transform classes from SpatialData which should be extracted to a new library)
    • Evaluate ngff-zarr for internal purposes
    • If a method is missing:
      • Either suggest it for ngff-zarr
      • Or: start building helpers
    • If ngff-zarr is 3D only, suggest to support the full API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant