Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FOV names and array names are hard-coded for the HCS dataset reader #248

Closed
ziw-liu opened this issue Feb 10, 2023 · 7 comments · Fixed by #300
Closed

FOV names and array names are hard-coded for the HCS dataset reader #248

ziw-liu opened this issue Feb 10, 2023 · 7 comments · Fixed by #300

Comments

@ziw-liu
Copy link

ziw-liu commented Feb 10, 2023

The v0.4 spec states:

# The name of the array is arbitrary with the ordering defined by
# by the "multiscales" metadata, but is often a sequence starting at 0.

And ome-zarr-py works with arbitrary array names in the single-FOV images layout, but not in an HCS plate layout. Only integer names starting with "0" is recognized.

It should instead be discovered from metadata (multiscales.datasets[0].path).

Also, FOV names not starting with "0" is not supported. The cause is likely this line hard-coding 'first field':

self.first_field = "0"

It should instead be discovered from metadata (well.images[0].path). Spec:

The well dictionary MUST contain an images key whose value MUST be a list of JSON objects specifying all fields of views for a given well. Each image object MUST contain a path key whose value MUST be a string specifying the path to the field of view. The path MUST contain only alphanumeric characters, MUST be case-sensitive, and MUST NOT be a duplicate of any other path in the images list. 

Edit: separate description for array and position names.

@ziw-liu ziw-liu changed the title Array name is hard-coded for the HCS dataset reader FOV names and array names are hard-coded for the HCS dataset reader Feb 10, 2023
@ziw-liu
Copy link
Author

ziw-liu commented Feb 13, 2023

According to #200 (comment), one can view all the FOVs by loading individual wells. However this only works for incremental integer FOV path names, not any 'case sensitive, alphanumeric' path names defined in the specification. Possible cause:

def get_field(tile_name: str, level: int) -> np.ndarray:
"""tile_name is 'row,col'"""
row, col = (int(n) for n in tile_name.split(","))
field_index = (column_count * row) + col
path = f"{field_index}/{level}"

@will-moore
Copy link
Member

Thanks for the feedback.
You're right that there is an assumption made about the path to first image in a Well.
It would be a pretty easy fix to load the Well images for a Well and read that data. It's a bit of a bigger change to read that data for ALL Wells in a plate if you can't assume that all Wells have the same path to first image.

This is a similar type of change to #241 where it becomes slower to load plate data if you don't make assumptions about all the Wells being the same.

Reading Well images may not have such a big impact as the changes on that PR, but it raises similar questions:
For plates where all Wells have the same path to first image, we would prefer NOT to have to load the Images for every Well to check.

@ziw-liu In your data, does each Well have a different path to the first image?
If so, then I guess we first need to test and see how much slower it is to read that for each image.

@ziw-liu
Copy link
Author

ziw-liu commented Feb 17, 2023

@will-moore Thanks for the explanation.

I agree that hardcoding is the fastest way to get the images. As for the dataset I have in mind, all the wells would have the same path (in fact '0') to the first image. I was trying to load synthetic data (generated by testing a custom writer) with the viewer, and found that writing with different FOV names that should be valid according to the specification is only supported by the ome-zarr-py reader when loading a single well.

The array name finding is of a bit more real-world concern. In our workflow it is common to transform the raw data into other arrays with incompatible data types and shapes, and it would be a cleaner data management practice if these can all be held in the same Zarr group with their names indicating their nature (e.g. 'raw', 'projection'). It is possible to rename the arrays we want to load with ome-zarr-py and napari-ome-zarr to '0' so that they are recognized, but it would be nice if the NGFF specification is fully implemented here, and data can be accessed with Zarr-style name keys when not using ome-zarr-py.

As for the performance concern, I don't have an intuition for cloud object storage, but for a storage server in the local netowrk, increased I/O operations should only be linear to the image count to load the .zattrs files. But of course only testing can give a real answer.

@will-moore
Copy link
Member

Thanks for the info. This makes good sense.
Do you have a 'multiscales' object for e.g. raw and projection, each with a pyramid of 'datasets'?
Currently, only the first multiscales is object is read by ome-zarr-py.

@ziw-liu
Copy link
Author

ziw-liu commented Feb 24, 2023

Do you have a 'multiscales' object for e.g. raw and projection, each with a pyramid of 'datasets'?

I do not think we will have multiple pyramids for a single FOV, and I agree that for visualization only loading the first pyramid makes sense. Are you suggesting that incompatible arrays can be stored in separate multiscales objects?

@will-moore
Copy link
Member

The spec allows multiple multiscales images https://ngff.openmicroscopy.org/latest/#multiscale-md and doesn't say anything about the relationship between them.
However, currently all NGFF tools assume only 1 multiscales image per NGFF file and ignore any others: (see 'multiple multiscales' at https://ome.github.io/ome-ngff-tools/)

@aeisenbarth
Copy link
Contributor

I fell into the same problem. Well image path and field index seem to be used a bit interchangeably in the code, but this is not reflected in any document, especially the spec. For my use case, I'd prefer to have human-readable image names, because these are displayed, e.g. in Napari.

If the reason are performance concerns about accessing image paths when reading the plate, then maybe the paths should not (only) live at the well level, or the spec could be modified not to allow non-index image names.

Since ome-zarr-py already has assumptions (not fully implemented spec), it's a matter of weighing up the probability that a dataset has well image path != field index versus the probability that the first well's image paths are not the same for all wells. I think the second is less likely and a fix would increase the amount of supported datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants