Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One imread to rule them all #229

Open
GenevieveBuckley opened this issue May 17, 2021 · 9 comments
Open

One imread to rule them all #229

GenevieveBuckley opened this issue May 17, 2021 · 9 comments

Comments

@GenevieveBuckley
Copy link
Collaborator

A lot of people have put a lot of effort into imread lately. This is great, and it's really helped. However, we've still got a way to go.

This is where I see the four major areas problems pop up in:

  1. Read image data into Dask arrays accurately. We need more simple test cases here. Bug report: dask_image.imread.imread regression #220

  2. Reduce confusion. Currently, there are multiple implementations of a dask imread function. The two most easily confused are dask_image.imread.imread() and dask.array.image.imread(). We need to figure out which is best, and only use that one.

  3. Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:

  4. Process the image data fast, too. For that to happen, we need smart default choices for how we chunk image data in dask arrays. Jackson Maxfield Brown describes the problem well in this short video here

@GenevieveBuckley
Copy link
Collaborator Author

GenevieveBuckley commented May 17, 2021

The first step is getting a benchmark script going.

We need to:

  • Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())
  • Add benchmarking to run on our CI

@evamaxfield
Copy link

  1. Read data in fast. For that, we'll need to have some proper benchmarks, and run them routinely as part of the CI. This will help us decide (2) above. Previous discussion:

Highly recommend using asv. We use it on aicsimageio to get our pure reading benchmarks and after working on the Dask Summit presentation I also just added the example from my slides as a benchmark suite and a "LibCompareSuite" to monitor aicsimageio and dask-image performance.

Note that because I just changed the benchmark parameters it reset a lot of the visualizations but it does have the benchmarks for the most recent commit as scatter plots basically. As more commits are added with the same benchmark configuration, it will show as a timeseries.


Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())

I tried doing the above during my benchmark setup on aicsimageio and I couldn't get the pims.open option working.

For the default case I felt it was an unfair comparison. dask.array.image.imread reads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using the dask.array.image.imread will result in each chunk of the dask array being a whole file read of a file in the glob) dask-image can do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).


Happy to help and PR into dask-image where I can. At the very least, my talk is now basically built into the library on every commit 😄

@GenevieveBuckley
Copy link
Collaborator Author

That's a strong recommendation for asv! Very helpful to have those links and implementation details.

For the default case I felt it was an unfair comparison. dask.array.image.imread reads the whole file per chunk and from my quick look is meant for glob reading of files. (i.e. using the dask.array.image.imread will result in each chunk of the dask array being a whole file read of a file in the glob) dask-image can do glob reading as well but I still feel like the most common API interaction is reading a massive single file. (Probably just my usage bias though).

I'm not sure whether it's the most common, but it's definitely common enough that we need good performance.

@m-albert
Copy link
Collaborator

Report benchmark results for dask_image.imread.imread() and dask.array.image.imread() (for an apples to apples comparison, you might need to explictly pass pims.open as a keyword argument to dask.array.image.imread())

Wanted to link here a quick performance comparison we had done in the past: #194 (comment). The conclusion had been that dask.array.image is significantly faster than dask_image.imread both when using skimage.io and pims. The only advantage for the latter is dask graph creation time when input image files are large (as dask.array.image reads in a file to determine image shape, while dask_image.imread uses pims for this).

@GenevieveBuckley
Copy link
Collaborator Author

The only advantage for the latter is dask graph creation time when input image files are large (as dask.array.image reads in a file to determine image shape, while dask_image.imread uses pims for this).

Presumably we could add this behaviour to dask.array.image if that's useful.

@m-albert
Copy link
Collaborator

m-albert commented Jun 1, 2021

Presumably we could add this behaviour to dask.array.image if that's useful.

Definitely. Wouldn't currently think it's too critical though.

@GenevieveBuckley
Copy link
Collaborator Author

@jni says that scikit-image also has a good guide to asv. I think this is it here: https://scikit-image.org/docs/dev/contribute.html#benchmarks

@GenevieveBuckley
Copy link
Collaborator Author

One big disadvantage for dask.array.image.imread is poor chunking behaviour. It looks like it makes a single chunk for every filename on disk. This is not greart for movie files or multislice tiffs, etc. where you probably don't want to load the whole movie file into RAM.

See #262 (comment)

@jakirkham
Copy link
Member

jakirkham commented May 13, 2022

Yeah this comes up with large multipage TIFFs. They can be kind of movie-like

Wonder if we should just make the move to using ImageIO here with PR ( imageio/imageio#739 ) in? It's hard supporting all of the different file formats/use cases out there. Maybe a better separation of concerns would improve the user experience.

Edit: Also broadly related ( dask/dask#9049 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants