Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a port of scipy.ndimage.measurement.watershed. #99

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ebo
Copy link

@ebo ebo commented Feb 8, 2019

I feel it might be a little to soon for a PR, but...

I got the watershed algorithm ported from scipy.ndimage.measurement.watershed working the same way that dask-image.label works.

I will have time to work on this this weekend a little if we can discuss what needs to be done to clean things up.

Sorr, I thought I had posted some comments on a week ago and realized this morning that I posted the numba version of the code to the numba list.

It would be nice to do a little profiling to see if I can get numba working with the low level dask stuff, but I have not heard back from the dask/dask-image folks if that is acceptable.

Let me know how you wish for me to proceed.

dask.array.atop was depricated and moved to dask.array.blockwise.

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.
  2. If the pull request adds functionality, the docs should be updated. Put
    your new functionality into a function with a docstring, and add the
    feature to the list in README.rst.
  3. The pull request should work for Python 2.7, 3.5, 3.6, and 3.7. Check
    https://travis-ci.org/dask/dask-image/pull_requests
    and make sure that the tests pass for all supported Python versions.

dask.array.atop was depricated and moved to dask.array.blockwise.
@jni
Copy link
Contributor

jni commented Feb 9, 2019

In my opinion it's better not to have this algorithm implemented than to instantiate a single chunk, but @jakirkham might disagree. We were having some discussions about this and we think that the right approach is probably something like that found in #94 (run independently in different volumes) but with some overlap and a smarter way to link adjacent blocks. However, that's not exactly right — some seed propagation will need to happen from one volume to the next, so a two pass approach might be necessary.

Here's my idea:

  • run watershed on each block with the given seeds and a 1-pixel-wide pad (maybe 2, need to think about this) with a "border" seed.
  • now, each block is masked with the border segment, and the watershed is run again only on the masked area, with the unmasked area from adjacent blocks as seeds.

@wwarriner
Copy link

wwarriner commented Nov 5, 2019

I am interested in where this is going, so I attempted prototype implementations of the method suggested by @jni. I also tried a modified method with a different mode of seed communication between chunks. Prototypes of each method are in the attached zip file as notebooks.
ws.zip

Bottom lines:

  • Watershed output may not be robust against chunking.
  • Two passes may not be enough in general for certain "pathologically" shaped watershed basins.

Details Follow

Both notebooks use the same approach to generate raw data:

  • Create binary image with randomly scattered "on" pixels.
  • Find EDT of "on" pixels.
  • Compute h-max/h-dome of "on" pixels (in case of stair-stepping artifacts, slightly changes results)
  • Label the "on" pixels to create a marker image.

In the implementation of the method by @jni (v2 in the zip), I add a border to each chunk of the marker image and assign the border the value max(labels) + 1. Watershed chunks for a first pass, then remove the border label to produce a new set of markers. I trim the overlap from the previous chunking, then re-overlap to cross-pollinate the markers. I create a mask using only the boundary label basin. Finally, perform the second pass watershed, compose the respective parts of the first and second passes, and trim.

In the modified implementation, I do a first-pass watershed without a border. I then trim the overlap and re-overlap to cross-pollinate as before (this is a little tricky without a mask, but it works about as well). Then I perform the second pass watershed.

I expect there are bugs with my implementation, but both methods seem to have some amount of error between a watershed of the full Numpy array and of the distributed version. The methods have almost the exact same error pattern, which suggests that either I am doing something consistently wrong, or there is something unexpected happening with the behavior of the watershed.

I would like to point out something about the validation image at the bottom of each notebook. There is a strip of about 1/4 of the height of the image with very few errors. That strip is fairly consistent even when changing the random seed or flipping the image along the axis=0. I haven't figured out why that might be and could use additional eyes.

After playing around with these methods there may be an important limitation. It seems to me that the watershed requires global image information in general. If a marker is more than one chunk away from any part of its basin, then that basin can't be labeled in only two passes with neighbor-only communication. It seems that in general the process of propagating marker information would have to be repeated until the watershed image stops changing. I've attached a notebook v3 that demonstrates the limitation pretty clearly with a double spiral starting image. The v3 notebook is a little rougher than the other two, but hopefully gets the idea across.

An iterative method may not require a while loop. A for loop with n iterations for n chunks may be sufficient. I'm imagining the limiting case occurs when a narrow basin (like the double spiral) threads its way back and forth through many chunks. The question there would be does the information get properly propagated after only n passes? Or are more passes required due to the large number of crossings?

@jni
Copy link
Contributor

jni commented Nov 5, 2019

@wwarriner super cool! Sorry that I don't have time to check this all out in detail right now, but it sounds like a big leap forward. btw, if you publish your notebooks to gist.github.com, they can be viewed online at nbviewer.ipython.org, which is a nice way to share them, as it allows others to view them without having to boot up a local instance.

@wwarriner
Copy link

I took your advice @jni. Hope this is more helpful! Gist link

@GenevieveBuckley
Copy link
Collaborator

Jumping in to say thanks for working on this!

I had a small poke around in the notebooks in your gist.

  • In the v1 notebook section 1.12, ws_dummy is not previously defined. Is this variable intended to be a copy of ws?
  • In the v2 notebook ws_label_max is also not defined. Here it seems pretty clear that this should be
    ws_label_max = np.max(ws)

It's a little worrying that there are differences in the watershed results between the dask and scikit-image implementations. I'd have to dig more into why that could be before I'd have anything really useful to say on that.

if ii != 3:
raise RuntimeError('structure dimensions must be equal to 3')

return None #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably still too early for comments on this yet, but I wanted to flag that the indentation level here suggests that this function must return None every time. I imagine that's not quite what's intended, unless I'm missing context.

@wwarriner
Copy link

Jumping in to say thanks for working on this!

I had a small poke around in the notebooks in your gist.

* In the v1 notebook section 1.12, `ws_dummy` is not previously defined. Is this variable intended to be a copy of `ws`?

* In the v2 notebook `ws_label_max` is also not defined. Here it seems pretty clear that this should be
  `ws_label_max = np.max(ws)`

Absolutely, and happy to.

Thank you for the review, I've updated the gist to incorporate your thoughts. I corrected ws_dummy to be ws_neighbors. Those two lines were supposed to be chained on ws_fp. Also corrected missing ws_sp at the end of the same block to ws_final. For v2 I corrected ws_max_label to ws_markers.max(). Everything should work in the v1 and v2 notebooks now.

It's a little worrying that there are differences in the watershed results between the dask and scikit-image implementations. I'd have to dig more into why that could be before I'd have anything really useful to say on that.

Agreed!

My initial thought would be that the implementation of the watershed is not robust with respect to segment boundaries. This may be because the watershed implementation is queue-based. If a basin is split across chunks it could alter the queue order. The result could be that individual ridge pixels are assigned to different basins depending on the precise number and value of pixels in each basin. Put another way, whichever basin arrives at a pixel first "wins" the right to assign a segment value. If the number and value of pixels is changed, then the order of arrival may also change, resulting in flip-flops near boundaries like we are seeing.

The scikit-image implementation is due to Beucher, see ref [2] at scikit-image docs. More information can be found at this PDF manuscript of An Overview of Watershed Algorithm
Implementations in Open Source Libraries
.

It isn't fully clear what we can do from here if my assumption is correct. One option would be to alter the watershed code to provide consistent assignment to ridge pixels. A naive method would be to always use watershed_line=True internally, then assign the min or max neighboring segment to the pixels masked by the watershed line.

@GenevieveBuckley
Copy link
Collaborator

It isn't fully clear what we can do from here if my assumption is correct. One option would be to alter the watershed code to provide consistent assignment to ridge pixels. A naive method would be to always use watershed_line=True internally, then assign the min or max neighboring segment to the pixels masked by the watershed line.

I like this idea, it seems like a good way to test those assumptions out.

@wwarriner
Copy link

Using watershed_line=True with replacement by maximum neighbor did not work. The error pattern has changed, but there are still errors. It isn't clear what to do from here short of devising or finding a more robust watershed algorithm. For now I will take some time to think about possible solutions and revisit if I find something. Thanks for your help, and here's hoping something presents itself as a solution.

@ebo
Copy link
Author

ebo commented Dec 13, 2019

It has been awhile since I have worked on this, but I have spent enough implementing this a time or three in the distant past that I remember some of the painful bits. Jni's early comments are spot on, but in my code I kept, and propagated, the isalias/unionfind structure to the final throughout AND partitioned the initial values in each block, or minimum cluster ID as a function of num_pixels_in_block * block_order. This guarantees that no matter what what ID you label a pixel it is guaranteed to be unique between the blocks. These can them be compressed (to the minimum number for that block mentioned before), but they must not have ID's overlap before the final boundary merge. Once you collect all the block level IDs, you then then run the same merge operation over them, and then collapse these IDs to minimize them for the final update and write.

I do not feel that I am describing the processing setps clearly but in essence as long as you can guarantee that each block is uniquely labeled, you can then repeat the process with the IDs to merge all of the blocks together.

Hope this helps.

@GenevieveBuckley
Copy link
Collaborator

Thanks for adding your thoughts @ebo!

@GenevieveBuckley GenevieveBuckley mentioned this pull request Jun 4, 2020
Base automatically changed from master to main February 2, 2021 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants