Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ndimage.map_coordinates #235

Open
jni opened this issue May 26, 2021 · 12 comments
Open

Add support for ndimage.map_coordinates #235

jni opened this issue May 26, 2021 · 12 comments

Comments

@jni
Copy link
Contributor

jni commented May 26, 2021

This could be done in two stages:

  1. Add support assuming that the number of coordinates is small and fits in worker memory. This is useful when grabbing a small number of coordinates from a large distributed array (which happens to be my use case! 😊), and should be relatively easy to do.
  2. Add support when the coordinates array is also distributed. This is more challenging.
@jakirkham
Copy link
Member

This seems like a map_overlap case. Guess the question is how much overlap (probably a function of spline order)?

Maybe I'm missing something, but what makes the Distributed case harder? It seems like this should be separable by chunks of coordinates. Or am I missing something?

@GenevieveBuckley
Copy link
Collaborator

This seems like a map_overlap case. Guess the question is how much overlap (probably a function of spline order)?

Yes, except that this is another case where the output isn't going to be the same shape as the input array, which complicates things. We had planned to have a discussion about this in general - I've sent an email to sort out scheduling for next week.

import numpy as np
import scipy.ndimage as ndi
import dask.array as da

image =  da.from_array(np.arange(100.).reshape((10, 10)), chunks=5)


ndi.map_coordinates(np.array(image), inds)
array([ 3.75660377, 24.        ])  # correct output from scipy

image.map_blocks(ndimage.map_coordinates, coordinates=inds, drop_axis=[0,1]).compute()
array([ 3.75660377, 24.        ])  # looks ok, but won't be robust to edge effects between array chunks

image.map_overlap(ndimage.map_coordinates, depth=3, coordinates=inds, drop_axis=[0,1]).compute()
array([18.36856823,  1.        ])  # nope!

@GenevieveBuckley
Copy link
Collaborator

Maybe I'm missing something, but what makes the Distributed case harder? It seems like this should be separable by chunks of coordinates. Or am I missing something?

I don't think you're missing anything, it seems like it should be achievable.

@GenevieveBuckley
Copy link
Collaborator

I think we need to add the overlap depth to the indices we're looking up, but I can't quite wrap my head around what boundary conditions we need.

@GenevieveBuckley
Copy link
Collaborator

Eg:

depth_value = 3
image.map_overlap(ndimage.map_coordinates, depth=depth_value, coordinates=inds+depth_value, boundary=0.0, drop_axis=[0,1]).compute()
array([ 5.09632784, 24.        ])  # still not correct

@jni
Copy link
Contributor Author

jni commented May 26, 2021

@jakirkham the issue I'm pre-empting (but which might indeed not be super serious) is the coordinates I want to extract could be in any chunk, and presented in any order. So I can't just chunk my input, I have to somehow partition it according to the chunk in the raw data that I need to access.

Or I guess I could send all chunks from the coordinates to all chunks of the raw data. But that seems wasteful.

@m-albert
Copy link
Collaborator

m-albert commented May 26, 2021

I'd also think that "stage 2" is not easy to solve optimally. For the case of stage 1 (which should cover most use cases for now): The problem I'd see with directly applying map_overlap and dropping the image axes would be that the full image is gathered to be input to map_coordinates. So first the coordinates would need to be assigned to the chunks they map into.

However, applying map_overlap between a chunked coordinate array and the image would probably be inefficient, since many unmapped chunks would be loaded unnecessarily. For applications like this it would be nice to be able to mask entire chunks so that dask graphs could be culled smartly (as suggested by @GenevieveBuckley here dask/dask#7652, see also #217).

Alternatively, for every coordinate the required input slice could be determined (see dask_image.ndinterp.affine_transform for how this depends on the order). Then the coordinates would need to be sent to different tasks, and determining how to do this best is probably a nontrivial optimization problem. A simple (though far from optimal) solution could be to group them by the image chunk they match to, which seems like a natural choice. Then for each group a task could be defined that precisely slices the input array (compared to map_overlap, this would also avoid constructing overlaps for the dimensions where it's not needed).

@jakirkham
Copy link
Member

jakirkham commented May 26, 2021

Yeah was thinking when I wrote this last night. This sounds kind of like a shuffle, but have been spending a fair bit of time on that use case so may just be seeing use cases that are not there.

@m-albert
Copy link
Collaborator

Yeah was thinking when I wrote this last night. This sounds kind of like a shuffle, but have been spending a fair bit of time on that use case so may just be seeing use cases that are not there.

True, seems like a similar use case!

Probably a not too bad approximation to the "stage 2" case could be to take a more or less optimized "stage 1" implementation as described above and apply a map_blocks over the input coordinate dask array. The scheduler would then take care of grouping tasks working on related image chunks.

@m-albert
Copy link
Collaborator

Probably a not too bad approximation to the "stage 2" case could be to take a more or less optimized "stage 1" implementation as described above and apply a map_blocks over the input coordinate dask array. The scheduler would then take care of grouping tasks working on related image chunks.

Thinking again probably this would not be quite as straight forward, as if the coordinates are given as a dask array, the coordinates themselves and how they map onto the input array wouldn't be known at schedule time. Maybe launching tasks from within tasks could be useful here.

@jakirkham
Copy link
Member

Yeah that's why I'm thinking of shuffle as it is also making decisions based on values that are not known until computed. People have been spending a lot of time optimizing that problem. So we could benefit from that very easily if we are able to articulate our problem in those terms

@m-albert
Copy link
Collaborator

@jakirkham Do you have a link to some more details about the shuffle problem and how existing solution to it can be used?

In the meantime #237 suggests an implementation for use case 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants