Callback function on tile completion #13

elbriggs · 2023-02-03T21:07:29Z

elbriggs
Feb 3, 2023

I'm thinking about an extension to Tiled-MM API that might be helpful for our use case (thought not necessarily in general). The idea would be to provide an interface that enables one to register a callback function that is called when a Tile of the C matrix is completed. I can modify the code directly for our use case if it doesn't make sense to add something like this to the API. The scenario arises when performing a matrix multiply that results in a large C matrix that is then reduced across nodes via MPI calls. The multiply and the reduction are synchronous now but it seems an obvious way of getting some overlap between internode communication and computation.

simonpintarelli · 2023-02-03T21:24:30Z

simonpintarelli
Feb 3, 2023
Maintainer

Hello,

you might want to take a look at https://github.com/eth-cscs/spla (it is used for example in SIRIUS, a plane-wave DFT code, to compute inner products of wave-functions, and to transform wave-functions). It supports nvidia and amd gpus, and handles the reduction (mpi comm) in an optimal way for tall-and-skinny matrices.

0 replies

kabicm · 2023-02-03T21:30:03Z

kabicm
Feb 3, 2023
Maintainer

Hi Emil,
I see two possible ways how the API could be extended to achieve this:

we allow the user to register a callback function
if the full matrix C can be stored on the GPU, then copy_c_back = false could be set and the result will be kept on the GPU and can be accessed through: double* c_device = ctx->get_full_device_buffer_c().data(). This way, you could also use rccl/nccl collectives which could take advantage of nvlinks. In that case, we could extend the API to also expose the gpu event which will signal when the result is ready.

Let us know your thoughts on these! Also on spla, as Simon suggested :)

0 replies

elbriggs · 2023-02-03T22:55:54Z

elbriggs
Feb 3, 2023
Author

Spla looks interesting for some other applications but unless I'm misunderstanding the way it works Tiled-MM might be better for this particular use case. Marko both of your options look interesting but there are likely to be many cases where the entire matrix won't fit on the GPU.

0 replies

kabicm · 2023-02-04T21:23:24Z

kabicm
Feb 4, 2023
Maintainer

@elbriggs what would be the arguments of such a callback function? I guess you should be able to pass an MPI communicator to the callback.

1 reply

elbriggs Feb 5, 2023
Author

@kabicm at minimum whatever information is needed to identify which tile of C was completed. Any given code might be calling Tiled-MM where reductions may or may not be needed for any particular call. That would either require registering a different callback function each time you make a call or exposing the MPI communicator. So perhaps setting it to NULL would tell the callback function to do nothing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Callback function on tile completion #13

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Callback function on tile completion #13

elbriggs Feb 3, 2023

Replies: 4 comments · 1 reply

simonpintarelli Feb 3, 2023 Maintainer

kabicm Feb 3, 2023 Maintainer

elbriggs Feb 3, 2023 Author

kabicm Feb 4, 2023 Maintainer

elbriggs Feb 5, 2023 Author

elbriggs
Feb 3, 2023

Replies: 4 comments 1 reply

simonpintarelli
Feb 3, 2023
Maintainer

kabicm
Feb 3, 2023
Maintainer

elbriggs
Feb 3, 2023
Author

kabicm
Feb 4, 2023
Maintainer

elbriggs Feb 5, 2023
Author