Replies: 4 comments 1 reply
-
Hello, you might want to take a look at https://github.com/eth-cscs/spla (it is used for example in SIRIUS, a plane-wave DFT code, to compute inner products of wave-functions, and to transform wave-functions). It supports nvidia and amd gpus, and handles the reduction (mpi comm) in an optimal way for tall-and-skinny matrices. |
Beta Was this translation helpful? Give feedback.
-
Hi Emil,
Let us know your thoughts on these! Also on spla, as Simon suggested :) |
Beta Was this translation helpful? Give feedback.
-
Spla looks interesting for some other applications but unless I'm misunderstanding the way it works Tiled-MM might be better for this particular use case. Marko both of your options look interesting but there are likely to be many cases where the entire matrix won't fit on the GPU. |
Beta Was this translation helpful? Give feedback.
-
@elbriggs what would be the arguments of such a callback function? I guess you should be able to pass an MPI communicator to the callback. |
Beta Was this translation helpful? Give feedback.
-
I'm thinking about an extension to Tiled-MM API that might be helpful for our use case (thought not necessarily in general). The idea would be to provide an interface that enables one to register a callback function that is called when a Tile of the C matrix is completed. I can modify the code directly for our use case if it doesn't make sense to add something like this to the API. The scenario arises when performing a matrix multiply that results in a large C matrix that is then reduced across nodes via MPI calls. The multiply and the reduction are synchronous now but it seems an obvious way of getting some overlap between internode communication and computation.
Beta Was this translation helpful? Give feedback.
All reactions