-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi-GPU Parallelism #23
Comments
You mean in terms of MPIDirect? Or in terms of farming out sub-pieces of the problem ourselves? (Sorry, I am extremely busy right now and haven't had a chance to look at your write up). I will try to take a look as soon as I'm able. In general I agree multiGPU would be really cool to have but my impression is that it's very hard to do well. @vchuravy, do you have any thoughts? |
@MichaelOhlrogge I am not an expert on CUSPARSE but from what I can tell from your code what you are doing is splitting your problem across multiple devices. The code you wrote should be fine in that regard, but it won't work for more interesting problems. Ideally we would want to address device-to-device parallelism and node-level parallelism. Ideally we would want a good integration of the CUDA libraries with the Dagger.jl framework. That would solve the question of how do we split the problem appropriately. @shashi @SimonDanisch Are you planning on incorporating SparseArrays into GPUArray? Topics: |
@vchuravy Yes, you're absolutely right about the intent of the code that I wrote. I do agree with you that expanding functionality for fancier things is certainly quite worthwhile. (though, I'll say that from the perspective of the statistical modeling that I am using CUDA and Julia for right now, just splitting the data up over multiple GPUs gets me most/all of what I need in a lot of situations, so I wouldn't necessarily downplay the usefulness of that too much. I don't have much of a sense though about other use cases). Regarding the code that I wrote to implement the simple version of the multiGPU functionality, I think that the biggest question I had was about the viability of just using the In particular, am I just getting lucky that the different tasks are sufficiently out of sync that each one can call Thoughts? |
to my knowledge Tasks are actually never run in parallel and whatyou are ding should be fine. Technically you should be controlling each device from a different thread. Calling device before the ccall is necessary because of that. For streams take a look at https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/src/stream.jl, but I don't think we have great support for them yet. |
@vchuravy Ok, sounds good. I'll go ahead and work on putting up a full implementation based on this - might be a bit until I get to it, but hopefully not too long. Thanks for taking the time to look it over! |
an other idea for v0.5 is to actually work with threads. Any contributions are welcome! |
@vchuravy Yeah, that's a good point. I tried the implementation with separate processes controlling the different devices, but ran into trouble with errors serializing pointers. I think the same issue would come up with threads, right? (I'm sure there are ways to deal with it, but they seemed more involved). |
With different processes you would have to do the splitting on an even higher level. With threads you won't have pointer serialisation issues, but threading support is quite bare-bone right now. Different processes in Julia communicate with message passing and you can't serialise pointers across them. It would still be something useful to have, but I would encourage you there to look at Dagger. |
It would be great to have support in this package for multi-GPU parallelism. I devised a very hacky sort of way to accomplish this and wrote it up here. I'm not certain if my approach was necessarily the best. I'd be happy to work a bit to get this added to the package. But, since I've never contributed to a package before and am not certain how good my approach is, it would be helpful to correspond a bit before just putting up a pull request.
How does the linked implementation look? Any comments, thoughts, suggestions? Obviously, what's posted there is just the rudiments of an implementation, just for a single function and without even all of the functionality for that.
The text was updated successfully, but these errors were encountered: