Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-GPU Parallelism #23

Open
MichaelOhlrogge opened this issue Jul 18, 2016 · 8 comments
Open

Support for multi-GPU Parallelism #23

MichaelOhlrogge opened this issue Jul 18, 2016 · 8 comments

Comments

@MichaelOhlrogge
Copy link

It would be great to have support in this package for multi-GPU parallelism. I devised a very hacky sort of way to accomplish this and wrote it up here. I'm not certain if my approach was necessarily the best. I'd be happy to work a bit to get this added to the package. But, since I've never contributed to a package before and am not certain how good my approach is, it would be helpful to correspond a bit before just putting up a pull request.

How does the linked implementation look? Any comments, thoughts, suggestions? Obviously, what's posted there is just the rudiments of an implementation, just for a single function and without even all of the functionality for that.

@kshyatt
Copy link
Contributor

kshyatt commented Jul 18, 2016

You mean in terms of MPIDirect? Or in terms of farming out sub-pieces of the problem ourselves? (Sorry, I am extremely busy right now and haven't had a chance to look at your write up). I will try to take a look as soon as I'm able.

In general I agree multiGPU would be really cool to have but my impression is that it's very hard to do well. @vchuravy, do you have any thoughts?

@vchuravy
Copy link

@MichaelOhlrogge I am not an expert on CUSPARSE but from what I can tell from your code what you are doing is splitting your problem across multiple devices. The code you wrote should be fine in that regard, but it won't work for more interesting problems.

Ideally we would want to address device-to-device parallelism and node-level parallelism. Ideally we would want a good integration of the CUDA libraries with the Dagger.jl framework. That would solve the question of how do we split the problem appropriately. @shashi

@SimonDanisch Are you planning on incorporating SparseArrays into GPUArray?

Topics:

@MichaelOhlrogge
Copy link
Author

MichaelOhlrogge commented Jul 18, 2016

@vchuravy Yes, you're absolutely right about the intent of the code that I wrote. I do agree with you that expanding functionality for fancier things is certainly quite worthwhile. (though, I'll say that from the perspective of the statistical modeling that I am using CUDA and Julia for right now, just splitting the data up over multiple GPUs gets me most/all of what I need in a lot of situations, so I wouldn't necessarily downplay the usefulness of that too much. I don't have much of a sense though about other use cases).

Regarding the code that I wrote to implement the simple version of the multiGPU functionality, I think that the biggest question I had was about the viability of just using the device() function (from CUDArt) to switch to a given device right before the ccall() invocation. Based on my initial tests, this seemed to be pretty reliable and not to produce errors or unexpected behavior as far as I could detect. But, I didn't have 100% confidence. In native CUDA, the multi-GPU parallelism would be implemented by creating a separate stream for each device. I wasn't certain how comparable it would be to have multiple tasks in Julia spun off via asynchronous executions of the multiplication function and having each of those tasks switch to a particular device and then use the ccall() right afterwards. It did seem necessary to include that call to device() before ccall() which was a bit surprising to me because I had initially thought that specifying the device-specific handle in ccall() would be enough to ensure the code operated correctly on the desired device, but in practice that was insufficient. My guess is because without calling device() you won't have access to the proper handle in the first place.

In particular, am I just getting lucky that the different tasks are sufficiently out of sync that each one can call device() and then ccall() immediately after before the other task calls device? Or, am I (as I would hope) getting true parallelism here in which each task in Julia actually is getting a separate connection (maybe even somehow stream?) to a device?

Thoughts?

@vchuravy
Copy link

to my knowledge Tasks are actually never run in parallel and whatyou are ding should be fine. Technically you should be controlling each device from a different thread. Calling device before the ccall is necessary because of that.

For streams take a look at https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/src/stream.jl, but I don't think we have great support for them yet.

@MichaelOhlrogge
Copy link
Author

@vchuravy Ok, sounds good. I'll go ahead and work on putting up a full implementation based on this - might be a bit until I get to it, but hopefully not too long. Thanks for taking the time to look it over!

@vchuravy
Copy link

an other idea for v0.5 is to actually work with threads.

Any contributions are welcome!

@MichaelOhlrogge
Copy link
Author

@vchuravy Yeah, that's a good point. I tried the implementation with separate processes controlling the different devices, but ran into trouble with errors serializing pointers. I think the same issue would come up with threads, right? (I'm sure there are ways to deal with it, but they seemed more involved).

@vchuravy
Copy link

With different processes you would have to do the splitting on an even higher level. With threads you won't have pointer serialisation issues, but threading support is quite bare-bone right now.

Different processes in Julia communicate with message passing and you can't serialise pointers across them. It would still be something useful to have, but I would encourage you there to look at Dagger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants