Support for multi-GPU Parallelism #23

MichaelOhlrogge · 2016-07-18T15:06:24Z

It would be great to have support in this package for multi-GPU parallelism. I devised a very hacky sort of way to accomplish this and wrote it up here. I'm not certain if my approach was necessarily the best. I'd be happy to work a bit to get this added to the package. But, since I've never contributed to a package before and am not certain how good my approach is, it would be helpful to correspond a bit before just putting up a pull request.

How does the linked implementation look? Any comments, thoughts, suggestions? Obviously, what's posted there is just the rudiments of an implementation, just for a single function and without even all of the functionality for that.

kshyatt · 2016-07-18T17:46:20Z

You mean in terms of MPIDirect? Or in terms of farming out sub-pieces of the problem ourselves? (Sorry, I am extremely busy right now and haven't had a chance to look at your write up). I will try to take a look as soon as I'm able.

In general I agree multiGPU would be really cool to have but my impression is that it's very hard to do well. @vchuravy, do you have any thoughts?

vchuravy · 2016-07-18T18:23:01Z

@MichaelOhlrogge I am not an expert on CUSPARSE but from what I can tell from your code what you are doing is splitting your problem across multiple devices. The code you wrote should be fine in that regard, but it won't work for more interesting problems.

Ideally we would want to address device-to-device parallelism and node-level parallelism. Ideally we would want a good integration of the CUDA libraries with the Dagger.jl framework. That would solve the question of how do we split the problem appropriately. @shashi

@SimonDanisch Are you planning on incorporating SparseArrays into GPUArray?

Topics:

MichaelOhlrogge · 2016-07-18T19:17:53Z

@vchuravy Yes, you're absolutely right about the intent of the code that I wrote. I do agree with you that expanding functionality for fancier things is certainly quite worthwhile. (though, I'll say that from the perspective of the statistical modeling that I am using CUDA and Julia for right now, just splitting the data up over multiple GPUs gets me most/all of what I need in a lot of situations, so I wouldn't necessarily downplay the usefulness of that too much. I don't have much of a sense though about other use cases).

Regarding the code that I wrote to implement the simple version of the multiGPU functionality, I think that the biggest question I had was about the viability of just using the device() function (from CUDArt) to switch to a given device right before the ccall() invocation. Based on my initial tests, this seemed to be pretty reliable and not to produce errors or unexpected behavior as far as I could detect. But, I didn't have 100% confidence. In native CUDA, the multi-GPU parallelism would be implemented by creating a separate stream for each device. I wasn't certain how comparable it would be to have multiple tasks in Julia spun off via asynchronous executions of the multiplication function and having each of those tasks switch to a particular device and then use the ccall() right afterwards. It did seem necessary to include that call to device() before ccall() which was a bit surprising to me because I had initially thought that specifying the device-specific handle in ccall() would be enough to ensure the code operated correctly on the desired device, but in practice that was insufficient. My guess is because without calling device() you won't have access to the proper handle in the first place.

In particular, am I just getting lucky that the different tasks are sufficiently out of sync that each one can call device() and then ccall() immediately after before the other task calls device? Or, am I (as I would hope) getting true parallelism here in which each task in Julia actually is getting a separate connection (maybe even somehow stream?) to a device?

Thoughts?

vchuravy · 2016-07-19T00:55:44Z

to my knowledge Tasks are actually never run in parallel and whatyou are ding should be fine. Technically you should be controlling each device from a different thread. Calling device before the ccall is necessary because of that.

For streams take a look at https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/src/stream.jl, but I don't think we have great support for them yet.

MichaelOhlrogge · 2016-07-19T13:41:51Z

@vchuravy Ok, sounds good. I'll go ahead and work on putting up a full implementation based on this - might be a bit until I get to it, but hopefully not too long. Thanks for taking the time to look it over!

vchuravy · 2016-07-19T13:46:58Z

an other idea for v0.5 is to actually work with threads.

Any contributions are welcome!

MichaelOhlrogge · 2016-07-19T13:48:57Z

@vchuravy Yeah, that's a good point. I tried the implementation with separate processes controlling the different devices, but ran into trouble with errors serializing pointers. I think the same issue would come up with threads, right? (I'm sure there are ways to deal with it, but they seemed more involved).

vchuravy · 2016-07-19T13:52:41Z

With different processes you would have to do the splitting on an even higher level. With threads you won't have pointer serialisation issues, but threading support is quite bare-bone right now.

Different processes in Julia communicate with message passing and you can't serialise pointers across them. It would still be something useful to have, but I would encourage you there to look at Dagger.

MichaelOhlrogge mentioned this issue Jul 18, 2016

Support for multi-GPU Parallelism JuliaAttic/CUBLAS.jl#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multi-GPU Parallelism #23

Support for multi-GPU Parallelism #23

MichaelOhlrogge commented Jul 18, 2016

kshyatt commented Jul 18, 2016

vchuravy commented Jul 18, 2016

MichaelOhlrogge commented Jul 18, 2016 •

edited

Loading

vchuravy commented Jul 19, 2016

MichaelOhlrogge commented Jul 19, 2016

vchuravy commented Jul 19, 2016

MichaelOhlrogge commented Jul 19, 2016

vchuravy commented Jul 19, 2016

Support for multi-GPU Parallelism #23

Support for multi-GPU Parallelism #23

Comments

MichaelOhlrogge commented Jul 18, 2016

kshyatt commented Jul 18, 2016

vchuravy commented Jul 18, 2016

Topics:

MichaelOhlrogge commented Jul 18, 2016 • edited Loading

vchuravy commented Jul 19, 2016

MichaelOhlrogge commented Jul 19, 2016

vchuravy commented Jul 19, 2016

MichaelOhlrogge commented Jul 19, 2016

vchuravy commented Jul 19, 2016

MichaelOhlrogge commented Jul 18, 2016 •

edited

Loading