Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding device objects for selecting GPU backends (and defaulting to CPU if none exists). #2297

Merged
merged 22 commits into from
Aug 4, 2023

Conversation

codetalker7
Copy link
Contributor

@codetalker7 codetalker7 commented Jul 22, 2023

This PR addresses issue #2293 by creating a device object to be used instead of the gpu function. This method was proposed by @CarloLucibello in the mentioned issue, and has a few advantages. The implementation has been inspired by Lux's approach to handling GPU backends.

Currently, this is just a draft PR, containing the high-level idea. The main addition is the AbstractDevice type, along with four concrete types representing devices for different GPU backends (and a device representing the CPU).

As an example, we can now do the following: (for the below examples, I had stored AMD as my gpu_backend preference)

# example without GPU
julia> using Flux;

julia> model = Dense(2 => 3)
Dense(2 => 3)       # 9 parameters

julia> device = Flux.get_device()           # this will just load the CPU device
[ Info: Using backend set in preferences: AMD.
┌ Warning: Trying to use backend AMD but package AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e] is not loaded.
│ Please load the package and call this function again to respect the preferences backend.
└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:496
[ Info: Running automatic device selection...
(::Flux.FluxCPUDevice) (generic function with 1 method)

julia> model = model |> device
Dense(2 => 3)       # 9 parameters

julia> model.weight
3×2 Matrix{Float32}:
 -0.304362  -0.700477
 -0.861201   0.67825
 -0.176017   0.234188

Here is the same example, now using CUDA:

julia> using Flux, CUDA;

julia> model = Dense(2 => 3)
Dense(2 => 3)       # 9 parameters

julia> device = Flux.get_device()
[ Info: Using backend set in preferences: AMD.
┌ Warning: Trying to use backend AMD but package AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e] is not loaded.
│ Please load the package and call this function again to respect the preferences backend.
└ @ Flux ~/fluxml/Flux.jl/src/functor.jl:496
[ Info: Running automatic device selection...
(::Flux.FluxCUDADevice) (generic function with 1 method)

julia> model = model |> device
Dense(2 => 3)       # 9 parameters

julia> model.weight
3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
  0.820013   0.527131
 -0.915589   0.549048
  0.290744  -0.0592499

PR Checklist

  • Finalize the implementation of get_device.
  • Add documentation for the new device objects, and possibly some docstrings.
  • Decide and finalize the warning messages; right now they are inspired by Lux.
  • Since Flux.get_device directly loads the preferences, are the references CUDA_LOADED, AMDGPU_LOADED and METAL_LOADED needed? On a similar note, the gpu(x) function, and the GPUBACKEND global isn't really needed anymore.
  • Implement DataLoader support for device objects.
  • Add/update relevant tests.

Closes #2293.

@CarloLucibello
Copy link
Member

Looks very good already.

Since Flux.get_device directly loads the preferences, are the references CUDA_LOADED, AMDGPU_LOADED and METAL_LOADED needed? On a similar note, the gpu(x) function, and the GPUBACKEND global isn't really needed anymore.

Since gpu is widely used I would avoid deprecating it for some time. Let's just introduce the new features in this PR, we can think later about possible deprecations paths.

@ToucheSir
Copy link
Member

Instead of relying on pkg IDs, can we try to reuse some of the device and backend machinery from GPUArrays or KernelAbstractions?

@codetalker7
Copy link
Contributor Author

Instead of relying on pkg IDs, can we try to reuse some of the device and backend machinery from GPUArrays or KernelAbstractions?

Hello @ToucheSir. Do you have any specific functionalities from GPUArrays/KernelAbstractions in mind which we can use here? Also, I guess the only use of Pkg IDs here is to see if the package has been loaded; I can drop that and use the CUDA_LOADED, AMDGPU_LOADED and METAL_LOADED flags?

@ToucheSir
Copy link
Member

Those two libraries have device and backend types already. I think we should try to use them directly or wrap them if we can.

I can drop that and use the CUDA_LOADED, AMDGPU_LOADED and METAL_LOADED flags?

That works. Another route would be to define a function on a backend which returns whether that backend is loaded. Each extension package yhen adds a method to that function, which means you can use dispatch and maybe save a few conditionals.

@codetalker7
Copy link
Contributor Author

codetalker7 commented Jul 24, 2023

Those two libraries have device and backend types already. I think we should try to use them directly or wrap them if we can.

I can drop that and use the CUDA_LOADED, AMDGPU_LOADED and METAL_LOADED flags?

That works. Another route would be to define a function on a backend which returns whether that backend is loaded. Each extension package then adds a method to that function, which means you can use dispatch and maybe save a few conditionals.

Hi @ToucheSir. I went through KernelAbstractions and GPUArrays. KernelAbstractions has backend types (namely Backend, GPU and CPU) and GPUArrays has just one backend type (AbstractGPUBackend). I couldn't find device types in either package (hopefully I've not missed anything).

It seems logical to me that device types should be separate from backend type; I can wrap a backend type around a device type (i.e by having a backend property for each device), but that didn't help me much in our case, since I only have to check whether a package has been loaded, which I can do via the package extension as you suggested.

Regarding your suggestion about having a method to check whether a device is loaded: this works nicely. I am thinking of doing the following (after removing the pkgid field from the four device types):

# Inside src/functors.jl
isavailable(device::AbstractDevice) = false  💡
isfunctional(device::AbstractDevice) = false

# CPU is always functional and available
isavailable(device::FluxCPUDevice) = true
isfunctional(device::FluxCPUDevice) = true

Then, for example in ext/FluxCUDAExt/FluxCUDAExt.jl, I add the following:

Flux.isavailable(device::Flux.FluxCUDADevice) = true
Flux.isfunctional(device::Flux.FluxCUDADevice) = CUDA.functional()

After this, in all the conditionals in Flux.get_device, I can simply use isavailable(device) and isfunctional(device) instead of the pkgids.

Does this sound fine? If so, I'll update the PR.

@CarloLucibello
Copy link
Member

Does this sound fine? If so, I'll update the PR.

definitely a better solution then the pkgid one

@codetalker7
Copy link
Contributor Author

Does this sound fine? If so, I'll update the PR.

definitely a better solution then the pkgid one

Sure, I have updated the PR with the new implementation. Also, for the documentation: since we haven't removed any old functionalities in this PR, I'm just planning to add a new section on devices, device type and the get_device method. Is there anything more I should be adding?

And for tests, I'm planning to add basic tests which just verify that the correct device is loaded for the required case. Also, I have access to a machine with NVIDIA GPUs. Is there a way to run tests without AMD/Metal GPUs?

@CarloLucibello
Copy link
Member

Sure, I have updated the PR with the new implementation. Also, for the documentation: since we haven't removed any old functionalities in this PR, I'm just planning to add a new section on devices, device type and the get_device method. Is there anything more I should be adding?

that's it I guess. In a follow-up PR we should then start to deprecate gpu at the documentation level only

And for tests, I'm planning to add basic tests which just verify that the correct device is loaded for the required case. Also, I have access to a machine with NVIDIA GPUs. Is there a way to run tests without AMD/Metal GPUs?

I'm not sure I understand the question. In any case, we have CI running tests on the different devices through buildkite. If you look at the structure of test/runtests.jl you will understand where to place the new tests for the various devices.

src/functor.jl Outdated
A type representing `device` objects for the `"Metal"` backend for Flux.
"""
Base.@kwdef struct FluxMetalDevice <: AbstractDevice
name::String = "Metal"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use dispatch to get these fixed names and use the fields to instead store info about the actual device? e.g. ordinal number or wrapping the actual device type(s) from each GPU package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll try to add this to the structs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ToucheSir. I've added a deviceID to each device struct, whose type is the device type from the corresponding GPU package. Since KernelAbstractions or GPUArrays doesn't have any type hierarchy for device objects, I've moved the struct definitions to the package extensions. The device types are CUDA.CuDevice, AMDGPU.HIPDevice and Metal.MTLDevice respectively.

One disadvantage of this approach: from what I understand, Flux leaves the work of managing devices to the GPU packages. So, if the user chooses to switch a device by using functions from the GPU package, then our device object will also have to be updated (which currently isn't the case). But if users of Flux don't care about what device is allocated to them, I think this works fine.

What do you think about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the whole point of calling this a device instead of a backend is that we'd allow users to choose which device they want their model to be transferred onto. If that's not feasible because of limitations in the way GPU packages must be used, I'd rather just call these backends instead. Others might have differing opinions on this, however, cc @CarloLucibello from earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the whole point of calling this a device instead of a backend is that we'd allow users to choose which device they want their model to be transferred onto. If that's not feasible because of limitations in the way GPU packages must be used, I'd rather just call these backends instead. Others might have differing opinions on this, however, cc @CarloLucibello from earlier.

Yes, I agree. Also, if a user wants to have finer control over which device they want to use, isn't it better for them to just rely on CUDA.jl for example?

If not, I think it won't be hard to add a device selection capability within Flux as well. But ultimately, we will be calling functions from GPU packages, which the user can just call themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm fine with either; also, if we are to implement an interface for handling multiple devices, wouldn't it be a good idea to first discuss the overall API we want, and the specific implementation details we need (asking because I'm not completely aware of what all I'll have to implement to handle multiple devices)?

For instance, when we are talking about "multiple devices", do we mean providing the user the functionality to use "just one device", but have the ability to choose which one? Or do we mean using multiple devices simultaneously to train models? For the latter I was going through DaggerFlux.jl and it seems it's more non-trivial. The first idea seems easier to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in the middle I think. Training on multiple GPUs is out of scope for this PR (we have other efforts looking into that), but allowing users to transfer models to any active GPU without calling device! beforehand every time would be great for ergonomics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in the middle I think. Training on multiple GPUs is out of scope for this PR (we have other efforts looking into that), but allowing users to transfer models to any active GPU without calling device! beforehand every time would be great for ergonomics.

Sure, I think this shouldn't be too hard to implement. I have one idea for this.

Device methods

We will have the following methods:

function get_device()
    # this will be what we have right now
    # this returns an `AbstractDevice` whose deviceID
    # is the device with which the GPU package has been
    # initialized automatically
end

function get_device(backend::Type{<:KA.GPU}, ordinal::UInt)
    # this will return an `AbstractDevice` from the given backend whose deviceID
    # is the device with the given ordinal number. These methods will be defined
    # in the corresponding package extensions.
end

With these functions, users can then specify the backend + ordinal of the GPU device which they want to work with.

Model transfer between devices

Next, suppose we have a model m which is bound to an AbstractDevice, say device1, which has a backend1::Type{<:KA.GPU} and an ordinal1::UInt. Suppose device2 is another device object with backend2::Type{<:KA.GPU} and ordinal2::UInt.

Then, a call to device2(m) will do the following: if backend1 == backend2 and ordinal1 == ordinal2, then nothing happens and m is returned. Otherwise, device1 is "freed" of m (we'll have to do some device memory management here) and is bound to device2.

In the above, the tricky part is how to identify the GPU backend + ordinal which m is bound to, and how to do free the memory taken by m on the device. For simple models like Dense, I can do the following

# suppose the backend is CUDA
julia> using Flux, CUDA;

julia> m = Dense(2 => 3) |> gpu;

julia> CUDA.device(m.weight)    # this gives me the device to which m is bound
CuDevice(0): NVIDIA GeForce GTX 1650

julia> CUDA.unsafe_free!(m.weight) ;   # just an idea, but something similar

Now clearly, I can't do something similar if m is a complex model. So we'll probably have to add some property to models which stores the device backend + ordinal to which they are bound.

Regarding the freeing the GPU device memory: for CUDA for example, we can probably use the CUDA.unsafe_free! method. But it might be unsafe for a reason.

How does this idea sound, @ToucheSir @CarloLucibello? Any pointers/suggestions on how to track which device a model is bound to and how to do the memory management?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the actual data movement adaptors (e.g. FluxCUDAAdaptor) receives the device ID as an argument, then you only need to apply your detect + free logic at the level of individual parameters. fmap will take care of mapping the logic over a complex model.

In the simple case we are talking about, every parameter in the model should be bound to the same device. In general, model parallelism means that a model could be across multiple devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only thing we need to worry about is "can I move this array to this device the user asked for?" Which sounds simple but might be tricky in practice if the GPU packages don't provide a way to do that directly. I hope here's a relatively straightforward way for most of them, but if not we can save that for future work and/or bug upstream to add it in for us :)

@CarloLucibello
Copy link
Member

let's add some tests and get this PR merged, discussions on device selection can be done somewhere else

@codetalker7
Copy link
Contributor Author

let's add some tests and get this PR merged, discussions on device selection can be done somewhere else

@CarloLucibello sure, I'm fine with it. I was trying to implement @darsnack's and @ToucheSir's idea on data transfer, but if it's better I can make a new issue for discussing those ideas and a new PR to implement it.

Also, I haven't touched the old code (except for minor changes), and haven't added any new tests either. But Nightly CI is still failing. How do I fix that?

@CarloLucibello
Copy link
Member

Nightly CI has been failing for a while, ignore it.

@ToucheSir
Copy link
Member

After talking with Tim and thinking it over I think per-device movement should be as simple as

  1. Saving the current device
  2. Calling device!(new device ID)
  3. Allocating the destination array
  4. copy! ing to the new array
  5. Switching back to the original device with device!

If that turns out to be too much work, I agree with Carlo's suggestion.

docs/src/gpu.md Outdated Show resolved Hide resolved
docs/src/gpu.md Outdated Show resolved Hide resolved
@codetalker7 codetalker7 marked this pull request as ready for review July 31, 2023 20:59
@codetalker7
Copy link
Contributor Author

I've added a few device selection tests. Also, I found a bug in test/functors.jl: it should be AMDGPU_LOADED instead of AMD_LOADED (I've fixed it now). If I'm not wrong, some CI tests here were not catching that.

test/runtests.jl Outdated
Comment on lines 114 to 122
@test typeof(Flux.DEVICES[][Flux.GPU_BACKEND_ORDER["Metal"]]) <: Flux.FluxMetalDevice
device = Flux.get_device()

if Metal.functional()
@test typeof(Flux.DEVICES[][Flux.GPU_BACKEND_ORDER["Metal"]].deviceID) <: Metal.MTLDevice
@test typeof(device) <: Flux.FluxMetalDevice
@test typeof(device.deviceID) <: Metal.MTLDevice
@test Flux._get_device_name(device) in Flux.supported_devices()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not clutter test/runtests.jl with these tests. They can go within the ext_* folder.
Also, we need tests checking the x |> device transfers data correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll move the tests to the extensions files.

Regarding x |> device tests: the extensions have a massive test suite for the gpu function. Under the hood, x |> device is also calling a gpu function; do I need to write all the same test cases for x |> device as well? Or just simple tests like checking GPU array types suffices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple tests are enough

test/runtests.jl Outdated
@testset "CUDA" begin
include("ext_cuda/device_selection.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
include("ext_cuda/device_selection.jl")

let's group these tests under the single file "get_devices.jl"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please let me know, if other tests need to be added (like more tests for x |> device on other types).

@CarloLucibello CarloLucibello merged commit c2bd39d into FluxML:master Aug 4, 2023
5 of 6 checks passed
@CarloLucibello
Copy link
Member

Fantastic work @codetalker7, thanks

@codetalker7
Copy link
Contributor Author

Fantastic work @codetalker7, thanks

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow old silent behavior for gpu
4 participants