Implement interface for data transfer across GPU devices. #2308

codetalker7 · 2023-08-09T21:28:47Z

This PR addresses issue #2302. Here, the goal is to pass device information in adaptors (FluxCUDAAdaptor, FluxAMDAdaptor and FluxMetalAdaptor respectively), which will then be used to do data transfer at the parameter level only.

Also, new Flux.get_device methods will be added in the extensions which can return a Flux.AbstractDevice with a specific backend and ordinal.

Example for data movement with CUDA

julia> using Flux, CUDA;

julia> CUDA.devices()
CUDA.DeviceIterator() for 3 devices:
0. GeForce RTX 2080 Ti
1. GeForce RTX 2080 Ti
2. TITAN X (Pascal)

julia> CUDA.device()        # default device is 0
CuDevice(0): GeForce RTX 2080 Ti

julia> m = Dense(2 => 3);

julia> m.weight       # m lives on CPU initially
3×2 Matrix{Float32}: 
 -0.705636  -0.369204
 -0.265636  -0.730012
 -0.461941   0.0260784

julia> device1 = Flux.get_device(CUDABackend, UInt(1))    # get device with id 1
(::Flux.FluxCUDADevice) (generic function with 1 method)

julia> device1.deviceID
CuDevice(1): GeForce RTX 2080 Ti

julia> m = m |> device1;    # transfer m from CPU to device1

julia> m.weight
3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 -0.705636  -0.369204
 -0.265636  -0.730012
 -0.461941   0.0260784

julia> device(m.weight)    # verify that m is attached to device1
CuDevice(1): GeForce RTX 2080 Ti

julia> device2 = Flux.get_device(CUDABackend, UInt(2))    # get device with id 2
(::Flux.FluxCUDADevice) (generic function with 1 method)

julia> m = m |> device2;         # now transfer m to device2

julia> m.weight
3×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 -0.370832  -1.05765
  0.926787  -0.192985
  0.686085  -0.00252177

julia> device(m.weight)
CuDevice(2): TITAN X (Pascal)

cc @CarloLucibello @ToucheSir @darsnack.

PR Checklist

Do we need to garbage collect manually?
What about adapt_storage for data types other than AbstractArray?
Add relevant tests.
Adding relevant documentation.

codetalker7 · 2023-08-09T21:32:47Z

Also, things seem to be working fine here, except when I try to directly print the new model:

julia> using Flux, CUDA

julia> m = Dense(2 => 3);

julia> device1 = Flux.get_device(CUDABackend, UInt(1));

julia> m = m |> device1
Dense(2 => 3)       # 9 parametersError showing value of type Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}:
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/libcuda.jl:27
  [2] check
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/libcuda.jl:34 [inlined]
  [3] cuMemcpyDtoHAsync_v2
    @ ~/.julia/packages/CUDA/tVtYo/lib/utils/call.jl:26 [inlined]
  [4] #unsafe_copyto!#8
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/memory.jl:397 [inlined]
  [5] (::CUDA.var"#1014#1015"{Bool, Vector{Bool}, Int64, CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/array.jl:482
  [6] #context!#887
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
  [7] context!
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
  [8] unsafe_copyto!(dest::Vector{Bool}, doffs::Int64, src::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/array.jl:475
  [9] copyto!
    @ ~/.julia/packages/CUDA/tVtYo/src/array.jl:429 [inlined]
 [10] getindex
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:12 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:136 [inlined]
 [12] _mapreduce(f::ComposedFunction{typeof(!), typeof(iszero)}, op::typeof(|), As::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; dims::Colon, init::Nothing)
    @ GPUArrays ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:73
 [13] _mapreduce
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:35 [inlined]
 [14] #mapreduce#29
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:31 [inlined]
 [15] mapreduce
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:31 [inlined]
 [16] any
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:82 [inlined]
 [17] _any
    @ ~/fluxml/Flux.jl/src/layers/show.jl:129 [inlined]
 [18] (::Flux.var"#330#331"{ComposedFunction{typeof(!), typeof(iszero)}})(x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Flux ~/fluxml/Flux.jl/src/layers/show.jl:131
 [19] _any(f::Flux.var"#330#331"{ComposedFunction{typeof(!), typeof(iszero)}}, itr::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, #unused#::Colon)
    @ Base ./reduce.jl:1215
 [20] any
    @ ./reduce.jl:1210 [inlined]
 [21] _any
    @ ~/fluxml/Flux.jl/src/layers/show.jl:131 [inlined]
 [22] _all
    @ ~/fluxml/Flux.jl/src/layers/show.jl:135 [inlined]
 [23] _nan_show(io::IOContext{Base.TTY}, x::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Flux ~/fluxml/Flux.jl/src/layers/show.jl:120
 [24] _layer_show(io::IOContext{Base.TTY}, layer::Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, indent::Int64, name::Nothing)
    @ Flux ~/fluxml/Flux.jl/src/layers/show.jl:86
 [25] _layer_show
    @ ~/fluxml/Flux.jl/src/layers/show.jl:75 [inlined]
 [26] show(io::IOContext{Base.TTY}, m::MIME{Symbol("text/plain")}, x::Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
    @ Flux ~/fluxml/Flux.jl/src/layers/show.jl:67
 [27] (::REPL.var"#55#56"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:276
 [28] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:557
 [29] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:262
 [30] display
    @ ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:281 [inlined]
 [31] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [32] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [33] invokelatest
    @ ./essentials.jl:813 [inlined]
 [34] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:305
 [35] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:287
 [36] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:557
 [37] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:285
 [38] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:899
 [39] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [40] invokelatest
    @ ./essentials.jl:813 [inlined]
 [41] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/LineEdit.jl:2647
 [42] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL ~/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:1300
 [43] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:514

I'm not exactly sure where the illegal memory access is happening. If I try to manually try to run the code for show for Dense models, it seems to work fine.

ToucheSir · 2023-08-10T14:15:58Z

I believe you have to switch to device!(1) before printing the model because of how CUDA.jl works. That's a fine limitation though and one I totally expected. We can just document it with a small note for now.

ToucheSir

Care to try this for the other GPU backends as well?

ext/FluxCUDAExt/functor.jl

codetalker7 · 2023-08-11T06:58:46Z

Care to try this for the other GPU backends as well?

Yes, will update the PR. Thanks for the suggestions.

CarloLucibello · 2023-08-11T13:49:03Z

I think the interface should be simpler. I suggest

device = get_device("CUDA", 1)

codetalker7 · 2023-08-12T10:59:43Z

Care to try this for the other GPU backends as well?

Hi @ToucheSir. I can surely attempt to implement this for metal and AMD, but I don't have access to these GPUs (so I won't be able to test on them). Just wondering: is there any way for me to test AMD/Metal code?

codecov-commenter · 2023-08-12T11:48:02Z

Codecov Report

Patch coverage: 6.41% and project coverage change: -4.24% ⚠️

Comparison is base (656175d) 79.47% compared to head (0063bc0) 75.23%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2308      +/-   ##
==========================================
- Coverage   79.47%   75.23%   -4.24%     
==========================================
  Files          31       31              
  Lines        1749     1805      +56     
==========================================
- Hits         1390     1358      -32     
- Misses        359      447      +88

Files Changed	Coverage Δ
ext/FluxAMDGPUExt/FluxAMDGPUExt.jl	`5.00% <0.00%> (-1.67%)`	⬇️
ext/FluxAMDGPUExt/functor.jl	`0.00% <0.00%> (ø)`
ext/FluxCUDAExt/FluxCUDAExt.jl	`28.57% <0.00%> (-8.93%)`	⬇️
ext/FluxCUDAExt/functor.jl	`0.00% <0.00%> (ø)`
ext/FluxMetalExt/FluxMetalExt.jl	`7.14% <0.00%> (-1.20%)`	⬇️
src/functor.jl	`59.66% <55.55%> (-20.18%)`	⬇️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ToucheSir · 2023-08-14T14:08:53Z

Just wondering: is there any way for me to test AMD/Metal code?

Probably not locally, but since you'll be writing tests those should be auto-tested by CI. Also happy to help test the AMDGPU routines locally once those are in a good place.

codetalker7 · 2023-08-16T17:35:36Z

I've implemented the same interface for AMDGPU and Metal as well. The tests are already catching an error for Metal. I suspect it is because I'm using the registryID for a metal device. Unlike the CUDA and AMDGPU packages, Metal doesn't provide a way to get the ordinal of a metal GPU device. The closest thing I got from there was a registry ID.

Any nice way to get the ordinal for a metal device? Maybe we can match our device in the Metal.devices() list, and return the first matching index? But how do we match?

ToucheSir · 2023-08-17T04:38:30Z

I wasn't aware Metal.jl had this limitation. If there's no clean way to do it now, we can always disable specific device selection for it.

codetalker7 · 2023-08-17T05:38:46Z

I wasn't aware Metal.jl had this limitation. If there's no clean way to do it now, we can always disable specific device selection for it.

Okay, I don't see a clean way to do this at the moment. So I'll skip adding this functionality to metal for now.

codetalker7 · 2023-08-18T12:52:11Z

Added some tests for CUDA and AMD. How do they look? Next I'll add some documentation.

codetalker7 · 2023-08-18T18:42:43Z

Getting a weird error in buildkite, which says: LoadError: cannot assign a value to imported variable CUDA.device from module Main. This is because CUDA is functional in the AMD and Metal tests? This wasn't the case before. Did something change in the CI?

ordinal.

devices.

ordinal information through `FluxAMDAdaptor`.

CarloLucibello · 2023-08-20T12:22:23Z

ext/FluxCUDAExt/functor.jl

 adapt_storage(to::FluxCUDAAdaptor, x) = CUDA.cu(x)
+function adapt_storage(to::FluxCUDAAdaptor, x::AbstractArray)
+    typeof(to.ordinal) <: Nothing && return CUDA.cu(x)


Suggested change

typeof(to.ordinal) <: Nothing && return CUDA.cu(x)

to.ordinal === nothing && return CUDA.cu(x)

CarloLucibello · 2023-08-20T12:25:00Z

This seems good to go. You could remove the draft status.

codetalker7 · 2023-08-20T12:35:03Z

This seems good to go. You could remove the draft status.

Sure. I'll add some documentation.

documentation in the guide.

codetalker7 · 2023-08-20T18:14:37Z

Completed everything. Please let me know if something needs to be changed.

ToucheSir reviewed Aug 10, 2023

View reviewed changes

ext/FluxCUDAExt/functor.jl Outdated Show resolved Hide resolved

codetalker7 force-pushed the gpu_data_transfer branch from fafef86 to de8f957 Compare August 20, 2023 09:03

codetalker7 added 16 commits August 20, 2023 15:16

Adding new get_device method to return a CUDA device with particular

3d5d849

ordinal.

Adding an adapt function for AbstractArray to handle movement across

f720d5f

devices.

Making the get_device interface simpler, and some minor changes.

40e085c

Adding CPU option to get_device.

652bf95

Removing KernelAbstractions from deps.

b3cd292

Adding new get_device method to return a particular AMD device.

9925c5b

Adding new adapt_storage function for moving arrays. Also passing

df577f3

ordinal information through `FluxAMDAdaptor`.

Moving relevant function definitions to extensions.

8aa7eed

Making _metal accept an ordinal.

f137080

Adding new get_device method to return particular Metal device.

ef265eb

Adding new adapt_storage method for metal arrays.

3fbb4f5

Fixing minor error.

2c6bc55

Fixing minor error and spelling mistake.

829dcfa

Fixing package name: AMDGPU instead of AMD.

930d29c

Reverting back to old metal functionality.

fedee3b

Adding tests for moving models between CPU and NVIDIA devices.

a4449f8

codetalker7 added 6 commits August 20, 2023 15:16

Adding tests for data movement on AMD devices.

2a14650

Fixing index error while choosing AMD gpu device.

a9fb328

Fixing AMD ordinal starting index.

91a756f

Adding docstring for new get_device method.

a9f6615

Removing global name conflicts in tests.

b47a6f4

Minor fix to AMDs device id tests.

f1ab569

codetalker7 force-pushed the gpu_data_transfer branch from de8f957 to f1ab569 Compare August 20, 2023 10:05

Disambiguating test variables.

129a0b5

CarloLucibello reviewed Aug 20, 2023

View reviewed changes

codetalker7 marked this pull request as ready for review August 20, 2023 12:34

codetalker7 added 6 commits August 20, 2023 22:16

Adding more info in docstring of get_device, and writing some

7a5b847

documentation in the guide.

Fixing minor error in AMD code.

1b770bc

Fixing yet another ordinal index error in AMD code.

8c09485

Fixing another ordinal index error in AMD code.

e367598

Fixing spelling mistake.

08b1962

Replacing type checks for nothing but equality checks.

0063bc0

codetalker7 requested review from ToucheSir and CarloLucibello August 21, 2023 04:09

CarloLucibello approved these changes Aug 21, 2023

View reviewed changes

CarloLucibello merged commit 36fbdf1 into FluxML:master Aug 21, 2023
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement interface for data transfer across GPU devices. #2308

Implement interface for data transfer across GPU devices. #2308

codetalker7 commented Aug 9, 2023 •

edited

Loading

codetalker7 commented Aug 9, 2023

ToucheSir commented Aug 10, 2023

ToucheSir left a comment

codetalker7 commented Aug 11, 2023

CarloLucibello commented Aug 11, 2023 •

edited

Loading

codetalker7 commented Aug 12, 2023

codecov-commenter commented Aug 12, 2023 •

edited

Loading

ToucheSir commented Aug 14, 2023

codetalker7 commented Aug 16, 2023

ToucheSir commented Aug 17, 2023

codetalker7 commented Aug 17, 2023

codetalker7 commented Aug 18, 2023

codetalker7 commented Aug 18, 2023 •

edited

Loading

CarloLucibello Aug 20, 2023

CarloLucibello commented Aug 20, 2023

codetalker7 commented Aug 20, 2023

codetalker7 commented Aug 20, 2023

	typeof(to.ordinal) <: Nothing && return CUDA.cu(x)
	to.ordinal === nothing && return CUDA.cu(x)

Implement interface for data transfer across GPU devices. #2308

Implement interface for data transfer across GPU devices. #2308

Conversation

codetalker7 commented Aug 9, 2023 • edited Loading

Example for data movement with CUDA

PR Checklist

codetalker7 commented Aug 9, 2023

ToucheSir commented Aug 10, 2023

ToucheSir left a comment

Choose a reason for hiding this comment

codetalker7 commented Aug 11, 2023

CarloLucibello commented Aug 11, 2023 • edited Loading

codetalker7 commented Aug 12, 2023

codecov-commenter commented Aug 12, 2023 • edited Loading

Codecov Report

ToucheSir commented Aug 14, 2023

codetalker7 commented Aug 16, 2023

ToucheSir commented Aug 17, 2023

codetalker7 commented Aug 17, 2023

codetalker7 commented Aug 18, 2023

codetalker7 commented Aug 18, 2023 • edited Loading

CarloLucibello Aug 20, 2023

Choose a reason for hiding this comment

CarloLucibello commented Aug 20, 2023

codetalker7 commented Aug 20, 2023

codetalker7 commented Aug 20, 2023

codetalker7 commented Aug 9, 2023 •

edited

Loading

CarloLucibello commented Aug 11, 2023 •

edited

Loading

codecov-commenter commented Aug 12, 2023 •

edited

Loading

codetalker7 commented Aug 18, 2023 •

edited

Loading