feat: Distributed data parallel training support #2464

askorupka · 2024-06-22T19:33:46Z

Support for distributed data parallel training. Inspired by LuxDL/Lux.jl#500
This PR is still work in progress.
PR checklist to be continued.

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable

Both MPIBackend and NCCLBackend are supported.

Module can be used as in the below example (for distributed runs use mpiexecjl --project=@. -n 3 julia distributed_MPI.jl from your terminal, where distributed_MPI.jl (feel free to also use NCCLBakend'):

using Flux, MPI, NCCL, CUDA
using Random
using Optimisers
using Zygote
using Statistics

CUDA.allowscalar(false)

DistributedUtils.initialize(MPIBackend)
backend = DistributedUtils.get_distributed_backend(MPIBackend)
rank = DistributedUtils.local_rank(backend)

model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu

model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0) 

x = rand(Float32, 1, 16) |> gpu
y = x .^ 3

opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))
st_opt = Optimisers.setup(opt, model)
st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) 

loss(model) = mean((model(x) .- y).^2)
g_ = gradient(m -> loss(m), model)[1] 
Optimisers.update!(st_opt, model, g_)

for epoch in 1:100
  global model, st_opt
  l, back = Zygote.pullback(loss, model)
  println("Epoch $epoch: Loss $l")
  g = back(one(l))[1]
  st_opt, model = Optimisers.update(st_opt, model, g)
end

CarloLucibello · 2024-06-30T08:26:41Z

I suggest removing NCCL from this PR and just focusing on MPI

distributed.jl

askorupka · 2024-06-30T22:34:56Z

@CarloLucibello I was able to move it forward according to your suggestions and MPI example with training works 🎉 (I still need to do some cleanup tho)
details are in comments above - may be useful for you to have a look.

askorupka · 2024-07-07T17:50:07Z

Update: both MPI and NCCL work.
please run mpiexecjl --project=@. -n 3 julia distributed_NCCL.jl or mpiexecjl --project=@. -n 3 julia distributed_MPI.jl respectively from your terminal.

still in the draft state, requires some work - should be easier from now on:

examples
docs

Tests added, confliicts resolved

ToucheSir

This PR is a real tour de force, great work!

Assuming #2464 (comment) means you're starting to wrap things up, here are a couple heads up so they don't come as a surprise for the non-draft PR review.

Project.toml

ext/FluxMPINCCLExt/FluxMPINCCLExt.jl

askorupka · 2024-07-24T20:42:33Z

Docs added so PR checklist completed. Ready for review 🚀

docs/src/guide/gpu.md

Co-authored-by: Carlo Lucibello <[email protected]>

docs/src/guide/gpu.md

CarloLucibello · 2024-08-04T16:54:25Z

The new test files should be included in tests/runtests.jl.
As with other extensions, we should define in the file flags like

ENV["FLUX_TEST_DISTRIBUTED_MPI"] = "true"
ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "true"

and tests only conditional on them. We should test separately the MPI and NCCL backed.

For the time being, we won't run the test on the CI because we would have to setup the MPI and NCCL stuff. We can figure out how to test on CI in a follow up PR.

Project.toml

CarloLucibello · 2024-08-18T15:13:16Z

tests failure is unrelated and likely due to Enzyme. I opened an issue in EnzymeAD/Enzyme.jl#1738

pxl-th · 2024-08-19T11:37:11Z

Since Enzyme explicitly installs CUDA when running tests, we should avoid running them on AMDGPU/Metal CIs, until it gains support for them or switch to those backends properly.

kishore-nori · 2024-09-26T05:17:44Z

When updating to Flux.jl latest v0.14.20 , I get the following error, which wasn't there for v0.14.19 , I am on Julia 1.10.5. I have tested it and this is a precompilation error

ERROR: LoadError: ArgumentError: Package FluxMPIExt does not have CUDA in its dependencies:
- You may have a partially installed environment. Try `Pkg.instantiate()`
  to ensure all packages in the environment are installed.
- Or, if you have FluxMPIExt checked out for development and have
  added CUDA as a dependency but haven't updated your primary
  environment's manifest file, try `Pkg.resolve()`.
- Otherwise you may need to report an issue with FluxMPIExt

I think CUDA, AMDGPU should be mentioned here

Flux.jl/Project.toml

Line 40 in 26f5c4f

FluxMPIExt = "MPI"

cc @mcabbott

kishore-nori · 2024-10-03T01:28:03Z

update to above: the precompilation error happens only when both Flux.jl (0.14.20) and MPI.jl are in the environment, if MPI.jl is not there, then there is no problem. And the precompilation error complains CUDA absence, like shown above, even if CUDA is not in the environment. So, I think it's got to do with the missing deps for FluxMPIExt extension.

And CUDA (and AMDGPU) are being used in FluxMPIExt:

Flux.jl/ext/FluxMPIExt/FluxMPIExt.jl

Line 3 in eece505

using CUDA

Flux.jl/ext/FluxMPIExt/FluxMPIExt.jl

Line 10 in eece505

using AMDGPU

askorupka · 2024-10-03T09:59:42Z

Hi @kishore-nori I've managed to replicate the issue, thanks for reporting it.
I'm working on the fix so that FluxMPIExt doesn't require CUDA anymore as we want to avoid adding too many deps.

askorupka force-pushed the distributed branch from fb7f9fe to 0eabc2a Compare June 23, 2024 15:18

CarloLucibello reviewed Jun 30, 2024

View reviewed changes

distributed.jl Outdated Show resolved Hide resolved

CarloLucibello reviewed Jun 30, 2024

View reviewed changes

distributed.jl Outdated Show resolved Hide resolved

CarloLucibello reviewed Jun 30, 2024

View reviewed changes

distributed.jl Outdated Show resolved Hide resolved

askorupka force-pushed the distributed branch from e529d6f to 71ae53d Compare July 7, 2024 17:47

askorupka force-pushed the distributed branch from 5997a83 to 71ae53d Compare July 7, 2024 18:08

ToucheSir reviewed Jul 9, 2024

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

ext/FluxMPINCCLExt/FluxMPINCCLExt.jl Show resolved Hide resolved

CarloLucibello and others added 8 commits July 21, 2024 19:34

first experiment distributed

0393894

feat: add DistributedUtils (MPI&NCCL working)

76ae025

feat: add DistributedUtils (MPI&NCCL working)

181cc9c

fix: no need for amdgpu now

40bf188

chore: cleanup&propose how to use amdgpu

450f62c

chore: add preferences for CUDA-awareness

8fbde8d

feat: fix devices for CUDA-awareness

599f506

chore: add tests

3382010

askorupka force-pushed the distributed branch from 2a10050 to 3382010 Compare July 21, 2024 17:49

chore: get rid of unnecessary deps

443875e

askorupka force-pushed the distributed branch from a4bad49 to 443875e Compare July 21, 2024 18:16

chore: update NEWS.md

330b20b

CarloLucibello marked this pull request as ready for review July 24, 2024 07:24

askorupka added 4 commits July 24, 2024 10:20

chore: cleanup env

a255ff9

chore: update docs

3aab47d

chore: update docs & cleanup

2f54c88

chore: update docs & cleanup

8a984bb

askorupka requested a review from CarloLucibello July 24, 2024 20:42

CarloLucibello reviewed Jul 31, 2024

View reviewed changes

askorupka and others added 7 commits August 3, 2024 21:43

Update docs/src/guide/gpu.md

cee9150

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

5c85fe8

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

e151ead

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

2797924

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

f2cedd5

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

a3b62cb

Co-authored-by: Carlo Lucibello <[email protected]>

Update docs/src/guide/gpu.md

a144ccf

Co-authored-by: Carlo Lucibello <[email protected]>

askorupka commented Aug 3, 2024

View reviewed changes

docs/src/guide/gpu.md Outdated Show resolved Hide resolved

Update docs/src/guide/gpu.md

22b35a0

askorupka commented Aug 3, 2024

View reviewed changes

docs/src/guide/gpu.md Outdated Show resolved Hide resolved

Update docs/src/guide/gpu.md

7d03ef7

CarloLucibello reviewed Aug 4, 2024

View reviewed changes

docs/src/guide/gpu.md Show resolved Hide resolved

askorupka and others added 2 commits August 17, 2024 22:22

chore: add PR review suggestions

6c11e3c

Merge branch 'master' into distributed

6a6951a

CarloLucibello reviewed Aug 18, 2024

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

Project.toml Outdated Show resolved Hide resolved

CarloLucibello mentioned this pull request Aug 18, 2024

Flux Enzyme tests are failing EnzymeAD/Enzyme.jl#1738

Closed

askorupka added 4 commits August 19, 2024 11:54

chore: fix docs

41acd3f

fix: add runtests.jl

0e33cfa

chore: small docs update

58ae10f

chore: remove pkgs from deps

053dcc7

CarloLucibello approved these changes Aug 19, 2024

View reviewed changes

CarloLucibello merged commit d1ff714 into master Aug 19, 2024
3 of 9 checks passed

mcabbott deleted the distributed branch September 20, 2024 14:51

askorupka mentioned this pull request Oct 4, 2024

fix: CUDA package optional for FluxMPIExt #2488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Distributed data parallel training support #2464

feat: Distributed data parallel training support #2464

askorupka commented Jun 22, 2024 •

edited

Loading

CarloLucibello commented Jun 30, 2024

askorupka commented Jun 30, 2024 •

edited

Loading

askorupka commented Jul 7, 2024 •

edited

Loading

ToucheSir left a comment

askorupka commented Jul 24, 2024

CarloLucibello commented Aug 4, 2024

CarloLucibello commented Aug 18, 2024

pxl-th commented Aug 19, 2024

kishore-nori commented Sep 26, 2024 •

edited

Loading

kishore-nori commented Oct 3, 2024 •

edited

Loading

askorupka commented Oct 3, 2024 •

edited

Loading

feat: Distributed data parallel training support #2464

feat: Distributed data parallel training support #2464

Conversation

askorupka commented Jun 22, 2024 • edited Loading

PR Checklist

CarloLucibello commented Jun 30, 2024

askorupka commented Jun 30, 2024 • edited Loading

askorupka commented Jul 7, 2024 • edited Loading

ToucheSir left a comment

Choose a reason for hiding this comment

askorupka commented Jul 24, 2024

CarloLucibello commented Aug 4, 2024

CarloLucibello commented Aug 18, 2024

pxl-th commented Aug 19, 2024

kishore-nori commented Sep 26, 2024 • edited Loading

kishore-nori commented Oct 3, 2024 • edited Loading

askorupka commented Oct 3, 2024 • edited Loading

askorupka commented Jun 22, 2024 •

edited

Loading

askorupka commented Jun 30, 2024 •

edited

Loading

askorupka commented Jul 7, 2024 •

edited

Loading

kishore-nori commented Sep 26, 2024 •

edited

Loading

kishore-nori commented Oct 3, 2024 •

edited

Loading

askorupka commented Oct 3, 2024 •

edited

Loading