-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Distributed data parallel training support #2464
Conversation
I suggest removing NCCL from this PR and just focusing on MPI |
@CarloLucibello I was able to move it forward according to your suggestions and MPI example with training works 🎉 (I still need to do some cleanup tho) |
Update: both MPI and NCCL work. still in the draft state, requires some work - should be easier from now on:
Tests added, confliicts resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a real tour de force, great work!
Assuming #2464 (comment) means you're starting to wrap things up, here are a couple heads up so they don't come as a surprise for the non-draft PR review.
Docs added so PR checklist completed. Ready for review 🚀 |
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
Co-authored-by: Carlo Lucibello <[email protected]>
The new test files should be included in ENV["FLUX_TEST_DISTRIBUTED_MPI"] = "true"
ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "true" and tests only conditional on them. We should test separately the MPI and NCCL backed. For the time being, we won't run the test on the CI because we would have to setup the MPI and NCCL stuff. We can figure out how to test on CI in a follow up PR. |
tests failure is unrelated and likely due to Enzyme. I opened an issue in EnzymeAD/Enzyme.jl#1738 |
Since Enzyme explicitly installs CUDA when running tests, we should avoid running them on AMDGPU/Metal CIs, until it gains support for them or switch to those backends properly. |
When updating to Flux.jl latest v0.14.20 , I get the following error, which wasn't there for v0.14.19 , I am on Julia 1.10.5. I have tested it and this is a precompilation error ERROR: LoadError: ArgumentError: Package FluxMPIExt does not have CUDA in its dependencies:
- You may have a partially installed environment. Try `Pkg.instantiate()`
to ensure all packages in the environment are installed.
- Or, if you have FluxMPIExt checked out for development and have
added CUDA as a dependency but haven't updated your primary
environment's manifest file, try `Pkg.resolve()`.
- Otherwise you may need to report an issue with FluxMPIExt I think CUDA, AMDGPU should be mentioned here Line 40 in 26f5c4f
cc @mcabbott |
update to above: the precompilation error happens only when both And CUDA (and AMDGPU) are being used in Flux.jl/ext/FluxMPIExt/FluxMPIExt.jl Line 3 in eece505
Flux.jl/ext/FluxMPIExt/FluxMPIExt.jl Line 10 in eece505
|
Hi @kishore-nori I've managed to replicate the issue, thanks for reporting it. |
Support for distributed data parallel training. Inspired by LuxDL/Lux.jl#500
This PR is still work in progress.
PR checklist to be continued.
PR Checklist
Both
MPIBackend
andNCCLBackend
are supported.Module can be used as in the below example (for distributed runs use
mpiexecjl --project=@. -n 3 julia distributed_MPI.jl
from your terminal, wheredistributed_MPI.jl
(feel free to also useNCCLBakend
'):