Parallelize neuron training processes for each neuron core #566

mselim00 · 2025-01-21T23:18:29Z

Enables full multi-processing across all neuron cores, and corrects an earlier issue where world size wasn't being correctly determined (i.e., each process was in its own process group). Changes from mpirun to torchrun.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mselim00 · 2025-01-27T18:21:26Z

Will publish test results in a bit

test/cases/neuron-training/bert_training_test.go

mselim00 · 2025-01-27T20:08:10Z

Ran the test on 2 nodes locally with go test

2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS

There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.

test/cases/neuron-training/bert_training_test.go

mattcjo · 2025-01-27T23:17:15Z

test/cases/neuron-training/manifests/training-comm-service.yaml

+apiVersion: v1
+kind: Service
+metadata:
+  name: training
+  labels:
+    app: training
+spec:
+  clusterIP: None
+  selector:
+    job-name: bert-training


Is explicit service creation required for torchrun?

yeah this service is required so we can dynamically determine the master node's IP with bert-training-0.training in the job spec

mattcjo · 2025-01-27T23:19:05Z

Ran the test on 2 nodes locally with go test
2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS
There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.

@mselim00 This is slightly concerning. Are you able to confirm expected number of processes is running even if metrics seem off?

mselim00 · 2025-01-27T23:27:15Z

Ran the test on 2 nodes locally with go test
2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS
There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.
@mselim00 This is slightly concerning. Are you able to confirm expected number of processes is running even if metrics seem off?

Yep, I manually checked that we have logs from all 64 ranks, that all of them print those metrics, and that all of them print the training complete log line. I'm not sure as to the root cause atm, just know that it's probably not a RegEx issue at this point.

mattcjo

LGTM. Approving since CI check failure is unrelated. Merge once fixed.

mselim00 added 2 commits January 21, 2025 23:16

Run 1 neuron training proc per neuron core

cd821a6

Switch to elastic launch, scale process count to neuron core count

b4d5fbc

mselim00 force-pushed the neuron-training branch from dd5d885 to b4d5fbc Compare January 24, 2025 07:17

mselim00 added 4 commits January 25, 2025 05:46

Enable worker <-> master collective communication

caa3b1d

Fix epoch time parsing, bump sdk version, cleanup

c25b2d7

Fix rank parsing, clean up excess logs

714576d

Fix average epoch/throughput regexs for multiprocessing

57d3591

mselim00 force-pushed the neuron-training branch from 874c16d to 57d3591 Compare January 27, 2025 18:20

mselim00 requested review from mattcjo and wwvela January 27, 2025 18:22

wwvela reviewed Jan 27, 2025

View reviewed changes

test/cases/neuron-training/bert_training_test.go Show resolved Hide resolved

Rename neuron training to bert training

ca6777b

wwvela reviewed Jan 27, 2025

View reviewed changes

test/cases/neuron-training/bert_training_test.go Outdated Show resolved Hide resolved

mattcjo reviewed Jan 27, 2025

View reviewed changes

Formatting fix

1c81f9c

mattcjo approved these changes Jan 28, 2025

View reviewed changes

mselim00 changed the title ~~[WIP] Parallelize training processes for each neuron core~~ Parallelize neuron training processes for each neuron core Jan 28, 2025

mselim00 mentioned this pull request Jan 28, 2025

Fix kubetest2 build by replacing opencensus vanity url #569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize neuron training processes for each neuron core #566

Parallelize neuron training processes for each neuron core #566

mselim00 commented Jan 21, 2025 •

edited

Loading

mselim00 commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo Jan 27, 2025

mselim00 Jan 27, 2025

mattcjo commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo left a comment

Parallelize neuron training processes for each neuron core #566

Are you sure you want to change the base?

Parallelize neuron training processes for each neuron core #566

Conversation

mselim00 commented Jan 21, 2025 • edited Loading

mselim00 commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo Jan 27, 2025

Choose a reason for hiding this comment

mselim00 Jan 27, 2025

Choose a reason for hiding this comment

mattcjo commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo left a comment

Choose a reason for hiding this comment

mselim00 commented Jan 21, 2025 •

edited

Loading