Add support for nccom test for neuron instances #505

Pavani-Panakanti · 2024-11-08T17:09:31Z

Issue #, if available:
Add support for nccom test for neuron instances

Description of changes:
We run nccl_test with mpirun. mpioperator understands mpirun command, sets up all required config and connectivity for running mpijob.
nccom_test uses nccom binary. We need to pass in worker pod IP's while issuing the nccom command and it also needs ssh connectivity between pods. We don't have a similar support as in nccl_test to run nccom_test within containers. To make this work, added some ssh config in dockerfile and retrieved worker pod IPs dynamically in the manifest via launcher init container.

Some part of the code to setup dependencies (efa device plugin, mpioperator, getting instance details) is taken from current nccl_test code.
More important files to review would be
e2e2/test/cases/neuron/manifests/multi-node-test-neuron.yaml
e2e2/test/cases/nvidia/manifests/efa-device-plugin.yaml

Testing
Skipped adding all info and debug logs in between

dev-dsk-pavanipt-2a-0981017d %  go test -timeout 60m -v . -run ^TestNeuronNodes$/multi-node -args -neuronTestImage 632572741643.dkr.ecr.us-east-1.amazonaws.com/nccom-tests:latest_1 -efaEnabled true -nodeType trn1.32xlarge
2024/11/08 06:53:17 No node type specified. Using the node type trn1.32xlarge in the node groups.
=== RUN   TestNeuronNodes
=== RUN   TestNeuronNodes/multi-node
    neuron_test.go:129: Applying multi node manifest
    neuron_test.go:134: Applied manifest successfully
=== RUN   TestNeuronNodes/multi-node/NCCOM_test_succeeds
    neuron_test.go:145: Waiting for MPIJob to complete
    neuron_test.go:159: Test log for multi-node-nccom-test:
    neuron_test.go:160:  * Starting OpenBSD Secure Shell server sshd
           ...done.
        Running allr with size 8.0B
        [sudo] password for ubuntu: Warning: Permanently added '192.168.72.67' (ECDSA) to the list of known hosts.
        Warning: Permanently added '192.168.81.149' (ECDSA) to the list of known hosts.

        [1,1]<stdout>:+---+----+---------+---------+------------+-------+--------+---------+---------+[1,1]<stdout>:---------+--------+---------+---------+-------+[1,1]<stdout>:
               size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
                     8               2    fp32          214.14           0.00           0.00
                    16               4    fp32           89.97           0.00           0.00
                    32               8    fp32           83.39           0.00           0.00
                    64              16    fp32          568.82           0.00           0.00
                   128              32    fp32          319.74           0.00           0.00
                   256              64    fp32          419.38           0.00           0.00
                   512             128    fp32          546.94           0.00           0.00
                  1024             256    fp32          905.48           0.00           0.00
                  2048             512    fp32          885.73           0.00           0.00
                  4096            1024    fp32          786.26           0.00           0.01
                  8192            2048    fp32          908.77           0.01           0.02
                 16384            4096    fp32           775.7           0.02           0.04
                 32768            8192    fp32           761.6           0.04           0.08
                 65536           16384    fp32         1033.47           0.06           0.12
                131072           32768    fp32         1017.15           0.12           0.24
                262144           65536    fp32         1010.88           0.24           0.48
                524288          131072    fp32          576.92           0.85           1.67
               1048576          262144    fp32          477.79           2.04           4.02
               2097152          524288    fp32          343.24           5.69          11.20
               4194304         1048576    fp32          472.54           8.27          16.27
               8388608         2097152    fp32          674.48          11.58          22.80
              16777216         4194304    fp32         1034.57          15.10          29.73
              33554432         8388608    fp32         1250.98          24.98          49.18
              67108864        16777216    fp32         2458.93          25.42          50.04
             134217728        33554432    fp32          4537.6          27.55          54.23
             268435456        67108864    fp32         8192.21          30.52          60.08
             536870912       134217728    fp32        16713.25          29.92          58.90
            1073741824       268435456    fp32        32697.21          30.58          60.21
            2147483648       536870912    fp32        64892.09          30.82          60.68
        Avg bus bandwidth:	16.5521GB/s
        
--- PASS: TestNeuronNodes (367.44s)
    --- PASS: TestNeuronNodes/multi-node (367.44s)
        --- PASS: TestNeuronNodes/multi-node/NCCOM_test_succeeds (367.09s)
PASS
ok  	github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron	387.662s

dev-dsk-pavanipt-2a-0981017d %  go test -timeout 60m -v . -run ^TestMPIJobPytorchTraining$/single-node -args -neuronTestImage 632572741643.dkr.ecr.us-east-1.amazonaws.com/nccom-tests:latest_2              
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds
=== NAME  TestMPIJobPytorchTraining/single-node
    neuron_test.go:67: Test log for neuronx-single-node:
    --- PASS: TestNeuronNodes (115.70s)
    --- PASS: TestNeuronNodes/single-node (115.70s)
        --- PASS: TestNeuronNodes/single-node/Single_node_test_Job_succeeds (115.07s)
PASS
ok  	github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron	130.404s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mattcjo · 2024-11-13T19:06:11Z

@Pavani-Panakanti Any sample outputs from a test run?

Pavani-Panakanti · 2024-11-13T19:17:20Z

@Pavani-Panakanti Any sample outputs from a test run?

Added logs from the test

mattcjo

LGTM. Nice work, thanks @Pavani-Panakanti

Pavani-Panakanti added 8 commits October 25, 2024 19:56

nccom test

8231e5a

changes

afb2823

Add nccom_test changes

e2474bb

update changes

90cf2d4

update single node test

db3771f

go fmt

3b23df4

remove unused

5ee53e3

remove comment

b6ba911

mattcjo approved these changes Nov 13, 2024

View reviewed changes

Pavani-Panakanti merged commit 68cf94c into aws:main Nov 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for nccom test for neuron instances #505

Add support for nccom test for neuron instances #505

Pavani-Panakanti commented Nov 8, 2024 •

edited

Loading

mattcjo commented Nov 13, 2024

Pavani-Panakanti commented Nov 13, 2024

mattcjo left a comment

Add support for nccom test for neuron instances #505

Add support for nccom test for neuron instances #505

Conversation

Pavani-Panakanti commented Nov 8, 2024 • edited Loading

mattcjo commented Nov 13, 2024

Pavani-Panakanti commented Nov 13, 2024

mattcjo left a comment

Choose a reason for hiding this comment

Pavani-Panakanti commented Nov 8, 2024 •

edited

Loading