Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nccom test for neuron instances #505

Merged
merged 8 commits into from
Nov 13, 2024

Conversation

Pavani-Panakanti
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti commented Nov 8, 2024

Issue #, if available:
Add support for nccom test for neuron instances

Description of changes:
We run nccl_test with mpirun. mpioperator understands mpirun command, sets up all required config and connectivity for running mpijob.
nccom_test uses nccom binary. We need to pass in worker pod IP's while issuing the nccom command and it also needs ssh connectivity between pods. We don't have a similar support as in nccl_test to run nccom_test within containers. To make this work, added some ssh config in dockerfile and retrieved worker pod IPs dynamically in the manifest via launcher init container.

Some part of the code to setup dependencies (efa device plugin, mpioperator, getting instance details) is taken from current nccl_test code.
More important files to review would be
e2e2/test/cases/neuron/manifests/multi-node-test-neuron.yaml
e2e2/test/cases/nvidia/manifests/efa-device-plugin.yaml

Testing
Skipped adding all info and debug logs in between

dev-dsk-pavanipt-2a-0981017d %  go test -timeout 60m -v . -run ^TestNeuronNodes$/multi-node -args -neuronTestImage 632572741643.dkr.ecr.us-east-1.amazonaws.com/nccom-tests:latest_1 -efaEnabled true -nodeType trn1.32xlarge
2024/11/08 06:53:17 No node type specified. Using the node type trn1.32xlarge in the node groups.
=== RUN   TestNeuronNodes
=== RUN   TestNeuronNodes/multi-node
    neuron_test.go:129: Applying multi node manifest
    neuron_test.go:134: Applied manifest successfully
=== RUN   TestNeuronNodes/multi-node/NCCOM_test_succeeds
    neuron_test.go:145: Waiting for MPIJob to complete
    neuron_test.go:159: Test log for multi-node-nccom-test:
    neuron_test.go:160:  * Starting OpenBSD Secure Shell server sshd
           ...done.
        Running allr with size 8.0B
        [sudo] password for ubuntu: Warning: Permanently added '192.168.72.67' (ECDSA) to the list of known hosts.
        Warning: Permanently added '192.168.81.149' (ECDSA) to the list of known hosts.

        [1,1]<stdout>:+---+----+---------+---------+------------+-------+--------+---------+---------+[1,1]<stdout>:---------+--------+---------+---------+-------+[1,1]<stdout>:
               size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
                     8               2    fp32          214.14           0.00           0.00
                    16               4    fp32           89.97           0.00           0.00
                    32               8    fp32           83.39           0.00           0.00
                    64              16    fp32          568.82           0.00           0.00
                   128              32    fp32          319.74           0.00           0.00
                   256              64    fp32          419.38           0.00           0.00
                   512             128    fp32          546.94           0.00           0.00
                  1024             256    fp32          905.48           0.00           0.00
                  2048             512    fp32          885.73           0.00           0.00
                  4096            1024    fp32          786.26           0.00           0.01
                  8192            2048    fp32          908.77           0.01           0.02
                 16384            4096    fp32           775.7           0.02           0.04
                 32768            8192    fp32           761.6           0.04           0.08
                 65536           16384    fp32         1033.47           0.06           0.12
                131072           32768    fp32         1017.15           0.12           0.24
                262144           65536    fp32         1010.88           0.24           0.48
                524288          131072    fp32          576.92           0.85           1.67
               1048576          262144    fp32          477.79           2.04           4.02
               2097152          524288    fp32          343.24           5.69          11.20
               4194304         1048576    fp32          472.54           8.27          16.27
               8388608         2097152    fp32          674.48          11.58          22.80
              16777216         4194304    fp32         1034.57          15.10          29.73
              33554432         8388608    fp32         1250.98          24.98          49.18
              67108864        16777216    fp32         2458.93          25.42          50.04
             134217728        33554432    fp32          4537.6          27.55          54.23
             268435456        67108864    fp32         8192.21          30.52          60.08
             536870912       134217728    fp32        16713.25          29.92          58.90
            1073741824       268435456    fp32        32697.21          30.58          60.21
            2147483648       536870912    fp32        64892.09          30.82          60.68
        Avg bus bandwidth:	16.5521GB/s
        
--- PASS: TestNeuronNodes (367.44s)
    --- PASS: TestNeuronNodes/multi-node (367.44s)
        --- PASS: TestNeuronNodes/multi-node/NCCOM_test_succeeds (367.09s)
PASS
ok  	github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron	387.662s
dev-dsk-pavanipt-2a-0981017d %  go test -timeout 60m -v . -run ^TestMPIJobPytorchTraining$/single-node -args -neuronTestImage 632572741643.dkr.ecr.us-east-1.amazonaws.com/nccom-tests:latest_2              
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
=== RUN   TestMPIJobPytorchTraining/single-node/Single_node_test_Job_succeeds
=== NAME  TestMPIJobPytorchTraining/single-node
    neuron_test.go:67: Test log for neuronx-single-node:
    --- PASS: TestNeuronNodes (115.70s)
    --- PASS: TestNeuronNodes/single-node (115.70s)
        --- PASS: TestNeuronNodes/single-node/Single_node_test_Job_succeeds (115.07s)
PASS
ok  	github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron	130.404s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mattcjo
Copy link
Contributor

mattcjo commented Nov 13, 2024

@Pavani-Panakanti Any sample outputs from a test run?

@Pavani-Panakanti
Copy link
Contributor Author

@Pavani-Panakanti Any sample outputs from a test run?

Added logs from the test

Copy link
Contributor

@mattcjo mattcjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work, thanks @Pavani-Panakanti

@Pavani-Panakanti Pavani-Panakanti merged commit 68cf94c into aws:main Nov 13, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants