Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Neuron SDK components versions #485

Merged
merged 1 commit into from
Sep 20, 2024
Merged

Conversation

nkvetsinski
Copy link
Contributor

Issue #, if available:

Description of changes:

Noticed that Neuron tests were failing:

torch.distributed.run: [WARNING] *****************************************
orch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
 torch.distributed.run: [WARNING] *****************************************
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
F external/xla/xla/parse_flags_from_env.cc:224] Unknown flags in XLA_FLAGS: --xla_gpu_simplify_all_fp_conversions=false --xla_gpu_force_compilation_parallelism=8
torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 12) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
tests/testNeuronSingleAllReduce.py FAILED
Failures:
[1]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 13)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 13
---------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-18_02:41:46
host : neuronx-single-node-8q4fw
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 12)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 12
===================================================

I looked at a coredump from one of the runs, which pointed me in the direction of updating the SDK. Tests are passing with the versions from this PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cartermckinnon cartermckinnon merged commit b35d508 into aws:main Sep 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants