Launch efa instances from heterogenous reservations #3768

arjkesh · 2024-03-13T03:59:50Z

GitHub Issue #, if available:

Note:

If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Update EFA tests to use a heterogenous approach for launching instances from capacity reservation. This allows us to make use of reservations where there may be < min number of instances available in a particular reservation, but there are enough instances available in multiple reservations or in reservations + open capacity

Tests run

Successfully launch p5 instances (3100f27)
Successfully launch with p4d instances (9bb1796)
Test launching from one (cbb3eab)
...and then launching quickly again to see failure mode (01f5ee6)
test default PR behavior (8731694)
additional sanity test for deepcopy update - d904077

NOTE: By default, docker builds are disabled. In order to build your container, please update dlc_developer_config.toml and specify the framework to build in "build_frameworks"

I have run builds/tests on commit for my changes.

NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:

Expand

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true

Additionally, please run the sagemaker local tests in at least one revision:

sagemaker_local_tests = true

Formatting

I have run black -l 100 on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)

DLC image/dockerfile

Builds to Execute

Expand

Click the checkbox to enable a build to execute upon merge.

Note: By default, pipelines are set to "latest". Replace with major.minor framework version if you do not want "latest".

build_pytorch_training_latest
build_pytorch_inference_latest
build_tensorflow_training_latest
build_tensorflow_inference_latest

Additional context

PR Checklist

Expand

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

NEURON/GRAVITON Testing Checklist

When creating a PR:

I've modified dlc_developer_config.toml in my PR branch by setting neuron_mode = true or graviton_mode = true

Benchmark Testing Checklist

When creating a PR:

I've modified dlc_developer_config.toml in my PR branch by setting ec2_benchmark_tests = true or sagemaker_benchmark_tests = true

Pytest Marker Checklist

Expand

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

roywei · 2024-03-13T16:54:46Z

test/test_utils/ec2.py

+def launch_efa_with_heterogenous_reservations(
+    ec2_client, ec2_instance_type, ec2_run_instances_definition, fn_name=""
+):
+    minimum_number_of_instances = ec2_run_instances_definition["MinCount"]


let's add a high level description on what we try to do here

updated fn description

test/test_utils/ec2.py

roywei · 2024-03-13T17:02:06Z

test/test_utils/ec2.py

+    pass
+
+
+def launch_efa_with_heterogenous_reservations(


if this works, we can deprecate launch_efa_with_reservations, I can't think of a use case where 2 instances must come from same cr

This reverts commit 9bb1796.

arjkesh · 2024-03-19T06:19:48Z

test/sagemaker_tests/pytorch/training/integration/sagemaker/test_pytorchddp_inductor.py

@@ -35,7 +37,10 @@ def can_run_pytorchddp(ecr_image):
    return Version(image_framework_version) in SpecifierSet(">=1.10")


-# Skip due to known issue: https://github.com/pytorch/pytorch/issues/99074
+@pytest.mark.skipif(


I can remove these conditional skips if we still want to run these

arjkesh · 2024-03-19T06:20:00Z

test/sagemaker_tests/pytorch/training/integration/sagemaker/test_pytorchddp.py

@@ -35,6 +37,10 @@ def can_run_pytorchddp(ecr_image):
    return Version(image_framework_version) in SpecifierSet(">=1.10")


+@pytest.mark.skipif(


I can remove these conditional skips if we still want to run these

arjkesh added 2 commits March 12, 2024 20:55

init p5 optimize

cc4b037

temp changes

c3abb82

roywei reviewed Mar 13, 2024

View reviewed changes

arjkesh added 20 commits March 15, 2024 15:20

debug

3fc3f3d

disable build

bb842bf

update refresh logic

1f0e964

add debug

7d2e450

raise on purpose

1b26823

raise again

c656ee3

update

1d5c089

update

fa0fe28

update

3100f27

run efa with default behavior

9bb1796

evert "run efa with default behavior"

1fa6f80

This reverts commit 9bb1796.

competition 1

cbb3eab

competition 2

01f5ee6

revert temp changes, make sure PRs behave as usual

458ac2f

add more logging

0f60161

bug fix

8731694

deepcopy

1f444db

update

d904077

revert temp changes

1afe96e

Merge branch 'master' into p5_optimize

6ca99b6

arjkesh commented Mar 19, 2024

View reviewed changes

arjkesh changed the title ~~Test heterogenous CR approach for efa tests~~ Launch efa instances from heterogenous reservations Mar 19, 2024

arjkesh marked this pull request as ready for review March 19, 2024 06:27

arjkesh requested review from a team as code owners March 19, 2024 06:27

arjkesh added 2 commits March 19, 2024 15:33

Merge branch 'master' into p5_optimize

9312b99

Merge branch 'master' into p5_optimize

cac4919

roywei approved these changes Mar 26, 2024

View reviewed changes

arjkesh enabled auto-merge (squash) March 27, 2024 00:41

Lokiiiiii approved these changes Mar 27, 2024

View reviewed changes

Merge branch 'master' into p5_optimize

e7abd56

arjkesh merged commit 6adcb21 into aws:master Mar 27, 2024
28 checks passed

evakravi pushed a commit to evakravi/deep-learning-containers that referenced this pull request Sep 5, 2024

Launch efa instances from heterogenous reservations (aws#3768)

2a7bd6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launch efa instances from heterogenous reservations #3768

Launch efa instances from heterogenous reservations #3768

arjkesh commented Mar 13, 2024 •

edited

Loading

roywei Mar 13, 2024

arjkesh Mar 16, 2024

roywei Mar 13, 2024

arjkesh Mar 19, 2024

arjkesh Mar 19, 2024

		@@ -35,6 +37,10 @@ def can_run_pytorchddp(ecr_image):
		return Version(image_framework_version) in SpecifierSet(">=1.10")


		@pytest.mark.skipif(

Launch efa instances from heterogenous reservations #3768

Launch efa instances from heterogenous reservations #3768

Conversation

arjkesh commented Mar 13, 2024 • edited Loading

Description

Tests run

Formatting

DLC image/dockerfile

Builds to Execute

Additional context

PR Checklist

NEURON/GRAVITON Testing Checklist

Benchmark Testing Checklist

Pytest Marker Checklist

roywei Mar 13, 2024

Choose a reason for hiding this comment

arjkesh Mar 16, 2024

Choose a reason for hiding this comment

roywei Mar 13, 2024

Choose a reason for hiding this comment

arjkesh Mar 19, 2024

Choose a reason for hiding this comment

arjkesh Mar 19, 2024

Choose a reason for hiding this comment

arjkesh commented Mar 13, 2024 •

edited

Loading