-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launch efa instances from heterogenous reservations #3768
Conversation
test/test_utils/ec2.py
Outdated
def launch_efa_with_heterogenous_reservations( | ||
ec2_client, ec2_instance_type, ec2_run_instances_definition, fn_name="" | ||
): | ||
minimum_number_of_instances = ec2_run_instances_definition["MinCount"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a high level description on what we try to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated fn description
test/test_utils/ec2.py
Outdated
pass | ||
|
||
|
||
def launch_efa_with_heterogenous_reservations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this works, we can deprecate launch_efa_with_reservations
, I can't think of a use case where 2 instances must come from same cr
This reverts commit 9bb1796.
@@ -35,7 +37,10 @@ def can_run_pytorchddp(ecr_image): | |||
return Version(image_framework_version) in SpecifierSet(">=1.10") | |||
|
|||
|
|||
# Skip due to known issue: https://github.com/pytorch/pytorch/issues/99074 | |||
@pytest.mark.skipif( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove these conditional skips if we still want to run these
@@ -35,6 +37,10 @@ def can_run_pytorchddp(ecr_image): | |||
return Version(image_framework_version) in SpecifierSet(">=1.10") | |||
|
|||
|
|||
@pytest.mark.skipif( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove these conditional skips if we still want to run these
GitHub Issue #, if available:
Note:
If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.
Description
Update EFA tests to use a heterogenous approach for launching instances from capacity reservation. This allows us to make use of reservations where there may be < min number of instances available in a particular reservation, but there are enough instances available in multiple reservations or in reservations + open capacity
Tests run
NOTE: By default, docker builds are disabled. In order to build your container, please update dlc_developer_config.toml and specify the framework to build in "build_frameworks"
NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:
Expand
sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true
Additionally, please run the sagemaker local tests in at least one revision:
sagemaker_local_tests = true
Formatting
black -l 100
on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)DLC image/dockerfile
Builds to Execute
Expand
Click the checkbox to enable a build to execute upon merge.
Note: By default, pipelines are set to "latest". Replace with major.minor framework version if you do not want "latest".
Additional context
PR Checklist
Expand
NEURON/GRAVITON Testing Checklist
dlc_developer_config.toml
in my PR branch by settingneuron_mode = true
orgraviton_mode = true
Benchmark Testing Checklist
dlc_developer_config.toml
in my PR branch by settingec2_benchmark_tests = true
orsagemaker_benchmark_tests = true
Pytest Marker Checklist
Expand
@pytest.mark.model("<model-type>")
to the new tests which I have added, to specify the Deep Learning model that is used in the test (use"N/A"
if the test doesn't use a model)@pytest.mark.integration("<feature-being-tested>")
to the new tests which I have added, to specify the feature that will be tested@pytest.mark.multinode(<integer-num-nodes>)
to the new tests which I have added, to specify the number of nodes used on a multi-node test@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)
to the new tests which I have added, if a test is specifically applicable to only one processor typeBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.