Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI #3760

sirutBuasai · 2024-03-08T23:45:53Z

GitHub Issue #, if available:

Note:

If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests run

PyTorch 2.2 Training EC2 image tests
- e32684a
  - ECS test passed
  - EKS test passed
  - Sanity test passed
  - EC2 test passed
PyTorch 2.2 Training SM image tests
- ac78c1f
  - ECS test passed
  - EKS test passed
  - Sanity test passed
  - EC2 test passed
  - SM benchmark test passed
  - SM efa test passed
  - SM local test passed
  - SM remote test passed
  - SM rc test passed
Tensorflow 2.14 Training EC2/SM image tests
- 4013f89
  - ECS test passed
  - EKS test passed
  - Sanity test passed
  - EC2 test passed
  - SM benchmark test passed
  - SM efa test passed
  - SM local test passed
  - SM remote test passed
  - SM rc test passed
PyTorch 2.2 Inference EC2/SM image tests
- 64d9afa
  - ECS test passed
  - EKS test passed
  - Sanity test passed
  - EC2 test passed
  - SM benchmark test passed
  - SM efa test passed
  - SM local test passed
  - SM remote test passed
  - SM rc test passed

NOTE: By default, docker builds are disabled. In order to build your container, please update dlc_developer_config.toml and specify the framework to build in "build_frameworks"

I have run builds/tests on commit for my changes.

NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:

Expand

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true

Additionally, please run the sagemaker local tests in at least one revision:

sagemaker_local_tests = true

Formatting

I have run black -l 100 on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)

DLC image/dockerfile

Builds to Execute

Expand

Click the checkbox to enable a build to execute upon merge.

Note: By default, pipelines are set to "latest". Replace with major.minor framework version if you do not want "latest".

build_pytorch_training_latest
build_pytorch_inference_latest
build_tensorflow_training_latest
build_tensorflow_inference_latest

Additional context

PR Checklist

Expand

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

NEURON/GRAVITON Testing Checklist

When creating a PR:

I've modified dlc_developer_config.toml in my PR branch by setting neuron_mode = true or graviton_mode = true

Benchmark Testing Checklist

When creating a PR:

I've modified dlc_developer_config.toml in my PR branch by setting ec2_benchmark_tests = true or sagemaker_benchmark_tests = true

Pytest Marker Checklist

Expand

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…dia Driver AMI

test/dlc_tests/ec2/pytorch/training/test_pytorch_training_2_2.py

test/test_utils/__init__.py

test/dlc_tests/conftest.py

test/dlc_tests/benchmark/ec2/pytorch/inference/test_performance_pytorch_inference.py

test/test_utils/__init__.py

arjkesh · 2024-03-19T05:44:16Z

test/test_utils/__init__.py

-UBUNTU_20_BASE_DLAMI_US_WEST_2 = get_ami_id_boto3(
-    region_name="us-west-2", ami_name_pattern="Deep Learning Base GPU AMI (Ubuntu 20.04) ????????"
+# DLAMI Base is split between OSS Nvidia Driver and Propietary Nvidia Driver. see https://docs.aws.amazon.com/dlami/latest/devguide/important-changes.html
+UBUNTU_20_BASE_OSS_DLAMI_US_WEST_2 = get_ami_id_boto3(


looked at scope of removing these, and it will over-scope this PR. We can proceed with this for now

arjkesh

Nice work and thorough testing

…idia Driver AMI (aws#3760) * Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI * update gdrcopy to 2.4 * formatting * disable buiild and fix sm local test instance ami * use proprietary drier dlami as default * fix ul20 and aml2 dlami name logic and test only ec2 * allow test efa * update oss dlami list * test curand * ensure ec2 instance type fixture is ran before ec2 instance ami * alter ami pulling logic * usefixtures * use parametrize * use instance ami in parametrize * add instace ami ad parametrize * fix curand test * correct ami name * correct ami format * use proprietary dlami for curand * rebuild * logging debug * remove parametrize ami * flip logic * formatting * print instance ami * fix typo * remove parametrize logic and fix proprietary dlami name pattern * revert gdr copy * update test with gdrcopy 2.4 * build test pt ec2 * build test pt sm * remove gdrcopy ami * sanity and sm local testonly * build test pt sm * formatting * test pt sm * build test pt sm * disable build * build test pt sm * use get-login-password * remove () from get-login * test tensorflow * use login_to_ecr_registry function * use dict for base dlami logic * use image uri instead * fix aml2 dlami logic * revert toml file

Sirut Buasai added 2 commits March 8, 2024 15:44

Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvi…

840aef6

…dia Driver AMI

update gdrcopy to 2.4

95f6e46

sirutBuasai requested review from a team as code owners March 8, 2024 23:45

aws-deep-learning-containers-ci bot added build Reflects file change in build folder ec2 Reflects file change in dlc_tests/ec2 folder pytorch Reflects file change in pytorch folder Size:S Determines the size of the PR test Reflects file change in test folder labels Mar 8, 2024

Sirut Buasai added 10 commits March 8, 2024 15:58

formatting

2e4b09b

disable buiild and fix sm local test instance ami

c28d32b

use proprietary drier dlami as default

61212b4

fix ul20 and aml2 dlami name logic and test only ec2

a0f8f82

allow test efa

ce3d3da

update oss dlami list

00fba94

test curand

434fbdc

ensure ec2 instance type fixture is ran before ec2 instance ami

115c33c

alter ami pulling logic

092b14b

usefixtures

b75b415

aws-deep-learning-containers-ci bot added the benchmark Reflects file change in dlc_tests/benchmark folder label Mar 12, 2024

Sirut Buasai and others added 10 commits March 12, 2024 13:53

use parametrize

9dc8fec

use instance ami in parametrize

95ddb86

add instace ami ad parametrize

1d9347f

Merge branch 'master' into update-ami

a78962d

fix curand test

0a9504d

correct ami name

c66555e

correct ami format

e5716bc

use proprietary dlami for curand

66ce9fc

rebuild

68273a4

logging debug

c70f0e9

sirutBuasai and others added 2 commits March 14, 2024 23:31

Merge branch 'master' into update-ami

f8538bf

formatting

f9633d6

arjkesh reviewed Mar 16, 2024

View reviewed changes

test/dlc_tests/ec2/pytorch/training/test_pytorch_training_2_2.py Show resolved Hide resolved

test/test_utils/__init__.py Show resolved Hide resolved

Sirut Buasai added 5 commits March 16, 2024 08:01

test pt sm

2682dac

build test pt sm

b561099

disable build

2b52804

build test pt sm

dd1e2b2

use get-login-password

e5fe485

sirutBuasai requested review from a team as code owners March 18, 2024 08:03

Sirut Buasai added 2 commits March 18, 2024 10:26

remove () from get-login

ac78c1f

test tensorflow

4013f89

arjkesh reviewed Mar 18, 2024

View reviewed changes

test/test_utils/__init__.py Outdated Show resolved Hide resolved

test/test_utils/__init__.py Outdated Show resolved Hide resolved

test/test_utils/__init__.py Outdated Show resolved Hide resolved

test/test_utils/__init__.py Show resolved Hide resolved

Sirut Buasai added 2 commits March 18, 2024 12:54

use login_to_ecr_registry function

9f74eeb

use dict for base dlami logic

185d4a5

arjkesh reviewed Mar 18, 2024

View reviewed changes

test/dlc_tests/conftest.py Show resolved Hide resolved

test/dlc_tests/benchmark/ec2/pytorch/inference/test_performance_pytorch_inference.py Outdated Show resolved Hide resolved

test/test_utils/__init__.py Show resolved Hide resolved

Sirut Buasai and others added 4 commits March 18, 2024 16:27

use image uri instead

a893035

fix aml2 dlami logic

64d9afa

revert toml file

ef03578

Merge branch 'master' into update-ami

feab20e

arjkesh reviewed Mar 19, 2024

View reviewed changes

arjkesh approved these changes Mar 19, 2024

View reviewed changes

sirutBuasai enabled auto-merge (squash) March 19, 2024 06:57

Lokiiiiii approved these changes Mar 19, 2024

View reviewed changes

ztlevi approved these changes Mar 19, 2024

View reviewed changes

sirutBuasai merged commit 5f8e78c into aws:master Mar 19, 2024
28 checks passed

sirutBuasai deleted the update-ami branch March 19, 2024 18:01

sirutBuasai mentioned this pull request Mar 26, 2024

[bug] ECR Get Login logs into the wrong account when run locally #899

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI #3760

Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI #3760

sirutBuasai commented Mar 8, 2024 •

edited

Loading

arjkesh Mar 19, 2024

arjkesh left a comment

Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI #3760

Update DLAMI BASE AMI Logic to switch between OSS and Proprietary Nvidia Driver AMI #3760

Conversation

sirutBuasai commented Mar 8, 2024 • edited Loading

Description

Tests run

Formatting

DLC image/dockerfile

Builds to Execute

Additional context

PR Checklist

NEURON/GRAVITON Testing Checklist

Benchmark Testing Checklist

Pytest Marker Checklist

arjkesh Mar 19, 2024

Choose a reason for hiding this comment

arjkesh left a comment

Choose a reason for hiding this comment

sirutBuasai commented Mar 8, 2024 •

edited

Loading