Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add tensorflow and pytorch CUDA version tests for GPU image build #452

Merged
merged 8 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions test/test_artifacts/v0/gpu-dependencies.test.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ARG SAGEMAKER_DISTRIBUTION_IMAGE
FROM $SAGEMAKER_DISTRIBUTION_IMAGE

ARG MAMBA_DOCKERFILE_ACTIVATE=1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this ARG for?

Copy link
Contributor Author

@TRNWWZ TRNWWZ Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, these 3 lines are used to activate test environment, we have it in other tests too: https://github.com/aws/sagemaker-distribution/blob/main/test/test_artifacts/v1/autogluon.test.Dockerfile#L1-L4


# Execute cuda valudaiton script:
# 1. Check if TensorFlow is installed with CUDA support for GPU image
# 2. Check if Pytorch is installed with CUDA support for GPU image
COPY --chown=$MAMBA_USER:$MAMBA_USER scripts/cuda_validation.py .
RUN chmod +x cuda_validation.py
RUN python3 cuda_validation.py
46 changes: 46 additions & 0 deletions test/test_artifacts/v0/scripts/cuda_validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Verify Tensorflow CUDA
import tensorflow as tf

cuda_available = tf.test.is_built_with_cuda()
if not cuda_available:
raise Exception("TensorFlow is installed without CUDA support for GPU image build.")
print("TensorFlow is built with CUDA support.")


# Verify Pytorch is installed with CUDA version
import subprocess

# Run the micromamba list command and capture the output
result = subprocess.run(["micromamba", "list"], stdout=subprocess.PIPE, text=True)

# Split the output into lines
package_lines = result.stdout.strip().split("\n")

# Find the PyTorch entry
pytorch_entry = None
for line in package_lines:
dependency_info = line.strip().split()
if dependency_info and dependency_info[0] == "pytorch":
pytorch_entry = line.split()
break

# If PyTorch is installed, print its information
if pytorch_entry:
package_name = pytorch_entry[0]
package_version = pytorch_entry[1]
package_build = pytorch_entry[2]
print(f"PyTorch: {package_name} {package_version} {package_build}")
# Raise exception if CUDA is not detected
if "cuda" not in package_build:
raise Exception("Pytorch is installed without CUDA support for GPU image build.")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also do a print here "Pytorch is built with CUDA support"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also do a print here "Pytorch is built with CUDA support"

Good point, updated exception message

# Verify Pytorch has CUDA working properly
# Because this function only works on a GPU instance, so it may fail in local test
# To test manually on a GPU instance, run: "docker run --gpus all <image id>"
import torch

if not torch.cuda.is_available():
raise Exception(
"Pytorch is installed with CUDA support but not working in current environment. \
Make sure to execute this test case in GPU environment if you are not"
)
11 changes: 11 additions & 0 deletions test/test_artifacts/v1/gpu-dependencies.test.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ARG SAGEMAKER_DISTRIBUTION_IMAGE
FROM $SAGEMAKER_DISTRIBUTION_IMAGE

ARG MAMBA_DOCKERFILE_ACTIVATE=1

# Execute cuda valudaiton script:
# 1. Check if TensorFlow is installed with CUDA support for GPU image
# 2. Check if Pytorch is installed with CUDA support for GPU image
COPY --chown=$MAMBA_USER:$MAMBA_USER scripts/cuda_validation.py .
RUN chmod +x cuda_validation.py
RUN python3 cuda_validation.py
46 changes: 46 additions & 0 deletions test/test_artifacts/v1/scripts/cuda_validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Verify Tensorflow CUDA
import tensorflow as tf

cuda_available = tf.test.is_built_with_cuda()
if not cuda_available:
raise Exception("TensorFlow is installed without CUDA support for GPU image build.")
print("TensorFlow is built with CUDA support.")


# Verify Pytorch is installed with CUDA version
import subprocess

# Run the micromamba list command and capture the output
result = subprocess.run(["micromamba", "list"], stdout=subprocess.PIPE, text=True)

# Split the output into lines
package_lines = result.stdout.strip().split("\n")

# Find the PyTorch entry
pytorch_entry = None
for line in package_lines:
dependency_info = line.strip().split()
if dependency_info and dependency_info[0] == "pytorch":
pytorch_entry = line.split()
break

# If PyTorch is installed, print its information
if pytorch_entry:
package_name = pytorch_entry[0]
package_version = pytorch_entry[1]
package_build = pytorch_entry[2]
print(f"PyTorch: {package_name} {package_version} {package_build}")
# Raise exception if CUDA is not detected
if "cuda" not in package_build:
raise Exception("Pytorch is installed without CUDA support for GPU image build.")

# Verify Pytorch has CUDA working properly
# Because this function only works on a GPU instance, so it may fail in local test
# To test manually on a GPU instance, run: "docker run --gpus all <image id>"
import torch

if not torch.cuda.is_available():
raise Exception(
"Pytorch is installed with CUDA support but not working in current environment. \
Make sure to execute this test case in GPU environment if you are not"
)
11 changes: 11 additions & 0 deletions test/test_artifacts/v2/gpu-dependencies.test.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ARG SAGEMAKER_DISTRIBUTION_IMAGE
FROM $SAGEMAKER_DISTRIBUTION_IMAGE

ARG MAMBA_DOCKERFILE_ACTIVATE=1

# Execute cuda valudaiton script:
# 1. Check if TensorFlow is installed with CUDA support for GPU image
# 2. Check if Pytorch is installed with CUDA support for GPU image
COPY --chown=$MAMBA_USER:$MAMBA_USER scripts/cuda_validation.py .
RUN chmod +x cuda_validation.py
RUN python3 cuda_validation.py
46 changes: 46 additions & 0 deletions test/test_artifacts/v2/scripts/cuda_validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Verify Tensorflow CUDA
import tensorflow as tf

cuda_available = tf.test.is_built_with_cuda()
if not cuda_available:
raise Exception("TensorFlow is installed without CUDA support for GPU image build.")
print("TensorFlow is built with CUDA support.")


# Verify Pytorch is installed with CUDA version
import subprocess

# Run the micromamba list command and capture the output
result = subprocess.run(["micromamba", "list"], stdout=subprocess.PIPE, text=True)

# Split the output into lines
package_lines = result.stdout.strip().split("\n")

# Find the PyTorch entry
pytorch_entry = None
for line in package_lines:
dependency_info = line.strip().split()
if dependency_info and dependency_info[0] == "pytorch":
pytorch_entry = line.split()
break

# If PyTorch is installed, print its information
if pytorch_entry:
package_name = pytorch_entry[0]
package_version = pytorch_entry[1]
package_build = pytorch_entry[2]
print(f"PyTorch: {package_name} {package_version} {package_build}")
# Raise exception if CUDA is not detected
if "cuda" not in package_build:
raise Exception("Pytorch is installed without CUDA support for GPU image build.")

# Verify Pytorch has CUDA working properly
# Because this function only works on a GPU instance, so it may fail in local test
# To test manually on a GPU instance, run: "docker run --gpus all <image id>"
import torch

if not torch.cuda.is_available():
raise Exception(
"Pytorch is installed with CUDA support but not working in current environment. \
Make sure to execute this test case in GPU environment if you are not"
)
1 change: 1 addition & 0 deletions test/test_dockerfile_based_harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def test_dockerfiles_for_cpu(
("langchain-aws.test.Dockerfile", ["langchain-aws"]),
("mlflow.test.Dockerfile", ["mlflow"]),
("sagemaker-mlflow.test.Dockerfile", ["sagemaker-mlflow"]),
("gpu-dependencies.test.Dockerfile", ["pytorch", "tensorflow"]),
],
)
def test_dockerfiles_for_gpu(
Expand Down