This document explains how to use docker images published at NVIDIA DGX Container Registry to run training jobs on Batch AI service.
You can use this instructions to configure your own jobs or to update existing recipes to run jobs on these images.
Known limitations:
- Batch AI currently doesn't support running distributed CNTK jobs on
nvcr.io/nvidia/cntk:17.*
images; - Caffe2 recipe in this repo is compatible only with
nvcr.io/nvidia/caffe2:17.10
; nvcr.io/partners/chainer:4.0.0b1
doesn't support distributed training and cannot be used with the distributed chainer recipe from this repo.
You need to have a NVIDIA GPU Cloud account in order to use images from DXG Container Registry. If you have no account yet, please sign up for the service at https://ngc.nvidia.com/signup.
You can find a list of available docker images at https://ngc.nvidia.com/registry. Select required image and get its
fully qualified docker image name (for example, nvcr.io/nvidia/caffe:17.12
for caffe image). You will provide this name
as image
parameter value and nvcr.io
as the server URL in container settings.
To use docker images published at NVIDIA DGX Container Registry you need to obtain and provide Batch AI with your API key.
You can obtain an API key by following the these steps:
- Login to https://ngc.nvidia.com/registry;
- Click "Get API Key" button at the top left corner;
- Generate an API Key by clicking "Generate API Key" button and confirming the action;
- Copy generated API key.
There are two ways to specify API key:
- You can provide API key value directly as a password in job configuration (
job.json
for CLI orJobCreateParameter
for SDKs); - You can store API key in Azure Key Vault and use key vault reference in job parameters as described here.
The following sections demonstrate how to use these approaches using Azure CLI 2.0 and python.
Here is an example of specifying container settings in job.json
for running a job in a container with tensorflow:17.10
image:
{
"properties": {
"containerSettings": {
"imageSourceRegistry": {
"image": "nvcr.io/nvidia/tensorflow:17.10",
"serverUrl": "nvcr.io",
"credentials": {
"username": "$oauthtoken",
"password": "<Your API Key>"
}
}
},
... rest of job's parameters ...
}
}
- Add
container_registry
section into you configuration.json file as
{
"container_registry" : {
"user": "$oauthtoken",
"password": "<Your API Key>"
},
... rest of configuration ...
}
- Configure container settings in
JobCreateParameters
in the following way:
parameters = models.job_create_parameters.JobCreateParameters(
container_settings=models.ContainerSettings(
models.ImageSourceRegistry(
server_url='nvcr.io',
image='nvcr.io/nvidia/tensorflow:17.10',
credentials=models.PrivateRegistryCredentials(
username=cfg.container_registry_user,
password=cfg.container_registry_password))),
# rest of the parameters
)
- Store your API Key in Azure KeyVault using Azure portal or by following these instructions.
- Use secret reference in container settings as shown below:
{
"properties": {
"containerSettings": {
"imageSourceRegistry": {
"serverUrl": "nvcr.io",
"image": "nvcr.io/nvidia/tensorflow:17.10",
"credentials": {
"username": "$oauthtoken",
"passwordSecretReference": {
"sourceVault": {
"id": "/subscriptions/<Your Subscription ID>/resourceGroups/<KeyVault Resource Group>/providers/Microsoft.KeyVault/vaults/<Key Vault Name>"
},
"secretUrl": "https://<KeyVault Name>.vault.azure.net/secrets/<Secret Name>"
}
}
}
},
... rest of job's parameters ...
}
}
- Store your API Key in Azure KeyVault using Azure portal or by following these instructions.
- Add KeyVault id and secret url into your configuration.json file as:
{
"keyvault_id": "/subscriptions/<Your Subscription ID>/resourceGroups/<KeyVault Resource Group>/providers/Microsoft.KeyVault/vaults/<Key Vault Name>",
"container_registry" : {
"user": "$oauthtoken",
"secret_url": "https://<KeyVault Name>.vault.azure.net/secrets/<Secret Name>"
},
... rest of configuration ...
}
- Configure container settings in
JobCreateParameters
in the following way:
parameters = models.job_create_parameters.JobCreateParameters(
container_settings=models.ContainerSettings(
models.ImageSourceRegistry(
server_url='nvcr.io',
image='nvcr.io/nvidia/tensorflow:17.10',
credentials=models.PrivateRegistryCredentials(
username=cfg.container_registry_user,
password_secret_reference=models.KeyVaultSecretReference(
source_vault=models.ResourceId(cfg.keyvault_id),
secret_url=cfg.container_registry_secret_url)))),
# rest of the parameters
)