Skip to content

Latest commit




Folders and files

Last commit message
Last commit date

parent directory


Accelerate BERT pre-training with ONNX Runtime

This example uses ONNX Runtime to pre-train the BERT PyTorch model maintained at

You can run the training in Azure Machine Learning or on an Azure VM with NVIDIA GPU.


  1. Clone this repo

    git clone
    cd onnxruntime-training-examples
  2. Clone download code and model

    git clone --no-checkout
    cd DeepLearningExamples/
    git checkout 4733603577080dbd1bdcd51864f31e45d5196704
    cd ..
  3. Create working directory

    mkdir -p workspace
    mv DeepLearningExamples/PyTorch/LanguageModeling/BERT/ workspace
    rm -rf DeepLearningExamples
    cp -r ./nvidia-bert/ort_addon/* workspace/BERT
    cd workspace
    git clone
    cd wikiextractor/
    git checkout e4abb4cbd019b0257824ee47c23dd163919b731b
    cd ../../ 

Download and prepare data

The following are a minimal set of instructions to download and process one of the datasets used for BERT pre-training.

To include additional datasets, and for more details, refer to DeepLearningExamples.

Note that the datasets used for BERT pre-training need a large amount of disk space. After processing, the data should be made available for training. Due to the large size of the data copy, we recommend that you execute the steps below in the training environment itself or in an environment from where data transfer to training environment will be fast and efficient.

  1. Check pre-requisites

    • Python 3.6
    • Natural Language Toolkit (NLTK) python3-pip install nltk
  2. Download and prepare Wikicorpus training data in HDF5 format.

    export BERT_PREP_WORKING_DIR=./workspace/BERT/data/
    # Download google_pretrained_weights
    python ./workspace/BERT/data/ --action download --dataset google_pretrained_weights
    # Download wikicorpus_en via wget
    mkdir -p ./workspace/BERT/data/download/wikicorpus_en
    cd ./workspace/BERT/data/download/wikicorpus_en
    bzip2 -dv enwiki-latest-pages-articles.xml.bz2
    mv enwiki-latest-pages-articles.xml wikicorpus_en.xml
    cd ../../../../..
    # Fix path issue to use BERT_PREP_WORKING_DIR as prefix for path instead of hard-coded prefix
    sed -i "s/path_to_wikiextractor_in_container = '/path_to_wikiextractor_in_container = './g" ./workspace/BERT/data/
    # Format text files
    python ./workspace/BERT/data/ --action text_formatting --dataset wikicorpus_en
    # Shard text files
    python ./workspace/BERT/data/ --action sharding --dataset wikicorpus_en
    # Fix path to workspace to allow running outside of the docker container
    sed -i "s/python \/workspace\/bert/python .\/workspace\/BERT/g" ./workspace/BERT/data/
    # Create HDF5 files Phase 1
    python ./workspace/BERT/data/ --action create_hdf5_files --dataset wikicorpus_en --max_seq_length 128 \
      --max_predictions_per_seq 20 --vocab_file ./workspace/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
    # Create HDF5 files Phase 2
    python ./workspace/BERT/data/ --action create_hdf5_files --dataset wikicorpus_en --max_seq_length 512 \
    --max_predictions_per_seq 80 --vocab_file ./workspace/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1
  3. Make data accessible for training

    After completing the steps above, data in hdf5 format will be available at the following locations:

    • Phase 1 data: ./workspace/BERT/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/train
    • Phase 2 data: ./workspace/BERT/data/hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/train

    Below instructions refer to these hdf5 data files as the data to make accessible to training process.

BERT pre-training with ONNX Runtime in Azure Machine Learning

  1. Data Transfer

    • Transfer training data to Azure blob storage

    To transfer the data to an Azure blob storage using Azure CLI, use command:

    az storage blob upload-batch --account-name <storage-name> -d <container-name> -s ./workspace/BERT/data
    • Register the blob container as a data store
    • Mount the data store in the compute targets used for training

    Please refer to the storage guidance for details on using Azure storage account for training in Azure Machine Learning.

  2. Execute pre-training

    The BERT pre-training job in Azure Machine Learning can be launched using either of these environments:

    • Azure Machine Learning Compute Instance to run the Jupyter notebook.
    • Azure Machine Learning SDK

    You will need a GPU optimized compute target - either NCv3 or NDv2 series, to execute this pre-training job.

    Execute the steps in the Python notebook azureml-notebooks/run-pretraining.ipynb within your environment. If you have a local setup to run an Azure ML notebook, you could run the steps in the notebook in that environment. Otherwise, a compute instance in Azure Machine Learning could be created and used to run the steps.

BERT pre-training with ONNX Runtime directly on ND40rs_v2 (or similar NVIDIA capable Azure VM)

  1. Check pre-requisites

  2. Build the ONNX Runtime Docker image

    Build the onnxruntime wheel from source into a Docker image.

    cd nvidia-bert/docker
    cd ../..
    • Tag this image onnxruntime-pytorch-for-bert`

    To build and install the onnxruntime wheel on the host machine, follow steps here

  3. Set correct paths to training data for docker image.

    Edit nvidia-bert/docker/

    -v <replace-with-path-to-phase1-hdf5-training-data>:/data/128
    -v <replace-with-path-to-phase2-hdf5-training-data>:/data/512

    The two directories must contain the hdf5 training files.

  4. Set the number of GPUs and per GPU limit.

    Edit workspace/BERT/scripts/

  5. Modify other training parameters as needed.

    Edit workspace/BERT/scripts/


    The above defaults are tuned for an Azure NC24rs_v3.

    The training batch size refers to the number of samples a single GPU sees before weights are updated. The training is performed over local and global steps. A local step refers to a single backpropagation execution on the model to calculate its gradient. These gradients are accumulated every local step until weights are updated in a global step. The microbatch size is samples a single GPU sees in a single backpropagation execution step. The microbatch size will be the training batch size divided by gradient accumulation steps.

    Note: The effective batch size will be (number of GPUs) x train_batch_size (per GPU). In general we recommend setting the effective batch size to ~64,000 for phase 1 and ~32,000 for phase 2. The number of gradient accumulation steps should be minimized without overflowing the GPU memory (i.e. maximizes microbatch size).

    Consult Parameters section by NVIDIA for additional details.

  6. Launch interactive container.

    cd workspace/BERT
    bash ../../nvidia-bert/docker/
  7. Launch pre-training run

    bash /workspace/bert/scripts/

    If you get memory errors, try reducing the batch size or enabling the partition optimizer flag.


For fine-tuning tasks, follow