We present the code and instructions to reproduce our NeurIPS 2022 Spotlight paper "Understanding the Failure of Batch Normalization for Transformers in NLP" on neural machine translation experiments.
For other tasks, you can easily modify the normalization module in language modeling, named entity recognition, text classification to reproduce the corresponding results. For the reason of license, we do not include them here. We are still appending new features.
The codes are based on fairseq (v0.9.0)
BN/RBN module is located at: fairseq\modules\norm\mask_batchnorm3d.py
Install PyTorch (we use Python=3.6 and PyTorch=1.7.1, higher version of python and PyTorch should also work)
conda create -n rbn python=3.6
conda activate rbn
conda install pytorch==1.7.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch (or pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html)
Install fairseq by:
cd RegularizedBN
pip install --editable ./
Install other requirements
pip install -r requirements.txt
Download the data from google drive and extract it in data-bin. You can also download it from Baidu Netdisk.
cd data-bin
unzip iwslt14.tokenized.de-en.zip
cd ..
Training the model (8GB GPU memory is enough)
chmod +x ./iwslt14_bash/train-iwslt14-pre-max-epoch.sh ./iwslt14_bash/train-iwslt14-post-max-epoch.sh
For Pre-Norm Transformer:
BN:
CUDA_VISIBLE_DEVICES=0 ./iwslt14_bash/train-iwslt14-pre-max-epoch.sh batch_1_1
RBN:
CUDA_VISIBLE_DEVICES=1 ./iwslt14_bash/train-iwslt14-pre-max-epoch.sh batch_diff_0.1_0.01
LN:
CUDA_VISIBLE_DEVICES=2 ./iwslt14_bash/train-iwslt14-pre-max-epoch.sh layer_1
For Post-Norm Transformer:
BN:
CUDA_VISIBLE_DEVICES=0 ./iwslt14_bash/train-iwslt14-post-max-epoch.sh batch_1_1
RBN:
CUDA_VISIBLE_DEVICES=1 ./iwslt14_bash/train-iwslt14-post-max-epoch.sh batch_diff_60_0
LN:
CUDA_VISIBLE_DEVICES=2 ./iwslt14_bash/train-iwslt14-post-max-epoch.sh layer_1