Exploring Quantization for Efficient Pre-Training of Transformer Language Models

This repository contains the code for Exploring Quantization for Efficient Pre-Training of Transformer Language Models.

Abstract

The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in Transformers during pre-training has remained largely unexplored at scale for language modeling. This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components. By systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states, we assess its effects on model efficiency, stability, and performance during training. By offering a comprehensive recipe of effective quantization strategies to be applied during the pre-training of Transformers, we promote high training efficiency from scratch while retaining language modeling ability.

Installation

Python: 3.8+ , CUDA: 11.8

Clone the repository.
Create a virtual environment and activate it:

python -m venv env
source env/bin/activate

Install the dependencies

pip install --upgrade pip
pip install -r requirements.txt

Follow additional steps to install FlashAttention as shown in the scripts/install_requirements.sh script.

Running experiments

Experiments were conducted using the OpenWebText dataset from HuggingFace, following a set of training configurations similar to those in nanoGPT. These experiments were run on 4xA100 80G GPUs.

Preparing dataset

To download and tokenize the dataset into your $HF_HOME directory, run:

python src/main.py --configs 'configs/gpt2_baseline.jsonnet' load_and_tokenize_dataset

Training

Our training utilizes the HuggingFace Trainer. You can adjust the training configurations in the configs/trainer directory. On average, training takes approximately 4.3 days to complete 300k steps.

torchrun --nproc_per_node=4 src/main.py --configs 'configs/gpt2_baseline.jsonnet' train

We provide scripts for running the experiments to a Slurm queue in scripts/train.sh

Evaluation

To evaluate the model's performance on a selected task, using the same configuration for the model, training recipe, dataset, and quantization, run:

python src/main.py --configs 'configs/gpt2_baseline.jsonnet, configs/evaluation_task/hellaswag.jsonnet' evaluate

Citation

If you use this code for your research, please consider citing our paper:

@article{chitsaz2024exploring,
  title={Exploring Quantization for Efficient Pre-Training of Transformer Language Models},
  author={Chitsaz, Kamran and Fournier, Quentin and Mordido, Gon{\c{c}}alo and Chandar, Sarath},
  journal={arXiv preprint arXiv:2407.11722},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

Abstract

Installation

Running experiments

Preparing dataset

Training

Evaluation

Citation

About

Releases

Packages

Languages

chandar-lab/EfficientLLMs

Folders and files

Latest commit

History

Repository files navigation

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

Abstract

Installation

Running experiments

Preparing dataset

Training

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages