Skip to content

Commit

Permalink
Merge branch 'EleutherAI:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
a663E-36z1120 authored Oct 15, 2023
2 parents 99d37f9 + f6ac04d commit d69132a
Show file tree
Hide file tree
Showing 38 changed files with 1,017 additions and 938 deletions.
51 changes: 51 additions & 0 deletions .github/workflows/coverity_scan.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Coverity
on:
workflow_dispatch:
inputs:
build_version:
description: "Version of GPT-NeoX being submitted for scan"
required: false
default: "GPT-NeoX build version"
build_description:
description: "Description of the current build"
required: false
default: "Current build of GPT-NeoX"

jobs:
coverity:

runs-on: ubuntu-latest

env:
COV_USER: ${{ secrets.COV_USER }}
COVERITY_PROJECT: ${{ secrets.COVERITY_PROJECT }}
COVERITY_TOKEN: ${{ secrets.COVERITY_TOKEN }}

steps:
- uses: actions/checkout@v2

- name: Install utils
run: |
apt update -y && apt upgrade -y
apt install curl jq wget -y
- name: Coverity Download
run: |
wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=EleutherAI%2Fgpt-neox" -O coverity_tool.tgz
$GITHUB_WORKSPACE/bin/cov-configure --python
$GITHUB_WORKSPACE/bin/cov-configure --gcc
- name: Coverity Scan
run: |
set -x
$GITHUB_WORKSPACE/bin/cov-build --dir cov-int --no-command --fs-capture-search $GITHUB_WORKSPACE
- name: Coverity Upload
run: |
tar caf build-results.bz2 cov-int
curl --form token=$COV_PASSPHRASE \
--form email=$COV_USER \
--form file=@GITHUB_WORKSPACE/build-results.bz2 \
--form version="Version" \
--form description="Build" \
https://scan.coverity.com/builds?project=EleutherAI%2Fgpt-neox
24 changes: 21 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ authors:
- affiliation: EleutherAI
family-names: Andonian
given-names: Alex
- affiliation: EleutherAI
family-names: Anthony
given-names: Quentin
- affiliation: EleutherAI
family-names: Biderman
given-names: Stella
Expand Down Expand Up @@ -34,15 +37,30 @@ authors:
- affiliation: EleutherAI
family-names: Pieler
given-names: Michael
- affiliation: EleutherAI
family-names: Phang
given-names: Jason
- affiliation: EleutherAI
family-names: Purohit
given-names: Shivanshu
- affiliation: EleutherAI
family-names: Schoelkopf
given-names: Hailey
- affiliation: EleutherAI
family-names: Stander
given-names: Dashiell
- affiliation: EleutherAI
family-names: Songz
given-names: Tri
- affiliation: EleutherAI
family-names: Phil
given-names: Wang
family-names: Tigges
given-names: Curt
- affiliation: EleutherAI
family-names: Thérien
given-names: Benjamin
- affiliation: EleutherAI
family-names: Wang
given-names: Phil
- affiliation: EleutherAI
family-names: Weinbach
given-names: Samuel
Expand All @@ -55,7 +73,7 @@ license: "Apache-2.0"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://www.github.com/eleutherai/gpt-neox"
title: "GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch"
version: "0.0.1"
version: "2.0.0"
doi: "10.5281/zenodo.5879544"
date-released: 2021-08-23
...
143 changes: 130 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ GPT-NeoX leverages many of the same features and technologies as the popular Meg
* Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://github.com/huggingface/transformers/) libraries, logging via [WandB](https://wandb.ai/site), and evaluation via our [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).

## News
**[8/10/2023]** We now support checkpointing with AWS S3! Activate with the `s3_path` config option (for more detail, see [the PR](https://github.com/EleutherAI/gpt-neox/pull/1010))

**[9/20/2023]** As of https://github.com/EleutherAI/gpt-neox/pull/1035, we have deprecated Flash Attention 0.x and 1.x, and migrated support to Flash Attention 2.x. We don't believe this will cause problems, but if you have a specific use-case that requires old flash support using the latest GPT-NeoX, please raise an issue.

**[8/10/2023]** We have experimental support for LLaMA 2 and Flash Attention v2 supported in our [math-lm](https://github.com/EleutherAI/math-lm) project that will be upstreamed later this month.

**[5/17/2023]** After fixing some miscellenous bugs we now fully support bf16.
**[5/17/2023]** After fixing some miscellaneous bugs we now fully support bf16.

**[4/11/2023]** We have upgraded our Flash Attention implementation to now support Alibi positional embeddings.

Expand Down Expand Up @@ -91,6 +93,121 @@ from the repository root.
To use [Flash-Attention](https://github.com/HazyResearch/flash-attention), install the additional dependencies in `./requirements/requirements-flashattention.txt` and set the attention type in your configuration accordingly (see [configs](./configs/)). This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere GPUs (such as A100s); see the repository for more details.


### Multi-Node Launching

NeoX and Deep(er)Speed support training on multiple different nodes and you have the option of using a variety of different launchers to orchestrate multi-node jobs.

In general there needs to be a "hostfile" somewhere accessible with the format:

```bash
node1_ip slots=8
node2_ip slots=8
```

where the first column contains the IP address for each node in your setup and the number of slots is the number of GPUs that node has access to. In your config you must pass in the path to the hostfile with `"hostfile": "/path/to/hostfile"`. Alternatively the path to the hostfile can be in the environment variable `DLTS_HOSTFILE`.

#### pdsh

`pdsh` is the default launcher, and if you're using `pdsh` then all you must do (besides ensuring that pdsh is installed in your environment) is set `{"launcher": "pdsh"}` in your config files.

#### MPI

If using MPI then you must specify the MPI library (DeepSpeed/GPT-NeoX currently supports `mvapich`, `openmpi`, `mpich`, and `impi`, though `openmpi` is the most commonly used and tested) as well as pass the `deepspeed_mpi` flag in your config file:

```json
{
"launcher": "openmpi",
"deepspeed_mpi": true
}
```

With your environment properly set up and the correct configuration files you can use `deepy.py` like a normal python script and start (for example) a training job with:

`python3 deepy.py train.py /path/to/configs/my_model.yml`

#### Slurm

Using Slurm can be slightly more involved. Like with MPI, you must add the following to your config:

```json
{
"launcher": "slurm",
"deepspeed_slurm": true
}
```
If you do not have ssh access to the compute nodes in your Slurm cluster you need to add `{"no_ssh_check": true}`

#### (Advanced) Custom Launching

There are many cases where the above default launching options are not sufficient

- Many clusters have their own unique job scheduler or specific MPI/Slurm arguments necessary for launching jobs such as [Summit JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun) or [LLNL Flux](https://computing.llnl.gov/projects/flux-building-framework-resource-management)
- While the above Slurm/MPI/pdsh default options are enough for most job runs, advanced users may want to add arguments for optimization or debugging purposes

In these cases, you will need to modify the DeepSpeed [multinode runner](https://github.com/microsoft/DeepSpeed/blob/17957728c0362bf8ae70feca308e491e55ef9feb/deepspeed/launcher/multinode_runner.py) utility to support your usecase. Broadly, these enhancements fall under two categories:

##### 1. Adding a Launcher (e.g. [JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun), [Flux](https://computing.llnl.gov/projects/flux-building-framework-resource-management), etc)

In this case, you must add a new multinode runner class to `deepspeed/launcher/multinode_runner.py` and expose it as a configuration option in GPT-NeoX. Examples on how we did this for [Summit JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun) are in [this DeeperSpeed commit](https://github.com/EleutherAI/DeeperSpeed/commit/9aed6c8500d7c492d85c5c88687322dbda70e370) and [this GPT-NeoX commit](https://github.com/EleutherAI/gpt-neox/commit/3782c7ae60f8624e566e3879b89bb09e8b59b869), respectively.

##### 2. Modifying Run Command or Environment Variables

We have encountered many cases where we wish to modify the MPI/Slurm run command for an optimization or to debug (e.g. to modify the [Slurm srun CPU binding](https://slurm.schedmd.com/srun.html#OPT_cpu-bind) or to tag MPI logs with the rank). In this case, you must modify the multinode runner class' run command under its `get_cmd` method (e.g. [mpirun_cmd](https://github.com/microsoft/DeepSpeed/blob/17957728c0362bf8ae70feca308e491e55ef9feb/deepspeed/launcher/multinode_runner.py#L135-L147) for OpenMPI). Examples on how we did this to provide optimized and rank-tagged run commands using Slurm and OpenMPI for the Stability cluster are in [this DeeperSpeed branch](https://github.com/microsoft/DeepSpeed/compare/master...EleutherAI:DeeperSpeed:v2.0-stability)


#### Hostfile Generation

In general you will not be able to have a single fixed hostfile, so you need to have a script to generate one dynamically when your job starts. An example script to dynamically generate a hostfile using [Slurm](https://slurm.schedmd.com/documentation.html) and 8 GPUs per node is:

```bash
#!/bin/bash
GPUS_PER_NODE=8
mkdir -p /sample/path/to/hostfiles
# need to add the current slurm jobid to hostfile name so that we don't add to previous hostfile
hostfile=/sample/path/to/hostfiles/hosts_$SLURM_JOBID
# be extra sure we aren't appending to a previous hostfile
rm $hostfile &> /dev/null
# loop over the node names
for i in `scontrol show hostnames $SLURM_NODELIST`
do
# add a line to the hostfile
echo $i slots=$GPUS_PER_NODE >>$hostfile
done
```

`$SLURM_JOBID` and `$SLURM_NODELIST` being environment variables Slurm will create for you. See the [sbatch documentation](https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES) for a full list of available Slurm environment variables set at job creation time.

#### Job Launching

Then you can create an [sbatch](https://slurm.schedmd.com/sbatch.html) script from which to kick off your GPT-NeoX job. A bare-bones sbatch script on a Slurm-based cluster with 8 GPUs per node would look like this:

```bash
#!/bin/bash
#SBATCH --job-name="neox"
#SBATCH --partition=your-partition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8

# Some potentially useful distributed environment variables
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12802
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

# Your hostfile creation script from above
./write_hostfile.sh
# Tell DeepSpeed where to find our generated hostfile via DLTS_HOSTFILE
export DLTS_HOSTFILE=/sample/path/to/hostfiles/hosts_$SLURM_JOBID

# Launch training
python3 deepy.py train.py /sample/path/to/your/configs/my_model.yml

```

You can then kick off a training run with `sbatch my_sbatch_script.sh`


### Containerized Setup

We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named `gpt-neox` from the repository root directory with `docker build -t gpt-neox -f Dockerfile .`. We also host pre-built images on [Docker Hub at `leogao2/gpt-neox`](https://hub.docker.com/r/leogao2/gpt-neox/tags).
Expand Down Expand Up @@ -169,7 +286,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):
(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)

You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:
You can now pretokenize your data using `tools/datasets/preprocess_data.py`, the arguments for which are detailed below:

```
usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
Expand Down Expand Up @@ -210,7 +327,7 @@ runtime:
For example:

```bash
python tools/preprocess_data.py \
python tools/datasets/preprocess_data.py \
--input ./data/mydataset.jsonl.zst \
--output-prefix ./data/mydataset \
--vocab ./data/gpt2-vocab.json \
Expand Down Expand Up @@ -314,19 +431,19 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints

To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
```bash
python ./tools/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
```

To convert a sequential model to Hugging Face format, run:
```bash
python ./tools/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
```
(Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)

Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
```bash
huggingface-cli login
python ./tools/upload.py
python ./tools/ckpts/upload.py
```
and input the requested information, including HF hub user token.

Expand Down Expand Up @@ -357,12 +474,12 @@ If you have found the GPT-NeoX library helpful in your work, you can cite this r
```bibtex
@software{gpt-neox-library,
title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},
url = {https://www.github.com/eleutherai/gpt-neox},
doi = {10.5281/zenodo.5879544},
month = {8},
year = {2021},
version = {0.0.1},
month = {9},
year = {2023},
version = {2.0.0},
}
```

Expand All @@ -388,7 +505,7 @@ Citation instructions for other pretrained models can be found [in the appropria
GPT-NeoX has been used by academic and industry researchers for a variety of high performance computing projects.

### Our Research
EleutherAI and our colaborators have used it in the following publications:
EleutherAI and our collaborators have used it in the following publications:
- Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, McDonell, Jason Phang, Michael Pieler, Prashanth, Shivanshu Purohit, Laria Reynolds, Jon Tow, Ben Wang, and Samuel Weinbach. "[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745)." In *Proceedings of the ACL Workshop on Challenges \& Perspectives in Creating Large Language Models* (2022).
- Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan et al. "[Pythia: A suite for analyzing large language models across training and scaling](https://arxiv.org/abs/2304.01373)." In _International Conference on Machine Learning_, pp. 2397-2430. PMLR (2023).
- Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023).
Expand Down Expand Up @@ -435,7 +552,7 @@ The following models were trained using this library:
**Other Modalities**
- [University College London](https://www.ucl.ac.uk/computer-science/)'s [ChessGPT-3B](https://huggingface.co/Waterhorse/chessgpt-base-v1)
- [Gretel](https://gretel.ai/)'s [Text-to-Table](https://huggingface.co/gretelai/text2table)


## Licensing

Expand All @@ -461,4 +578,4 @@ For full terms, see the `LICENSE` file. If you have any questions, comments, or

## Acknowledgements

We run our experiments on a Kubernetes cluster provided by [CoreWeave](https://coreweave.com/) and a SLURM cluster provided by [Stability AI](https://stability.ai). We are thankful to the DeepSpeed team for their advice and consultation.
We run our experiments on a Kubernetes cluster provided by [CoreWeave](https://coreweave.com/) and a Slurm cluster provided by [Stability AI](https://stability.ai). We are thankful to the DeepSpeed team for their advice and consultation.
18 changes: 17 additions & 1 deletion configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = 1d20559
Default = ec71f71

current git hash of repository

Expand Down Expand Up @@ -1169,6 +1169,22 @@ Training Arguments



- **s3_path**: str

Default = None

Path to s3 bucket for saving checkpoints.



- **s3_chunk_size**: int

Default = 104857600

The number of bytes in each file chunk when uploading to s3. Defaults to 100MiB.



- **config_files**: dict

Default = None
Expand Down
Loading

0 comments on commit d69132a

Please sign in to comment.