Merge branch 'EleutherAI:main' into main

a663E-36z1120 · Oct 15, 2023 · d69132a · d69132a
2 parents 99d37f9 + f6ac04d
commit d69132a
Show file tree

Hide file tree

Showing 38 changed files with 1,017 additions and 938 deletions.
diff --git a/.github/workflows/coverity_scan.yml b/.github/workflows/coverity_scan.yml
@@ -0,0 +1,51 @@
+name: Coverity
+on:
+  workflow_dispatch:
+    inputs:
+      build_version:
+        description: "Version of GPT-NeoX being submitted for scan"
+        required: false
+        default: "GPT-NeoX build version"
+      build_description:
+        description: "Description of the current build"
+        required: false
+        default: "Current build of GPT-NeoX"
+
+jobs:
+  coverity:
+
+    runs-on: ubuntu-latest
+
+    env:
+      COV_USER: ${{ secrets.COV_USER }}
+      COVERITY_PROJECT: ${{ secrets.COVERITY_PROJECT }}
+      COVERITY_TOKEN: ${{ secrets.COVERITY_TOKEN }}
+
+    steps:
+    - uses: actions/checkout@v2
+
+    - name: Install utils
+      run: |
+        apt update -y && apt upgrade -y
+        apt install curl jq wget -y
+
+    - name: Coverity Download
+      run: |
+        wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=EleutherAI%2Fgpt-neox" -O coverity_tool.tgz
+        $GITHUB_WORKSPACE/bin/cov-configure --python
+        $GITHUB_WORKSPACE/bin/cov-configure --gcc
+
+    - name: Coverity Scan
+      run: |
+        set -x
+        $GITHUB_WORKSPACE/bin/cov-build --dir cov-int --no-command --fs-capture-search $GITHUB_WORKSPACE
+
+    - name: Coverity Upload
+      run: |
+        tar caf build-results.bz2 cov-int
+        curl --form token=$COV_PASSPHRASE \
+          --form email=$COV_USER \
+          --form file=@GITHUB_WORKSPACE/build-results.bz2 \
+          --form version="Version" \
+          --form description="Build" \
+          https://scan.coverity.com/builds?project=EleutherAI%2Fgpt-neox
diff --git a/CITATION.cff b/CITATION.cff
@@ -4,6 +4,9 @@ authors:
   - affiliation: EleutherAI
     family-names: Andonian
     given-names: Alex
+  - affiliation: EleutherAI
+    family-names: Anthony
+    given-names: Quentin
   - affiliation: EleutherAI
     family-names: Biderman
     given-names: Stella
@@ -34,15 +37,30 @@ authors:
   - affiliation: EleutherAI
     family-names: Pieler
     given-names: Michael
+  - affiliation: EleutherAI
+    family-names: Phang
+    given-names: Jason
   - affiliation: EleutherAI
     family-names: Purohit
     given-names: Shivanshu
+  - affiliation: EleutherAI
+    family-names: Schoelkopf
+    given-names: Hailey
+  - affiliation: EleutherAI
+    family-names: Stander
+    given-names: Dashiell
   - affiliation: EleutherAI
     family-names: Songz
     given-names: Tri
   - affiliation: EleutherAI
-    family-names: Phil
-    given-names: Wang
+    family-names: Tigges
+    given-names: Curt
+  - affiliation: EleutherAI
+    family-names: Thérien
+    given-names: Benjamin
+  - affiliation: EleutherAI
+    family-names: Wang
+    given-names: Phil
   - affiliation: EleutherAI
     family-names: Weinbach
     given-names: Samuel
@@ -55,7 +73,7 @@ license: "Apache-2.0"
 message: "If you use this software, please cite it using these metadata."
 repository-code: "https://www.github.com/eleutherai/gpt-neox"
 title: "GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch"
-version: "0.0.1"
+version: "2.0.0"
 doi: "10.5281/zenodo.5879544"
 date-released: 2021-08-23
 ...
diff --git a/README.md b/README.md
@@ -18,11 +18,13 @@ GPT-NeoX leverages many of the same features and technologies as the popular Meg
 * Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://github.com/huggingface/transformers/) libraries, logging via [WandB](https://wandb.ai/site), and evaluation via our [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).
 
 ## News
+**[8/10/2023]** We now support checkpointing with AWS S3! Activate with the `s3_path` config option (for more detail, see [the PR](https://github.com/EleutherAI/gpt-neox/pull/1010))
+
 **[9/20/2023]** As of https://github.com/EleutherAI/gpt-neox/pull/1035, we have deprecated Flash Attention 0.x and 1.x, and migrated support to Flash Attention 2.x. We don't believe this will cause problems, but if you have a specific use-case that requires old flash support using the latest GPT-NeoX, please raise an issue.
 
 **[8/10/2023]** We have experimental support for LLaMA 2 and Flash Attention v2 supported in our [math-lm](https://github.com/EleutherAI/math-lm) project that will be upstreamed later this month.
 
-**[5/17/2023]** After fixing some miscellenous bugs we now fully support bf16.
+**[5/17/2023]** After fixing some miscellaneous bugs we now fully support bf16.
 
 **[4/11/2023]** We have upgraded our Flash Attention implementation to now support Alibi positional embeddings.
 
@@ -91,6 +93,121 @@ from the repository root.
 To use [Flash-Attention](https://github.com/HazyResearch/flash-attention), install the additional dependencies in  `./requirements/requirements-flashattention.txt` and set the attention type in your configuration accordingly (see [configs](./configs/)). This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere GPUs (such as A100s); see the repository for more details.
 
 
+### Multi-Node Launching
+
+NeoX and Deep(er)Speed support training on multiple different nodes and you have the option of using a variety of different launchers to orchestrate multi-node jobs.
+
+In general there needs to be a "hostfile" somewhere accessible with the format:
+
+```bash
+node1_ip slots=8
+node2_ip slots=8
+```
+
+where the first column contains the IP address for each node in your setup and the number of slots is the number of GPUs that node has access to. In your config you must pass in the path to the hostfile with `"hostfile": "/path/to/hostfile"`. Alternatively the path to the hostfile can be in the environment variable `DLTS_HOSTFILE`.
+
+#### pdsh
+
+`pdsh` is the default launcher, and if you're using `pdsh` then all you must do (besides ensuring that pdsh is installed in your environment) is set `{"launcher": "pdsh"}` in your config files.
+
+#### MPI
+
+If using MPI then you must specify the MPI library (DeepSpeed/GPT-NeoX currently supports `mvapich`, `openmpi`, `mpich`, and `impi`, though `openmpi` is the most commonly used and tested) as well as pass the `deepspeed_mpi` flag in your config file:
+
+```json
+{
+    "launcher": "openmpi",
+    "deepspeed_mpi": true
+}
+```
+
+With your environment properly set up and the correct configuration files you can use `deepy.py` like a normal python script and start (for example) a training job with:
+
+`python3 deepy.py train.py /path/to/configs/my_model.yml`
+
+#### Slurm
+
+Using Slurm can be slightly more involved. Like with MPI, you must add the following to your config:
+
+```json
+{
+    "launcher": "slurm",
+    "deepspeed_slurm": true
+}
+```
+If you do not have ssh access to the compute nodes in your Slurm cluster you need to add `{"no_ssh_check": true}`
+
+#### (Advanced) Custom Launching
+
+There are many cases where the above default launching options are not sufficient
+
+- Many clusters have their own unique job scheduler or specific MPI/Slurm arguments necessary for launching jobs such as [Summit JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun) or [LLNL Flux](https://computing.llnl.gov/projects/flux-building-framework-resource-management)
+- While the above Slurm/MPI/pdsh default options are enough for most job runs, advanced users may want to add arguments for optimization or debugging purposes
+
+In these cases, you will need to modify the DeepSpeed [multinode runner](https://github.com/microsoft/DeepSpeed/blob/17957728c0362bf8ae70feca308e491e55ef9feb/deepspeed/launcher/multinode_runner.py) utility to support your usecase. Broadly, these enhancements fall under two categories:
+
+##### 1. Adding a Launcher (e.g. [JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun), [Flux](https://computing.llnl.gov/projects/flux-building-framework-resource-management), etc)
+
+In this case, you must add a new multinode runner class to `deepspeed/launcher/multinode_runner.py` and expose it as a configuration option in GPT-NeoX. Examples on how we did this for [Summit JSRun](https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun) are in [this DeeperSpeed commit](https://github.com/EleutherAI/DeeperSpeed/commit/9aed6c8500d7c492d85c5c88687322dbda70e370) and [this GPT-NeoX commit](https://github.com/EleutherAI/gpt-neox/commit/3782c7ae60f8624e566e3879b89bb09e8b59b869), respectively.
+
+##### 2. Modifying Run Command or Environment Variables
+
+We have encountered many cases where we wish to modify the MPI/Slurm run command for an optimization or to debug (e.g. to modify the [Slurm srun CPU binding](https://slurm.schedmd.com/srun.html#OPT_cpu-bind) or to tag MPI logs with the rank). In this case, you must modify the multinode runner class' run command under its `get_cmd` method (e.g. [mpirun_cmd](https://github.com/microsoft/DeepSpeed/blob/17957728c0362bf8ae70feca308e491e55ef9feb/deepspeed/launcher/multinode_runner.py#L135-L147) for OpenMPI). Examples on how we did this to provide optimized and rank-tagged run commands using Slurm and OpenMPI for the Stability cluster are in [this DeeperSpeed branch](https://github.com/microsoft/DeepSpeed/compare/master...EleutherAI:DeeperSpeed:v2.0-stability)
+
+
+#### Hostfile Generation
+
+In general you will not be able to have a single fixed hostfile, so you need to have a script to generate one dynamically when your job starts. An example script to dynamically generate a hostfile using [Slurm](https://slurm.schedmd.com/documentation.html) and 8 GPUs per node is:
+
+```bash
+#!/bin/bash
+GPUS_PER_NODE=8
+mkdir -p /sample/path/to/hostfiles
+# need to add the current slurm jobid to hostfile name so that we don't add to previous hostfile
+hostfile=/sample/path/to/hostfiles/hosts_$SLURM_JOBID
+# be extra sure we aren't appending to a previous hostfile
+rm $hostfile &> /dev/null
+# loop over the node names
+for i in `scontrol show hostnames $SLURM_NODELIST`
+do
+    # add a line to the hostfile
+    echo $i slots=$GPUS_PER_NODE >>$hostfile
+done
+```
+
+`$SLURM_JOBID` and `$SLURM_NODELIST` being environment variables Slurm will create for you. See the [sbatch documentation](https://slurm.schedmd.com/sbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES) for a full list of available Slurm environment variables set at job creation time.
+
+#### Job Launching
+
+Then you can create an [sbatch](https://slurm.schedmd.com/sbatch.html) script from which to kick off your GPT-NeoX job. A bare-bones sbatch script on a Slurm-based cluster with 8 GPUs per node would look like this:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name="neox"
+#SBATCH --partition=your-partition
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --gres=gpu:8
+
+# Some potentially useful distributed environment variables
+export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export MASTER_PORT=12802
+export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
+
+# Your hostfile creation script from above
+./write_hostfile.sh
+# Tell DeepSpeed where to find our generated hostfile via DLTS_HOSTFILE
+export DLTS_HOSTFILE=/sample/path/to/hostfiles/hosts_$SLURM_JOBID
+
+# Launch training
+python3 deepy.py train.py /sample/path/to/your/configs/my_model.yml
+
+```
+
+You can then kick off a training run with `sbatch my_sbatch_script.sh`
+
+
 ### Containerized Setup
 
 We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named `gpt-neox` from the repository root directory with `docker build -t gpt-neox -f Dockerfile .`. We also host pre-built images on [Docker Hub at `leogao2/gpt-neox`](https://hub.docker.com/r/leogao2/gpt-neox/tags).
@@ -169,7 +286,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):
 
 (alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)
 
-You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:
+You can now pretokenize your data using `tools/datasets/preprocess_data.py`, the arguments for which are detailed below:
 
 ```
 usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
@@ -210,7 +327,7 @@ runtime:
 For example:
 
 ```bash
-python tools/preprocess_data.py \
+python tools/datasets/preprocess_data.py \
             --input ./data/mydataset.jsonl.zst \
             --output-prefix ./data/mydataset \
             --vocab ./data/gpt2-vocab.json \
@@ -314,19 +431,19 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints
 
 To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
 ```bash
-python ./tools/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
 ```
 
 To convert a sequential model to Hugging Face format, run:
 ```bash
-python  ./tools/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python  ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
 ```
 (Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)
 
 Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
 ```bash
 huggingface-cli login
-python ./tools/upload.py
+python ./tools/ckpts/upload.py
 ```
 and input the requested information, including HF hub user token.
 
@@ -357,12 +474,12 @@ If you have found the GPT-NeoX library helpful in your work, you can cite this r
 ```bibtex
 @software{gpt-neox-library,
   title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
-  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
+  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},
   url = {https://www.github.com/eleutherai/gpt-neox},
   doi = {10.5281/zenodo.5879544},
-  month = {8},
-  year = {2021},
-  version = {0.0.1},
+  month = {9},
+  year = {2023},
+  version = {2.0.0},
 }
 ```
 
@@ -388,7 +505,7 @@ Citation instructions for other pretrained models can be found [in the appropria
 GPT-NeoX has been used by academic and industry researchers for a variety of high performance computing projects.
 
 ### Our Research
-EleutherAI and our colaborators have used it in the following publications:
+EleutherAI and our collaborators have used it in the following publications:
  - Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, McDonell, Jason Phang, Michael Pieler, Prashanth, Shivanshu Purohit, Laria Reynolds, Jon Tow, Ben Wang, and Samuel Weinbach. "[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745)." In *Proceedings of the ACL Workshop on Challenges \& Perspectives in Creating Large Language Models* (2022).
  - Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan et al. "[Pythia: A suite for analyzing large language models across training and scaling](https://arxiv.org/abs/2304.01373)." In _International Conference on Machine Learning_, pp. 2397-2430. PMLR (2023).
  - Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. "[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https://arxiv.org/abs/2302.12433). *arXiv preprint arXiv:2302.12433* (2023).
@@ -435,7 +552,7 @@ The following models were trained using this library:
 **Other Modalities**
 -  [University College London](https://www.ucl.ac.uk/computer-science/)'s [ChessGPT-3B](https://huggingface.co/Waterhorse/chessgpt-base-v1)
 -  [Gretel](https://gretel.ai/)'s [Text-to-Table](https://huggingface.co/gretelai/text2table)
-  
+
 
 ## Licensing
 
@@ -461,4 +578,4 @@ For full terms, see the `LICENSE` file. If you have any questions, comments, or
 
 ## Acknowledgements
 
-We run our experiments on a Kubernetes cluster provided by [CoreWeave](https://coreweave.com/) and a SLURM cluster provided by [Stability AI](https://stability.ai). We are thankful to the DeepSpeed team for their advice and consultation.
+We run our experiments on a Kubernetes cluster provided by [CoreWeave](https://coreweave.com/) and a Slurm cluster provided by [Stability AI](https://stability.ai). We are thankful to the DeepSpeed team for their advice and consultation.
diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md
@@ -111,7 +111,7 @@ Logging Arguments
 
 - **git_hash**: str
 
-    Default = 1d20559
+    Default = ec71f71
 
     current git hash of repository
 
@@ -1169,6 +1169,22 @@ Training Arguments
 
 
 
+- **s3_path**: str
+
+    Default = None
+
+    Path to s3 bucket for saving checkpoints.
+
+
+
+- **s3_chunk_size**: int
+
+    Default = 104857600
+
+    The number of bytes in each file chunk when uploading to s3. Defaults to 100MiB.
+
+
+
 - **config_files**: dict
 
     Default = None