Merge branch 'main' into add-axonn-3d-TP

EleutherAI · Dec 19, 2023 · 7438b33 · 7438b33
2 parents f1c40e2 + 050f560
commit 7438b33
Show file tree

Hide file tree

Showing 39 changed files with 100,922 additions and 572 deletions.
diff --git a/.github/workflows/coverity_scan.yml b/.github/workflows/coverity_scan.yml
@@ -23,29 +23,38 @@ jobs:
 
     steps:
     - uses: actions/checkout@v2
+      with:
+        path: gpt-neox
 
     - name: Install utils
       run: |
-        apt update -y && apt upgrade -y
-        apt install curl jq wget -y
+        sudo apt update -y && sudo apt upgrade -y
+        sudo apt install curl jq wget -y
 
     - name: Coverity Download
       run: |
-        wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=EleutherAI%2Fgpt-neox" -O coverity_tool.tgz
-        $GITHUB_WORKSPACE/bin/cov-configure --python
-        $GITHUB_WORKSPACE/bin/cov-configure --gcc
+        wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=$COVERITY_PROJECT" -O coverity_tool.tgz --no-verbose
+        mkdir $GITHUB_WORKSPACE/coverity && tar xvf coverity_tool.tgz -C $GITHUB_WORKSPACE/coverity --strip-components=1
+        $GITHUB_WORKSPACE/coverity/bin/cov-configure --python
+        $GITHUB_WORKSPACE/coverity/bin/cov-configure --gcc
 
-    - name: Coverity Scan
+    - name: Coverity Scan and Upload
       run: |
         set -x
-        $GITHUB_WORKSPACE/bin/cov-build --dir cov-int --no-command --fs-capture-search $GITHUB_WORKSPACE
-
-    - name: Coverity Upload
-      run: |
+        pushd $GITHUB_WORKSPACE
+        cd $GITHUB_WORKSPACE/gpt-neox
+        $GITHUB_WORKSPACE/coverity/bin/cov-build --dir $GITHUB_WORKSPACE/cov-int --no-command --fs-capture-search ./
+        popd
         tar caf build-results.bz2 cov-int
-        curl --form token=$COV_PASSPHRASE \
+        curl --form token=$COVERITY_TOKEN \
           --form email=$COV_USER \
-          --form file=@GITHUB_WORKSPACE/build-results.bz2 \
-          --form version="Version" \
-          --form description="Build" \
-          https://scan.coverity.com/builds?project=EleutherAI%2Fgpt-neox
+          --form [email protected] \
+          --form version="${{ inputs.build_version }}" \
+          --form description="${{ inputs.build_description }}" \
+          https://scan.coverity.com/builds?project=$COVERITY_PROJECT
+
+    - name: Upload Scan Build as Artifact
+      uses: actions/upload-artifact@v3
+      with:
+        name: coverity-build-${{ github.sha }}
+        path: build-results.bz2
diff --git a/README.md b/README.md
@@ -82,9 +82,8 @@ python ./megatron/fused_kernels/setup.py install # optional, if using fused kern
 
 from the repository root.
 
-<aside>
-
-**Warning:** Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.
+> [!Warning]
+> Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.
 
 </aside>
 
@@ -229,19 +228,19 @@ We currently offer three main functions:
 which can be launched with:
 
 ```bash
-./deepy.py [script.py] [./path/to/config_1.yaml] [./path/to/config_2.yaml] ... [./path/to/config_n.yaml]
+./deepy.py [script.py] [./path/to/config_1.yml] [./path/to/config_2.yml] ... [./path/to/config_n.yml]
 ```
 
 For example, to launch training you can run
 ```bash
-./deepy.py train.py ./configs/20B.yaml ./configs/local_cluster.yaml
+./deepy.py train.py ./configs/20B.yml ./configs/local_cluster.yml
 ```
 
 For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation) respectively.
 
 # Configuration
 
-GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), showing a diverse array of features and model sizes.
+GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yml files in [configs](./configs/), showing a diverse array of features and model sizes.
 
 These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as `pipe-parallel-size`, `model-parallel-size` to increase or decrease the degree of parallelisation, `train_micro_batch_size_per_gpu` or `gradient-accumulation-steps` to modify batch size related settings, or the `zero_optimization` dict to modify how optimizer states are parallelised across workers.
 
@@ -350,7 +349,7 @@ Training is launched using `deepy.py`, a wrapper around DeepSpeed's launcher, wh
 The general usage pattern is:
 
 ```bash
-python ./deepy.py train.py [path/to/config1.yaml] [path/to/config2.yaml] ...
+python ./deepy.py train.py [path/to/config1.yml] [path/to/config2.yml] ...
 ```
 
 You can pass in an arbitrary number of configs which will all be merged at runtime.
@@ -360,19 +359,19 @@ You can also optionally pass in a config prefix, which will assume all your conf
 E.G:
 
 ```bash
-python ./deepy.py train.py -d configs 125M.yaml local_setup.yaml
+python ./deepy.py train.py -d configs 125M.yml local_setup.yml
 ```
 
 This will deploy the `train.py` script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the `/job/hostfile` file (see [parameter documentation](configs/README.md)), or can simply be passed in as the `num_gpus` arg if running on a single node setup.
 
-Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g `configs/125M.yaml`) and the data path parameters in another (e.g `configs/local_setup.yaml`).
+Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g `configs/125M.yml`) and the data path parameters in another (e.g `configs/local_setup.yml`).
 
 
 ## Pretrained Models
 
 ### GPT-NeoX-20B
 
-GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on [the Pile](https://arxiv.org/abs/2101.00027). Technical details about GPT-NeoX-20B can be found in [the associated paper](https://arxiv.org/abs/2204.06745). The configuration file for this model is both available at [`./configs/20B.yaml`](./configs/20B.yaml) and included in the download links below.
+GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on [the Pile](https://arxiv.org/abs/2101.00027). Technical details about GPT-NeoX-20B can be found in [the associated paper](https://arxiv.org/abs/2204.06745). The configuration file for this model is both available at [`./configs/20B.yml`](./configs/20B.yml) and included in the download links below.
 
 [Slim weights](https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/) - (No optimizer states, for inference or finetuning, 39GB)
 
@@ -411,7 +410,7 @@ We support three types of generation from a pretrained model:
 2. Conditional generation based on an input read from a file
 3. Interactive generation, which allows for multiple rounds of back-and-forth between a user and the language model via a command line interface
 
-All three types of text generation can be launched via `python ./deepy.py generate.py -d configs 125M.yaml local_setup.yaml text_generation.yaml` with the appropriate values set in `configs/text_generation.yaml`.
+All three types of text generation can be launched via `python ./deepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml` with the appropriate values set in `configs/text_generation.yml`.
 
 # Evaluation
 
@@ -420,7 +419,7 @@ GPT-NeoX supports evaluation on downstream tasks through the [language model eva
 To evaluate a trained model on the evaluation harness, simply run:
 
 ```bash
-python ./deepy.py evaluate.py -d configs your_configs.yaml --eval_tasks task1 task2 ... taskn
+python ./deepy.py evaluate.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn
 ```
 
 where `--eval_tasks` is a list of evaluation tasks followed by spaces, e.g `--eval_tasks lambada hellaswag piqa sciq`. For details of all tasks available, refer to the [lm-evaluation-harness repo](https://github.com/EleutherAI/lm-evaluation-harness).
@@ -431,12 +430,12 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints
 
 To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
 ```bash
-python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
 ```
 
 To convert a sequential model to Hugging Face format, run:
 ```bash
-python  ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python  ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
 ```
 (Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)
 

diff --git a/configs/README.md b/configs/README.md
@@ -85,7 +85,7 @@ Note: yaml arguments may be formatted with either '-' or '_'. The standard separ
 
    # misc. training settings
    "distributed_backend": "nccl",
-   "save_interval": 10000,
+   "checkpoint_factor": 10000,
    "eval_interval": 1000,
    "eval_iters": 10,
 
@@ -230,7 +230,7 @@ Additional DeepSpeed settings besides those mentioned above should be wrapped in
    "load": "checkpoints",
    "tensorboard_dir": "tensorboard",
    "log_dir": "logs",
-   "save_interval": 10000,
+   "checkpoint_factor": 10000,
    "eval_interval": 1000,
    "eval_iters": 10,
 ```

diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md
@@ -111,7 +111,7 @@ Logging Arguments
 
 - **git_hash**: str
 
-    Default = 20d4228
+    Default = bb1b145
 
     current git hash of repository
 
@@ -800,23 +800,23 @@ Misc. Arguments
 
 
 
-- **do_train**: int
+- **do_train**: bool
 
     Default = None
 
     Set during training
 
 
 
-- **do_valid**: int
+- **do_valid**: bool
 
     Default = None
 
     Set during training
 
 
 
-- **do_test**: int
+- **do_test**: bool
 
     Default = None
 

diff --git a/deepy.py b/deepy.py
@@ -19,13 +19,13 @@
 import deepspeed.launcher.runner
 
 
-def main():
+def main(input_args=None):
     logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
 
     from megatron.neox_arguments import NeoXArgs
     from megatron.utils import get_wandb_api_key
 
-    neox_args = NeoXArgs.consume_deepy_args()
+    neox_args = NeoXArgs.consume_deepy_args(input_args)
     deepspeed_main_args = neox_args.get_deepspeed_main_args()
 
     # Extract wandb API key and inject into worker environments

diff --git a/evaluate.py b/evaluate.py
@@ -31,8 +31,8 @@
 import json
 
 
-def main():
-    model, neox_args = setup_for_inference_or_eval(use_cache=False)
+def main(input_args=None, overwrite_values=None):
+    model, neox_args = setup_for_inference_or_eval(use_cache=False, input_args=input_args, overwrite_values=overwrite_values)
     results = run_eval_harness(
         model,
         forward_step,

diff --git a/generate.py b/generate.py
@@ -26,11 +26,11 @@
 )
 
 
-def main():
+def main(input_args=None, overwrite_values=None):
     """
     Generate text/sample model
     """
-    model, neox_args = setup_for_inference_or_eval(use_cache=True)
+    model, neox_args = setup_for_inference_or_eval(use_cache=True,input_args=input_args, overwrite_values=overwrite_values)
     if neox_args.recompute:
         model.module.inference_mode(
             use_cache=False
@@ -83,7 +83,7 @@ def main():
 
     else:
         raise ValueError(
-            f"`text-gen-type` either not specified or not recognised: {neox_args.text_gen_type}"
+            f"`text_gen_type` either not specified or not recognised: {neox_args.text_gen_type}"
         )
 
 

diff --git a/megatron/__init__.py b/megatron/__init__.py
@@ -23,5 +23,4 @@ def print_rank_0(*message):
         print(*message, flush=True)
 
 
-from .initialize import initialize_megatron
 from .neox_arguments import NeoXArgs
diff --git a/megatron/checkpointing.py b/megatron/checkpointing.py
@@ -392,9 +392,10 @@ def load_checkpoint(
     if neox_args.finetune:
         iteration = 0
     else:
-        iteration = state_dict.get("iteration") or state_dict.get(
-            "total_iters"
-        )  # total_iters backward compatible with older checkpoints
+        if "iteration" in state_dict:
+            iteration = state_dict["iteration"]
+        else:
+            iteration = state_dict.get("total_iters") # total_iters backward compatible with older checkpoints
         if iteration is None:
             raise ValueError(
                 f"Unable to load iteration from checkpoint {checkpoint_name} with keys {state_dict.keys()}, exiting"

diff --git a/megatron/fused_kernels/__init__.py b/megatron/fused_kernels/__init__.py
@@ -16,7 +16,6 @@
 import pathlib
 import subprocess
 
-from torch.utils import cpp_extension
 from pathlib import Path
 
 srcpath = Path(__file__).parent.absolute()
Original file line number	Diff line number	Diff line change
Expand Up		@@ -23,5 +23,4 @@ def print_rank_0(*message):
		print(*message, flush=True)


		from .initialize import initialize_megatron
		from .neox_arguments import NeoXArgs