Skip to content

Commit

Permalink
Merge branch 'main' into add-axonn-3d-TP
Browse files Browse the repository at this point in the history
  • Loading branch information
Quentin-Anthony authored Dec 19, 2023
2 parents f1c40e2 + 050f560 commit 7438b33
Show file tree
Hide file tree
Showing 39 changed files with 100,922 additions and 572 deletions.
39 changes: 24 additions & 15 deletions .github/workflows/coverity_scan.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,38 @@ jobs:

steps:
- uses: actions/checkout@v2
with:
path: gpt-neox

- name: Install utils
run: |
apt update -y && apt upgrade -y
apt install curl jq wget -y
sudo apt update -y && sudo apt upgrade -y
sudo apt install curl jq wget -y
- name: Coverity Download
run: |
wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=EleutherAI%2Fgpt-neox" -O coverity_tool.tgz
$GITHUB_WORKSPACE/bin/cov-configure --python
$GITHUB_WORKSPACE/bin/cov-configure --gcc
wget https://scan.coverity.com/download/linux64 --post-data "token=$COVERITY_TOKEN&project=$COVERITY_PROJECT" -O coverity_tool.tgz --no-verbose
mkdir $GITHUB_WORKSPACE/coverity && tar xvf coverity_tool.tgz -C $GITHUB_WORKSPACE/coverity --strip-components=1
$GITHUB_WORKSPACE/coverity/bin/cov-configure --python
$GITHUB_WORKSPACE/coverity/bin/cov-configure --gcc
- name: Coverity Scan
- name: Coverity Scan and Upload
run: |
set -x
$GITHUB_WORKSPACE/bin/cov-build --dir cov-int --no-command --fs-capture-search $GITHUB_WORKSPACE
- name: Coverity Upload
run: |
pushd $GITHUB_WORKSPACE
cd $GITHUB_WORKSPACE/gpt-neox
$GITHUB_WORKSPACE/coverity/bin/cov-build --dir $GITHUB_WORKSPACE/cov-int --no-command --fs-capture-search ./
popd
tar caf build-results.bz2 cov-int
curl --form token=$COV_PASSPHRASE \
curl --form token=$COVERITY_TOKEN \
--form email=$COV_USER \
--form file=@GITHUB_WORKSPACE/build-results.bz2 \
--form version="Version" \
--form description="Build" \
https://scan.coverity.com/builds?project=EleutherAI%2Fgpt-neox
--form [email protected] \
--form version="${{ inputs.build_version }}" \
--form description="${{ inputs.build_description }}" \
https://scan.coverity.com/builds?project=$COVERITY_PROJECT
- name: Upload Scan Build as Artifact
uses: actions/upload-artifact@v3
with:
name: coverity-build-${{ github.sha }}
path: build-results.bz2
27 changes: 13 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,8 @@ python ./megatron/fused_kernels/setup.py install # optional, if using fused kern

from the repository root.

<aside>

**Warning:** Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.
> [!Warning]
> Our codebase relies on [DeeperSpeed](https://github.com/EleutherAI/DeeperSpeed), our fork of the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.
</aside>

Expand Down Expand Up @@ -229,19 +228,19 @@ We currently offer three main functions:
which can be launched with:

```bash
./deepy.py [script.py] [./path/to/config_1.yaml] [./path/to/config_2.yaml] ... [./path/to/config_n.yaml]
./deepy.py [script.py] [./path/to/config_1.yml] [./path/to/config_2.yml] ... [./path/to/config_n.yml]
```

For example, to launch training you can run
```bash
./deepy.py train.py ./configs/20B.yaml ./configs/local_cluster.yaml
./deepy.py train.py ./configs/20B.yml ./configs/local_cluster.yml
```

For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation) respectively.

# Configuration

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), showing a diverse array of features and model sizes.
GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yml files in [configs](./configs/), showing a diverse array of features and model sizes.

These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as `pipe-parallel-size`, `model-parallel-size` to increase or decrease the degree of parallelisation, `train_micro_batch_size_per_gpu` or `gradient-accumulation-steps` to modify batch size related settings, or the `zero_optimization` dict to modify how optimizer states are parallelised across workers.

Expand Down Expand Up @@ -350,7 +349,7 @@ Training is launched using `deepy.py`, a wrapper around DeepSpeed's launcher, wh
The general usage pattern is:

```bash
python ./deepy.py train.py [path/to/config1.yaml] [path/to/config2.yaml] ...
python ./deepy.py train.py [path/to/config1.yml] [path/to/config2.yml] ...
```

You can pass in an arbitrary number of configs which will all be merged at runtime.
Expand All @@ -360,19 +359,19 @@ You can also optionally pass in a config prefix, which will assume all your conf
E.G:

```bash
python ./deepy.py train.py -d configs 125M.yaml local_setup.yaml
python ./deepy.py train.py -d configs 125M.yml local_setup.yml
```

This will deploy the `train.py` script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the `/job/hostfile` file (see [parameter documentation](configs/README.md)), or can simply be passed in as the `num_gpus` arg if running on a single node setup.

Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g `configs/125M.yaml`) and the data path parameters in another (e.g `configs/local_setup.yaml`).
Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g `configs/125M.yml`) and the data path parameters in another (e.g `configs/local_setup.yml`).


## Pretrained Models

### GPT-NeoX-20B

GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on [the Pile](https://arxiv.org/abs/2101.00027). Technical details about GPT-NeoX-20B can be found in [the associated paper](https://arxiv.org/abs/2204.06745). The configuration file for this model is both available at [`./configs/20B.yaml`](./configs/20B.yaml) and included in the download links below.
GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on [the Pile](https://arxiv.org/abs/2101.00027). Technical details about GPT-NeoX-20B can be found in [the associated paper](https://arxiv.org/abs/2204.06745). The configuration file for this model is both available at [`./configs/20B.yml`](./configs/20B.yml) and included in the download links below.

[Slim weights](https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/) - (No optimizer states, for inference or finetuning, 39GB)

Expand Down Expand Up @@ -411,7 +410,7 @@ We support three types of generation from a pretrained model:
2. Conditional generation based on an input read from a file
3. Interactive generation, which allows for multiple rounds of back-and-forth between a user and the language model via a command line interface

All three types of text generation can be launched via `python ./deepy.py generate.py -d configs 125M.yaml local_setup.yaml text_generation.yaml` with the appropriate values set in `configs/text_generation.yaml`.
All three types of text generation can be launched via `python ./deepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml` with the appropriate values set in `configs/text_generation.yml`.

# Evaluation

Expand All @@ -420,7 +419,7 @@ GPT-NeoX supports evaluation on downstream tasks through the [language model eva
To evaluate a trained model on the evaluation harness, simply run:

```bash
python ./deepy.py evaluate.py -d configs your_configs.yaml --eval_tasks task1 task2 ... taskn
python ./deepy.py evaluate.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn
```

where `--eval_tasks` is a list of evaluation tasks followed by spaces, e.g `--eval_tasks lambada hellaswag piqa sciq`. For details of all tasks available, refer to the [lm-evaluation-harness repo](https://github.com/EleutherAI/lm-evaluation-harness).
Expand All @@ -431,12 +430,12 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints

To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
```bash
python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
```

To convert a sequential model to Hugging Face format, run:
```bash
python ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
```
(Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)

Expand Down
4 changes: 2 additions & 2 deletions configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Note: yaml arguments may be formatted with either '-' or '_'. The standard separ

# misc. training settings
"distributed_backend": "nccl",
"save_interval": 10000,
"checkpoint_factor": 10000,
"eval_interval": 1000,
"eval_iters": 10,

Expand Down Expand Up @@ -230,7 +230,7 @@ Additional DeepSpeed settings besides those mentioned above should be wrapped in
"load": "checkpoints",
"tensorboard_dir": "tensorboard",
"log_dir": "logs",
"save_interval": 10000,
"checkpoint_factor": 10000,
"eval_interval": 1000,
"eval_iters": 10,
```
Expand Down
8 changes: 4 additions & 4 deletions configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = 20d4228
Default = bb1b145

current git hash of repository

Expand Down Expand Up @@ -800,23 +800,23 @@ Misc. Arguments



- **do_train**: int
- **do_train**: bool

Default = None

Set during training



- **do_valid**: int
- **do_valid**: bool

Default = None

Set during training



- **do_test**: int
- **do_test**: bool

Default = None

Expand Down
4 changes: 2 additions & 2 deletions deepy.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@
import deepspeed.launcher.runner


def main():
def main(input_args=None):
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))

from megatron.neox_arguments import NeoXArgs
from megatron.utils import get_wandb_api_key

neox_args = NeoXArgs.consume_deepy_args()
neox_args = NeoXArgs.consume_deepy_args(input_args)
deepspeed_main_args = neox_args.get_deepspeed_main_args()

# Extract wandb API key and inject into worker environments
Expand Down
4 changes: 2 additions & 2 deletions evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@
import json


def main():
model, neox_args = setup_for_inference_or_eval(use_cache=False)
def main(input_args=None, overwrite_values=None):
model, neox_args = setup_for_inference_or_eval(use_cache=False, input_args=input_args, overwrite_values=overwrite_values)
results = run_eval_harness(
model,
forward_step,
Expand Down
6 changes: 3 additions & 3 deletions generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@
)


def main():
def main(input_args=None, overwrite_values=None):
"""
Generate text/sample model
"""
model, neox_args = setup_for_inference_or_eval(use_cache=True)
model, neox_args = setup_for_inference_or_eval(use_cache=True,input_args=input_args, overwrite_values=overwrite_values)
if neox_args.recompute:
model.module.inference_mode(
use_cache=False
Expand Down Expand Up @@ -83,7 +83,7 @@ def main():

else:
raise ValueError(
f"`text-gen-type` either not specified or not recognised: {neox_args.text_gen_type}"
f"`text_gen_type` either not specified or not recognised: {neox_args.text_gen_type}"
)


Expand Down
1 change: 0 additions & 1 deletion megatron/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,4 @@ def print_rank_0(*message):
print(*message, flush=True)


from .initialize import initialize_megatron
from .neox_arguments import NeoXArgs
7 changes: 4 additions & 3 deletions megatron/checkpointing.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,9 +392,10 @@ def load_checkpoint(
if neox_args.finetune:
iteration = 0
else:
iteration = state_dict.get("iteration") or state_dict.get(
"total_iters"
) # total_iters backward compatible with older checkpoints
if "iteration" in state_dict:
iteration = state_dict["iteration"]
else:
iteration = state_dict.get("total_iters") # total_iters backward compatible with older checkpoints
if iteration is None:
raise ValueError(
f"Unable to load iteration from checkpoint {checkpoint_name} with keys {state_dict.keys()}, exiting"
Expand Down
1 change: 0 additions & 1 deletion megatron/fused_kernels/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@
import pathlib
import subprocess

from torch.utils import cpp_extension
from pathlib import Path

srcpath = Path(__file__).parent.absolute()
Expand Down
Loading

0 comments on commit 7438b33

Please sign in to comment.