Skip to content

Commit

Permalink
Organize the tools directory (#1055)
Browse files Browse the repository at this point in the history
* Re-organize the  folder

Co-authored-by: Stella Biderman <[email protected]>
Signed-off-by: Dashiell Stander <[email protected]>

* Add README.md files for each subdirectory.

Signed-off-by: Dashiell Stander <[email protected]>

* Update NeoXArgs docs automatically

* Clarify the difference between HF scripts

Signed-off-by: Dashiell Stander <[email protected]>

* Update NeoXArgs docs automatically

* Fix tools paths

* Update NeoXArgs docs automatically

* flesh out ckpts README

* Update NeoXArgs docs automatically

* Fix tools paths for megatron imports

* Update NeoXArgs docs automatically

* Delete tools/ckpts/merge_mp_partitions.py since it's based on a very old Megatron

* Update NeoXArgs docs automatically

* Add blurb to bash tools README

* Update NeoXArgs docs automatically

* Flesh out datasets README

* Update NeoXArgs docs automatically

* formatting

* Update NeoXArgs docs automatically

---------

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Stella Biderman <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
  • Loading branch information
4 people authored Oct 2, 2023
1 parent 7a8569f commit 3f43f07
Show file tree
Hide file tree
Showing 25 changed files with 552 additions and 504 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):
(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)

You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:
You can now pretokenize your data using `tools/datasets/preprocess_data.py`, the arguments for which are detailed below:

```
usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
Expand Down Expand Up @@ -327,7 +327,7 @@ runtime:
For example:

```bash
python tools/preprocess_data.py \
python tools/datasets/preprocess_data.py \
--input ./data/mydataset.jsonl.zst \
--output-prefix ./data/mydataset \
--vocab ./data/gpt2-vocab.json \
Expand Down Expand Up @@ -431,19 +431,19 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints

To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
```bash
python ./tools/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
```

To convert a sequential model to Hugging Face format, run:
```bash
python ./tools/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
python ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
```
(Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)

Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
```bash
huggingface-cli login
python ./tools/upload.py
python ./tools/ckpts/upload.py
```
and input the requested information, including HF hub user token.

Expand Down
2 changes: 1 addition & 1 deletion configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = fd35b00
Default = a0cf0e8

current git hash of repository

Expand Down
2 changes: 1 addition & 1 deletion prepare_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from tools.corpora import prepare_dataset, DATA_DOWNLOADERS
from tools.datasets.corpora import prepare_dataset, DATA_DOWNLOADERS
import argparse

TOKENIZER_CHOICES = [
Expand Down
15 changes: 15 additions & 0 deletions tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# GPT-NeoX Auxiliary Tools

This directory contains a number of auxiliary tools that are useful for working with GPT-NeoX but not part of the main training code.

## Bash

This directory contains some simple, frequently used bash commands to make working on multiple machines easier.

## Checkpoints

This directory contains tools for manipulating and converting checkpoints including changing the parallelism settings of a pretrained model, converting between GPT-NeoX and the transformers library, and updating checkpoints trained with Version 1.x of this library to be compatible with Version 2.x.

## Datasets

This directory contains tools for downloading and preprocessing datasets to the format expected by the GPT-NeoX library.
8 changes: 8 additions & 0 deletions tools/bash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Bash Scripts
Useful for running distributed per-node scripts on e.g. Kubernetes

* `kill.sh` kills all python processes
* `killall.sh` uses pdsh to kill all `train.py` processes on the nodes listed in `/job/hosts/`
* `sync_cmd.sh` uses pdsh to run a command on all the nodes listed in `/job/hosts/`
* `sync.sh` uses pdcp to copy every file in a provided path to all of the nodes listed in `/job/hosts/`
* `syncdir.sh` uses pdcp to copy every file in a provided path to all of the nodes listed in `/job/hosts/`
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion tools/syncdir.sh → tools/bash/syncdir.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

# Push files to all nodes
# Usage
# sync.sh file [file2..]
# syncdir.sh file [file2..]

echo Number of files to upload: $#

Expand Down
133 changes: 133 additions & 0 deletions tools/ckpts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Checkpoint Scripts


## Utilities

### `inspect_checkpoints.py`
Reports information about a saved checkpoint.
```
usage: inspect_checkpoints.py [-h] [--attributes [ATTRIBUTES ...]] [--interactive] [--compare] [--diff] dir
positional arguments:
dir The checkpoint dir to inspect. Must be either: - a directory containing pickle binaries saved with 'torch.save' ending in .pt or .ckpt - a single path to a .pt or .ckpt file - two comma separated directories -
in which case the script will *compare* the two checkpoints
options:
-h, --help show this help message and exit
--attributes [ATTRIBUTES ...]
Name of one or several attributes to query. To access an attribute within a nested structure, use '/' as separator.
--interactive, -i Drops into interactive shell after printing the summary.
--compare, -c If true, script will compare two directories separated by commas
--diff, -d In compare mode, only print diffs
```

## HuggingFace Scripts

### `convert_hf_to_sequential.py`
A script for converting publicly available Huggingface (HF) checkpoints NeoX format.

Note that this script requires access to corresponding config files for equivalent NeoX models to those found in Hugging face.

```
Example usage: (Converts the 70M Pythia model to NeoX format)
================================================================
OMPI_COMM_WORLD_RANK=0 CUDA_VISIBLE_DEVICES=0 python tools/ckpts/convert_hf_to_sequential.py \
--hf-model-name pythia-70m-v0 \
--revision 143000 \
--output-dir checkpoints/neox_converted/pythia/70m \
--cache-dir checkpoints/HF \
--config configs/pythia/70M.yml configs/local_setup.yml \
--test
For multi-gpu support we must initialize deepspeed:
NOTE: This requires manually changing the arguments below.
================================================================
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./deepy.py tools/ckpts/convert_hf_to_sequential.py \
-d configs pythia/70M.yml local_setup.yml
```
### `convert_module_to_hf.py`
Converts a NeoX model with pipeline parallelism greater than 1 to a HuggingFace transformers `GPTNeoXForCausalLM` model

Note that this script does not support all NeoX features.
Please investigate carefully whether your model is compatible with all architectures supported by the GPTNeoXForCausalLM class in HF.

(e.g. position embeddings such as AliBi may not be supported by Huggingface's GPT-NeoX architecture)

```
usage: convert_module_to_hf.py [-h] [--input_dir INPUT_DIR] [--config_file CONFIG_FILE] [--output_dir OUTPUT_DIR] [--upload]
Merge MP partitions and convert to HF Model.
options:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Path to NeoX checkpoint, e.g. /path/to/model/global_step143000
--config_file CONFIG_FILE
Path to config file for the input NeoX checkpoint.
--output_dir OUTPUT_DIR
Output dir, where to save the HF Model, tokenizer, and configs
--upload Set to true in order to upload to the HF Hub directly.
```

### `convert_sequential_to_hf.py`
Converts a NeoX model without pipeline parallelism to a HuggingFace transformers `GPTNeoXForCausalLM` model.

```
usage: convert_sequential_to_hf.py [-h] [--input_dir INPUT_DIR] [--config_file CONFIG_FILE] [--output_dir OUTPUT_DIR] [--upload]
Merge MP partitions and convert to HF Model.
options:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Path to NeoX checkpoint, e.g. /path/to/model/global_step143000
--config_file CONFIG_FILE
Path to config file for the input NeoX checkpoint.
--output_dir OUTPUT_DIR
Output dir, where to save the HF Model, tokenizer, and configs
--upload Set to true in order to upload to the HF Hub directly.
```
### `upload.py`
Uploads a _converted_ checkpoint to the HuggingFace hub.

```
python upload.py <converted-ckpt-dir> <repo-name> <branch-name>
```
## NeoX-20B Scripts

### `merge20b.py`
Reduces model and pipeline parallelism of a 20B checkpoint to 1 and 1.

```
usage: merge20b.py [-h] [--input_dir INPUT_DIR] [--output_dir OUTPUT_DIR]
Merge 20B checkpoint.
options:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Checkpoint dir, which should contain (e.g. a folder named "global_step150000")
--output_dir OUTPUT_DIR
Output dir, to save the 1-GPU weights configs
```
## Llama Scripts

### `convert_raw_llama_weights_to_neox.py`
Takes a Llama checkpoint and puts it into a NeoX-compatible format.

```
usage: convert_raw_llama_weights_to_neox.py [-h] [--input_dir INPUT_DIR] [--model_size {7B,13B,30B,65B,tokenizer_only}] [--output_dir OUTPUT_DIR] [--num_output_shards NUM_OUTPUT_SHARDS] [--pipeline_parallel]
Convert raw LLaMA checkpoints to GPT-NeoX format.
options:
-h, --help show this help message and exit
--input_dir INPUT_DIR
Location of LLaMA weights, which contains tokenizer.model and model folders
--model_size {7B,13B,30B,65B,tokenizer_only}
--output_dir OUTPUT_DIR
Location to write GPT-NeoX mode
--num_output_shards NUM_OUTPUT_SHARDS
--pipeline_parallel Only use if PP>1
```
Loading

0 comments on commit 3f43f07

Please sign in to comment.