Organize the tools directory (#1055)

* Re-organize the folder Co-authored-by: Stella Biderman <[email protected]> Signed-off-by: Dashiell Stander <[email protected]> * Add README.md files for each subdirectory. Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Clarify the difference between HF scripts Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Fix tools paths * Update NeoXArgs docs automatically * flesh out ckpts README * Update NeoXArgs docs automatically * Fix tools paths for megatron imports * Update NeoXArgs docs automatically * Delete tools/ckpts/merge_mp_partitions.py since it's based on a very old Megatron * Update NeoXArgs docs automatically * Add blurb to bash tools README * Update NeoXArgs docs automatically * Flesh out datasets README * Update NeoXArgs docs automatically * formatting * Update NeoXArgs docs automatically --------- Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Stella Biderman <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>
EleutherAI · Oct 2, 2023 · 3f43f07 · 3f43f07
1 parent 7a8569f
commit 3f43f07
Show file tree

Hide file tree

Showing 25 changed files with 552 additions and 504 deletions.
diff --git a/README.md b/README.md
@@ -286,7 +286,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):
 
 (alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)
 
-You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:
+You can now pretokenize your data using `tools/datasets/preprocess_data.py`, the arguments for which are detailed below:
 
 ```
 usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
@@ -327,7 +327,7 @@ runtime:
 For example:
 
 ```bash
-python tools/preprocess_data.py \
+python tools/datasets/preprocess_data.py \
             --input ./data/mydataset.jsonl.zst \
             --output-prefix ./data/mydataset \
             --vocab ./data/gpt2-vocab.json \
@@ -431,19 +431,19 @@ GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints
 
 To convert a NeoX checkpoint (with pipeline-parallel-size>=1) to Hugging Face-loadable format, run:
 ```bash
-python ./tools/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python ./tools/ckpts/convert_module_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
 ```
 
 To convert a sequential model to Hugging Face format, run:
 ```bash
-python  ./tools/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
+python  ./tools/ckpts/convert_sequential_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yaml --output_dir hf_model/save/location
 ```
 (Note: this script should be used for v2.0 checkpoints saved on a v2.0 commit prior to https://github.com/EleutherAI/gpt-neox/pull/866 and which used `pipe-parallel-size=1`. Using `pipe-parallel-size=0` will also save models in this format.)
 
 Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
 ```bash
 huggingface-cli login
-python ./tools/upload.py
+python ./tools/ckpts/upload.py
 ```
 and input the requested information, including HF hub user token.
 

diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md
@@ -111,7 +111,7 @@ Logging Arguments
 
 - **git_hash**: str
 
-    Default = fd35b00
+    Default = a0cf0e8
 
     current git hash of repository
 

diff --git a/prepare_data.py b/prepare_data.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from tools.corpora import prepare_dataset, DATA_DOWNLOADERS
+from tools.datasets.corpora import prepare_dataset, DATA_DOWNLOADERS
 import argparse
 
 TOKENIZER_CHOICES = [

diff --git a/tools/README.md b/tools/README.md
@@ -0,0 +1,15 @@
+# GPT-NeoX Auxiliary Tools
+
+This directory contains a number of auxiliary tools that are useful for working with GPT-NeoX but not part of the main training code.
+
+## Bash
+
+This directory contains some simple, frequently used bash commands to make working on multiple machines easier.
+
+## Checkpoints
+
+This directory contains tools for manipulating and converting checkpoints including changing the parallelism settings of a pretrained model, converting between GPT-NeoX and the transformers library, and updating checkpoints trained with Version 1.x of this library to be compatible with Version 2.x.
+
+## Datasets
+
+This directory contains tools for downloading and preprocessing datasets to the format expected by the GPT-NeoX library.
diff --git a/tools/bash/README.md b/tools/bash/README.md
@@ -0,0 +1,8 @@
+# Bash Scripts
+Useful for running distributed per-node scripts on e.g. Kubernetes 
+
+* `kill.sh` kills all python processes
+* `killall.sh` uses pdsh to kill all `train.py` processes on the nodes listed in `/job/hosts/`
+* `sync_cmd.sh` uses pdsh to run a command on all the nodes listed in `/job/hosts/`
+* `sync.sh` uses pdcp to copy every file in a provided path to all of the nodes listed in `/job/hosts/`
+* `syncdir.sh` uses pdcp to copy every file in a provided path to all of the nodes listed in `/job/hosts/`
diff --git a/tools/kill.sh → tools/bash/kill.sh b/tools/kill.sh → tools/bash/kill.sh
diff --git a/tools/killall.sh → tools/bash/killall.sh b/tools/killall.sh → tools/bash/killall.sh
diff --git a/tools/sync.sh → tools/bash/sync.sh b/tools/sync.sh → tools/bash/sync.sh
diff --git a/tools/sync_cmd.sh → tools/bash/sync_cmd.sh b/tools/sync_cmd.sh → tools/bash/sync_cmd.sh
diff --git a/tools/syncdir.sh → tools/bash/syncdir.sh b/tools/syncdir.sh → tools/bash/syncdir.sh
@@ -16,7 +16,7 @@
 
 # Push files to all nodes
 # Usage
-# sync.sh file [file2..]
+# syncdir.sh file [file2..]
 
 echo Number of files to upload: $#
 

diff --git a/tools/ckpts/README.md b/tools/ckpts/README.md
@@ -0,0 +1,133 @@
+# Checkpoint Scripts
+
+
+## Utilities
+
+### `inspect_checkpoints.py` 
+Reports information about a saved checkpoint.
+```
+usage: inspect_checkpoints.py [-h] [--attributes [ATTRIBUTES ...]] [--interactive] [--compare] [--diff] dir
+
+positional arguments:
+  dir                   The checkpoint dir to inspect. Must be either: - a directory containing pickle binaries saved with 'torch.save' ending in .pt or .ckpt - a single path to a .pt or .ckpt file - two comma separated directories -
+                        in which case the script will *compare* the two checkpoints
+
+options:
+  -h, --help            show this help message and exit
+  --attributes [ATTRIBUTES ...]
+                        Name of one or several attributes to query. To access an attribute within a nested structure, use '/' as separator.
+  --interactive, -i     Drops into interactive shell after printing the summary.
+  --compare, -c         If true, script will compare two directories separated by commas
+  --diff, -d            In compare mode, only print diffs
+```
+
+## HuggingFace Scripts
+
+### `convert_hf_to_sequential.py` 
+A script for converting publicly available Huggingface (HF) checkpoints NeoX format.
+
+Note that this script requires access to corresponding config files for equivalent NeoX models to those found in Hugging face.
+
+```
+Example usage: (Converts the 70M Pythia model to NeoX format)
+================================================================
+OMPI_COMM_WORLD_RANK=0 CUDA_VISIBLE_DEVICES=0 python tools/ckpts/convert_hf_to_sequential.py \
+    --hf-model-name pythia-70m-v0 \
+    --revision 143000 \
+    --output-dir checkpoints/neox_converted/pythia/70m \
+    --cache-dir checkpoints/HF \
+    --config configs/pythia/70M.yml configs/local_setup.yml \
+    --test
+
+
+For multi-gpu support we must initialize deepspeed:
+NOTE: This requires manually changing the arguments below.
+================================================================
+CUDA_VISIBLE_DEVICES=0,1,2,3 python ./deepy.py tools/ckpts/convert_hf_to_sequential.py \
+    -d configs pythia/70M.yml local_setup.yml
+```
+### `convert_module_to_hf.py` 
+Converts a NeoX model with pipeline parallelism greater than 1 to a HuggingFace transformers `GPTNeoXForCausalLM` model
+
+Note that this script does not support all NeoX features.
+Please investigate carefully whether your model is compatible with all architectures supported by the GPTNeoXForCausalLM class in HF.
+
+(e.g. position embeddings such as AliBi may not be supported by Huggingface's GPT-NeoX architecture)
+
+```
+usage: convert_module_to_hf.py [-h] [--input_dir INPUT_DIR] [--config_file CONFIG_FILE] [--output_dir OUTPUT_DIR] [--upload]
+
+Merge MP partitions and convert to HF Model.
+
+options:
+  -h, --help            show this help message and exit
+  --input_dir INPUT_DIR
+                        Path to NeoX checkpoint, e.g. /path/to/model/global_step143000
+  --config_file CONFIG_FILE
+                        Path to config file for the input NeoX checkpoint.
+  --output_dir OUTPUT_DIR
+                        Output dir, where to save the HF Model, tokenizer, and configs
+  --upload              Set to true in order to upload to the HF Hub directly.
+```
+
+### `convert_sequential_to_hf.py` 
+Converts a NeoX model without pipeline parallelism to a HuggingFace transformers `GPTNeoXForCausalLM` model.
+
+```
+usage: convert_sequential_to_hf.py [-h] [--input_dir INPUT_DIR] [--config_file CONFIG_FILE] [--output_dir OUTPUT_DIR] [--upload]
+
+Merge MP partitions and convert to HF Model.
+
+options:
+  -h, --help            show this help message and exit
+  --input_dir INPUT_DIR
+                        Path to NeoX checkpoint, e.g. /path/to/model/global_step143000
+  --config_file CONFIG_FILE
+                        Path to config file for the input NeoX checkpoint.
+  --output_dir OUTPUT_DIR
+                        Output dir, where to save the HF Model, tokenizer, and configs
+  --upload              Set to true in order to upload to the HF Hub directly.
+```
+### `upload.py` 
+Uploads a _converted_ checkpoint to the HuggingFace hub.
+
+```
+python upload.py <converted-ckpt-dir> <repo-name> <branch-name>
+```
+## NeoX-20B Scripts
+
+### `merge20b.py` 
+Reduces model and pipeline parallelism of a 20B checkpoint to 1 and 1.
+
+```
+usage: merge20b.py [-h] [--input_dir INPUT_DIR] [--output_dir OUTPUT_DIR]
+
+Merge 20B checkpoint.
+
+options:
+  -h, --help            show this help message and exit
+  --input_dir INPUT_DIR
+                        Checkpoint dir, which should contain (e.g. a folder named "global_step150000")
+  --output_dir OUTPUT_DIR
+                        Output dir, to save the 1-GPU weights configs
+```
+## Llama Scripts
+
+### `convert_raw_llama_weights_to_neox.py` 
+Takes a Llama checkpoint and puts it into a NeoX-compatible format.
+
+```
+usage: convert_raw_llama_weights_to_neox.py [-h] [--input_dir INPUT_DIR] [--model_size {7B,13B,30B,65B,tokenizer_only}] [--output_dir OUTPUT_DIR] [--num_output_shards NUM_OUTPUT_SHARDS] [--pipeline_parallel]
+
+Convert raw LLaMA checkpoints to GPT-NeoX format.
+
+options:
+  -h, --help            show this help message and exit
+  --input_dir INPUT_DIR
+                        Location of LLaMA weights, which contains tokenizer.model and model folders
+  --model_size {7B,13B,30B,65B,tokenizer_only}
+  --output_dir OUTPUT_DIR
+                        Location to write GPT-NeoX mode
+  --num_output_shards NUM_OUTPUT_SHARDS
+  --pipeline_parallel   Only use if PP>1
+```