-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
49cf641
commit bf65987
Showing
1 changed file
with
105 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,107 @@ | ||
# Data Scripts | ||
|
||
* `preprocess_data.py` takes a raw dataset, splits it up, tokenizes it, and saves it as numpy files that can be memmapped and used efficiently by the training code. | ||
* `preprocess_data_with_mask.py` does the same but also creates `label` tensors if the dataset has labels. | ||
* `multinode_prepare_data.sh` does the same but distributed over multiple nodes. | ||
* `corpora.py` has information for common datasets. | ||
## `preprocess_data.py` | ||
Takes a raw dataset, splits it up, tokenizes it, and saves it as numpy files that can be memmapped and used efficiently by the training code. | ||
|
||
``` | ||
usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] | ||
--tokenizer-type | ||
{HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer,TiktokenTokenizer,SPMTokenizer} | ||
[--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix | ||
OUTPUT_PREFIX [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] | ||
[--log-interval LOG_INTERVAL] | ||
options: | ||
-h, --help show this help message and exit | ||
input data: | ||
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma | ||
separated list | ||
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...] | ||
space separate listed of keys to extract from jsonl. Defa | ||
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar. | ||
tokenizer: | ||
--tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer,TiktokenTokenizer,SPMTokenizer} | ||
What type of tokenizer to use. | ||
--vocab-file VOCAB_FILE | ||
Path to the vocab file | ||
--merge-file MERGE_FILE | ||
Path to the BPE merge file (if necessary). | ||
--append-eod Append an <eod> token to the end of a document. | ||
--ftfy Use ftfy to clean text | ||
output data: | ||
--output-prefix OUTPUT_PREFIX | ||
Path to binary output file without suffix | ||
--dataset-impl {lazy,cached,mmap} | ||
Dataset implementation to use. Default: mmap | ||
runtime: | ||
--workers WORKERS Number of worker processes to launch | ||
--log-interval LOG_INTERVAL | ||
Interval between progress updates | ||
``` | ||
## `preprocess_data_with_mask.py` | ||
Does the same but also creates `label` tensors if the dataset has labels. | ||
|
||
``` | ||
usage: preprocess_data_with_mask.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] | ||
[--mask-before-token MASK_BEFORE_TOKEN] [--num-docs NUM_DOCS] --tokenizer-type | ||
{HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} | ||
[--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] | ||
--output-prefix OUTPUT_PREFIX [--dataset-impl {lazy,cached,mmap}] | ||
[--workers WORKERS] [--log-interval LOG_INTERVAL] | ||
options: | ||
-h, --help show this help message and exit | ||
input data: | ||
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma | ||
separated list | ||
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...] | ||
space separate listed of keys to extract from jsonl. Defa | ||
--mask-before-token MASK_BEFORE_TOKEN | ||
apply loss masks before certain token(s). If multi-token pattern, separate by commas without | ||
space, e.g. --mask-before-token 0,1,1270 to use the token pattern [0,1,1270]. | ||
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar. | ||
tokenizer: | ||
--tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} | ||
What type of tokenizer to use. | ||
--vocab-file VOCAB_FILE | ||
Path to the vocab file | ||
--merge-file MERGE_FILE | ||
Path to the BPE merge file (if necessary). | ||
--append-eod Append an <eod> token to the end of a document. | ||
--ftfy Use ftfy to clean text | ||
output data: | ||
--output-prefix OUTPUT_PREFIX | ||
Path to binary output file without suffix | ||
--dataset-impl {lazy,cached,mmap} | ||
Dataset implementation to use. Default: mmap | ||
runtime: | ||
--workers WORKERS Number of worker processes to launch | ||
--log-interval LOG_INTERVAL | ||
Interval between progress updates | ||
``` | ||
## `multinode_prepare_data.sh` | ||
Does the same but distributed over multiple nodes. | ||
|
||
``` | ||
# USAGE: | ||
# This script allows you to prepare your dataset using multiple nodes by chunking the individual files and distributed the chunks | ||
# over the processes. | ||
# This bash script takes a single text file as input argument. | ||
# The text file contains a valid filepath in each line, leading to a jsonl-file. | ||
# Furthermore an environment variable for the rank and the world size needs to be set. | ||
# These default to the SLURM and OMPI variables in this order of priority, but they can be set manually as well | ||
# using the variables $RANK and $WORLD_SIZE, which will overwrite the cluster-specific variables. | ||
# You can also add all arguments of the prepare_data.py script to this script and it will simply pass them through. | ||
``` | ||
|
||
|
||
## `corpora.py` | ||
Has information for common datasets. Primarily meant for use in top-level `prepare_data.py` script. |