-
Notifications
You must be signed in to change notification settings - Fork 42
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add back some codes about hetero training and recomputation (#141)
Co-authored-by: zhaoyinglia <[email protected]>
- Loading branch information
1 parent
d05ca66
commit 3540eb4
Showing
13 changed files
with
431 additions
and
108 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
defaults: | ||
- train: demo_hetero | ||
- _self_ | ||
|
||
experiment: | ||
exp_name: aquila2 | ||
exp_dir: ./outputs | ||
task: | ||
type: train | ||
backend: megatron | ||
entrypoint: ./flagscale/train/train_aquila.py | ||
runner: | ||
hostfile: xxxx # Please replace with your actual hostfile path | ||
rdzv_backend: "static" # hetero training only supports static | ||
envs: | ||
CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7 | ||
CUDA_DEVICE_MAX_CONNECTIONS: 1 | ||
cmds: | ||
before_start: "ulimit -n 1048576" | ||
after_stop: "" | ||
|
||
action: run | ||
|
||
hydra: | ||
run: | ||
dir: ${experiment.exp_dir}/hydra |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
system: | ||
tensor_model_parallel_size: 2 | ||
pipeline_model_parallel_size: 4 | ||
disable_bias_linear: True | ||
use_flash_attn: True | ||
sequence_parallel: True | ||
use_distributed_optimizer: True | ||
use_mcore_models: true | ||
transformer_impl: transformer_engine | ||
hetero: | ||
hetero_mode: "pp" | ||
hetero_pipeline_stages: [4, 2, 2, 4, 4] | ||
recompute: | ||
recompute_granularity: "full" | ||
recompute_method: "uniform" | ||
recompute_num_layers: 1 | ||
recompute_granularity_per_stage: [1, 0, 2, 1, 1, 1] | ||
recompute_method_per_stage: [1, 0, 2, 0, 1, 1] | ||
recompute_num_layers_per_stage: [1, 2, 2, 1, 1, 2] | ||
precision: | ||
bf16: True | ||
attention_softmax_in_fp32: True | ||
accumulate_allreduce_grads_in_fp32: True | ||
logging: | ||
log_interval: 1 | ||
log_throughput: true | ||
tensorboard_log_interval: 1 | ||
wandb_project: "aquila2" | ||
wandb_exp_name: "test" | ||
checkpoint: | ||
save_interval: 1000 | ||
|
||
|
||
model: | ||
num_layers: 12 | ||
hidden_size: 4096 | ||
num_attention_heads: 32 | ||
seq_length: 2048 | ||
max_position_embeddings: 2048 | ||
norm_epsilon: 1e-5 | ||
use_rotary_position_embeddings: true | ||
no_position_embedding: true | ||
swiglu: true | ||
multiple_of: 256 | ||
normalization: RMSNorm | ||
rotary_interleaved_patch: true | ||
untie_embeddings_and_output_weights: true | ||
init_method_std: 0.0165 | ||
attention_dropout: 0.0 | ||
hidden_dropout: 0.0 | ||
weight_decay: 0.1 | ||
clip_grad: 1.0 | ||
train_samples: 100000 | ||
global_batch_size: 32 | ||
micro_batch_size: 1 | ||
# rampup_batch_size: [32, 32, 2000000] | ||
seed: 42 | ||
|
||
optimizer: | ||
lr: 2e-4 | ||
weight_decay: 0.01 | ||
adam_beta1: 0.9 | ||
adam_beta2: 0.95 | ||
lr_scheduler: | ||
lr: 1.5e-4 | ||
min_lr: 1.5e-5 | ||
lr_warmup_samples: 500 | ||
lr_decay_style: cosine | ||
|
||
data: | ||
data_path: xxxx # Please replace with your actual data path | ||
split: 1 | ||
tokenizer: | ||
tokenizer_type: xxxx # Please replace with your actual tokenizer type | ||
tokenizer_path: xxxx # Please replace with your actual tokenizer path | ||
vocab_file: null | ||
merge_file: null | ||
special_tokens_file: null | ||
vocab_size: xxxx # Please replace with your actual vocab size | ||
make_vocab_size_divisible_by: 64 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.