Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI run error #729

Open
wzzanthony opened this issue Aug 8, 2024 · 0 comments
Open

MPI run error #729

wzzanthony opened this issue Aug 8, 2024 · 0 comments

Comments

@wzzanthony
Copy link

I tried to run the program but I met the following error.

+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | ../fineweb10B/fineweb_train_.bin |
| val data pattern | ../fineweb10B/fineweb_val_
.bin |
| output log dir | log124M |
| checkpoint_every | 5000 |
| resume | 0 |
| micro batch size B | 64 |
| sequence length T | 1024 |
| total batch size | 524288 |
| LR scheduler | cosine |
| learning rate (LR) | 6.000000e-04 |
| warmup iterations | 700 |
| final LR fraction | 0.000000e+00 |
| weight decay | 1.000000e-01 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 250 |
| val_max_steps | 20 |
| sample_every | 20000 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA A100-SXM4-80GB |
| peak TFlops | 312.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | d12 |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 19560 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| num_processes | 8 |
| zero_stage | 1 |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run python dev/data/hellaswag.py to export and use it with -h 1.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288
=> setting grad_accum_steps=1

WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it

allocating 237 MiB for parameter gradients

WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it


WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it

allocating 21216 MiB for activations
allocating 59 MiB for AdamW optimizer state m
allocating 59 MiB for AdamW optimizer state v
allocating 59 MiB for master copy of params
device memory usage: 23273 MiB / 81050 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 238
val loss 11.009205
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.

My program runs on the school's server, and because I don't have sudo privileges, I can only run the program in the container provided by the school. The CUDA version is 12.3, the cuDNN version is 8.9.7, and cuDNN-frontend is installed in the home directory (~/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant