You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run the program but I met the following error.
+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | ../fineweb10B/fineweb_train_.bin |
| val data pattern | ../fineweb10B/fineweb_val_.bin |
| output log dir | log124M |
| checkpoint_every | 5000 |
| resume | 0 |
| micro batch size B | 64 |
| sequence length T | 1024 |
| total batch size | 524288 |
| LR scheduler | cosine |
| learning rate (LR) | 6.000000e-04 |
| warmup iterations | 700 |
| final LR fraction | 0.000000e+00 |
| weight decay | 1.000000e-01 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 250 |
| val_max_steps | 20 |
| sample_every | 20000 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA A100-SXM4-80GB |
| peak TFlops | 312.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | d12 |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 19560 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| num_processes | 8 |
| zero_stage | 1 |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run python dev/data/hellaswag.py to export and use it with -h 1.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288
=> setting grad_accum_steps=1
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
allocating 237 MiB for parameter gradients
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run python train_gpt2.py to write it
allocating 21216 MiB for activations
allocating 59 MiB for AdamW optimizer state m
allocating 59 MiB for AdamW optimizer state v
allocating 59 MiB for master copy of params
device memory usage: 23273 MiB / 81050 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 238
val loss 11.009205
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
My program runs on the school's server, and because I don't have sudo privileges, I can only run the program in the container provided by the school. The CUDA version is 12.3, the cuDNN version is 8.9.7, and cuDNN-frontend is installed in the home directory (~/)
The text was updated successfully, but these errors were encountered:
I tried to run the program but I met the following error.
+-----------------------+----------------------------------------------------+
| Parameter | Value |
+-----------------------+----------------------------------------------------+
| train data pattern | ../fineweb10B/fineweb_train_.bin |
| val data pattern | ../fineweb10B/fineweb_val_.bin |
| output log dir | log124M |
| checkpoint_every | 5000 |
| resume | 0 |
| micro batch size B | 64 |
| sequence length T | 1024 |
| total batch size | 524288 |
| LR scheduler | cosine |
| learning rate (LR) | 6.000000e-04 |
| warmup iterations | 700 |
| final LR fraction | 0.000000e+00 |
| weight decay | 1.000000e-01 |
| skip update lossz | 0.000000 |
| skip update gradz | 0.000000 |
| max_steps | -1 |
| val_loss_every | 250 |
| val_max_steps | 20 |
| sample_every | 20000 |
| genT | 64 |
| overfit_single_batch | 0 |
| use_master_weights | enabled |
| gelu_fusion | 0 |
| recompute | 1 |
+-----------------------+----------------------------------------------------+
| device | NVIDIA A100-SXM4-80GB |
| peak TFlops | 312.0 |
| precision | BF16 |
+-----------------------+----------------------------------------------------+
| weight init method | d12 |
| max_sequence_length T | 1024 |
| vocab_size V | 50257 |
| padded_vocab_size Vp | 50304 |
| num_layers L | 12 |
| num_heads NH | 12 |
| channels C | 768 |
| num_parameters | 124475904 |
+-----------------------+----------------------------------------------------+
| train_num_batches | 19560 |
| val_num_batches | 20 |
+-----------------------+----------------------------------------------------+
| run hellaswag | no |
+-----------------------+----------------------------------------------------+
| num_processes | 8 |
| zero_stage | 1 |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run
python dev/data/hellaswag.py
to export and use it with-h 1
.num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=64 * seq_len T=1024 * num_processes=8 and total_batch_size=524288
=> setting grad_accum_steps=1
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itallocating 237 MiB for parameter gradients
WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itWARNING: Failed to open the tokenizer file gpt2_tokenizer.bin
The Tokenizer is a new feature added April 14 2024.
Re-run
python train_gpt2.py
to write itallocating 21216 MiB for activations
allocating 59 MiB for AdamW optimizer state m
allocating 59 MiB for AdamW optimizer state v
allocating 59 MiB for master copy of params
device memory usage: 23273 MiB / 81050 MiB
memory per sequence: 331 MiB
-> estimated maximum batch size: 238
val loss 11.009205
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
[CUDNN ERROR] at file llmc/cudnn_att.cpp:205:
[cudnn_frontend] Error: No execution plans support the graph.
My program runs on the school's server, and because I don't have sudo privileges, I can only run the program in the container provided by the school. The CUDA version is 12.3, the cuDNN version is 8.9.7, and cuDNN-frontend is installed in the home directory (~/)
The text was updated successfully, but these errors were encountered: