Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudnn_frontend] Error: No execution plans support the graph. #761

Open
Necktwi opened this issue Sep 19, 2024 · 1 comment
Open

[cudnn_frontend] Error: No execution plans support the graph. #761

Necktwi opened this issue Sep 19, 2024 · 1 comment

Comments

@Necktwi
Copy link

Necktwi commented Sep 19, 2024

necktwi@CheapFellow:~/workspace/llm.c$ make train_gpt2cu USE_CUDNN=1 CUDNN_FRONTEND_PATH="/home/necktwi/workspace/cudnn-frontend/include"

necktwi@CheapFellow:~/workspace/llm.c$ ./train_gpt2cu 
Multi-GPU support is disabled. Using a single GPU.
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 2060                            |
| peak TFlops           | -1.0                                               |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3652 MiB / 5740 MiB
memory per sequence: 331 MiB
 -> estimated maximum batch size: 10
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.
@alphapibeta
Copy link

alphapibeta commented Sep 22, 2024

 ~/e/llm.c  master ▓▒░ make train_gpt2cu USE_CUDNN=1                                         ░▒▓ ✔  32s  base Py  tesla@tesla  1 task  11:40:48 AM 
---------------------------------------------
✓ cuDNN found, will run with flash-attention
✓ OpenMP found
✓ NCCL found, OK to train with multiple GPUs
✓ MPI enabled
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -c --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 llmc/cudnn_att.cpp -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -o build/cudnn_att.o
/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 -DENABLE_CUDNN -DMULTI_GPU -DUSE_MPI -DENABLE_BF16 train_gpt2.cu build/cudnn_att.o -lcublas -lcublasLt -lnvidia-ml -lcudnn -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/home/tesla/cudnn-frontend/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/ -lnccl -lmpi -o train_gpt2cu

 ~/e/llm.c  master ▓▒░ ./train_gpt2cu                                                        ░▒▓ ✔  34s  base Py  tesla@tesla  1 task  11:41:23 AM 
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 4096                                               |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 2060                            |
| peak TFlops           | -1.0                                               |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 74                                                 |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Optimization is disabled                                              |
| num_processes         | 1                                                  |
| zero_stage            | 0                                                  |
+-----------------------+----------------------------------------------------+
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=4 * seq_len T=1024 * num_processes=1 and total_batch_size=4096
=> setting grad_accum_steps=1
allocating 237 MiB for parameter gradients
allocating 1326 MiB for activations
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
device memory usage: 3566 MiB / 5919 MiB
memory per sequence: 331 MiB
 -> estimated maximum batch size: 11
[CUDNN ERROR] at file llmc/cudnn_att.cpp:120:
[cudnn_frontend] Error: No execution plans support the graph.

./train_gpt2cu  # gives error

Python files runs.

 ~/e/llm.c  master ▓▒░ python3 train_gpt2.py                                                     ░▒▓ 1 ✘  base Py  tesla@tesla  1 task  11:41:27 AM 
/home/tesla/exp/llm.c/train_gpt2.py:34: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import ZeroRedundancyOptimizer
Running pytorch 2.4.0+cu121
using device: cuda
total desired batch size: 256
=> calculated gradient accumulation steps: 1
wrote gpt2_tokenizer.bin
loading weights from pretrained gpt: gpt2
DataLoader: total number of tokens: 32,768 across 1 files
padded vocab size from 50257 to 50304
wrote gpt2_124M.bin
padded vocab size from 50257 to 50304
wrote gpt2_124M_bf16.bin
padded vocab size in reference grads from 50257 to 50304
wrote gpt2_124M_debug_state.bin
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
using regular AdamW
step    1/10 | train loss 5.270009 | norm 30.5000 | lr 1.00e-04 | (135.69 ms | 1887 tok/s)
step    2/10 | train loss 4.060703 | norm 17.0772 | lr 1.00e-04 | (104.22 ms | 2456 tok/s)
step    3/10 | train loss 3.320115 | norm 14.7840 | lr 1.00e-04 | (96.73 ms | 2647 tok/s)
step    4/10 | train loss 2.717573 | norm 13.1957 | lr 1.00e-04 | (100.47 ms | 2548 tok/s)
step    5/10 | train loss 2.181084 | norm 12.3892 | lr 1.00e-04 | (103.23 ms | 2480 tok/s)
step    6/10 | train loss 1.653934 | norm 10.6317 | lr 1.00e-04 | (97.49 ms | 2626 tok/s)
step    7/10 | train loss 1.168067 | norm 9.7828 | lr 1.00e-04 | (98.37 ms | 2602 tok/s)
step    8/10 | train loss 0.736853 | norm 8.1185 | lr 1.00e-04 | (100.69 ms | 2543 tok/s)
step    9/10 | train loss 0.400987 | norm 6.2682 | lr 1.00e-04 | (104.10 ms | 2459 tok/s)
step   10/10 | train loss 0.187464 | norm 3.6643 | lr 1.00e-04 | (97.34 ms | 2630 tok/s)
final 9 iters avg: 100.293ms
peak memory consumption: 2320 MiB

The error is due to cudnn-frontend? or something else?
The python is able to run the test while ./train_gpt2cu gives error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants