Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI run with 8 GPU fails #727

Open
msharmavikram opened this issue Aug 2, 2024 · 1 comment
Open

MPI run with 8 GPU fails #727

msharmavikram opened this issue Aug 2, 2024 · 1 comment

Comments

@msharmavikram
Copy link
Contributor

msharmavikram commented Aug 2, 2024

mpirun -np 8 ./train_gpt2cu
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 32768                                              |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA A100-SXM4-80GB                              |
| peak TFlops           | 312.0                                              |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10951] *** Process received signal ***
[149-130-218-240:10951] Signal: Aborted (6)
[149-130-218-240:10951] Signal code:  (-6)
[149-130-218-240:10951] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe612442520]
[149-130-218-240:10951] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe6124969fc]
[149-130-218-240:10951] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe612442476]
[149-130-218-240:10951] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe6124287f3]
[149-130-218-240:10951] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe61242871b]
[149-130-218-240:10951] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe612439e96]
[149-130-218-240:10951] [ 6] ./train_gpt2cu(+0x17762)[0x55f5ea98f762]
[149-130-218-240:10951] [ 7] ./train_gpt2cu(+0xf120)[0x55f5ea987120]
[149-130-218-240:10951] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe612429d90]
[149-130-218-240:10951] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe612429e40]
[149-130-218-240:10951] [10] ./train_gpt2cu(+0x13275)[0x55f5ea98b275]
[149-130-218-240:10951] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10949] *** Process received signal ***
[149-130-218-240:10949] Signal: Aborted (6)
[149-130-218-240:10949] Signal code:  (-6)
[149-130-218-240:10949] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4969642520]
[149-130-218-240:10949] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f49696969fc]
[149-130-218-240:10949] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4969642476]
[149-130-218-240:10949] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f49696287f3]
[149-130-218-240:10949] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f496962871b]
[149-130-218-240:10949] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4969639e96]
[149-130-218-240:10949] [ 6] ./train_gpt2cu(+0x17762)[0x55756a4e6762]
[149-130-218-240:10949] [ 7] ./train_gpt2cu(+0xf120)[0x55756a4de120]
[149-130-218-240:10949] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4969629d90]
[149-130-218-240:10949] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4969629e40]
[149-130-218-240:10949] [10] ./train_gpt2cu(+0x13275)[0x55756a4e2275]
[149-130-218-240:10949] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10947] *** Process received signal ***
[149-130-218-240:10947] Signal: Aborted (6)
[149-130-218-240:10947] Signal code:  (-6)
[149-130-218-240:10947] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd0d6042520]
[149-130-218-240:10947] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd0d60969fc]
[149-130-218-240:10947] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd0d6042476]
[149-130-218-240:10947] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd0d60287f3]
[149-130-218-240:10947] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fd0d602871b]
[149-130-218-240:10947] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fd0d6039e96]
[149-130-218-240:10947] [ 6] ./train_gpt2cu(+0x17762)[0x55b68d44b762]
[149-130-218-240:10947] [ 7] ./train_gpt2cu(+0xf120)[0x55b68d443120]
[149-130-218-240:10947] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd0d6029d90]
[149-130-218-240:10947] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd0d6029e40]
[149-130-218-240:10947] [10] ./train_gpt2cu(+0x13275)[0x55b68d447275]
[149-130-218-240:10947] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10948] *** Process received signal ***
[149-130-218-240:10948] Signal: Aborted (6)
[149-130-218-240:10948] Signal code:  (-6)
[149-130-218-240:10948] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcbac242520]
[149-130-218-240:10948] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcbac2969fc]
[149-130-218-240:10948] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcbac242476]
[149-130-218-240:10948] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcbac2287f3]
[149-130-218-240:10948] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fcbac22871b]
[149-130-218-240:10948] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fcbac239e96]
[149-130-218-240:10948] [ 6] ./train_gpt2cu(+0x17762)[0x55c4774ce762]
[149-130-218-240:10948] [ 7] ./train_gpt2cu(+0xf120)[0x55c4774c6120]
[149-130-218-240:10948] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcbac229d90]
[149-130-218-240:10948] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcbac229e40]
[149-130-218-240:10948] [10] ./train_gpt2cu(+0x13275)[0x55c4774ca275]
[149-130-218-240:10948] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10950] *** Process received signal ***
[149-130-218-240:10950] Signal: Aborted (6)
[149-130-218-240:10950] Signal code:  (-6)
[149-130-218-240:10950] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7faae5a42520]
[149-130-218-240:10950] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7faae5a969fc]
[149-130-218-240:10950] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7faae5a42476]
[149-130-218-240:10950] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7faae5a287f3]
[149-130-218-240:10950] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7faae5a2871b]
[149-130-218-240:10950] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7faae5a39e96]
[149-130-218-240:10950] [ 6] ./train_gpt2cu(+0x17762)[0x562edaec8762]
[149-130-218-240:10950] [ 7] ./train_gpt2cu(+0xf120)[0x562edaec0120]
[149-130-218-240:10950] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7faae5a29d90]
[149-130-218-240:10950] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7faae5a29e40]
[149-130-218-240:10950] [10] ./train_gpt2cu(+0x13275)[0x562edaec4275]
[149-130-218-240:10950] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10945] *** Process received signal ***
[149-130-218-240:10945] Signal: Aborted (6)
[149-130-218-240:10945] Signal code:  (-6)
[149-130-218-240:10945] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe034642520]
[149-130-218-240:10945] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe0346969fc]
[149-130-218-240:10945] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe034642476]
[149-130-218-240:10945] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe0346287f3]
[149-130-218-240:10945] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe03462871b]
[149-130-218-240:10945] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe034639e96]
[149-130-218-240:10945] [ 6] ./train_gpt2cu(+0x17762)[0x561977d15762]
[149-130-218-240:10945] [ 7] ./train_gpt2cu(+0xf120)[0x561977d0d120]
[149-130-218-240:10945] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe034629d90]
[149-130-218-240:10945] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe034629e40]
[149-130-218-240:10945] [10] ./train_gpt2cu(+0x13275)[0x561977d11275]
[149-130-218-240:10945] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10946] *** Process received signal ***
[149-130-218-240:10946] Signal: Aborted (6)
[149-130-218-240:10946] Signal code:  (-6)
[149-130-218-240:10946] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4bd8842520]
[149-130-218-240:10946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f4bd88969fc]
[149-130-218-240:10946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4bd8842476]
[149-130-218-240:10946] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f4bd88287f3]
[149-130-218-240:10946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f4bd882871b]
[149-130-218-240:10946] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4bd8839e96]
[149-130-218-240:10946] [ 6] ./train_gpt2cu(+0x17762)[0x5637c07ba762]
[149-130-218-240:10946] [ 7] ./train_gpt2cu(+0xf120)[0x5637c07b2120]
[149-130-218-240:10946] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4bd8829d90]
[149-130-218-240:10946] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4bd8829e40]
[149-130-218-240:10946] [10] ./train_gpt2cu(+0x13275)[0x5637c07b6275]
[149-130-218-240:10946] *** End of error message ***
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10944] *** Process received signal ***
[149-130-218-240:10944] Signal: Aborted (6)
[149-130-218-240:10944] Signal code:  (-6)
[149-130-218-240:10944] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f21acc42520]
[149-130-218-240:10944] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f21acc969fc]
[149-130-218-240:10944] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f21acc42476]
[149-130-218-240:10944] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f21acc287f3]
[149-130-218-240:10944] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f21acc2871b]
[149-130-218-240:10944] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f21acc39e96]
[149-130-218-240:10944] [ 6] ./train_gpt2cu(+0x17762)[0x55d509142762]
[149-130-218-240:10944] [ 7] ./train_gpt2cu(+0xf120)[0x55d50913a120]
[149-130-218-240:10944] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f21acc29d90]
[149-130-218-240:10944] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f21acc29e40]
[149-130-218-240:10944] [10] ./train_gpt2cu(+0x13275)[0x55d50913e275]
[149-130-218-240:10944] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node 149-130-218-240 exited on signal 6 (Aborted).

MPI runs with 4 or 6 GPUs works just fine.

@msharmavikram
Copy link
Contributor Author

I am running this on CUDA 12.2 version - without cuDNN on Lamdhalabs cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant