We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun -np 8 ./train_gpt2cu +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin | | val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin | | output log dir | NULL | | checkpoint_every | 0 | | resume | 0 | | micro batch size B | 4 | | sequence length T | 1024 | | total batch size | 32768 | | LR scheduler | cosine | | learning rate (LR) | 3.000000e-04 | | warmup iterations | 0 | | final LR fraction | 1.000000e+00 | | weight decay | 0.000000e+00 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 20 | | val_max_steps | 20 | | sample_every | 20 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-SXM4-80GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10951] *** Process received signal *** [149-130-218-240:10951] Signal: Aborted (6) [149-130-218-240:10951] Signal code: (-6) [149-130-218-240:10951] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe612442520] [149-130-218-240:10951] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe6124969fc] [149-130-218-240:10951] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe612442476] [149-130-218-240:10951] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe6124287f3] [149-130-218-240:10951] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe61242871b] [149-130-218-240:10951] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe612439e96] [149-130-218-240:10951] [ 6] ./train_gpt2cu(+0x17762)[0x55f5ea98f762] [149-130-218-240:10951] [ 7] ./train_gpt2cu(+0xf120)[0x55f5ea987120] [149-130-218-240:10951] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe612429d90] [149-130-218-240:10951] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe612429e40] [149-130-218-240:10951] [10] ./train_gpt2cu(+0x13275)[0x55f5ea98b275] [149-130-218-240:10951] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10949] *** Process received signal *** [149-130-218-240:10949] Signal: Aborted (6) [149-130-218-240:10949] Signal code: (-6) [149-130-218-240:10949] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4969642520] [149-130-218-240:10949] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f49696969fc] [149-130-218-240:10949] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4969642476] [149-130-218-240:10949] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f49696287f3] [149-130-218-240:10949] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f496962871b] [149-130-218-240:10949] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4969639e96] [149-130-218-240:10949] [ 6] ./train_gpt2cu(+0x17762)[0x55756a4e6762] [149-130-218-240:10949] [ 7] ./train_gpt2cu(+0xf120)[0x55756a4de120] [149-130-218-240:10949] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4969629d90] [149-130-218-240:10949] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4969629e40] [149-130-218-240:10949] [10] ./train_gpt2cu(+0x13275)[0x55756a4e2275] [149-130-218-240:10949] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10947] *** Process received signal *** [149-130-218-240:10947] Signal: Aborted (6) [149-130-218-240:10947] Signal code: (-6) [149-130-218-240:10947] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd0d6042520] [149-130-218-240:10947] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd0d60969fc] [149-130-218-240:10947] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd0d6042476] [149-130-218-240:10947] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd0d60287f3] [149-130-218-240:10947] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fd0d602871b] [149-130-218-240:10947] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fd0d6039e96] [149-130-218-240:10947] [ 6] ./train_gpt2cu(+0x17762)[0x55b68d44b762] [149-130-218-240:10947] [ 7] ./train_gpt2cu(+0xf120)[0x55b68d443120] [149-130-218-240:10947] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd0d6029d90] [149-130-218-240:10947] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd0d6029e40] [149-130-218-240:10947] [10] ./train_gpt2cu(+0x13275)[0x55b68d447275] [149-130-218-240:10947] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10948] *** Process received signal *** [149-130-218-240:10948] Signal: Aborted (6) [149-130-218-240:10948] Signal code: (-6) [149-130-218-240:10948] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcbac242520] [149-130-218-240:10948] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcbac2969fc] [149-130-218-240:10948] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcbac242476] [149-130-218-240:10948] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcbac2287f3] [149-130-218-240:10948] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fcbac22871b] [149-130-218-240:10948] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fcbac239e96] [149-130-218-240:10948] [ 6] ./train_gpt2cu(+0x17762)[0x55c4774ce762] [149-130-218-240:10948] [ 7] ./train_gpt2cu(+0xf120)[0x55c4774c6120] [149-130-218-240:10948] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcbac229d90] [149-130-218-240:10948] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcbac229e40] [149-130-218-240:10948] [10] ./train_gpt2cu(+0x13275)[0x55c4774ca275] [149-130-218-240:10948] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10950] *** Process received signal *** [149-130-218-240:10950] Signal: Aborted (6) [149-130-218-240:10950] Signal code: (-6) [149-130-218-240:10950] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7faae5a42520] [149-130-218-240:10950] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7faae5a969fc] [149-130-218-240:10950] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7faae5a42476] [149-130-218-240:10950] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7faae5a287f3] [149-130-218-240:10950] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7faae5a2871b] [149-130-218-240:10950] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7faae5a39e96] [149-130-218-240:10950] [ 6] ./train_gpt2cu(+0x17762)[0x562edaec8762] [149-130-218-240:10950] [ 7] ./train_gpt2cu(+0xf120)[0x562edaec0120] [149-130-218-240:10950] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7faae5a29d90] [149-130-218-240:10950] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7faae5a29e40] [149-130-218-240:10950] [10] ./train_gpt2cu(+0x13275)[0x562edaec4275] [149-130-218-240:10950] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10945] *** Process received signal *** [149-130-218-240:10945] Signal: Aborted (6) [149-130-218-240:10945] Signal code: (-6) [149-130-218-240:10945] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe034642520] [149-130-218-240:10945] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe0346969fc] [149-130-218-240:10945] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe034642476] [149-130-218-240:10945] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe0346287f3] [149-130-218-240:10945] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe03462871b] [149-130-218-240:10945] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe034639e96] [149-130-218-240:10945] [ 6] ./train_gpt2cu(+0x17762)[0x561977d15762] [149-130-218-240:10945] [ 7] ./train_gpt2cu(+0xf120)[0x561977d0d120] [149-130-218-240:10945] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe034629d90] [149-130-218-240:10945] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe034629e40] [149-130-218-240:10945] [10] ./train_gpt2cu(+0x13275)[0x561977d11275] [149-130-218-240:10945] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10946] *** Process received signal *** [149-130-218-240:10946] Signal: Aborted (6) [149-130-218-240:10946] Signal code: (-6) [149-130-218-240:10946] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4bd8842520] [149-130-218-240:10946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f4bd88969fc] [149-130-218-240:10946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4bd8842476] [149-130-218-240:10946] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f4bd88287f3] [149-130-218-240:10946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f4bd882871b] [149-130-218-240:10946] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4bd8839e96] [149-130-218-240:10946] [ 6] ./train_gpt2cu(+0x17762)[0x5637c07ba762] [149-130-218-240:10946] [ 7] ./train_gpt2cu(+0xf120)[0x5637c07b2120] [149-130-218-240:10946] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4bd8829d90] [149-130-218-240:10946] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4bd8829e40] [149-130-218-240:10946] [10] ./train_gpt2cu(+0x13275)[0x5637c07b6275] [149-130-218-240:10946] *** End of error message *** | weight init method | gpt2_124M_bf16.bin | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10944] *** Process received signal *** [149-130-218-240:10944] Signal: Aborted (6) [149-130-218-240:10944] Signal code: (-6) [149-130-218-240:10944] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f21acc42520] [149-130-218-240:10944] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f21acc969fc] [149-130-218-240:10944] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f21acc42476] [149-130-218-240:10944] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f21acc287f3] [149-130-218-240:10944] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f21acc2871b] [149-130-218-240:10944] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f21acc39e96] [149-130-218-240:10944] [ 6] ./train_gpt2cu(+0x17762)[0x55d509142762] [149-130-218-240:10944] [ 7] ./train_gpt2cu(+0xf120)[0x55d50913a120] [149-130-218-240:10944] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f21acc29d90] [149-130-218-240:10944] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f21acc29e40] [149-130-218-240:10944] [10] ./train_gpt2cu(+0x13275)[0x55d50913e275] [149-130-218-240:10944] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 7 with PID 0 on node 149-130-218-240 exited on signal 6 (Aborted).
MPI runs with 4 or 6 GPUs works just fine.
The text was updated successfully, but these errors were encountered:
I am running this on CUDA 12.2 version - without cuDNN on Lamdhalabs cloud.
Sorry, something went wrong.
No branches or pull requests
MPI runs with 4 or 6 GPUs works just fine.
The text was updated successfully, but these errors were encountered: