Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-OK-status: GpuLaunchKernel error during distributed training of a large model #88

Open
bojone opened this issue Jun 3, 2023 · 0 comments

Comments

@bojone
Copy link

bojone commented Jun 3, 2023

I am attempting to train a model with 3 billion parameters on two A100 GPUs using nvidia-tensorflow 1.15 (21.07-tf1-py3), with a batch size of 24 and tf.distribute.MirroredStrategy.

The error message is:

2023-06-03 07:27:26.364872: F tensorflow/core/kernels/concat_lib_gpu_impl.cu.cc:161] Non-OK-status: GpuLaunchKernel( concat_variable_kernel<T, IntType, true>, config.block_count, config.thread_per_block, smem_usage, gpu_device.stream(), input_ptrs, output_scan, static_cast(output->dimension(0)), static_cast(output->dimension(1)), output->data()) status: Internal: invalid configuration argument

This seems to be an issue that occurs only when the model is large enough and distributed training is used (as the model trains successfully on a single GPU with a batch_size of 12 and on two GPUs with a model size of 1.5B).

I understand that using TensorFlow for training large models may not be the best option, but at present, I need to address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant