You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A note: I don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.
Describe the bug
Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly.
In the case of :
onnx_dump,it fails with terminate called without an active exception
Sometimes shows ERROR sending to socket: Bad file descriptor before printing terminate called without an active exception
cudart,it fails with a simple Segmentation fault (core dumped)
Suppose I write a CUDA executable named say toy.cu as follows:
#include<cuda.h>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#defineBLOCK_SIZE128__global__voiddo_something(float* d_array)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
d_array[idx]*=100;
}
intmain()
{
long N= 1<<7;
float *arr = (float*) malloc(N*sizeof(float));
long i;
for (i=1;i<=N;i++)
arr[i-1]=i;
float *d_array;
cudaError_t ret;
ret = cudaMalloc(&d_array, N*sizeof(float));
printf("Return value of cudaMalloc = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
exit(1);
}
ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
printf("Return value of cudaMemcpy = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
exit(1);
}
int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);
ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("Return value of cudaMemcpy = %d\n", ret);
int j;
for(i=0;i<N;)
{
for(j=0;j<8;j++)
printf("%.0f\t", arr[i++]);
printf("\n");
}
cudaFree(d_array);
return0;
}
And compile it as :
nvcc -o toy toy.cu --cudart shared
Then, in the docker container setup to use the appropriate libguestlib.so I call the following script.sh
Many (not all) fails, whether I use cudart or onnx_dump
To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:
Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp to connect to my host using it's IP address.
And then did the following:
$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install
Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu and /usr/local/cuda-10.1/targets/x86_64-linux/lib/ in the docker container. And modified the library symlinks accordingly:
A note: I don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.
Describe the bug
Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly.
In the case of :
onnx_dump
,it fails withterminate called without an active exception
ERROR sending to socket: Bad file descriptor
before printingterminate called without an active exception
cudart
,it fails with a simpleSegmentation fault (core dumped)
Suppose I write a CUDA executable named say
toy.cu
as follows:And compile it as :
Then, in the docker container setup to use the appropriate
libguestlib.so
I call the followingscript.sh
And run the following command:
Many (not all) fails, whether I use
cudart
oronnx_dump
To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the
NVIDIA-Linux-x86_64-418.226.00.run
from the NVIDIA website.Second, I installed CUDA Toolkit 10.1 using the
cuda_10.1.168_418.67_linux.run
from the NVIDIA website.Third, I install cudnn 7.6.3.30 using the following files:
Next, I forked the AvA repository.
I modified the
ava/guestlib/cmd_channel_socket_tcp.cpp
to connect to my host using it's IP address.And then did the following:
Then I used a CUDA-10.1 docker image (the one given this repository under
tools/docker
, with a bit of modification to remove the issue of cuda keys for apt update)Bind mounted my build directory in the docker container and then copied the
libguestlib.so
from the build directory to/usr/lib/x86_64-linux-gnu
and/usr/local/cuda-10.1/targets/x86_64-linux/lib/
in the docker container. And modified the library symlinks accordingly:Added the guest config in the docker container as:
Then I tried to launch the manger on the host as follows:
And on the guest, I try to run the
toy
cuda program. But it fails as described earlier.I described the setup for
onnx_dump
but the setup forcudart
is similar. But still it gives the error as described earlier.Expected behavior
I expect all the instances of the
toy
executable launched concurrently to run successfully.Environment:
The text was updated successfully, but these errors were encountered: