-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIGRE can't run with pytorch on the same GPU #509
Comments
Hi! Thanks for the big report. Hopefully when the PR is merged we also fix this issue together. |
Hi! Could you please forward the error message you get when running your script with if torch.cuda.device_count() > 1:
net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0]) Might cause problems. From experience, using that lines with libraries that interact with the GPU in an intensive way cause problems (I had that in ASTRA+ODL). In the error trace, I see that something in |
I've add |
@huscael thanks for the test! Indeed there could be issues with Not saying its this 100%, as I don't know what DataParallel does internally, but it could be. |
Perhaps its related to how you pass the |
this line |
I have tried to set gpuids to [0] for both TIGRE and DataParallel, and this Invalid Argument error still occures. As of my current remedy is to use more gpus than I really need, e.g. set device id [0,1,2] for DataParallel and gpuids [3] for TIGRE. Under such circumstance this error finally disappears, but TIGRE uses very little of that gpu device, other colleages would run their programs on that gpu unintentionally and cause the same Invalid Argument error to my program. In addition, this way requires more gpu, and gpu resource is not very sufficient in my lab. So here I'm asking for a way to run pytorch TIGRE and DataParallel on the same GPU, thanks a lot! |
That's for fdk, but I suspect the error is caused by the data parallel inside the data loader/dataset. the call to Ax, not the call to fdk. Dataparallels point is to put the data in gpus. In any case, I don't know how DataParallel parallelizes the loader exactly (but it does put things on gpus, which may cause the problems as I said). As we are working on getting Tigre a bit more pytorch compatible we may find the issue, but for now the only thing I can say is that I don't know and it's not technically a supported feature, so technically not a bug. Hopefully I can give you a better answer at some point. I'll ping you if I find an answer |
I also have a similar error when I run the Tigre and torch on the same GPU. After I use tigre.Ax() to generate projections, I can no longer push any data to the GPU with torch. And it always raises a CUDA error: an illegal memory access was encountered. I think the Tigre toolbox may change some global parameters or environments about the GPU, and cause the torch to fail to connect the GPU drive. |
@ldy1995 still unsure what the issue is, but in theory TIGRE should create a context and destroy it every time its called. i.e. as opposed to pytorch that holds the GPU memory all the time, each Ax() or Atb() TIGRE call should be a new and independent call to the GPU that opens and closes the session. Clearly this is not happening, but not entirely sure why. I will be working on making TIGRE torch compatible soon, so hopefully we can fix this. Any ideas welcome, of course. |
Expected Behavior
I am using pytorch and TIGRE together to do inverse projection, but I found that when I put pytorch and TIGRE on the same GPU, it will report an error. If I put pytorch and TIGRE on different GPU, it won't report an error, and I would like to ask why is that? Is there a way to run pytorch TIGRE and pytorch on the same GPU? Thanks!
Actual Behavior
When TIGRE and pytorch work the same GPU, I get the following error:
Code to reproduce the problem (If applicable)
The following code is just for reproducing the problem, and it's different from my actual code but is enough for showing the case.
Specifications
The text was updated successfully, but these errors were encountered: