Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NameError: name 'is_relion_abort' is not defined #7

Open
heejongkim opened this issue Nov 14, 2023 · 2 comments
Open

NameError: name 'is_relion_abort' is not defined #7

heejongkim opened this issue Nov 14, 2023 · 2 comments

Comments

@heejongkim
Copy link

Hi,

As I'm trying to perform "Estimating inverse deformations", I got the following error immediately at the stage of "Assigning a diameter" iteration.

NameError: name 'is_relion_abort' is not defined

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[54749,1],0]
Exit code: 2

It seems the issue comes from the following line:

if is_relion_abort(output_directory) == False:

Not sure it's merely the definition missing or it has a deeper issue from my side as it's under "except:"

If you need any additional logs that may help, please let me know.

Thank you so much.

best,
heejong

@heejongkim
Copy link
Author

In the same context, I found the following error too.

envs/relion-5.0/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 704.00 MiB (GPU 0; 10.75 GiB total capacity; 9.66 GiB already allocated; 638.44 MiB free; 9.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

However, I saw the same issue during the estimation (step 1) and saw that batch size got automatically adjusted and proceeded. I wonder if this didn't happen in this step?

@heejongkim
Copy link
Author

Recently, I got a chance to revisit the issue, and I was able to resolve the issue by changing the batch size to accommodate the VRAM occupancy. It seems like, unlike the estimating motion step, this step doesn't dynamically adjust the batch size.

After finishing that, I'm encountering a new issue with deformed backprojection, which may or may not be connected to inverse deformations.
As I resumed the job with backprojection batchsize 2, it went up to the beginning of "start deformable_backprojection of half 1" and it failed at the beginning of the loop.
More specifically,

tile_deformation = inverse_model(z_image.to(torch.float16), torch.stack(

It failed at this without any error msg.

I wonder if it's due to inv_chkpt.pth issue (size 485K), or it's a separate issue from that.

If you need any additional information to narrow down the source of trouble, please let me know.

Thank you.

best,
heejong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant