Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal code: (-6) mpirun.relion noticed that process rank # with PID 0 on node ## exited on signal 6 #1215

Open
SomePersonSomeWhereInTheWorld opened this issue Dec 2, 2024 · 0 comments

Comments

@SomePersonSomeWhereInTheWorld
Copy link

SomePersonSomeWhereInTheWorld commented Dec 2, 2024

Job crashes at end of iteration 1. Is this a dependency or version issue?
Describe your problem

Environment:

  • OS: Centos 7.9
  • MPI runtime: Open MPI 3.1.6
  • RELION version beta 5 commit 93049a
  • Memory: 2TB
  • GPU: RTX A6000

Job options:

  • Type of job: Class3D
  • Number of MPI processes: 9
  • Number of threads: 10
  • Full command (see note.txt in the job directory):
    ``which relion_refine_mpi` --o Class3D/job115/run --i j848_stack.star --ref j848_inimodel.mrc --firstiter_cc --ini_high 12 --dont_combine_weights_via_disc --pool 300 --pad 2  --ctf --ctf_intact_first_peak --iter 100 --tau2_fudge 8 --particle_diameter 360 --blush  --K 2 --flatten_solvent --zero_mask --skip_align  --sym C1 --norm --scale  --j 10  --pipeline_control Class3D/job115/
    

Error message:

Please cite the full error message as the example below.

Dec  2 11:12:56 cryoem9 relion_refine_mpi: *** Error in `/programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi': corrupted double-linked list: 0x0000000004d98e60 ***
Dec  2 11:12:56 cryoem9 abrt-hook-ccpp: Process 50081 (relion_refine_mpi) of user 1000 killed by SIGABRT - dumping core
Dec  2 11:12:56 cryoem9 abrt-hook-ccpp: Process 50087 (relion_refine_mpi) of user 1000 killed by SIGABRT - ignoring (repeated crash)
Dec  2 11:13:05 cryoem9 abrt-server: Executable '/programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/relion-5.0-beta_2023-11-12-lfbx/bin/relion_refine_mpi' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Dec  2 11:13:05 cryoem9 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2024-12-02-14:12:56-50081' exited with 1
Dec  2 11:13:05 cryoem9 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2024-12-02-14:12:56-50081'
[sn4622115580:50087] *** Process received signal ***
[sn4622115580:50087] Signal: Aborted (6)
[sn4622115580:50087] Signal code:  (-6)
[sn4622115580:50081] *** Process received signal ***
[sn4622115580:50081] Signal: Aborted (6)
[sn4622115580:50081] Signal code:  (-6)
[sn4622115580:50081] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f44a9a68630]
[sn4622115580:50081] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f44a8bdb387]
[sn4622115580:50081] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f44a8bdca78]
[sn4622115580:50081] [ 3] /lib64/libc.so.6(+0x78f67)[0x7f44a8c1df67]
[sn4622115580:50081] [ 4] /lib64/libc.so.6(+0x80c37)[0x7f44a8c25c37]
[sn4622115580:50081] [ 5] /lib64/libc.so.6(+0x82135)[0x7f44a8c27135]
[sn4622115580:50081] [ 6] [sn4622115580:50087] [ 0] /lib64/libc.so.6(+0x83056)[0x7f44a8c28056]
[sn4622115580:50081] [ 7] /lib64/libpthread.so.0(+0xf630)[0x7f186dc2c630]
[sn4622115580:50087] [ 1] /lib64/libc.so.6(__libc_memalign+0x75)[0x7f44a8c2b075]

[sn4622115580:50081] [ 8] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_malloc_plain+0x15)[0x7f44b06173f5]
[sn4622115580:50081] [ 9] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x3e480)[0x7f44b0620480]
[sn4622115580:50081] [10] /lib64/libc.so.6(gsignal+0x37)[0x7f186cd9f387]
[sn4622115580:50087] [ 2] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f44b061a294]
[sn4622115580:50081] [11] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f44b061a6cf]
[sn4622115580:50081] [12] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x444e2)[0x7f44b06264e2]
[sn4622115580:50081] [13] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f44b061a294]
[sn4622115580:50081] [14] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f44b061a6cf]
[sn4622115580:50081] [15] /lib64/libc.so.6(abort+0x148)[0x7f186cda0a78]
[sn4622115580:50087] [ 3] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x77a83)[0x7f44b0659a83]
[sn4622115580:50081] [16] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f44b061a294]
[sn4622115580:50081] [17] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f44b061a6cf]
/lib64/libc.so.6(+0x78f67)[0x7f186cde1f67]
[sn4622115580:50087] [ 4] [sn4622115580:50081] [18] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x77a15)[0x7f44b0659a15]
[sn4622115580:50081] [19] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f44b061a294]
[sn4622115580:50081] [20] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0xe9048)[0x7f44b06cb048]
[sn4622115580:50081] [21] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkapiplan+0xe1)[0x7f44b06cb241]
/lib64/libc.so.6(+0x7f474)[0x7f186cde8474]
[sn4622115580:50087] [ 5] [sn4622115580:50081] [22] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_plan_many_dft_r2c+0x145)[0x7f44b06d8f55]
[sn4622115580:50081] [23] /lib64/libc.so.6(+0x82d00)[0x7f186cdebd00]
[sn4622115580:50087] [ 6] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_plan_dft_r2c+0x25)[0x7f44b06d84c5]
[sn4622115580:50081] [24] /lib64/libc.so.6(+0x83056)[0x7f186cdec056]
[sn4622115580:50087] [ 7] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN18FourierTransformer7setRealER13MultidimArrayIdEb+0xc4)[0x57ee64]
[sn4622115580:50081] [25] /lib64/libc.so.6(__libc_memalign+0x75)[0x7f186cdef075]
[sn4622115580:50087] [ 8] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_malloc_plain+0x15)[0x7f18747db3f5]
[sn4622115580:50087] [ 9] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN11MlOptimiser30initialLowPassFilterReferencesEv+0xf1)[0x666ad1]
[sn4622115580:50081] [26] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN11MlOptimiser27maximizationOtherParametersEv+0xd68)[0x667da8]
[sn4622115580:50081] [27] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x3e480)[0x7f18747e4480]
[sn4622115580:50087] [10] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f18747de294]
[sn4622115580:50087] [11] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f18747de6cf]
[sn4622115580:50087] [12] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x444e2)[0x7f18747ea4e2]
[sn4622115580:50087] [13] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f18747de294]
[sn4622115580:50087] [14] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1174)[0x4ca264]
[sn4622115580:50081] [28] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f18747de6cf]
[sn4622115580:50087] [15] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x77a83)[0x7f187481da83]
[sn4622115580:50087] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x375)[0x4cb995]
[sn4622115580:50081] [29] [16] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f18747de294]
[sn4622115580:50087] [17] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkplan_d+0xf)[0x7f18747de6cf]
[sn4622115580:50087] [18] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x77a15)[0x7f187481da15]
[sn4622115580:50087] [19] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(main+0x53)[0x488be3]
[sn4622115580:50081] *** End of error message ***
/programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0x38294)[0x7f18747de294]
[sn4622115580:50087] [20] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(+0xe9048)[0x7f187488f048]
[sn4622115580:50087] [21] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_mkapiplan+0xe1)[0x7f187488f241]
[sn4622115580:50087] [22] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_plan_many_dft_r2c+0x145)[0x7f187489cf55]
[sn4622115580:50087] [23] /programs/x86_64-linux/relion/5.0-beta_cu12.2/relion_extlib/fftw-3.3.10-jsrp/lib/libfftw3.so.3(fftw_plan_dft_r2c+0x25)[0x7f187489c4c5]
[sn4622115580:50087] [24] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN18FourierTransformer7setRealER13MultidimArrayIdEb+0xc4)[0x57ee64]
[sn4622115580:50087] [25] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN11MlOptimiser30initialLowPassFilterReferencesEv+0xf1)[0x666ad1]
[sn4622115580:50087] [26] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN11MlOptimiser27maximizationOtherParametersEv+0xd68)[0x667da8]
[sn4622115580:50087] [27] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1174)[0x4ca264]
[sn4622115580:50087] [28] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x375)[0x4cb995]
[sn4622115580:50087] [29] /programs/x86_64-linux/relion/5.0-beta_cu12.2/bin/relion_refine_mpi(main+0x53)[0x488be3]
[sn4622115580:50087] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
mpirun.relion noticed that process rank 8 with PID 0 on node sn4622115580 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Could this bug report be related?

weird combination between hardware settings and an oddly large boxsize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant