Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

Open
dmichalak opened this issue Dec 31, 2024 · 1 comment
Open

Comments

@dmichalak
Copy link

dmichalak commented Dec 31, 2024

Describe your problem

When running a refinement or classification job, an error will commonly occur after iteration 1 is complete, sometimes after iteration 2 which says corrupted size vs. prev_size. If the estimation of accuracies in the orientational assignment are completed after the iteration and then this error occurs, then I am able to continue the job usually without running into this error.

Tested

  • With and without Blush
  • 4, 8, 12 threads
  • Mask diameter (A) = 1000, 120

Error did not occur on the first iteration with Mask diameter (A) = 120, threads = 4, without Blush.
Error occurred on he second iteration with Mask dia (A) = 240

Could this be a GPU memory issue?

Environment:

  • OS: Ubuntu 22.04.5 LTS
  • MPI runtime: OpenMPI 4.1.2
  • RELION version RELION-5.0.0-commit-4e57e4
  • Memory: 512 GB
  • GPU: 4x RTX 3090

Dataset:

  • Box size: 72 px
  • Pixel size: 4.232 Å/px
  • Number of particles: 3,500
  • Description: Lattice segment of a protein cage

Job options:

  • Type of job: Class3D
  • Number of MPI processes: 5
  • Number of threads: 4, 8, 16
`which relion_refine_mpi` --continue Class3D/job231/run_it002_optimiser.star --o Class3D/job253/run --dont_combine_weights_via_disc --pool 1 --pad 2  --iter 25 --tau2_fudge 2 --particle_diameter 1000 --blush  --oversampling 1 --healpix_order 2 --sigma_ang 15 --offset_range 3 --offset_step 2 --allow_coarser_sampling --j 4 --gpu ""  --pipeline_control Class3D/job253/

run.out

RELION version: 5.0.0-commit-4e57a4 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Number of threads per MPI process       = 4
 + Total number of threads therefore       = 20
 + Leader      (0) runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     1  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     2  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     3  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     4  runs on host            = bbc-u02230823.nhlbi.nih.gov
 ==========================
 uniqueHost bbc-u02230823.nhlbi.nih.gov has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
 Thread 1 on follower 1 mapped to device 0
 Thread 2 on follower 1 mapped to device 0
 Thread 3 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 1
 Thread 1 on follower 2 mapped to device 1
 Thread 2 on follower 2 mapped to device 1
 Thread 3 on follower 2 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 2
 Thread 1 on follower 3 mapped to device 2
 Thread 2 on follower 3 mapped to device 2
 Thread 3 on follower 3 mapped to device 2
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 3
 Thread 1 on follower 4 mapped to device 3
 Thread 2 on follower 4 mapped to device 3
 Thread 3 on follower 4 mapped to device 3
 Running CPU instructions in double precision. 
 Estimating initial noise spectra from at most 480 particles 
   0/   0 sec ....................................................................................................................................................................~~(,_,"> yum!
 CurrentResolution= 20.3136 Angstroms, which requires orientationSampling of at least 2.32258 degrees for a particle of diameter 1000 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 2660
 OrientationalSampling= 15 NrOrientations= 140
 TranslationalSampling= 8.464 NrTranslations= 19
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 170240
 OrientationalSampling= 7.5 NrOrientations= 1120
 TranslationalSampling= 4.232 NrTranslations= 152
=============================
 Expectation iteration 1 of 25
  43/  43 sec ............................................................~~(,_,"> yum!

Error message:

corrupted size vs. prev_size
[bbc-u02230823:3509023] *** Process received signal ***
[bbc-u02230823:3509023] Signal: Aborted (6)
[bbc-u02230823:3509023] Signal code:  (-6)
[bbc-u02230823:3509023] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74a778842520]
[bbc-u02230823:3509023] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74a7788969fc]
[bbc-u02230823:3509023] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74a778842476]
[bbc-u02230823:3509023] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74a7788287f3]
[bbc-u02230823:3509023] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x74a778889676]
[bbc-u02230823:3509023] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x74a7788a0cfc]
[bbc-u02230823:3509023] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa17e2)[0x74a7788a17e2]
[bbc-u02230823:3509023] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0xa1969)[0x74a7788a1969]
[bbc-u02230823:3509023] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0xa3bdb)[0x74a7788a3bdb]
[bbc-u02230823:3509023] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0xa4d4f)[0x74a7788a4d4f]
[bbc-u02230823:3509023] [10] /lib/x86_64-linux-gnu/libc.so.6(memalign+0x7f)[0x74a7788a5cdf]
[bbc-u02230823:3509023] [11] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN9Projector14initialiseDataEi+0x79b)[0x5bbd7be5542b]
[bbc-u02230823:3509023] [12] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN13BackProjector23initialiseDataAndWeightEi+0xd)[0x5bbd7bed38dd]
[bbc-u02230823:3509023] [13] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN11MlWsumModel6unpackER13MultidimArrayIdEib+0xcb6)[0x5bbd7bff0c56]
[bbc-u02230823:3509023] [14] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0xa10)[0x5bbd7be1fc20]
[bbc-u02230823:3509023] [15] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x7c5)[0x5bbd7be37b05]
[bbc-u02230823:3509023] [16] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(main+0x84)[0x5bbd7bde1544]
[bbc-u02230823:3509023] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74a778829d90]
[bbc-u02230823:3509023] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74a778829e40]
[bbc-u02230823:3509023] [19] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_start+0x25)[0x5bbd7bde4fe5]
[bbc-u02230823:3509023] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node bbc-u02230823 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@biochem-fan
Copy link
Member

Does this happen only on this particular dataset or all datasets on this machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants