Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

dmichalak · 2024-12-31T17:22:43Z

Describe your problem

When running a refinement or classification job, an error will commonly occur after iteration 1 is complete, sometimes after iteration 2 which says corrupted size vs. prev_size. If the estimation of accuracies in the orientational assignment are completed after the iteration and then this error occurs, then I am able to continue the job usually without running into this error.

Tested

With and without Blush
4, 8, 12 threads
Mask diameter (A) = 1000, 120

Error did not occur on the first iteration with Mask diameter (A) = 120, threads = 4, without Blush.
Error occurred on he second iteration with Mask dia (A) = 240

Could this be a GPU memory issue?

Environment:

OS: Ubuntu 22.04.5 LTS
MPI runtime: OpenMPI 4.1.2
RELION version RELION-5.0.0-commit-4e57e4
Memory: 512 GB
GPU: 4x RTX 3090

Dataset:

Box size: 72 px
Pixel size: 4.232 Å/px
Number of particles: 3,500
Description: Lattice segment of a protein cage

Job options:

Type of job: Class3D
Number of MPI processes: 5
Number of threads: 4, 8, 16

`which relion_refine_mpi` --continue Class3D/job231/run_it002_optimiser.star --o Class3D/job253/run --dont_combine_weights_via_disc --pool 1 --pad 2  --iter 25 --tau2_fudge 2 --particle_diameter 1000 --blush  --oversampling 1 --healpix_order 2 --sigma_ang 15 --offset_range 3 --offset_step 2 --allow_coarser_sampling --j 4 --gpu ""  --pipeline_control Class3D/job253/

run.out

RELION version: 5.0.0-commit-4e57a4 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Number of threads per MPI process       = 4
 + Total number of threads therefore       = 20
 + Leader      (0) runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     1  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     2  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     3  runs on host            = bbc-u02230823.nhlbi.nih.gov
 + Follower     4  runs on host            = bbc-u02230823.nhlbi.nih.gov
 ==========================
 uniqueHost bbc-u02230823.nhlbi.nih.gov has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
 Thread 1 on follower 1 mapped to device 0
 Thread 2 on follower 1 mapped to device 0
 Thread 3 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 1
 Thread 1 on follower 2 mapped to device 1
 Thread 2 on follower 2 mapped to device 1
 Thread 3 on follower 2 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 2
 Thread 1 on follower 3 mapped to device 2
 Thread 2 on follower 3 mapped to device 2
 Thread 3 on follower 3 mapped to device 2
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 3
 Thread 1 on follower 4 mapped to device 3
 Thread 2 on follower 4 mapped to device 3
 Thread 3 on follower 4 mapped to device 3
 Running CPU instructions in double precision. 
 Estimating initial noise spectra from at most 480 particles 
   0/   0 sec ....................................................................................................................................................................~~(,_,"> yum!
 CurrentResolution= 20.3136 Angstroms, which requires orientationSampling of at least 2.32258 degrees for a particle of diameter 1000 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 2660
 OrientationalSampling= 15 NrOrientations= 140
 TranslationalSampling= 8.464 NrTranslations= 19
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 170240
 OrientationalSampling= 7.5 NrOrientations= 1120
 TranslationalSampling= 4.232 NrTranslations= 152
=============================
 Expectation iteration 1 of 25
  43/  43 sec ............................................................~~(,_,"> yum!

Error message:

corrupted size vs. prev_size
[bbc-u02230823:3509023] *** Process received signal ***
[bbc-u02230823:3509023] Signal: Aborted (6)
[bbc-u02230823:3509023] Signal code:  (-6)
[bbc-u02230823:3509023] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74a778842520]
[bbc-u02230823:3509023] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74a7788969fc]
[bbc-u02230823:3509023] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74a778842476]
[bbc-u02230823:3509023] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74a7788287f3]
[bbc-u02230823:3509023] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x74a778889676]
[bbc-u02230823:3509023] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x74a7788a0cfc]
[bbc-u02230823:3509023] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa17e2)[0x74a7788a17e2]
[bbc-u02230823:3509023] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0xa1969)[0x74a7788a1969]
[bbc-u02230823:3509023] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0xa3bdb)[0x74a7788a3bdb]
[bbc-u02230823:3509023] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0xa4d4f)[0x74a7788a4d4f]
[bbc-u02230823:3509023] [10] /lib/x86_64-linux-gnu/libc.so.6(memalign+0x7f)[0x74a7788a5cdf]
[bbc-u02230823:3509023] [11] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN9Projector14initialiseDataEi+0x79b)[0x5bbd7be5542b]
[bbc-u02230823:3509023] [12] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN13BackProjector23initialiseDataAndWeightEi+0xd)[0x5bbd7bed38dd]
[bbc-u02230823:3509023] [13] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN11MlWsumModel6unpackER13MultidimArrayIdEib+0xcb6)[0x5bbd7bff0c56]
[bbc-u02230823:3509023] [14] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0xa10)[0x5bbd7be1fc20]
[bbc-u02230823:3509023] [15] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x7c5)[0x5bbd7be37b05]
[bbc-u02230823:3509023] [16] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(main+0x84)[0x5bbd7bde1544]
[bbc-u02230823:3509023] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74a778829d90]
[bbc-u02230823:3509023] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74a778829e40]
[bbc-u02230823:3509023] [19] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_start+0x25)[0x5bbd7bde4fe5]
[bbc-u02230823:3509023] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node bbc-u02230823 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

biochem-fan · 2025-01-01T00:27:28Z

Does this happen only on this particular dataset or all datasets on this machine?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

dmichalak commented Dec 31, 2024 •

edited

Loading

biochem-fan commented Jan 1, 2025

Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

Abort signal during Refine3D and Class3D - corrupted size vs. prev_size #1224

Comments

dmichalak commented Dec 31, 2024 • edited Loading

biochem-fan commented Jan 1, 2025

dmichalak commented Dec 31, 2024 •

edited

Loading