You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a refinement or classification job, an error will commonly occur after iteration 1 is complete, sometimes after iteration 2 which says corrupted size vs. prev_size. If the estimation of accuracies in the orientational assignment are completed after the iteration and then this error occurs, then I am able to continue the job usually without running into this error.
Tested
With and without Blush
4, 8, 12 threads
Mask diameter (A) = 1000, 120
Error did not occur on the first iteration with Mask diameter (A) = 120, threads = 4, without Blush.
Error occurred on he second iteration with Mask dia (A) = 240
RELION version: 5.0.0-commit-4e57a4
Precision: BASE=double, CUDA-ACC=single
=== RELION MPI setup ===
+ Number of MPI processes = 5
+ Number of threads per MPI process = 4
+ Total number of threads therefore = 20
+ Leader (0) runs on host = bbc-u02230823.nhlbi.nih.gov
+ Follower 1 runs on host = bbc-u02230823.nhlbi.nih.gov
+ Follower 2 runs on host = bbc-u02230823.nhlbi.nih.gov
+ Follower 3 runs on host = bbc-u02230823.nhlbi.nih.gov
+ Follower 4 runs on host = bbc-u02230823.nhlbi.nih.gov
==========================
uniqueHost bbc-u02230823.nhlbi.nih.gov has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 1 mapped to device 0
Thread 1 on follower 1 mapped to device 0
Thread 2 on follower 1 mapped to device 0
Thread 3 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 2 mapped to device 1
Thread 1 on follower 2 mapped to device 1
Thread 2 on follower 2 mapped to device 1
Thread 3 on follower 2 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 3 mapped to device 2
Thread 1 on follower 3 mapped to device 2
Thread 2 on follower 3 mapped to device 2
Thread 3 on follower 3 mapped to device 2
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 4 mapped to device 3
Thread 1 on follower 4 mapped to device 3
Thread 2 on follower 4 mapped to device 3
Thread 3 on follower 4 mapped to device 3
Running CPU instructions in double precision.
Estimating initial noise spectra from at most 480 particles
0/ 0 sec ....................................................................................................................................................................~~(,_,"> yum!
CurrentResolution= 20.3136 Angstroms, which requires orientationSampling of at least 2.32258 degrees for a particle of diameter 1000 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 2660
OrientationalSampling= 15 NrOrientations= 140
TranslationalSampling= 8.464 NrTranslations= 19
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 170240
OrientationalSampling= 7.5 NrOrientations= 1120
TranslationalSampling= 4.232 NrTranslations= 152
=============================
Expectation iteration 1 of 25
43/ 43 sec ............................................................~~(,_,"> yum!
Error message:
corrupted size vs. prev_size
[bbc-u02230823:3509023] *** Process received signal ***
[bbc-u02230823:3509023] Signal: Aborted (6)
[bbc-u02230823:3509023] Signal code: (-6)
[bbc-u02230823:3509023] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x74a778842520]
[bbc-u02230823:3509023] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x74a7788969fc]
[bbc-u02230823:3509023] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x74a778842476]
[bbc-u02230823:3509023] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x74a7788287f3]
[bbc-u02230823:3509023] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x74a778889676]
[bbc-u02230823:3509023] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x74a7788a0cfc]
[bbc-u02230823:3509023] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0xa17e2)[0x74a7788a17e2]
[bbc-u02230823:3509023] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0xa1969)[0x74a7788a1969]
[bbc-u02230823:3509023] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0xa3bdb)[0x74a7788a3bdb]
[bbc-u02230823:3509023] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0xa4d4f)[0x74a7788a4d4f]
[bbc-u02230823:3509023] [10] /lib/x86_64-linux-gnu/libc.so.6(memalign+0x7f)[0x74a7788a5cdf]
[bbc-u02230823:3509023] [11] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN9Projector14initialiseDataEi+0x79b)[0x5bbd7be5542b]
[bbc-u02230823:3509023] [12] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN13BackProjector23initialiseDataAndWeightEi+0xd)[0x5bbd7bed38dd]
[bbc-u02230823:3509023] [13] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN11MlWsumModel6unpackER13MultidimArrayIdEib+0xcb6)[0x5bbd7bff0c56]
[bbc-u02230823:3509023] [14] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi22combineAllWeightedSumsEv+0xa10)[0x5bbd7be1fc20]
[bbc-u02230823:3509023] [15] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x7c5)[0x5bbd7be37b05]
[bbc-u02230823:3509023] [16] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(main+0x84)[0x5bbd7bde1544]
[bbc-u02230823:3509023] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74a778829d90]
[bbc-u02230823:3509023] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74a778829e40]
[bbc-u02230823:3509023] [19] /home/michalakdj/src/relion5/build/bin/relion_refine_mpi(_start+0x25)[0x5bbd7bde4fe5]
[bbc-u02230823:3509023] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node bbc-u02230823 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
The text was updated successfully, but these errors were encountered:
Describe your problem
When running a refinement or classification job, an error will commonly occur after iteration 1 is complete, sometimes after iteration 2 which says
corrupted size vs. prev_size
. If the estimation of accuracies in the orientational assignment are completed after the iteration and then this error occurs, then I am able to continue the job usually without running into this error.Tested
Error did not occur on the first iteration with Mask diameter (A) = 120, threads = 4, without Blush.
Error occurred on he second iteration with Mask dia (A) = 240
Could this be a GPU memory issue?
Environment:
Dataset:
Job options:
run.out
Error message:
The text was updated successfully, but these errors were encountered: