Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCU tests failed #5214

Open
9 tasks
pxlxingliang opened this issue Oct 11, 2024 · 1 comment
Open
9 tasks

DCU tests failed #5214

pxlxingliang opened this issue Oct 11, 2024 · 1 comment
Assignees
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues

Comments

@pxlxingliang
Copy link
Collaborator

Describe the Testing Issue

The daily dcu test failed on example 005_16Na at 20241011.

https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-dcu-cg-372d8a

The error message:

                              ABACUS v3.8.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 5329628 (Thu Oct 10 22:45:13 2024 +0800)

 Fri Oct 11 00:28:57 2024

Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
[j12r4n15:21269:0:21269] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
==== backtrace (tid:  21269) ====
 0 0x0000000000051213 ucs_debug_print_backtrace()  /public/home/bujd/tmp/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/sources/ucx-1.8.0/src/ucs/debug/debug.c:625
 1 0x000000000008559c __GI___libc_free()  :0
 2 0x0000000000c35f99 std::string::assign()  ???:0
 3 0x0000000000c360b9 std::string::assign()  ???:0
 4 0x0000000000c3783e std::string::assign()  ???:0
 5 0x0000000000bc8cec std::string::assign()  ???:0
 6 0x0000000000c20106 std::string::assign()  ???:0
 7 0x0000000000c8583a hipGetCmdName()  ???:0
 8 0x0000000000ca05ee hipGetDeviceCount()  ???:0
 9 0x0000000000453344 base_device::information::get_device_flag()  ???:0
10 0x0000000000183f08 std::_Function_handler<void (ModuleIO::Input_Item const&, Parameter&), ModuleIO::ReadInput::item_system()::$_169>::_M_invoke()  read_input_item_system.cpp:0
11 0x00000000001e97aa ModuleIO::ReadInput::read_txt_input()  ???:0
12 0x00000000001e90ac ModuleIO::ReadInput::read_parameters()  ???:0
13 0x0000000000250de5 Driver::reading()  ???:0
14 0x0000000000250c3d Driver::init()  ???:0
15 0x00000000000602d7 main()  ???:0
16 0x00000000000223d5 __libc_start_main()  ???:0
17 0x0000000000060160 _start()  ???:0
=================================
[j12r4n15:21269] *** Process received signal ***
[j12r4n15:21269] Signal: Segmentation fault (11)
[j12r4n15:21269] Signal code:  (-6)
[j12r4n15:21269] Failing at address: 0x62e000005315
[j12r4n15:21269] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b34ca07f5d0]
[j12r4n15:21269] [ 1] /lib64/libc.so.6(cfree+0x1c)[0x2b34d47bc59c]
[j12r4n15:21269] [ 2] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc35f99)[0x2b34cc079f99]
[j12r4n15:21269] [ 3] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc360b9)[0x2b34cc07a0b9]
[j12r4n15:21269] [ 4] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc3783e)[0x2b34cc07b83e]
[j12r4n15:21269] [ 5] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xbc8cec)[0x2b34cc00ccec]
[j12r4n15:21269] [ 6] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc20106)[0x2b34cc064106]
[j12r4n15:21269] [ 7] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc8583a)[0x2b34cc0c983a]
[j12r4n15:21269] [ 8] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(hipGetDeviceCount+0x17e)[0x2b34cc0e45ee]
[j12r4n15:21269] [ 9] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x453344)[0x55b838642344]
[j12r4n15:21269] [10] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x183f08)[0x55b838372f08]
[j12r4n15:21269] [11] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x1e97aa)[0x55b8383d87aa]
[j12r4n15:21269] [12] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x1e90ac)[0x55b8383d80ac]
[j12r4n15:21269] [13] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x250de5)[0x55b83843fde5]
[j12r4n15:21269] [14] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x250c3d)[0x55b83843fc3d]
[j12r4n15:21269] [15] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x602d7)[0x55b83824f2d7]
[j12r4n15:21269] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b34d47593d5]
[j12r4n15:21269] [17] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x60160)[0x55b83824f160]
[j12r4n15:21269] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21269 on node j12r4n15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Understand the testing issue described by the developer.
  • Review the specific test case, expected and actual results, and any error messages.
  • Identify the root cause of the test failure or issue.
  • If a possible solution is suggested, evaluate its feasibility and effectiveness.
  • Implement a fix for the test failure or issue, or create a new test case if needed.
  • Verify that the fix resolves the testing issue and the test case passes.
  • Review and update any relevant documentation, such as test plans or user guides.
  • Ensure the testing issue is resolved and close the ticket.
  • Share any lessons learned or best practices with the team to prevent similar issues in the future.
@mohanchen mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label Oct 14, 2024
@WHUweiqingzhou
Copy link
Collaborator

Recent dcu tests all passed. Maybe this issue is caused by machine problem. We will close this issue next meeting,

@WHUweiqingzhou WHUweiqingzhou self-assigned this Oct 17, 2024
@mohanchen mohanchen changed the title dcu test failed DCU tests failed Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
Development

No branches or pull requests

3 participants