Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: nspin=4 not working with device=gpu #5306

Open
16 tasks
AsTonyshment opened this issue Oct 18, 2024 · 0 comments · May be fixed by #5319
Open
16 tasks

Bug: nspin=4 not working with device=gpu #5306

AsTonyshment opened this issue Oct 18, 2024 · 0 comments · May be fixed by #5319
Assignees
Labels
Bugs (Exclude input and output) Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues non-collinear+SOC Issues related to SOC

Comments

@AsTonyshment
Copy link
Collaborator

Describe the bug

When setting device=gpu, the nspin=4 calculations result in an error (see log below).

 << Start SCF iteration.
[Workstation:842863] *** Process received signal ***
[Workstation:842863] Signal: Segmentation fault (11)
[Workstation:842863] Signal code: Address not mapped (1)
[Workstation:842863] Failing at address: 0x8
[Workstation:842863] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0xebbad842520]
[Workstation:842863] [ 1] /home/abacus-develop/build/abacus(+0x75d2a5)[0x60d5e747f2a5]
[Workstation:842863] [ 2] /home/abacus-develop/build/abacus(+0x75724d)[0x60d5e747924d]
[Workstation:842863] [ 3] /home/abacus-develop/build/abacus(+0x73270b)[0x60d5e745470b]
[Workstation:842863] [ 4] /home/abacus-develop/build/abacus(+0x683b58)[0x60d5e73a5b58]
[Workstation:842863] [ 5] /home/abacus-develop/build/abacus(+0x682811)[0x60d5e73a4811]
[Workstation:842863] [ 6] /home/abacus-develop/build/abacus(+0x67b1a5)[0x60d5e739d1a5]
[Workstation:842863] [ 7] /home/abacus-develop/build/abacus(+0x3e54b9)[0x60d5e71074b9]
[Workstation:842863] [ 8] /home/abacus-develop/build/abacus(+0x5b37b5)[0x60d5e72d57b5]
[Workstation:842863] [ 9] /home/abacus-develop/build/abacus(+0x56994d)[0x60d5e728b94d]
[Workstation:842863] [10] /home/abacus-develop/build/abacus(+0x34ef5c)[0x60d5e7070f5c]
[Workstation:842863] [11] /home/abacus-develop/build/abacus(+0x36416b)[0x60d5e708616b]
[Workstation:842863] [12] /home/abacus-develop/build/abacus(+0x3621c1)[0x60d5e70841c1]
[Workstation:842863] [13] /home/abacus-develop/build/abacus(+0x3638b7)[0x60d5e70858b7]
[Workstation:842863] [14] /home/abacus-develop/build/abacus(+0x99b64)[0x60d5e6dbbb64]
[Workstation:842863] [15] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0xebbad829d90]
[Workstation:842863] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0xebbad829e40]
[Workstation:842863] [17] /home/abacus-develop/build/abacus(+0x99a05)[0x60d5e6dbba05]
[Workstation:842863] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node Workstation exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Upon further testing, it appears that the issue is not related to the ks_solver itself, as device=cpu with ks_solver=cusolver works correctly. The problem seems to stem from <GPU grid integration> rather than the solver.

Expected behavior

Setting device=cpu works well.

To Reproduce

  1. Set device=gpu and nspin=4 for any SCF calculation.
  2. Run the calculation.

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@AsTonyshment AsTonyshment added Bugs (Exclude input and output) Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues non-collinear+SOC Issues related to SOC labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugs (Exclude input and output) Bugs that only solvable with sufficient knowledge of DFT GPU & DCU & HPC GPU and DCU and HPC related any issues non-collinear+SOC Issues related to SOC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants