Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Build with OpenBLAS instead of Apple Accelerate on macOS (Homebrew). #515

Merged
merged 1 commit into from
Aug 5, 2024

Conversation

mmuetzel
Copy link
Contributor

@mmuetzel mmuetzel commented Aug 5, 2024

Check if using OpenBLAS makes a difference compared to Apple Accelerate when it comes to the failing tests involving complex numbers.

See: #512 (comment)

@mmuetzel mmuetzel mentioned this pull request Aug 5, 2024
@raback
Copy link
Contributor

raback commented Aug 5, 2024

It seems that with macos-13 all issues were resolved and with macos-14 there are two remaining (unrelated?) issues. So I guess this could be merged. I am still trying to locate whether I found where things go bad exactly.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 5, 2024

It looks like that helped indeed. The macOS runner on Intel hardware passes all tests now. And for the runner on Apple Silicon only these two tests are still failing:

Errors while running CTest
	150 - EMWaveBoxHexasEigen (Failed)
	214 - FixTangentVelo (Failed)

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 5, 2024

Unrelated to the changes, there seems to be still (at least) one test that fails sporadically on MinGW:

The following tests FAILED:
	324 - MonolithicSlave2 (Failed)

It only failed for one of the runners and passed when it was re-run...

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 5, 2024

Running that test with valgrind shows a couple of warnings:

HeatSolve: Assembly:
HeatSolve: Assembly done
EnforceDirichletConditions: Dirichlet conditions enforced for dofs: 15
MergeSlaveSolvers: Monolithic treatment of solvers
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4C1FC04: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14401)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185==    by 0x4CD8CCE: __mainutils_MOD_solveequations (MainUtils.F90:2944)
==92185==    by 0x516A69E: execsimulation.2 (ElmerSolver.F90:3301)
==92185==    by 0x5161818: elmersolver_ (ElmerSolver.F90:672)
==92185==    by 0x109498: MAIN__ (Solver.F90:57)
==92185== 
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4C2103E: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14523)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185==    by 0x4CD8CCE: __mainutils_MOD_solveequations (MainUtils.F90:2944)
==92185==    by 0x516A69E: execsimulation.2 (ElmerSolver.F90:3301)
==92185==    by 0x5161818: elmersolver_ (ElmerSolver.F90:672)
==92185==    by 0x109498: MAIN__ (Solver.F90:57)
==92185== 
CRS_IncompleteLU: ILU(0) (Real), Performing Factorization:
CRS_IncompleteLU: ILU(0) (Real), NOF nonzeros:      1979
CRS_IncompleteLU: ILU(0) (Real), filling (%) :       100
CRS_IncompleteLU: ILU(0) (Real), Factorization ready at (s):     0.01
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4B2C743: realidrs.3 (IterativeMethods.F90:1621)
==92185==    by 0x4B30919: __iterativemethods_MOD_itermethod_idrs (IterativeMethods.F90:1545)
==92185==    by 0x491E19A: __loadmod_MOD_itercallftnr (LoadMod.F90:822)
==92185==    by 0x4B1FF7B: __itersolve_MOD_itersolver (IterSolve.F90:1005)
==92185==    by 0x4C233AA: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14668)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185== 
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4B30FB2: __iterativemethods_MOD_itermethod_idrs (IterativeMethods.F90:1559)
==92185==    by 0x491E19A: __loadmod_MOD_itercallftnr (LoadMod.F90:822)
==92185==    by 0x4B1FF7B: __itersolve_MOD_itersolver (IterSolve.F90:1005)
==92185==    by 0x4C233AA: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14668)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185==    by 0x4CD8CCE: __mainutils_MOD_solveequations (MainUtils.F90:2944)
==92185== 
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4B30FCA: __iterativemethods_MOD_itermethod_idrs (IterativeMethods.F90:1560)
==92185==    by 0x491E19A: __loadmod_MOD_itercallftnr (LoadMod.F90:822)
==92185==    by 0x4B1FF7B: __itersolve_MOD_itersolver (IterSolve.F90:1005)
==92185==    by 0x4C233AA: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14668)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185==    by 0x4CD8CCE: __mainutils_MOD_solveequations (MainUtils.F90:2944)
==92185== 
==92185== Conditional jump or move depends on uninitialised value(s)
==92185==    at 0x4B30FF2: __iterativemethods_MOD_itermethod_idrs (IterativeMethods.F90:1561)
==92185==    by 0x491E19A: __loadmod_MOD_itercallftnr (LoadMod.F90:822)
==92185==    by 0x4B1FF7B: __itersolve_MOD_itersolver (IterSolve.F90:1005)
==92185==    by 0x4C233AA: __solverutils_MOD_solvelinearsystem (SolverUtils.F90:14668)
==92185==    by 0x4C1C450: __solverutils_MOD_solvesystem (SolverUtils.F90:15688)
==92185==    by 0x4FC1C07: __defutils_MOD_defaultsolve (DefUtils.F90:3597)
==92185==    by 0xBD77997: heatsolver_ (HeatSolve.F90:1209)
==92185==    by 0x491FC01: __loadmod_MOD_execsolver (LoadMod.F90:465)
==92185==    by 0x4CC7603: __mainutils_MOD_singlesolver (MainUtils.F90:5281)
==92185==    by 0x4CC4AC2: __mainutils_MOD_solveractivate (MainUtils.F90:5623)
==92185==    by 0x4CD7497: solvecoupled.8 (MainUtils.F90:3249)
==92185==    by 0x4CD8CCE: __mainutils_MOD_solveequations (MainUtils.F90:2944)
==92185== 
ERROR:: IterSolve: Numerical Error: System diverged over maximum tolerance.
STOP 1

If that is indeed the case, the behavior might depend on "random" values at non-initialized memory addresses.

Edit: I still had local changes in SolverUtils.F90. I updated the valgrind output with the output I see with a clean c9482b4. So, line numbers should match now.

@raback
Copy link
Contributor

raback commented Aug 5, 2024

The problematic BLAS routine is probably "zdotc" which takes care of the complex dot product by setting pointer of dotprodfun to this. It is called in the Krylov methods like "dotprodfun(n, work(1:n,rr), 1, work(1:n,r+k-1), 1). Maybe the Mac native routine assumes different ordering in complex that what we provide. Difficult to get any closer than this call so I will accept this PR and we may study this later.

This seems to have occured before too:
https://lists.quantum-espresso.org/pipermail/users/2020-December/046538.html

The mac routines are probably highly optimized so this is a pity.

@raback raback merged commit f62ad8d into ElmerCSC:devel Aug 5, 2024
7 of 9 checks passed
@juharu
Copy link
Contributor

juharu commented Aug 6, 2024 via email

@juharu
Copy link
Contributor

juharu commented Aug 6, 2024 via email

@juharu
Copy link
Contributor

juharu commented Aug 6, 2024 via email

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 6, 2024

Thank you for fixing that. 🎉

For future reference, the change that @juharu is referring to is probably b77508e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants