Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runner that builds with Clang to CI. #487

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Conversation

mmuetzel
Copy link
Contributor

Building with LLVM Clang and libomp currently fails (see #486).

Add configurations to the build matrix that build with LLVM Clang without OpenMP. Also add a runner that builds with GCC without OpenMP.

@mmuetzel
Copy link
Contributor Author

It looks like a few tests fail with Clang that passed with GCC:

The following tests FAILED:
	118 - CurvedBoundaryP (Failed)
	136 - DisContBoundaryFullAngle (Failed)
	219 - FixTangentVelo (Failed)
	255 - IncreaseOrderTet (Failed)

@mmuetzel
Copy link
Contributor Author

As the motivation why I think testing with LLVM Clang on Linux could be useful:
LLVM Clang and Apple Clang are not identical. But they are pretty similar.
Many developers (including me) don't have physical access to a macOS device.
Sorting out potential issues with LLVM Clang on Linux often leads most of the way towards successfully building a project with Apple Clang on macOS.

@mmuetzel
Copy link
Contributor Author

The default log file is not very useful to try and understand what is happening in these failing tests. If I understand correctly, stderr and stdout are redirected to test-specific files. That makes it difficult to display the output in automated tests.

Would it be ok to no longer redirect the output and let ctest do "its thing" instead?

@raback
Copy link
Contributor

raback commented Jul 20, 2024

Looks great! It is nice to be able to test several builds at the same time.

Previously we seem to have run the tests with:
ctest -L quick => 'CTEST_OUTPUT_ON_FAILURE=1 ctest -L quick'
To allow output in CI.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jul 21, 2024

If I understand their documentation correctly, --output-on-failure should have the same effect as CTEST_OUTPUT_ON_FAILURE=1:
https://cmake.org/cmake/help/latest/manual/ctest.1.html#id18

--output-on-failure
Output anything outputted by the test program if the test should fail. This option can also be enabled by setting the CTEST_OUTPUT_ON_FAILURE environment variable

To be honest, the output that is currently shown for the failing tests might be enough for someone more familiar with ElmerFEM to understand what is happening. It's just that I didn't understand what it means.
E.g., for CurvedBoundaryP, the test output is:

     Start 118: CurvedBoundaryP
  1/4 Test #118: CurvedBoundaryP ..................***Failed    2.55 sec
  -- BINARY_DIR = /home/runner/work/elmerfem/elmerfem/build
  -- Extra library directories 
  CMake Error at /home/runner/work/elmerfem/elmerfem/cmake/Modules/test_macros.cmake:157 (FILE):
    FILE failed to open for reading (No such file or directory):
  
      /home/runner/work/elmerfem/elmerfem/build/fem/tests/CurvedBoundaryP/TEST.PASSED
  Call Stack (most recent call first):
    /home/runner/work/elmerfem/elmerfem/fem/tests/CurvedBoundaryP/runtest.cmake:3 (RUN_ELMER_TEST)

And the corresponding output in the test log is:

Start testing: Jul 13 21:44 UTC
  ----------------------------------------------------------
  118/870 Testing: CurvedBoundaryP
  118/870 Test: CurvedBoundaryP
  Command: "/usr/local/bin/cmake" "-DCMAKE_MODULE_PATH=/home/runner/work/elmerfem/elmerfem/cmake/Modules" "/usr/local/share/cmake-3.30/Modules" "-DELMERGRID_BIN=/home/runner/work/elmerfem/elmerfem/build/elmergrid/src/ElmerGrid" "-DELMERSOLVER_BIN=/home/runner/work/elmerfem/elmerfem/build/fem/src/ElmerSolver_mpi" "-DFINDNORM_BIN=/home/runner/work/elmerfem/elmerfem/build/fem/tests/findnorm" "-DMESH2D_BIN=/home/runner/work/elmerfem/elmerfem/build/meshgen2d/src/Mesh2D" "-DTEST_SOURCE=/home/runner/work/elmerfem/elmerfem/fem/tests/CurvedBoundaryP" "-DPROJECT_SOURCE_DIR=/home/runner/work/elmerfem/elmerfem/fem/tests" "-DSHLEXT=.so" "-DBINARY_DIR=/home/runner/work/elmerfem/elmerfem/build" "-DCMAKE_Fortran_COMPILER=/usr/bin/f95" "-DMPIEXEC=/usr/bin/mpiexec" "-DMPIEXEC_NUMPROC_FLAG=-n" "-DMPIEXEC_PREFLAGS=--allow-run-as-root" "-DMPIEXEC_POSTFLAGS=" "-DWITH_MPI=ON" "-DMPIEXEC_NTASKS=1" "-P" "/home/runner/work/elmerfem/elmerfem/fem/tests/CurvedBoundaryP/runtest.cmake"
  Directory: /home/runner/work/elmerfem/elmerfem/build/fem/tests/CurvedBoundaryP
  "CurvedBoundaryP" start time: Jul 13 21:44 UTC
  Output:
  ----------------------------------------------------------
  -- BINARY_DIR = /home/runner/work/elmerfem/elmerfem/build
  -- Extra library directories 
  CMake Error at /home/runner/work/elmerfem/elmerfem/cmake/Modules/test_macros.cmake:157 (FILE):
    FILE failed to open for reading (No such file or directory):
  
      /home/runner/work/elmerfem/elmerfem/build/fem/tests/CurvedBoundaryP/TEST.PASSED
  Call Stack (most recent call first):
    /home/runner/work/elmerfem/elmerfem/fem/tests/CurvedBoundaryP/runtest.cmake:3 (RUN_ELMER_TEST)
  
  
  <end of output>
  Test time =   2.55 sec
  ----------------------------------------------------------
  Test Failed.
  "CurvedBoundaryP" end time: Jul 13 21:44 UTC
  "CurvedBoundaryP" time elapsed: 00:00:02
  ----------------------------------------------------------

That doesn't give much to follow up on (at least for me).

I was just thinking/hoping that something would output a bit more than that to somewhere. But admittedly, I don't know if that is actually the case.

@mmuetzel
Copy link
Contributor Author

The link error with OpenMP using LLVM Clang on Ubuntu is fixed by configuring with -DOpenMP_C_FLAGS=-fopenmp=libgomp -DOpenMP_CXX_FLAGS=-fopenmp=libgomp.

@mmuetzel
Copy link
Contributor Author

The same set of tests fail when building with Clang independent on whether OpenMP is used.

@raback
Copy link
Contributor

raback commented Jul 31, 2024

I am looking at these great CI builds (ubuntu-clang & windows-mingw) and wondering whether they might deserve their own .yaml file. For me it seems that the status of a build is either success or failure. So if we would like to show one status flag for each build they need to have different .yaml file? Like for build.yaml we have the status in "https://github.com/ElmerCSC/elmerfem/actions/workflows/build.yaml/badge.svg". Now build.yaml gets bloated if we add all platforms there. What is your take on this? So we could have build-ubuntu-clang.yaml, build-windows-mingw.yaml etc. Just my uninformed thoughts...

@mmuetzel
Copy link
Contributor Author

The workflows for GCC vs. Clang on Ubuntu share almost all of the build instructions. So, separating them into two yaml files might lead to a lot of copy-paste and the potential for forgetting that changes might need to be applied in workflows for both.

I'm not sure if I understood your motivation correctly: Are you hesitating to add workflows for configurations that (currently) fail because that would mean that the badge would indicate that? Splitting the workflows just because they fail only puts a blanket over a potentially underlying issue imho. The configuration still fails (whether the badge is green or red doesn't make a difference).
Does ElmerFEM only "guarantee" to work for one platform with one compiler? Is it "just coincidental" if it works on other platforms or with other compilers? Is a failure on other platforms or with other compilers considered a bug?

Anyway: Since (more or less) recently, GitHub has reusable workflows. Maybe, it's possible to use those to keep duplication of the build rules in the yaml files to a minimum while still splitting (at least some) elements of a build matrix to different workflows...
I haven't looked into how to do that yet though.

Splitting the workflows for different platforms into separate yaml files is probably ok. Build instructions tend to differ for different platforms anyway.

Add configurations to the build matrix that build with LLVM Clang with
and without OpenMP.
Also add a runner that builds with GCC without OpenMP.
@raback
Copy link
Contributor

raback commented Jul 31, 2024

My idea was just to give some more information. If we have clang build with some persisting issues it is still nice to know with a quick inspection that a commit didn't break the gcc compilation. We could have then several badges. The same info can of course be retrieved by other means.

I am just trying to hunt why the clang fails. For some reason it seems that it cannot obtain the meshes.

@raback
Copy link
Contributor

raback commented Jul 31, 2024

I was able to get the std out of ElmerSolver out on failure. See e.g. test 55 here.
https://github.com/ElmerCSC/elmerfem/actions/runs/10178764044/job/28153050603
Unfortunately the mesh generation is not part of the redirected output.

@mmuetzel
Copy link
Contributor Author

I have a VM with Ubuntu 24.04 where I could try and reproduce the error locally.
Do you have any hints for what I should look out to get more information for debugging? Are there any commands that I could run manually (to check whether there is more output than what ctest captures)?

@raback
Copy link
Contributor

raback commented Jul 31, 2024

First failing test is "CurvedBoundaryP". Try with:

cd $ELMERSRC/fem/tests/CurverBoundaryP
ElmerGrid 14 2 circle_in_square.msh -autoclean
ElmerSolver

For some reason it seems that the mesh conversion from Gmsh to Elmer type fails and ElmerSolver reports that it cannot find the file. I cant see the output of ElmerGrid from the CI currently.

@raback
Copy link
Contributor

raback commented Jul 31, 2024

The test is run by the runtest.cmake in the test directory:

include(test_macros)
execute_process(COMMAND ${ELMERGRID_BIN} 14 2 circle_in_square.msh -autoclean)
RUN_ELMER_TEST()

Currently the output is redirected to stdout and stderr only in the RUN_ELMER_TEST() and these are visible in CI when setting CTEST_OUTPUT_ON_FAILURE=1. Unfortunately the problems seem to occur before this.

@mmuetzel
Copy link
Contributor Author

It looks like that command segfaults:

.../elmerfem/build-clang/elmergrid/src/ElmerGrid 14 2 circle_in_square.msh -autoclean

Starting program Elmergrid, compiled on Jul 25 2024
Elmergrid reading in-line arguments
Lower dimensional boundaries will be removed
Materials and boundaries will be renumbered
Nodes that do not appear in any element will be removed
Output will be saved to file circle_in_square.

Elmergrid loading data:
-----------------------
Format chosen using the first line: $MeshFormat
Gmsh version is 4.1
Loading mesh in Gmsh format 4.1 from file circle_in_square.msh
Reading 9 entities in 0D
Reading 8 entities in 1D
Reading 2 entities in 2D
Allocating for 42 knots and 97 elements.
Leading element dimension is 2
Allocating lookup table for tags of size 9
Defined 9 0DIM entities with geometric tag range [1 9]
Defined 8 1DIM entities with geometric tag range [1 8]
Defined 2 2DIM entities with geometric tag range [1 2]
Physical tag offset for 0D is 0
Geometric tag offset for 0D is 8
Reading 42 nodes in 19 blocks.
Reading 97 elements in 19 blocks.
Moving bulk elements to boundary elements
Segmentation fault (core dumped)

A backtrace from that crash with gdb (in case that helps):

Program received signal SIGSEGV, Segmentation fault.
ElementsToBoundaryConditions (data=data@entry=0x7ffffff6bd30, bound=bound@entry=0x55613740, retainorphans=retainorphans@entry=0, info=info@entry=1)
    at .../elmerfem/elmergrid/src/egmesh.c:6983
6983	    bound[j].created = FALSE;
(gdb) bt
#0  ElementsToBoundaryConditions (data=data@entry=0x7ffffff6bd30, bound=bound@entry=0x55613740, retainorphans=retainorphans@entry=0, info=info@entry=1)
    at .../elmerfem/elmergrid/src/egmesh.c:6983
#1  0x000055555558b9d7 in LoadGmshInput (data=0x7ffffff6bd30, bound=0x55613740, prefix=<optimized out>, keeporphans=0, info=info@entry=1)
    at .../elmerfem/elmergrid/src/egconvert.c:5234
#2  0x00005555555bd4b4 in main (argc=<optimized out>, argv=<optimized out>) at .../elmerfem/elmergrid/src/fempre.c:303
(gdb) p j
$1 = 0
(gdb) p bound
$2 = (struct BoundaryType *) 0x55613740
(gdb) p bound[0]
Cannot access memory at address 0x55613740

Maybe, bound is used before it is initialized?

@mmuetzel
Copy link
Contributor Author

Strange. It looks like the upper half of the address was lost somewhere (maybe overwritten on the stack???):

(gdb) up 2
#2  0x00005555555bd4b4 in main (argc=<optimized out>, argv=<optimized out>) at .../elmerfem/elmergrid/src/fempre.c:303
303	    if (LoadGmshInput(&(data[nofile]),boundaries[nofile],eg.filesin[nofile],
(gdb) p nofile
$3 = 0
(gdb) p boundaries
$4 = {0x555555613740, 0x0 <repeats 11 times>}
(gdb) p boundaries[nofile]
$5 = (struct BoundaryType *) 0x555555613740
(gdb) p boundaries[nofile][0]
$6 = {created = 0, nosides = 0, maxsidenodes = 0, coordsystem = 0, echain = 0, ediscont = 0, chainsize = 0, parent = 0x0, parent2 = 0x0, material = 0x0, side = 0x0, 
  side2 = 0x0, chain = 0x0, types = 0x0, discont = 0x0, normal = 0x0, elementtypes = 0x0, topology = 0x0}
(gdb) down 1
#1  0x000055555558b9d7 in LoadGmshInput (data=0x7ffffff6bd30, bound=0x55613740, prefix=<optimized out>, keeporphans=0, info=info@entry=1)
    at .../elmerfem/elmergrid/src/egconvert.c:5234
5234		  printf("unknown node %d %d in element %d\n",k,revindx[k],i);
(gdb)  p bound
$7 = (struct BoundaryType *) 0x55613740
(gdb) p bound[0]
Cannot access memory at address 0x55613740

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jul 31, 2024

I added a breakpoint early in that function and then ran n, p bound, and p meshdim repeatedly until I noticed that the address of bound changed:

4972	      tagoffset[meshdim] = maxphystag[meshdim];
$1075 = (struct BoundaryType *) 0x555555613740
$1076 = 0
(gdb) 
4973	      tagoffset[meshdim-1] = phystagoffset[0];
$1077 = (struct BoundaryType *) 0x555555613740
$1078 = 0
(gdb) 
4979	      GETLONGLINE;
$1079 = (struct BoundaryType *) 0x55613740
$1080 = 0
(gdb) 
4980	      if(!strstr(longline,"$EndEntities")) {
$1081 = (struct BoundaryType *) 0x55613740
$1082 = 0
(gdb) 

meshdim is still 0 at that point. So, likely tagoffset[meshdim-1] accesses out of bound memory...

@mmuetzel
Copy link
Contributor Author

The change in #506 avoids that segfault for me.

@raback raback merged commit 146b859 into ElmerCSC:devel Jul 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants