SPEC test_set_profile_* sporadically failing #357

landreman · 2023-10-01T10:13:25Z

The SPEC tests test_set_profile_cumulative and test_set_profile_non_cumulative are occasionally failing in the CI. Some examples:
https://github.com/hiddenSymmetries/simsopt/actions/runs/6364144339/job/17280224739
https://github.com/hiddenSymmetries/simsopt/actions/runs/6263108068/job/17006759326
Most of the time these tests both pass - usually it is just one of the many jobs in the "extensive CI" that fails. When they fail, the 2 tests seem to fail together, and the error message is

======================================================================
ERROR: test_set_profile_cumulative (mhd.test_spec.SpecTests)
Set a SPEC profile of a cumulative quantity (volume current in this example)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/simsopt/simsopt/tests/mhd/test_spec.py", line 178, in test_set_profile_cumulative
    s = Spec(filename)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/simsopt/mhd/spec.py", line 181, in __init__
    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])
IndexError: index 2 is out of bounds for axis 1 with size 2

======================================================================
ERROR: test_set_profile_non_cumulative (mhd.test_spec.SpecTests)
Set a SPEC profile of a non-cumulative quantity (surface current in this example)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/simsopt/simsopt/tests/mhd/test_spec.py", line 146, in test_set_profile_non_cumulative
    s = Spec(filename)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/simsopt/mhd/spec.py", line 181, in __init__
    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])
IndexError: index 2 is out of bounds for axis 1 with size 2

I'm not sure what would cause this error in a non-deterministic way.

These tests both begin with the following code:

        filename = os.path.join(TEST_DIR, 'RotatingEllipse_Nvol8.sp')

        with ScratchDir("."):
            s = Spec(filename)

So perhaps there is non-deterministic behavior coming from the filesystem and creating the scratch directory... I don't think a scratch directory is actually needed since there are no files written, right?

Any ideas how to fix this error?

The text was updated successfully, but these errors were encountered:

smiet · 2023-10-02T13:08:36Z

Hi Matt,

I am looking into it, though it might take me a while to get to the bottom.

The error is thrown when translating the initial guess for the interfaces into the Spec object.
These coefficients should be contained in the spec.allglobal.allrzrz array.
The test fails when the fourier coefficients for an interface lvol>2 are attempted to be read, which are not present in the allrzrz-array.

SPEC will write the interface guesses into a long lap of text at the bottom of the .sp file upon sucessful completion of a run.

When I look at the file, or read it using spec.allglobal.read_inputlist_from_file(), I see the array has the correct dimensions.
Maybe during the setup of the scratch filesystem the file is not done copying?
Or a previous run has written the interface guesses to file, but didn't complete?

I have seen similar errors when running an ipython kernel in which I load multiple Spec objects at the same time with different resolutions, but I am having trouble reproducing that now.

I will investigate further and discuss, the randomly failing tests are very annoying.

andrewgiuliani · 2024-04-23T21:44:56Z

Hi all, I've been able to SSH into the github runner and reproduce the error. You can do the same by adding

    - name: Setup tmate session
      uses: mxschmitt/action-tmate@v3

to test.yml just before the unit tests are run, just like I did here:

simsopt/.github/workflows/tests.yml

Line 185 in d0790e2

- name: Setup tmate session

When SSH'ed into the runner, I call the unit tests individually and it all seems to work

python -m unittest mhd.test_spec.SpecTests.test_set_profile_cumulative

python -m unittest mhd.test_spec.SpecTests.test_set_profile_non_cumulative

However, I can reproduce the bug by calling

coverage run --source=simsopt -m unittest discover -t tests -v -s tests/mhd

and stepping through the problematic unit tests test_set_profile_cumulative, test_set_profile_non_cumulative with an interactive debugger.

smiet · 2024-06-14T12:40:56Z

I think I have found the solution!

The condition to read the initial guess was based on:
` if self.allglobal.nmodes > 0 and self.nvol > 1:'

When that condition is met, the sporadic fail occurs when running over lvol and reading a non-existent entry in allrzrz

for lvol in range(0, self.nvol-1):
                    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])

the `self.allglobal.nmodes' references a f90wrapped fortran-array.

now the issue
the fortran memory persists if the python kernel isn't quit. This means that the allglobal.mnmodes can be present, and the test will try to read the stale f90wrapped memory set up in a previous test. If the number of volumes is smaller than that in memory, no problem, but if it is larger, it tries to read out of memory and fails as above.

Spec only updates the allrzrz and nmmodes array when Linitialize is set to zero (or a negative number)

Thus the failure only occurs when the fortran memory is not cleared, the spec equilibrium that is attempted to be read has more interfaces than the one just preceding it, and Linitialize!=0, so it does not update the arrays.

I believe that coverage keeps the python kernel active, or keeps the objects int he tests within scope for longer, and therefore prevents the clearing and re-setting of the Spec.spec FORTRAN memory, or the number of tests and their order are different as they are spread differently over processors, resulting in the sporadic failures.

possible solution
changing the condition upon which memory is read to self.inputlist.linitialize == 0 and self.nvol >1 should solve it, as I am attempting in PR #418

smiet · 2024-06-14T15:35:23Z

It did not solve the issue, though in PR #418 it is now a different test that is failing, by again accessing out-of-bound memory when trying to open the 8-volume-case RotatingEllipse_Nvol8.sp.

There is something very strange going on that the allrzrz array is not updated correctly in the f90-wrapped memory when the tests are run under coverage, but not when directly run.
@andrewgiuliani Let's try to plan another interactive debug session, we made a lot of progress last time, and I am feeling we are getting even closer

smiet · 2024-06-26T15:02:29Z

I think I solved it! The deep-dive into f90wrap to solve the numpy2.0 migration made me give it another go (plus new tests in my PR #418 made the bug not-so-sporadic anymore).

The issue: f90wrap uses getter and setters to access the F90wrapped FORTRAN arrays, and the logic is as follows:

handle = get_specific_array_handle(stuff)  #pointer to the array
if handle in self._arrays #dict with arrays and handles:
  return self._arrays[handle]
else:
  array = get_array_in_complicated_f90wrapway(stuff)
  self._arrays[handle]=array
  return array

Now for the race condition: If spec has run before, and the array has been accessed before (put in cache), and by happenstance the handle remains the same, python will return the cached array, and not access the FORTRAN memory.

The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a conveniently shaped hole has just been emptied in memory by deallocation.

This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.

Since the error is a cache mismatch on the python side, it should be solved by clearing the cache dictionary in the Spec __init__, and this change is made in PR #418.

andrewgiuliani · 2024-06-26T18:30:19Z

I'd suggest proposing the fix in a different PR than #418 to ensure it gets merged more quickly and so that we can see precisely what was changed.

smiet · 2024-06-27T08:28:51Z

Done in #431

landreman assigned smiet, jloizu and abaillod Oct 1, 2023

landreman mentioned this issue Apr 5, 2024

Mgrid improvements #396

Merged

andrewgiuliani mentioned this issue Apr 19, 2024

fix for failing workflows on github actions #406

Closed

smiet mentioned this issue Jun 27, 2024

clear spec cache to avoid sporadic mismatch #431

Merged

smiet mentioned this issue Jul 3, 2024

Undocumented behavior in f90wrapped SPEC and VMEC (singleton methods) #434

Open

smiet closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPEC test_set_profile_* sporadically failing #357

SPEC test_set_profile_* sporadically failing #357

landreman commented Oct 1, 2023

smiet commented Oct 2, 2023

andrewgiuliani commented Apr 23, 2024 •

edited

Loading

smiet commented Jun 14, 2024 •

edited

Loading

smiet commented Jun 14, 2024

smiet commented Jun 26, 2024

andrewgiuliani commented Jun 26, 2024

smiet commented Jun 27, 2024

SPEC test_set_profile_* sporadically failing #357

SPEC test_set_profile_* sporadically failing #357

Comments

landreman commented Oct 1, 2023

smiet commented Oct 2, 2023

andrewgiuliani commented Apr 23, 2024 • edited Loading

smiet commented Jun 14, 2024 • edited Loading

smiet commented Jun 14, 2024

smiet commented Jun 26, 2024

andrewgiuliani commented Jun 26, 2024

smiet commented Jun 27, 2024

andrewgiuliani commented Apr 23, 2024 •

edited

Loading

smiet commented Jun 14, 2024 •

edited

Loading