-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPEC test_set_profile_* sporadically failing #357
Comments
Hi Matt, I am looking into it, though it might take me a while to get to the bottom. The error is thrown when translating the initial guess for the interfaces into the Spec object. SPEC will write the interface guesses into a long lap of text at the bottom of the When I look at the file, or read it using spec.allglobal.read_inputlist_from_file(), I see the array has the correct dimensions. I have seen similar errors when running an ipython kernel in which I load multiple Spec objects at the same time with different resolutions, but I am having trouble reproducing that now. I will investigate further and discuss, the randomly failing tests are very annoying. |
Hi all, I've been able to SSH into the github runner and reproduce the error. You can do the same by adding
to simsopt/.github/workflows/tests.yml Line 185 in d0790e2
When SSH'ed into the runner, I call the unit tests individually and it all seems to work
However, I can reproduce the bug by calling
and stepping through the problematic unit tests |
I think I have found the solution! The condition to read the initial guess was based on: When that condition is met, the sporadic fail occurs when running over
the `self.allglobal.nmodes' references a f90wrapped fortran-array. now the issue Spec only updates the Thus the failure only occurs when the fortran memory is not cleared, the spec equilibrium that is attempted to be read has more interfaces than the one just preceding it, and I believe that coverage keeps the python kernel active, or keeps the objects int he tests within scope for longer, and therefore prevents the clearing and re-setting of the possible solution |
It did not solve the issue, though in PR #418 it is now a different test that is failing, by again accessing out-of-bound memory when trying to open the 8-volume-case There is something very strange going on that the |
I think I solved it! The deep-dive into f90wrap to solve the numpy2.0 migration made me give it another go (plus new tests in my PR #418 made the bug not-so-sporadic anymore). The issue: f90wrap uses getter and setters to access the F90wrapped FORTRAN arrays, and the logic is as follows: handle = get_specific_array_handle(stuff) #pointer to the array
if handle in self._arrays #dict with arrays and handles:
return self._arrays[handle]
else:
array = get_array_in_complicated_f90wrapway(stuff)
self._arrays[handle]=array
return array Now for the race condition: If spec has run before, and the array has been accessed before (put in cache), and by happenstance the handle remains the same, python will return the cached array, and not access the FORTRAN memory. The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a conveniently shaped hole has just been emptied in memory by deallocation. This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible. Since the error is a cache mismatch on the python side, it should be solved by clearing the cache dictionary in the Spec |
I'd suggest proposing the fix in a different PR than #418 to ensure it gets merged more quickly and so that we can see precisely what was changed. |
Done in #431 |
The SPEC tests
test_set_profile_cumulative
andtest_set_profile_non_cumulative
are occasionally failing in the CI. Some examples:https://github.com/hiddenSymmetries/simsopt/actions/runs/6364144339/job/17280224739
https://github.com/hiddenSymmetries/simsopt/actions/runs/6263108068/job/17006759326
Most of the time these tests both pass - usually it is just one of the many jobs in the "extensive CI" that fails. When they fail, the 2 tests seem to fail together, and the error message is
I'm not sure what would cause this error in a non-deterministic way.
These tests both begin with the following code:
So perhaps there is non-deterministic behavior coming from the filesystem and creating the scratch directory... I don't think a scratch directory is actually needed since there are no files written, right?
Any ideas how to fix this error?
The text was updated successfully, but these errors were encountered: