clear spec cache to avoid sporadic mismatch #431

smiet · 2024-06-27T08:28:23Z

Edited for improved explanation

This should fix the sporadic failure in the CI #357

The issue: f90wrap keeps a cache on the python side for the F90wrapped FORTRAN arrays, and the logic is as follows:

handle = get_specific_array_handle(stuff)  #pointer to the array
if handle in self._arrays #dict with arrays and handles:
  return self._arrays[handle]
else:
  array = get_array_in_complicated_f90wrapway(stuff)
  self._arrays[handle]=array
  return array

Now for the handle collision: If spec has run before, and the array has been accessed before (put in cache), and by happenstance in the second run, an array is placed in the same memory location as before, a handle collision occurs, and python assumes the array is identical to the one it accessed before (in shape and size too).

When python then tries to access this array, it can actually be a different array that just happens to be placed in the same location, or it is the right array, but the shape on python's side is not updated and we get an out of bounds error when trying to read outside the python-defined bounds.

The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a the code runs over and over again, and a conveniently shaped hole has just been emptied in memory by deallocation.

This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.

codecov · 2024-06-27T09:56:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.00%. Comparing base (9dd34c5) to head (7bc1e4c).
Report is 6 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #431   +/-   ##
=======================================
  Coverage   91.99%   92.00%           
=======================================
  Files          75       75           
  Lines       13499    13504    +5     
=======================================
+ Hits        12419    12424    +5     
  Misses       1080     1080

Flag	Coverage Δ
unittests	`92.00% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andrewgiuliani · 2024-06-27T12:03:52Z

I don't see how this is a race condition. We don't have multiple cores modifying the same area of memory here, do we?

Also why isn't the cache self._array being cleared between python runs? I would have thought that it would be empty at the start of a new run.

If the Fortran arrays in the cache aren't being freed won't this lead to a memory leak?

smiet · 2024-06-28T11:50:11Z

@andrewgiuliani you are right, this is not a race condition, excuse my ignorance.

F90wrap works in a funny way (I don't completely understand it), but the cache is only cleared when the python kernel quits. It sets up links to shared FORTRAN memory that persist until the kernel exits.

Normally you only run a single simulation, and the kernel quits when the optimisation has finished, it is only in testing that many many many SPEC instances are created and destroyed.

For example in

from simsopt.mhd import Spec
myspec1 = Spec('firstfile.sp')
myspec2 = Spec('differentfile.sp')

print(myspec1.allglobal.ext)
>>> differentfile

the Fortran arrays in myspec1 and myspec2 are the same.

This is a limitation of f90wrap and the above code should not be used. The best we can do is make sure that loading a new state runs correctly.

Yes this is a memory leak when many Spec objects are initialized in the same kernel session, but I don't see how to avoid it.

andrewgiuliani · 2024-06-28T18:26:13Z

When you clear the cache, is it possible to run a call on the FORTRAN side to deallocate the allocated arrays referenced in the cache?

If not, I would suggest that this anomalous behaviour should be mentioned in a docstring somewhere.

andrewgiuliani

see above for comments

smiet · 2024-06-28T18:51:38Z

@andrewgiuliani This is not possible with the current implementation of f90wrap, and I am not sure if desired. It is meant to wrap existing programs, giving you access to the memory as they run, with the possibility of muting existing arrays. Forcing deallocation on the FORTRAN side from the python side would alter the program it is wrapping.

Python keeps a dictionary of arrays that have been referenced, but if a new array has the same handle, the dictionary is not updated, and referencing it out of the python-defined bounds (not FORTRAN) results in the error we see.

It is certainly a bug in the implementation that a handle collission (that is the right term!) is likely to occur, if arrays are deallocated and immediately afterwards allocated on memory-constrained systems, but this is at the core of some design choices made for f90wrap, and not something I feel comfortable fixing.

The current PR fixes the anomalous behavior that was occurring (that the cache kept accumulating), and this fix, though a bit draconian, prevents such behavior (and since the collision could occur for every accessed array, it is the right thing to do).

I will probably open an issue on the upstream f90wrap, but this will take a while as the whole numpy2.0 migration has left me very little time for science, and I have some QUASR configurations to calculate tangles in! ;)

edit: actually, this just overwrites the existing dictionary with an empty dictionary. I assume the Python garbage collector will delete it as there are no more references to it. Could someone with more python knowledge comment if this is the right thing to do (@mbkumar)?

andrewgiuliani · 2024-06-28T23:42:30Z

I would have thought that a deallocate all function could be written in on the Fortran side and called from python.

In any case I think that this is great that the bug has been identified!

mbkumar · 2024-06-29T13:54:04Z

@smiet , can we have a zoom call to understand what is going on? I am not getting the full picture from your description. Can't we call python garbage collector on the dictionary you are resetting?

mbkumar · 2024-06-29T14:01:01Z

At the least can you talk to the f90wrap developer, who might have encountered this issue before? There should be some mechanism in f90wrap to clear the cache.

smiet · 2024-06-30T19:52:12Z

@mbkumar yes, let's zoom tomorrow (Monday), was away from internet this weekend. Can you set it up? I'm having a little trouble with email on my phone right now, available all your morning

andrewgiuliani · 2024-06-30T19:54:35Z

I'd like to be present during the zoom too to better understand the problem

smiet · 2024-07-03T15:30:13Z

Some notes of the discussion resulted in issue #434.

All tests are passing consistently, this PR doesn't solve the root cause, but it applies a fix to stop a known bug. I suggest we move further discussion of this topic to #434 and merge this fix so we can stop seeing sporadic failures in our CI. @mbkumar @landreman @andrewgiuliani

landreman

Thanks for figuring out a fix for this tricky bug!

smiet added 2 commits June 27, 2024 10:20

clear spec cache to avoid sporadic mismatch

8fba42c

fix paste error

d11685b

smiet mentioned this pull request Jun 27, 2024

SPEC test_set_profile_* sporadically failing #357

Closed

smiet mentioned this pull request Jun 28, 2024

Fix for python 3.8 and numpy dependency #432

Merged

andrewgiuliani requested changes Jun 28, 2024

View reviewed changes

smiet requested review from mbkumar, landreman and andrewgiuliani June 28, 2024 18:35

This was referenced Jul 1, 2024

Handle collision in array wrapper (that breaks CI intermittently) jameskermode/f90wrap#222

Open

Undocumented behavior in f90wrapped SPEC and VMEC (singleton methods) #434

Open

Mention github issue and PR in comments related to spec CI failures

7bc1e4c

landreman approved these changes Jul 3, 2024

View reviewed changes

andrewgiuliani approved these changes Jul 3, 2024

View reviewed changes

mbkumar merged commit 2c0f057 into master Jul 3, 2024
47 checks passed

smiet deleted the cbs/fix_spec_failure branch July 8, 2024 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clear spec cache to avoid sporadic mismatch #431

clear spec cache to avoid sporadic mismatch #431

smiet commented Jun 27, 2024 •

edited

Loading

codecov bot commented Jun 27, 2024 •

edited

Loading

andrewgiuliani commented Jun 27, 2024 •

edited

Loading

smiet commented Jun 28, 2024

andrewgiuliani commented Jun 28, 2024

andrewgiuliani left a comment

smiet commented Jun 28, 2024 •

edited

Loading

andrewgiuliani commented Jun 28, 2024 •

edited

Loading

mbkumar commented Jun 29, 2024 •

edited

Loading

mbkumar commented Jun 29, 2024

smiet commented Jun 30, 2024

andrewgiuliani commented Jun 30, 2024

smiet commented Jul 3, 2024

landreman left a comment

clear spec cache to avoid sporadic mismatch #431

clear spec cache to avoid sporadic mismatch #431

Conversation

smiet commented Jun 27, 2024 • edited Loading

codecov bot commented Jun 27, 2024 • edited Loading

Codecov Report

andrewgiuliani commented Jun 27, 2024 • edited Loading

smiet commented Jun 28, 2024

andrewgiuliani commented Jun 28, 2024

andrewgiuliani left a comment

Choose a reason for hiding this comment

smiet commented Jun 28, 2024 • edited Loading

andrewgiuliani commented Jun 28, 2024 • edited Loading

mbkumar commented Jun 29, 2024 • edited Loading

mbkumar commented Jun 29, 2024

smiet commented Jun 30, 2024

andrewgiuliani commented Jun 30, 2024

smiet commented Jul 3, 2024

landreman left a comment

Choose a reason for hiding this comment

smiet commented Jun 27, 2024 •

edited

Loading

codecov bot commented Jun 27, 2024 •

edited

Loading

andrewgiuliani commented Jun 27, 2024 •

edited

Loading

smiet commented Jun 28, 2024 •

edited

Loading

andrewgiuliani commented Jun 28, 2024 •

edited

Loading

mbkumar commented Jun 29, 2024 •

edited

Loading