Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear spec cache to avoid sporadic mismatch #431

Merged
merged 3 commits into from
Jul 3, 2024
Merged

Conversation

smiet
Copy link
Contributor

@smiet smiet commented Jun 27, 2024

Edited for improved explanation

This should fix the sporadic failure in the CI #357

The issue: f90wrap keeps a cache on the python side for the F90wrapped FORTRAN arrays, and the logic is as follows:

handle = get_specific_array_handle(stuff)  #pointer to the array
if handle in self._arrays #dict with arrays and handles:
  return self._arrays[handle]
else:
  array = get_array_in_complicated_f90wrapway(stuff)
  self._arrays[handle]=array
  return array

Now for the handle collision: If spec has run before, and the array has been accessed before (put in cache), and by happenstance in the second run, an array is placed in the same memory location as before, a handle collision occurs, and python assumes the array is identical to the one it accessed before (in shape and size too).

When python then tries to access this array, it can actually be a different array that just happens to be placed in the same location, or it is the right array, but the shape on python's side is not updated and we get an out of bounds error when trying to read outside the python-defined bounds.

The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a the code runs over and over again, and a conveniently shaped hole has just been emptied in memory by deallocation.

This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.

Copy link

codecov bot commented Jun 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.00%. Comparing base (9dd34c5) to head (7bc1e4c).
Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #431   +/-   ##
=======================================
  Coverage   91.99%   92.00%           
=======================================
  Files          75       75           
  Lines       13499    13504    +5     
=======================================
+ Hits        12419    12424    +5     
  Misses       1080     1080           
Flag Coverage Δ
unittests 92.00% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewgiuliani
Copy link
Contributor

andrewgiuliani commented Jun 27, 2024

I don't see how this is a race condition. We don't have multiple cores modifying the same area of memory here, do we?

Also why isn't the cache self._array being cleared between python runs? I would have thought that it would be empty at the start of a new run.

If the Fortran arrays in the cache aren't being freed won't this lead to a memory leak?

@smiet
Copy link
Contributor Author

smiet commented Jun 28, 2024

@andrewgiuliani you are right, this is not a race condition, excuse my ignorance.

F90wrap works in a funny way (I don't completely understand it), but the cache is only cleared when the python kernel quits. It sets up links to shared FORTRAN memory that persist until the kernel exits.

Normally you only run a single simulation, and the kernel quits when the optimisation has finished, it is only in testing that many many many SPEC instances are created and destroyed.

For example in

from simsopt.mhd import Spec
myspec1 = Spec('firstfile.sp')
myspec2 = Spec('differentfile.sp')

print(myspec1.allglobal.ext)
>>> differentfile

the Fortran arrays in myspec1 and myspec2 are the same.

This is a limitation of f90wrap and the above code should not be used. The best we can do is make sure that loading a new state runs correctly.

Yes this is a memory leak when many Spec objects are initialized in the same kernel session, but I don't see how to avoid it.

@andrewgiuliani
Copy link
Contributor

When you clear the cache, is it possible to run a call on the FORTRAN side to deallocate the allocated arrays referenced in the cache?

If not, I would suggest that this anomalous behaviour should be mentioned in a docstring somewhere.

Copy link
Contributor

@andrewgiuliani andrewgiuliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above for comments

@smiet
Copy link
Contributor Author

smiet commented Jun 28, 2024

@andrewgiuliani This is not possible with the current implementation of f90wrap, and I am not sure if desired. It is meant to wrap existing programs, giving you access to the memory as they run, with the possibility of muting existing arrays. Forcing deallocation on the FORTRAN side from the python side would alter the program it is wrapping.

Python keeps a dictionary of arrays that have been referenced, but if a new array has the same handle, the dictionary is not updated, and referencing it out of the python-defined bounds (not FORTRAN) results in the error we see.

It is certainly a bug in the implementation that a handle collission (that is the right term!) is likely to occur, if arrays are deallocated and immediately afterwards allocated on memory-constrained systems, but this is at the core of some design choices made for f90wrap, and not something I feel comfortable fixing.

The current PR fixes the anomalous behavior that was occurring (that the cache kept accumulating), and this fix, though a bit draconian, prevents such behavior (and since the collision could occur for every accessed array, it is the right thing to do).

I will probably open an issue on the upstream f90wrap, but this will take a while as the whole numpy2.0 migration has left me very little time for science, and I have some QUASR configurations to calculate tangles in! ;)

edit: actually, this just overwrites the existing dictionary with an empty dictionary. I assume the Python garbage collector will delete it as there are no more references to it. Could someone with more python knowledge comment if this is the right thing to do (@mbkumar)?

@andrewgiuliani
Copy link
Contributor

andrewgiuliani commented Jun 28, 2024

I would have thought that a deallocate all function could be written in on the Fortran side and called from python.

In any case I think that this is great that the bug has been identified!

@mbkumar
Copy link
Collaborator

mbkumar commented Jun 29, 2024

@smiet , can we have a zoom call to understand what is going on? I am not getting the full picture from your description. Can't we call python garbage collector on the dictionary you are resetting?

@mbkumar
Copy link
Collaborator

mbkumar commented Jun 29, 2024

At the least can you talk to the f90wrap developer, who might have encountered this issue before? There should be some mechanism in f90wrap to clear the cache.

@smiet
Copy link
Contributor Author

smiet commented Jun 30, 2024

@mbkumar yes, let's zoom tomorrow (Monday), was away from internet this weekend. Can you set it up? I'm having a little trouble with email on my phone right now, available all your morning

@andrewgiuliani
Copy link
Contributor

I'd like to be present during the zoom too to better understand the problem

@smiet
Copy link
Contributor Author

smiet commented Jul 3, 2024

Some notes of the discussion resulted in issue #434.

All tests are passing consistently, this PR doesn't solve the root cause, but it applies a fix to stop a known bug. I suggest we move further discussion of this topic to #434 and merge this fix so we can stop seeing sporadic failures in our CI. @mbkumar @landreman @andrewgiuliani

Copy link
Contributor

@landreman landreman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out a fix for this tricky bug!

@mbkumar mbkumar merged commit 2c0f057 into master Jul 3, 2024
47 checks passed
@smiet smiet deleted the cbs/fix_spec_failure branch July 8, 2024 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants