-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncontrolled process memory size during test #59
Comments
Unlike for mirgecom, jemalloc doesn't seem to work wonders for pytential. This is memory use according to memory-profiler for a sequential run of I'm glad it's enough to unbreak the CI (#56, #58), but doesn't appear totally solved. Using the machinery I made for mirgecom, I'll try to do a running total of CL buffer memory allocated and see if there's any growth there. |
jemalloc seems to have some knobs to make it release memory more aggresively. There's a dirty_decay_ms and a muzzy_decay_ms. Should be easy to see what it does by setting
(the default is 10000 (i.e. 10s) and smaller means it releases memory more) Found this pretty useful too, but no idea how up to date it is. |
Using this method for instrumenting memory allocation in pocl, I get the following plot of allocated memory (by pocl) over time: This seems to support the assertion that, as with mirgecom, the underlying cause for the growth of allocated memory might be memory fragmentation, and, as a result, the allocator requesting more and more memory from the OS. In particular, it appears that, at least from the point of view of OpenCL, all memory that is allocated is also freed. For reproducibility:
run with 48efe3b and pocl a2d016c8. |
Interestingly, replacing pocl's allocator for global memory with jemalloc doesn't seem to do much: If anything, it releases even less memory. Maybe @alexfikl is right, and this comes down to a knob twiddling. For reproducibility:
|
As for things that are cheap to do, I discovered another allocator to try. |
Tried with jemalloc with the flags given by @alexfikl
Not really all that different from glibc or vanilla jemalloc: |
Just to add more random data to this. I ran
This only runs 4 tests, not the whole thing, so should probably taken with a grain of salt. The different lines are
It does seem like changing some options will not have a big effect vs default jemalloc :\ |
I've also tried adding def teardown_function(func):
import gc
gc.collect()
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0) At the top of those tests to force it to free as much as possible, but it didn't seem to improve much. However, it does seem to deallocate memory towards the end there, so it may not continue growing as all the tests run.
|
Went ahead and ran the whole test suite with the snippet from previous comment added to all files. def teardown_function(func):
import gc
gc.collect()
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
|
Thanks! One additional bit of information: The Python GC already gets run by pyopencl (I think) between tests. |
Ah, that explains why I didn't see much of a difference when removing the |
Had an OpenCL-related theory for something that could be contributing: inducer/pyopencl#450 |
python test_layer_pot.py with this diff diff --git a/test/test_layer_pot.py b/test/test_layer_pot.py
index 409eede..a67bb78 100644
--- a/test/test_layer_pot.py
+++ b/test/test_layer_pot.py
@@ -614,15 +614,17 @@ def test_3d_jump_relations(actx_factory, relation, visualize=False):
# }}}
+def run_em_all():
+ import gc
+ for i in range(10):
+ test_off_surface_eval(_acf, True)
+ gc.collect()
+
+
# You can test individual routines by typing
# $ python test_layer_pot.py 'test_routine()'
if __name__ == "__main__":
- import sys
- if len(sys.argv) > 1:
- exec(sys.argv[1])
- else:
- from pytest import main
- main([__file__])
+ run_em_all()
# vim: fdm=marker seems to be a microcosm of this issue that runs in manageable time. Memory usage goes up in a straight line, when... there's no reason for that, IMO: |
The allocations seem to scale with the size of the problem (in this case, set by changing |
I'm using the code in this diff to hunt the leaks. One (possibly minor?) contributor to this appears to be |
That was a reference cycle, but not a significant contributor: inducer/pytools#181. |
Beyond that, it's mostly a bunch of pyopencl "stuff" that stays alive: <class 'list'>, <class 'list'>, <class 'set'>, <class 'pyopencl._cl.Context'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'pyopencl.scan.GenericScanKernel'>, <class 'list'>, <class 'pyopencl._cl.Device'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltFinalUpdateKernelInfo'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl._cl.Platform'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'tuple'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'tuple'>, <class 'list'>, <class 'tuple'>, <class 'list'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, ...` |
Hm. Calling |
Irritatingly, even with zero growth of Python-side object allocation, the |
OK, it's flat after switching to |
Well. So I removed that |
Observations:
|
inducer/loopy#797 (plus small changes across the stack) should make it so that loopy caches don't keep contexts alive. |
Somewhat depressingly, inducer/loopy#797 doesn't seem to have made a decisive difference. Blue is with the changes, black is before. |
I've also printed lists of Python objects that, according to
I'm increasingly coming to the conclusion that these leaks might not be Python-visible---these objects are unlikely to explain hundreds of megabytes, I think. |
Further confirmation of this is that |
Can you try with |
Does something like https://github.com/bloomberg/memray give additional useful information? I've never actually tried it.. EDIT: Possibly even https://github.com/bloomberg/pytest-memray? |
I was trying https://github.com/plasma-umass/scalene just now, but it seemed to incur fairly significant slowdown? Not sure. I guess I'll switch to memray for a bit? |
I've already cut pytest out of the picture and wrote some code to manually call the tests, to make sure that's not at fault. (Yes, I'm that deep into conspiracy territory.) Memray just crashed initially, but that seems to be due to the Intel OpenCL ICD. If I disable it, things at least seem to run. |
sigh It seems memray gets confused by GPUs and simply reports the 600 GiB of mapped GPU memory as memory consumption. 🤦 (Yes, yes, I'm rebuilding pocl without GPUs.) |
[Nvm all this, see below) 🎉 Memray did something useful! Here's the relevant flamegraph: https://gist.github.com/inducer/2aaa320fdf49b3d5af651cbc28d5ee4d It seems like most of the leaks are in pocl, with vkfft contributing a measly 100 MB. I'll recompile pocl with debug to try and get more precise info. |
[Nvm all this, see below) Here's the updated flamegraph with Pocl in debug mode: https://gist.github.com/inducer/bfaeb3a85025f3e7a747cba930e2477e One thing that's not clear is that the bottom of most stacktraces shows memray. Does that mean that memray's allocations dominate, and that is what we've learned? Or is the presence of those stackframes just an artifact that we should ignore? |
[Nvm all this, see below) Continuing on the assumption that the "memray intercept stuff" is not the actual source of memory consumption. If that's the case, then that flamegraph implicates two specific
Hmm... both of those bits of code look non-leaky to me, but both of them also return memory to a caller... so there's probably more going on. |
[Nvm all this, see below) The first of those sites also leaks when running https://gist.github.com/inducer/6c2a088eefeffd86cdd456e5aa244bec, a slightly modified PyOpenCL demo, on an empty cache, where it seems to leak 256 MB over the course of the runtime. |
[Nvm all this, see below) I've been staring at this for a while: both the PyOpenCL demo and the full-scale test with |
Never mind all that. TIL that |
The latest from memray: There is no leak! Here's the flamegraph: https://ssl.tiker.net/nextcloud/s/ayQRNNcx8MBexkS So maybe it's time to come to terms with the idea that the leak, if there is one, isn't visible to |
#214 and PRs linked therein is where this journey ends, for now. |
It looks like that 12G thing is Flamegraph: https://ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa |
Well that's quite impressive. I'll try to look at it over the weekend. It definitely should not do that :\ |
Hm, tried looking only at the skeletonization tests with
and there's definitely a big bump in memory due to that test. It's mainly due to constructing the full dense matrix here (a slightly bigger one than the one that is in pytential/test/test_linalg_skeletonization.py Lines 222 to 229 in 27d976d
Locally it only bumps up to about 2GB, not 12GB, so not sure what that's about. However, that matrix construction makes up most of the memory and the error computation (which uses that matrix to compute a bunch of block-wise matrix norms and a big SVD) makes up most of the runtime. I'll continue digging into it.. (blue is local and black is koelsch) |
@inducer What did you run to get the results from #59 (comment)? I tried running on koelsch and it looks intimidatingly reasonable From the command on the plot, it seems like you're also running the |
I did not use Also: let me rerun with the slow tests excluded. |
Yeah, all of my runs have been on the latest git from all the stack and with a EDIT: And I think I cleared all the disk caches last week too. |
Continued from https://gitlab.tiker.net/inducer/pytential/-/issues/131.
The text was updated successfully, but these errors were encountered: