-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finer control over GPU memory (access to cu*Destroy from Python) #370
Comments
Thanks for this request! I would be happy to consider a PR that adds this. This should be simple to do. Within the Lines 994 to 1041 in db77dc7
|
After some more investigation, I was actually able to identify the cause of our memory leak! Perhaps this is already documented somewhere but we weren't explicitly pushing the current CUDA context at the time that Python's garbage collector would delete the stream, and therefore cuStreamDestroy was never getting called. The pycuda.Stream's destructor was in fact getting correctly called after our function returned, but for some reason this exception was silently thrown (so we never found out): Lines 967 to 969 in db77dc7
I assume there's a good reason to raise the exception instead of pushing the context from the current thread, so we implemented a method on our end to destroy each Stream we create by wrapping the del self.stream with a context push /pop . Now our GPU memory usage is as flat as a pancake :)
I'm happy to contribute to any docs if you believe this could be helpful to someone else, otherwise I'm happy to close this issue. |
For any reader that comes across this, here's a minimal example to reproduce our leak (and how to fix it): import gc
from pycuda import driver as cuda
from threading import Thread
cuda.init()
ctx = cuda.Device(0).make_context()
ctx.pop()
def create_streams(N=16):
ctx.push()
streams = [cuda.Stream() for _ in range(N)]
ctx.pop()
return streams
def print_memory():
ctx.push()
free, total = cuda.mem_get_info()
print(f"{(total - free) / (1024 ** 2)}MB")
ctx.pop()
def main_loop(N=10, push_ctx=False):
for i in range(N):
print(f"-- {i=} --")
# Create some CUDA streams
streams = create_streams()
# Delete them (wrap the deletion with a context push/pop if selected)
if push_ctx:
ctx.push()
del streams
if push_ctx:
ctx.pop()
# Print memory usage. If there's a leak, this should increase on every iter
print_memory()
print("-" * 10)
# 1) Create and delete streams several times in the main thread.
# This doesn't leak memory because the stream destructor will be called
# from the same thread that created the context, so pycuda handles it nicely
main_loop()
# 2) Repeat step 1 from a separate thread.
# This will leak memory because the stream destructor isn't able to call
# cuStreamDestroy (isn't able to activate the context)
t = Thread(target=main_loop)
t.start()
t.join()
# 3) Repeat step 2 but pushing the context before deleting the streams. NO LEAK
t = Thread(target=main_loop, kwargs={"push_ctx": True})
t.start()
t.join() Bottom line is: if you use multi-threading, remember to push the context before deleting the stream(s)! |
Thanks for following up! I took a look why there is no warning about the silenced exception. Turns out it's getting caught here: Line 1010 in a25ed98
and then simply silenced here: Lines 161 to 171 in a25ed98
That struck me as a bad idea, given that that warning likely could have saved you a few hours of grief. https://gitlab.tiker.net/inducer/pycuda/-/merge_requests/80 adds a warning in that case. |
Is your feature request related to a problem? Please describe.
We recently identified a (GPU) memory leak in a routine that creates a new cuda stream on a given context every time it is called. Even though the scope of the
stream
variable is constrained to this function, it doesn't seem to be garbage collected. Every N times the function is called, the total GPU memory allocated (measured throughmem_get_info()
ornvidia-smi
) increases by a small amount (e.g. 2MB with N=16). After enough times, we completely starve our GPU's memory. See details on a minimal example to replicate in the Additional context section.Describe the solution you'd like
I'd like the destructor of
Stream
objects to be exposed (e.g. through an explicitdestroy
method), so that we could force free that memory, or a mechanism to ensure streams out of scope are automatically freed.Describe alternatives you've considered
So far, the only approach that seems to get the job done (although not ideal) is to detach the context and create a new one every once in a while. However, this has latency and performance drops associated with it, and involves having to reinitialize any other tasks that were running in that context.
Additional context
Minimal example to reproduce:
prints out:
The text was updated successfully, but these errors were encountered: