-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win32: Don't rely on timeout at all #2350
base: develop
Are you sure you want to change the base?
Conversation
If the process is about to exit, there's no point trying to do a bunch of work to clean up resources. The kernel will release them much more efficiently when the process exits at the end of this function.
Without this we run into a problem. After 50ms we forcably kill the pool thread. However, if the pool thread currently holds the allocation lock, but doesn't get scheduled within 50ms, said lock will never be released causing an infinite deadlock and preventing the DLL from releasing. Further, in the unload scenario (as opposed to process exit), we may actually want to let the thread pool finish in case it's working on finishing an operation on another thread (since the user may be expecting the answer).
@Keno So the This being said, we now have a new bug (a crash of GIMP) from some people, which apparently comes from these patchs we made (since this is the only difference between 2 installers we had of version 2.10.14), though we still haven't figured out why and how. If your patch makes this whole section of the code not being called anymore and letting the kernel handling thread termination, then I guess it could make our problem go away. |
It won't be called on process termination anymore, but it will still be called if FreeLibrary is called. I cannot guarantee that there aren't any other bugs here which will cause hangs in the latter case, but if so, any such bugs should be addressed on its own merits. There is no way to ensure that the thread we'd be killing here isn't in a critical section, so by fiddling with the timeout you're only moving around who sees the crashes. |
Probably best to confirm that the GIMP plugin does not get unloaded (and/or does not unload OpenBLAS) after the initial "are you there"... |
If this patch does cause hangs in the GIMP plugin, I'd like to know about it, so we can try to fix it properly. The bottom line here is that the fix to #307 did not address the underlying issue and turned a deterministic hang into a non-deterministic crash/hang/corruption depending on system contention. I tried the original reproducer from #307 and cannot reproduce the hang, but as I said, I wouldn't be entirely surprised if there was still incorrect code somewhere else that did cause a hang when openblas is unloaded. Nevertheless a deterministic hang is a million times easier to debug than non-deterministic corruption, so if we do see a hang, let's root cause it properly and fix it in the proper place. |
Actually, give me a second before trying it, I made a small typo in this patch. Edit: Never mind, the patch is fine. |
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
Hanging after getting a thread blown away looks a lot like #2341 (though I am still not entirely convinced that the changed proposed there is the right thing to do) |
Actually, it's potentially worse than that: kernel documentation for this says that it guarantees it will never again schedule the thread since it's already called TerminateThread on them (https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices#best-practices-for-synchronization) I don't think #2341 is related, since that appears to be Linux, and this is a Win32-only patch. |
Ok, in that case it would be good to know how GIMP uses OpenBLAS, so we can try to fix the underlying issue here. A minimal reproducer would be ideal of course. In the meantime, perhaps, we can just apply the first commit |
Right, #2341 is Linux-only... from #1720 (comment) it is probably the Levin-matting tool in the GIMP's GEGL library that causes a load/unload cycle of OpenBLAS on startup. |
#1720 looks to me like probably basically the same issue as here, except Windows has a loader lock, which makes many more things difficult. The documentation for DllMain has some nice diagrams to explain why the code here is expected to cause deadlocks. In particular, it says that if another thread tries to exit while there's another thread waiting on it from inside the DllMain function, that will always deadlock (until the timeout returns). If I'm reading the docs correctly, the solution is to instead use a separate Event to signal cleanup completion (and use WaitForMultipleObjects on either one getting signaled). |
Part of my problem (apart from not being a Windows programmer) is that I do not quite understand where the deadlock would occur - when it should only ever be the master thread waiting on others to exit. From my very limited understanding, perhaps using WaitForMultipleObjects with the pool.killed event as the second object (to avoid an infinite wait time) and then instead of the evil TerminateThread just doing a second round of WaitForMultipleObjects for anything not reaped on the first try might work ? |
So it turns out our last issue was on our side (some shaddy issue of wrong CFLAGS or something; we are not even sure ourselves). The code as it is currently works for us. So I would say, in any case, please don't merge this patch (as it just locks GIMP forever). I understand what you are saying @Keno, and in theory you are right that you should not assume that a long wait means there is a problem (maybe it's an expected long wait) so an infinite timeout could theoretically be right.
I confirm. GIMP doesn't use OpenBLAS directly, only through GEGL, which uses OpenBLAS only for its Levin-matting operation. Since we load GEGL at startup, it is indeed probably where a load/unload of OpenBlas must happen. |
@Jehan can you check if GIMP works with just the dllinit.c part of Keno's patch (which supposedly does not change behaviour on FreeLibrary) ? At the very least I would like to avoid turning this into a popularity contest between Julia and GIMP... |
Obviously I agree. I am all for all Free Software to improve together. 🙂
I guess it will because this was not part of our problem (but that's just a very wild guess). Anyway this should actually be asked to @jernejs. He is the one who maintains our Windows installer/build. @jernejs Would it be possible to make this test when you have some time? |
That seems like the opposite of a problem :P I’m cursed with the opposite affliction: having only passing knowledge of the OpenBLAS scheduler. But to summarize, that WaitForObject call is actually (guaranteed) to behave the same as calling |
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After reading https://docs.microsoft.com/de-de/windows/win32/dlls/dynamic-link-library-best-practices it seems clear to me that the first part of this PR (to not attempt to shut down threads on process exit) is correct. I am even more confused about what to do in the case of an immediate DLL unload while threads are still barely starting up - the available documentation appears to do little more than pronounce this a fundamentally unsafe situation and enumerate |
Furthermore, from https://devblogs.microsoft.com/oldnewthing/?p=25283 (last paragraph) it appears that to "properly wait until the thread finishes in dll unload" is not possible when that wait is initiated from DllMain (thread cannot exit before it has sent DLL_THREAD_DETACH, which it cannot do while the DLL_PROCESS_DETACH handler is still running). So blas_thread_shutdown() in any of its recent forms would probably be safe to invoke from the "other" code before it tries to unload the openblas dll (and also as part of our internal fork handler if necessary), but should not be called during any form of complete shutdown, be it dll unload or program exit. Not sure if we could get away with just doing nothing on DLL_PROCESS_DETACH in the unload case as well, |
Calling Ignoring it on program exit should be fine, since the other threads should be already killed anyways. Ignoring it on dll unload might be OK, but you’d have to leak any resources that could ever be examined by another thread (because you don’t know if it’s simultaneously accessing it). It’s unclear to me if Windows will leak the DLL memory, if it it’ll unmap that (causing the other threads to fault if they try to run after shutdown) I thought of one possible solution, but I’m surprised no other post has mentioned this (though it is tricky). Since the issue is that the LoadLibrary ref count might reach zero before all worker threads have started, the startup routine could acquire an extra handle to itself (increment the ref count). The last worker to start could then release (FreeLibrary) that extra ref count. That would defer the free until all threads are out of the loader lock. A few complications though:
|
Hmm. Looks like a rather involved way to invoke undefined behaviour ? (Though I saw something similar discussed on stackoverflow https://stackoverflow.com/questions/39241400/cleaning-up-threads-in-a-dll-endthreadex-vs-terminatethread) |
A quick test (replacing the openblas.dll in the current official GIMP package on a Windows system) does not show any obvious leaks or other shortcomings from not attempting any cleanup in the DLL unload case - certainly nothing catastrophic. |
I don't think any of the behavior I mentioned was undefined, I just noted that some of the functions may require the loader lock, which could be guaranteed to hang. From Google, it appears that others have used PS. Here's the text of Daniel's comment from that last link detailing the race condition present in that solution (while the direct blog link is now dead, it was recoverable on the wayback machine):
However, this should not be a problem given the plan I outlined above. |
To clarify I am not claiming that what you outlined would invoke "defined" undefined behaviour, just that there does not seem to be any official documentation stating that this is the recommended, guaranteed way to address this problem. So we might again end up with something that happens to work right now but causes insiduous problems with some later version/patchlevel of windows ? |
Not doing anything is relying on a race condition (that the worker threads are currently sleeping and will never awake after the DLL is unmapped from memory—that's nearly the same one that's sometimes seen to fail now). The approach I outlined should only be relying on simple defined behaviors (reference counting) and is presumably also the underlying implementation for the Win32 ThreadPool https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-setthreadpoolcallbacklibrary |
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
After much debugging, it turns out that OpenBlas is sometimes hanging on exit due to bad assumptions in its thread management. The upstream discussion is at OpenMathLib/OpenBLAS#2350, but this should fix our frequent win32 CI failures in the meantime.
Same issue as #2314, but this time seen in Julia. Having a timeout here is fundamentally unsound. There is no guarantee that the kernel will ever schedule the thread holding the lock. We need to rely on the pool threads to exit eventually (if they don't, that's a different bug that this can't work around - especially since we already don't do the termination for windows store apps). However, to prevent thread pool threads from blocking process exit, don't attempt to shut down BLAS if the process is about to exit - the kernel will kill all our resources anyway. After this patch, the problematic code section is only called if the dll is manually unloaded, which I think is
a) A much rarer use case and
b) even more important that we don't interrupt pool threads in the middle of work that that host application may want to look at later.
cc @Jehan