Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading deadlock bug in Windows (Julia 1.3 and 1.4-rc1) #34769

Closed
vlandau opened this issue Feb 15, 2020 · 1 comment · Fixed by #34807
Closed

Multi-threading deadlock bug in Windows (Julia 1.3 and 1.4-rc1) #34769

vlandau opened this issue Feb 15, 2020 · 1 comment · Fixed by #34807
Labels
multithreading Base.Threads and related functionality system:windows Affects only Windows

Comments

@vlandau
Copy link

vlandau commented Feb 15, 2020

cc @ViralBShah and @Keno

I encountered a bug in Julia that results in a deadlock on Windows. The bug gets triggered by my package tests, but only when running the tests on a CI virtual machine (both Travis Windows and Appveyor). Notably, if I run the tests (via Pkg.test()) manually on the Appveyor VM after I RDP into it, everything works. The bug only gets triggered if the tests were initiated by the CI service. Not sure what's going on there.

Some more info and context is available in this issue in my Julia package repo.

Here is google drive link to a minidump from the Appveyor VM that hopefully has the information needed to debug this. I would happily try to debug it myself if I could, but it is beyond my capabilities.

You can see the backtrace (from an Appveyor VM) confirming that it is a deadlock here.

Here's the session info for the Julia version that was used on the Appveyor VM (but note that the bug is also triggered in Julia 1.3.0 and 1.3.1):

Julia Version 1.4.0-rc1.0
Commit b0c33b0cf5 (2020-01-23 17:23 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 1
  JULIA_PROJECT = @.
  JULIA_VERSION = 1.4

and you can see the Appveyor output here.

Thanks, and please let me know if I can provide any additional info.

@ViralBShah ViralBShah added system:windows Affects only Windows multithreading Base.Threads and related functionality labels Feb 15, 2020
@ViralBShah
Copy link
Member

It would be nice to fix this for 1.4, if we can.

Keno added a commit that referenced this issue Feb 18, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occaisionally get supurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occaisionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
Keno added a commit that referenced this issue Feb 19, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
KristofferC pushed a commit that referenced this issue Feb 19, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
(cherry picked from commit f36edc2)
birm pushed a commit to birm/julia that referenced this issue Feb 22, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused JuliaLang#34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes JuliaLang#34769

Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
KristofferC pushed a commit that referenced this issue Apr 11, 2020
When there is no work to do, the first thread to be idle will attempt to
run the event loop once, waiting for any notifications (which will usually
create new work). However, there is an interesting corner case where a
notification arrives, but no work was scheduled. That doesn't usually happen,
but there are a few situations where it does:

1) Somebody used a libuv primitive outside of julia, so the callback
   doesn't schedule any julia work.
2) Another thread forbily interrupted the waiting thread because it wants to
   take over the event loop for various reasons
3) On Windows, we occasionally get spurious wake ups of the event loop.

The existing code in partr assumed that we were in situation 2, i.e. that
there was another thread waiting to take over the event loop, so it released
the event loop and simply put the current thread to sleep in the expectation
that another thread will pick it up. However, if we instead are in one of the
other two conditions, there may not be another thread there to pick up the event
loop. Thus, with no thread owning the event loop, julia will stop responding to
events and effectively deadlock. Since both 1 and 3 are rare, and we don't actually
enter the event loop until there was no work for 4 milliseconds (which is fairly rare),
this condition rarely happens, but is occasionally observable on Windows, where it
caused #34769. To test that this fix works, we manually create situation 1 in the
test by creating an idle callback, which will prevent the event loop from blocking,
but only schedules julia work after it's been called 100 times. This reproduces
the observed failure from the issue and is fixed by this PR.

Fixes #34769

Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jameson Nash <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality system:windows Affects only Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants