Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading deadlock on Windows #13

Closed
vlandau opened this issue Dec 17, 2019 · 33 comments
Closed

Multi-threading deadlock on Windows #13

vlandau opened this issue Dec 17, 2019 · 33 comments
Assignees
Labels
bug Something isn't working high-priority high priority

Comments

@vlandau
Copy link
Member

vlandau commented Dec 17, 2019

Need to identify cause. It is only exposed on Windows in Travis and Appveyor with Julia 1.3, doesn't happen in Julia 1.2. @ranjanan or @ViralBShah any advice would be greatly appreciated!

Here's a post on the Julia discourse I made with links to relevant code blocks in Omniscape.jl

@vlandau vlandau added bug Something isn't working high-priority high priority labels Dec 17, 2019
@vlandau vlandau self-assigned this Dec 17, 2019
@ViralBShah
Copy link
Member

We've received similar reports on windows. Is this reliably failing?

@JeffBezanson @vtjnash

@vlandau
Copy link
Member Author

vlandau commented Dec 17, 2019

@ViralBShah Yes, it consistently fails, but only on Travis and Appveyor (maybe due to the more limited hardware resources of those VMs?)

@ViralBShah
Copy link
Member

Do you have a windows desktop/laptop to try on?

@vlandau
Copy link
Member Author

vlandau commented Dec 17, 2019

It has been tested on two different windows machines locally and passed on both. I also tested it with JULIA_NUM_THREADS set to 1 on one of the machines (which is what the Travis and Appveyor VMs have) and it passed.

@vlandau
Copy link
Member Author

vlandau commented Dec 18, 2019

I do have access to a Windows machine to do additional local testing if/as needed.

@ViralBShah
Copy link
Member

It is possible that the Travis VMs are constrained. We will need a way to reproduce this reliably on a local setup in order to debug.

@vlandau
Copy link
Member Author

vlandau commented Dec 22, 2019

There may be a way to set up a Docker image with limited access to resources (?). I'll look into this a bit.

@vlandau
Copy link
Member Author

vlandau commented Dec 22, 2019

Looks like windows in Travis CI uses Window server 2016. Some more info: https://docs.travis-ci.com/user/reference/overview/#what-infrastructure-is-my-environment-running-on

Also, https://docs.docker.com/config/containers/resource_constraints/

Can't get into this until after Jan 2nd.

@vlandau
Copy link
Member Author

vlandau commented Jan 2, 2020

@ViralBShah do you think it's worth trying to debug from within an Appveyor VM?

There is a way to get inside the VM via remote desktop.
https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

@ViralBShah
Copy link
Member

I think it might be this one: JuliaLang/julia#34225

We should wait for it.

@ViralBShah
Copy link
Member

Until 1.3.2 (assuming that's where the fix will be), it may be best to stay with Julia 1.2.

@vlandau
Copy link
Member Author

vlandau commented Jan 2, 2020

Alright sounds good. Thanks for the heads up on that pull request!

Omniscape 0.1.3 works on Julia 1.3.1, so I'll just hold off on releasing 0.2.0 (which is where the deadlock bug was introduced) until the Julia patch with that fix is released.

@vlandau
Copy link
Member Author

vlandau commented Jan 2, 2020

FWIW, tests are suddenly passing on Appveyor... 🤷‍♂️

EDIT:
It might be because Appveyor is now using Julia 1.3.1, but Travis Windows builds are still failing despite also using Julia 1.3.1

@ViralBShah
Copy link
Member

It might be that Travis has a setup that triggers it more reliably, or it is a different bug altogether.

Cc @Keno in case he has further insight.

@vlandau
Copy link
Member Author

vlandau commented Jan 2, 2020

In trying to solve a separate issue, this PR seemed to solve the deadlock problem on Travis...

All it did was turn off inline by adding

script: travis_wait 30 julia --code-coverage --inline=no -e 'using Pkg; Pkg.build(); Pkg.test(coverage=true)'

to the travis.yml. Not sure what the travis_wait 30 is about.

Just posting here in case it provides a clue to what's going on.

EDIT:
travis_wait 30 seems to be what resolved the deadlock, but not because it allowed a longer time lapse before time out. Tests still passed on Windows in <10 minutes (the default).

EDIT:
Turns out travis_wait 30 was incorrectly passing jobs. Tests were actually failing/timing out, failure just wasn't appropriately reported.

@ViralBShah
Copy link
Member

@vlandau we will need a reproducer locally, even it is only somewhat reliable. Or a dump from the deadlocking system.

@vlandau
Copy link
Member Author

vlandau commented Jan 10, 2020

Okay, I will see if I can debug in Appveyor following @Keno's suggestion.

@vlandau
Copy link
Member Author

vlandau commented Jan 10, 2020

@ViralBShah, when entering the Appveyor VM via RDP and running the build and test commands manually, the tests pass reliably. Once I get rid of the rdp and allow the tests to continue, the tests fail reliably.

Is there some way to print debug info to the Appveyor output (here)? Or some way to save a minidump somewhere on the VM, then I can RDP in during the deadlock and copy it?

Thanks for any help/guidance.

@vlandau
Copy link
Member Author

vlandau commented Jan 21, 2020

Having issues with gdb in Appveyor (see this post on the Julia Discourse). It's something related to julia.exe not having symbols.

Is there a way to stream the stacktrace to a text file in real time so that we can see the trace leading up to the deadlock? That might be much more straightforward because I wouldn't need to rely on gdb.

cc @ranjanan in case you're able to jump into this at some point.

@vlandau
Copy link
Member Author

vlandau commented Jan 30, 2020

Bumping this issue in case (though I know you're all busy too!) anyone can take another look. I got some responses on the Discourse post, but I'm still having issues

https://discourse.julialang.org/t/how-to-generate-a-minidump-for-a-hanging-julia-program-on-appveyor-windows/33451/8

@Keno
Copy link

Keno commented Jan 31, 2020

I'm working on some deadlock detection tooling for Julia base, so if you can wait another week or two, you may be able to use that at that point.

@vlandau
Copy link
Member Author

vlandau commented Jan 31, 2020

Great to hear @Keno. I'd definitely be interested to try that out when it's ready.

@vlandau
Copy link
Member Author

vlandau commented Feb 14, 2020

@Keno and @ViralBShah, at long last, I have the backtrace from the deadlocked process!

It was through RDP to appveyor, so I wasn't able to copy the text from the gdb terminal, so hopefully a screenshot will suffice for now. Should I post an issue with this info to the JuliaLang git repo?

image

@ViralBShah
Copy link
Member

Thanks Vincent! Also drawing @vtjnash 's attention.

@ViralBShah
Copy link
Member

@Keno @vtjnash Should this be put into a Julia issue?

@Keno
Copy link

Keno commented Feb 14, 2020

Were you able to get a full minidump of the process in question? That stack trace alone is unfortunately not actionable (any deadlock would look like that for the blocked thread - we need to know what the other threads are up to).

@vlandau
Copy link
Member Author

vlandau commented Feb 14, 2020

Is there a way to grab that from gdb?

EDIT
To answer the question, no, but happy to work on getting that for you!

@Keno
Copy link

Keno commented Feb 14, 2020

I usually use the process explorer for minidump capture: https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer

@vlandau
Copy link
Member Author

vlandau commented Feb 14, 2020

Thanks! I will look into that and post back here when I've got it.

@vlandau
Copy link
Member Author

vlandau commented Feb 14, 2020

Alright, here's a link to the minidump of the process (created using process explorer, then right clicking on the Julia process and selecting "create minidump"). https://drive.google.com/file/d/18SqE_HtlE1UqeNao8soQ-NjWL9iA-ewv/view?usp=sharing

@vlandau
Copy link
Member Author

vlandau commented Feb 14, 2020

@Keno @vtjnash Should this be put into a Julia issue?

Now that I have the minidump info, should I post a Julia issue?

@ViralBShah
Copy link
Member

Yes please post a Julia issue

@vlandau
Copy link
Member Author

vlandau commented Feb 20, 2020

Alright, tests officially passed after julia/pull/34807. Thank you @ViralBShah and @Keno for the help here, and awesome work on the patch and debugging @Keno! Thanks very much!

Psyched to finally close this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority high priority
Projects
None yet
Development

No branches or pull requests

3 participants