-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threading deadlock on Windows #13
Comments
We've received similar reports on windows. Is this reliably failing? |
@ViralBShah Yes, it consistently fails, but only on Travis and Appveyor (maybe due to the more limited hardware resources of those VMs?) |
Do you have a windows desktop/laptop to try on? |
It has been tested on two different windows machines locally and passed on both. I also tested it with |
I do have access to a Windows machine to do additional local testing if/as needed. |
It is possible that the Travis VMs are constrained. We will need a way to reproduce this reliably on a local setup in order to debug. |
There may be a way to set up a Docker image with limited access to resources (?). I'll look into this a bit. |
Looks like windows in Travis CI uses Window server 2016. Some more info: https://docs.travis-ci.com/user/reference/overview/#what-infrastructure-is-my-environment-running-on Also, https://docs.docker.com/config/containers/resource_constraints/ Can't get into this until after Jan 2nd. |
@ViralBShah do you think it's worth trying to debug from within an Appveyor VM? There is a way to get inside the VM via remote desktop. |
I think it might be this one: JuliaLang/julia#34225 We should wait for it. |
Until 1.3.2 (assuming that's where the fix will be), it may be best to stay with Julia 1.2. |
Alright sounds good. Thanks for the heads up on that pull request! Omniscape 0.1.3 works on Julia 1.3.1, so I'll just hold off on releasing 0.2.0 (which is where the deadlock bug was introduced) until the Julia patch with that fix is released. |
FWIW, tests are suddenly passing on Appveyor... 🤷♂️ EDIT: |
It might be that Travis has a setup that triggers it more reliably, or it is a different bug altogether. Cc @Keno in case he has further insight. |
In trying to solve a separate issue, this PR seemed to solve the deadlock problem on Travis... All it did was turn off inline by adding script: travis_wait 30 julia --code-coverage --inline=no -e 'using Pkg; Pkg.build(); Pkg.test(coverage=true)' to the travis.yml. Not sure what the Just posting here in case it provides a clue to what's going on. EDIT: EDIT: |
@vlandau we will need a reproducer locally, even it is only somewhat reliable. Or a dump from the deadlocking system. |
Okay, I will see if I can debug in Appveyor following @Keno's suggestion. |
@ViralBShah, when entering the Appveyor VM via RDP and running the build and test commands manually, the tests pass reliably. Once I get rid of the rdp and allow the tests to continue, the tests fail reliably. Is there some way to print debug info to the Appveyor output (here)? Or some way to save a minidump somewhere on the VM, then I can RDP in during the deadlock and copy it? Thanks for any help/guidance. |
Having issues with gdb in Appveyor (see this post on the Julia Discourse). It's something related to julia.exe not having symbols. Is there a way to stream the stacktrace to a text file in real time so that we can see the trace leading up to the deadlock? That might be much more straightforward because I wouldn't need to rely on gdb. cc @ranjanan in case you're able to jump into this at some point. |
Bumping this issue in case (though I know you're all busy too!) anyone can take another look. I got some responses on the Discourse post, but I'm still having issues |
I'm working on some deadlock detection tooling for Julia base, so if you can wait another week or two, you may be able to use that at that point. |
Great to hear @Keno. I'd definitely be interested to try that out when it's ready. |
@Keno and @ViralBShah, at long last, I have the backtrace from the deadlocked process! It was through RDP to appveyor, so I wasn't able to copy the text from the gdb terminal, so hopefully a screenshot will suffice for now. Should I post an issue with this info to the JuliaLang git repo? |
Thanks Vincent! Also drawing @vtjnash 's attention. |
Were you able to get a full minidump of the process in question? That stack trace alone is unfortunately not actionable (any deadlock would look like that for the blocked thread - we need to know what the other threads are up to). |
Is there a way to grab that from gdb? EDIT |
I usually use the process explorer for minidump capture: https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer |
Thanks! I will look into that and post back here when I've got it. |
Alright, here's a link to the minidump of the process (created using process explorer, then right clicking on the Julia process and selecting "create minidump"). https://drive.google.com/file/d/18SqE_HtlE1UqeNao8soQ-NjWL9iA-ewv/view?usp=sharing |
Yes please post a Julia issue |
Alright, tests officially passed after julia/pull/34807. Thank you @ViralBShah and @Keno for the help here, and awesome work on the patch and debugging @Keno! Thanks very much! Psyched to finally close this!! |
Need to identify cause. It is only exposed on Windows in Travis and Appveyor with Julia 1.3, doesn't happen in Julia 1.2. @ranjanan or @ViralBShah any advice would be greatly appreciated!
Here's a post on the Julia discourse I made with links to relevant code blocks in Omniscape.jl
The text was updated successfully, but these errors were encountered: