Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in syncer #100

Open
ChrisSchinnerl opened this issue Sep 26, 2024 · 7 comments
Open

Fix deadlock in syncer #100

ChrisSchinnerl opened this issue Sep 26, 2024 · 7 comments
Assignees

Comments

@ChrisSchinnerl
Copy link
Member

A random deadlock that prevents our syncer from shutting down.

https://github.com/SiaFoundation/renterd/actions/runs/11030847304/job/30636343643?pr=1574

stacktrace.txt

Originally posted by @peterjan in #92 (comment)

@lukechampine
Copy link
Member

Here's what I have so far:

The Syncer uses a WaitGroup to ensure that all its goroutines exit before Syncer.Close returns. The deadlock happens because wg.Wait is not returning. This could be caused by calling wg.Add more than wg.Done, but that doesn't seem to be the case; all occurrences of wg.Add are immediately followed by a deferred wg.Done. This suggests that one of the goroutines simply isn't returning.

Looking at the stack trace, we can see that there is indeed one outstanding goroutine: 573687, which accepts an incoming connection and calls runPeer. This goroutine appears to be blocked on the mux's AcceptStream method. That's strange, because when mux.Close is called, AcceptStream should wake up, observe m.err != nil, and exit. Another strange thing is that the stack trace does not say that this goroutine has been blocked for very long. Maybe this is just an inconsistency in the trace; alternatively, it could mean that the goroutine is being woken repeatedly, but isn't exiting. This in turn would imply that m.err == nil, which would be very strange.

I think the next step is to add some debug logic to the mux package, and use it for renterd's CI until we trigger the bug again. I'll get started on that.

@lukechampine
Copy link
Member

Actually, first things first: renterd is using [email protected], which is pretty outdated. Let's update it to v1.3.0 and see if the bug recurs.

@ChrisSchinnerl
Copy link
Member Author

Actually, first things first: renterd is using [email protected], which is pretty outdated. Let's update it to v1.3.0 and see if the bug recurs.

I believe that this particular test run already uses v1.3.0 since we updated the dependency on the dev branch 6 days ago and the PR was created 4 days ago.

@lukechampine
Copy link
Member

I think not lol

created by go.sia.tech/mux/v2.newMux in goroutine 573687
    /home/runner/go/pkg/mod/go.sia.tech/[email protected]/v2/mux.go:377 +0x426

@peterjan
Copy link
Member

peterjan commented Oct 2, 2024

@lukechampine
Copy link
Member

lukechampine commented Oct 2, 2024

ok, so it's reproduced on v.1.3.0. Good to know.

@peterjan
Copy link
Member

peterjan commented Nov 1, 2024

Looks like it's happening once in a blue moon though, encountered another:

https://github.com/SiaFoundation/renterd/actions/runs/11612827367/job/32338279894?pr=1643
stacktrace.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants