-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastward resume worker pool deadlock #298
Comments
Output that shows a spawn resume that completes without deadlock (from the same code version as the failing case) ...
<\details> |
We need another object that wraps the wait group to help debug wait group issues (see issue #35). |
The deadlock is proving t be intermittent. This is an example of fastward resume working ok
|
... and here is an example of the fastward failing due to panic caused by sending on closed channel: panic: send on closed channel
The intermittency means we have a race condition. |
It looks like the wrong go routine is closing the jobs channel. The workers send to the jobs channel, but none of them can close the channel as there are multiple writers. Therefore, the workers send their own finished notification to the worker pool. Whilst the pool is in its drain phase, it will countdown until there are no workers left (determined by the number of entries in the worker collection). When the worker collection is empty, it will close the jobs channel on behalf of the workers. However, this seems to create a race condition that only manifests in the fastward scenario. |
Issue #292 (support concurrent traversal), implemented workload acceleration using the worker pool. However, the fastward resume scenario is the only one that causes a deadlock and wait ing on the waitgroup always blocks indefinitely.
One thing that should also be noted, which may be related is the fact that at no point within an accelerated navigation session do we call Done() on the waitgroup for all the other scenarios that do appear to work. This seems counter-intuitive, the wait should only ever get unblocked, when the correct number of Done()'s have been invoked, so this is fishy and may be the hidden cause of the deadlock in a fastward resume scenario. We need to rationalise this and fix so that we can safely invoke Done() in the normal manner.
The ensync method is invoked as part of walk. This is probably unwise and should be delegated to the session. Invoking from walk means that there a complication because what happens during fasward is different to what happens in spawn. We should get rid of these types of differences wherever possible.
For the fastward scenario, note this code (commit: f69db56):
This looks wrong because we are invoking ensync inside resume, but walk also does this, which means we get an extra and invalid decoration. This is the reason why we should delegate to the session as by doing this we can be sure that the logic is correct and consistent between fastward and spawn.
Output that shows a fastward resume that timesout the unit test due to deadlock ...
<\details>
The text was updated successfully, but these errors were encountered: