Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cpool_delete() #8627

Conversation

rickard-green
Copy link
Contributor

When setting modification marker on the prev field of a carrier to delete from a pool, we back off and wait for the content of the field to receive expected value if it did not have that from the beginning. Due to a copy-paste bug; when this happened, we waited on a completely different memory location which caused the scheduler thread doing this to get stuck forever. This is obviously a very rare scenario, since this bug has been present for 11 years without being reported.

When setting modification marker on the 'prev' field of a carrier to
delete from a pool, we back off and wait for the content of the
field to receive expected value if it did not have that from the
beginning. Due to a copy-paste bug; when this happened, we waited on
a completely different memory location which caused the scheduler
thread doing this to get stuck forever. This is obviously a very
rare scenario, since this bug has been present for 11 years without
being reported.
@rickard-green rickard-green added team:VM Assigned to OTP team VM fix labels Jun 28, 2024
@rickard-green rickard-green changed the base branch from master to maint June 28, 2024 21:34
Copy link
Contributor

github-actions bot commented Jun 28, 2024

CT Test Results

    3 files    141 suites   49m 1s ⏱️
1 589 tests 1 539 ✅ 49 💤 1 ❌
2 290 runs  2 220 ✅ 69 💤 1 ❌

For more details on these failures, see this check.

Results for commit e176896.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@rickard-green rickard-green linked an issue Jun 28, 2024 that may be closed by this pull request
@rickard-green rickard-green added the testing currently being tested, tag is used by OTP internal CI label Jun 28, 2024
@sverker sverker merged commit fde025a into erlang:maint Jul 2, 2024
16 of 18 checks passed
@sverker
Copy link
Contributor

sverker commented Jul 3, 2024

Merged to maint and master for OTP-27.1 and OTP-28.0.
Also scheduled for next OTP-26.2.5., OTP-25.3.2. and OTP-24.3.4.*.

@martinsumner
Copy link

martinsumner commented Jul 12, 2024

On 21st June we had an issue on a single node in a non-production cluster (running Riak with Erlang 24.3.4.17).

On this node 2 CPU cores (numbers 5 and 12 of 16) went immediately to 100% usage, despite a relatively small amount of background traffic (one would expect about 5% utilisation at that time). The two cores remained then locked at 100% CPU usage (all us no sys or wait time) and memory escalated. 23 minutes later a third core suddenly went to 100% utilisation.

Eventually the memory usage escalated to such an extent that the OOM killer intervened. As this was a single node in the cluster, no operator intervened (as the cluster overall continued to operate).

Question we have is if this the same issue as #8613 and may be resolved by this PR? Our investigation has ruled out known causes within Riak (but not unknowns!) and the other potential issue we've seen with the VM (such as the 100% core utilisation caused by hanging remote shell #4343).

We don't have any detailed debugging from the node at the time, so I appreciate that might be an impossible question to answer. So primarily I'm interested to know if we get these symptoms again, what should we do try and grab information relevant to determine if this correlates to the issue (and so would be fixed by the PR)?

Note that this is the first known instance we've seen on these conditions, but we recently moved from OTP 22 to OTP 24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schedulers stuck in cpool_delete and ets locks
3 participants