Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kv: sudden leaseholder changes due to io overload shedding #134423

Closed
dt opened this issue Nov 6, 2024 · 3 comments · Fixed by #134441 or ebembi-crdb/cockroach#14
Closed

storage,kv: sudden leaseholder changes due to io overload shedding #134423

dt opened this issue Nov 6, 2024 · 3 comments · Fixed by #134441 or ebembi-crdb/cockroach#14
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-admission-control Admission Control T-storage Storage Team

Comments

@dt
Copy link
Member

dt commented Nov 6, 2024

On a test cluster I observed frequent sudden and drastic leaseholder movements, when a node would in the space of a couple seconds shed all of its leases due to its IO overload score touching the threshold at which it does so.

Further investigation suggests this may be to a number of concurrent larger multi-level compactions that were recently enabled briefly occupying all the compaction slots, causing L0 to briefly increase in its level count and hitting the threshold.

It seems like we should shed leases more gradually as overload signals rise rather than all at once, and that we should avoid using all of our compaction capacity on longer running multi-level compactions for periods so long that they starve out compactions required to keep L0 level counts healthy.

Jira issue: CRDB-44074

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. A-admission-control branch-master Failures and bugs on the master branch. GA-blocker T-admission-control Admission Control branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Nov 6, 2024
@blathers-crl blathers-crl bot added the T-storage Storage Team label Nov 6, 2024
@github-project-automation github-project-automation bot moved this to Incoming in Storage Nov 6, 2024
@itsbilal
Copy link
Member

itsbilal commented Nov 6, 2024

Pebble companion issue cockroachdb/pebble#4139

@itsbilal
Copy link
Member

itsbilal commented Nov 6, 2024

I'll turn this issue into one about just disabling multilevel compactions in 24.3, while cockroachdb/pebble#4139 is about reenabling them with concurrency limits.

craig bot pushed a commit that referenced this issue Nov 6, 2024
134346: sql: skip TestIndexBackfillMergeRetry under duress r=Dedej-Bergin a=Dedej-Bergin

This test fails under duress so we are skipping it.

Fixes: #134033
Release note: None

134441: storage: disable multilevel compactions r=jbowens a=itsbilal

In their current state, multilevel compactions can cause momentary spikes in L0 sublevels, resulting in undesirable side-effects elsewhere.

Fixes #134423.

Epic: none

Release note: None

Co-authored-by: Bergin Dedej <[email protected]>
Co-authored-by: Bilal Akhtar <[email protected]>
craig bot pushed a commit that referenced this issue Nov 6, 2024
134441: storage: disable multilevel compactions r=jbowens a=itsbilal

In their current state, multilevel compactions can cause momentary spikes in L0 sublevels, resulting in undesirable side-effects elsewhere.

Fixes #134423.

Epic: none

Release note: None

Co-authored-by: Bilal Akhtar <[email protected]>
@craig craig bot closed this as completed in 88a7276 Nov 6, 2024
@github-project-automation github-project-automation bot moved this from Incoming to Done in Storage Nov 6, 2024
blathers-crl bot pushed a commit that referenced this issue Nov 6, 2024
In their current state, multilevel compactions can cause
momentary spikes in L0 sublevels, resulting in undesirable side-effects
elsewhere.

Fixes #134423.

Epic: none

Release note: None
@dt
Copy link
Member Author

dt commented Nov 7, 2024

Do we need a new separate issue for more gradual ramp up of lease shedding?

ebembi-crdb added a commit to ebembi-crdb/cockroach that referenced this issue Nov 11, 2024
In their current state, multilevel compactions can cause
momentary spikes in L0 sublevels, resulting in undesirable side-effects
elsewhere.

Fixes cockroachdb#134423.

Epic: none

Release note (bug fix): Addressed a bug with DROP CASCADE that would occasionally panic with an undropped backref message on partitioned tables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-admission-control Admission Control T-storage Storage Team
Projects
Status: Done
2 participants