Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: don't propose config changes with regressed lead support #132150

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Oct 8, 2024

Fixes #125355.

If the lead support has not caught up from the previous configuration, we must not propose another configuration change. Doing so would compromise the lead support promise made by the previous configuration and used as an expiration of a leader lease. Instead, we wait for the lead support under the current configuration to catch up to the maximum lead support reached under the previous config. If the lead support is never able to catch up, the leader will eventually step down due to CheckQuorum.

As a reminder, if we reject a config change, it will be reproposed again shortly, at which point the lead support will likely have caught up.

As a second reminder, we fortify both voters and learners. We also always add new replicas as learners before promoting them to voters. The combination here means that a leader will commonly already be supported a voter when it is promoted added to the lead support computation.

Release note: None

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner October 8, 2024 02:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leadSupportConfigChange branch from 2f4d137 to e0d641f Compare October 8, 2024 03:01
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leadSupportConfigChange branch from e0d641f to f4b2404 Compare October 16, 2024 05:50
@nvanbenschoten
Copy link
Member Author

@arulajmani I've rebased this on master and added unit tests + datadriven tests, so this should be good for a review.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 12 of 12 files at r5, 4 of 4 files at r6, 13 of 13 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/raft/testdata/confchange_fortification_safety.txt line 114 at r7 (raw file):

# Try to demote v2. This should be rejected because v1's lead support under the
# current config has not caught up to the previous config.

nit: consider printing out the fortification-state to make this easy to glean.


pkg/raft/testdata/confchange_fortification_safety.txt line 177 at r7 (raw file):

  4->1 MsgAppResp Term:1 Log:0/5 Commit:4

# Try to demote v2 again. This time, the proposal should be allowed.

nit: maybe fortification-state here as well.


pkg/raft/tracker/fortificationtracker_test.go line 547 at r7 (raw file):

		{
			afterConfigChange: func(sl *mockStoreLiveness, ft *FortificationTracker) {
				// r4 fortified, but support still lagging.

nit: is this meaningfully different than the previous test case?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=arulajmani

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/raft/testdata/confchange_fortification_safety.txt line 114 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: consider printing out the fortification-state to make this easy to glean.

Nice idea. Done.


pkg/raft/testdata/confchange_fortification_safety.txt line 177 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: maybe fortification-state here as well.

Done.


pkg/raft/tracker/fortificationtracker_test.go line 547 at r7 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: is this meaningfully different than the previous test case?

Not meaningfully. Removed.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r8, 4 of 4 files at r9, 6 of 6 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Oct 16, 2024

Merge conflict.

The test was only adding a single node, so the name was misleading.

Epic: None
Release note: None
This commit updates the confchange_v2_add_single_implicit test to start
with 2 nodes so that the addition of a third voter does not regress the
lead support and trip up the automatic transition out of the joint
configuration.

Epic: None
Release note: None
Fixes cockroachdb#125355.

If the lead support has not caught up from the previous configuration,
we must not propose another configuration change. Doing so would
compromise the lead support promise made by the previous configuration
and used as an expiration of a leader lease. Instead, we wait for the
lead support under the current configuration to catch up to the maximum
lead support reached under the previous config. If the lead support is
never able to catch up, the leader will eventually step down due to
CheckQuorum.

As a reminder, if we reject a config change, it will be reproposed again
shortly, at which point the lead support will likely have caught up.

As a second reminder, we fortify both voters and learners. We also
always add new replicas as learners before promoting them to voters. The
combination here means that a leader will commonly already be supported
a voter when it is promoted added to the lead support computation.

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/leadSupportConfigChange branch from 0b6cb2a to 8105ddb Compare October 16, 2024 20:35
@nvanbenschoten
Copy link
Member Author

bors r+

@craig craig bot merged commit 88faab9 into cockroachdb:master Oct 16, 2024
22 of 23 checks passed
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/leadSupportConfigChange branch October 22, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv/raft: support configuration changes for leader leases
3 participants