Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Manager switching upon quorum loss #155

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

WithSoull
Copy link
Contributor

@WithSoull WithSoull commented Jan 31, 2025

When manager lost connection, it lost quorum, than we need to make switchover

Todo`s:

  • Make switchover manager to any available host
  • Make feature tests
  • Make additonal logic of switching to some host, which is the nearest to master
  • Make feature tests again

@@ -42,6 +42,8 @@ type App struct {
switchHelper mysql.ISwitchHelper
}

var lostQuorumTime time.Time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All usage places of the variable are in App methods; I suppose it will beneficial to place it in either mysql.Cluster or App. Could we do that?

@@ -66,6 +66,8 @@ type Config struct {
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"`
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"`
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"`
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml: "waiting_recovery_network_timestamp"`
DoNotAcquireLockByDisconnectedManager time.Duration `config:"do_not_acquire_lock_by_disconnected_manager" yaml:"do_not_acquire_lock_by_disconnected_manager"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's more clear and understandable to use positive names, and the variable could sound like a boolean one. Could we please rename it to something like ManagerElectionDelayAfterQuorumLoss or ManagerLockAcquireDelayAfterQuorumLoss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will rename this variables


lostQuorumDuration := time.Now().Sub(lostQuorumTime)
if lostQuorumDuration < app.config.WaitingRecoveryNetworkTimestamp {
app.logger.Debug("manager try to acquire lock")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about providing more information regarding the quorum situation in this line? When we have a non-zero lostQuorumDuration, it might be helpful to highlight that in the application logs. I suggest adding something like In the period of lost quorum <DURATION> as a suffix to the logger text, as I believe it would make the situation more clear to the reader.

} else if lostQuorumDuration <= app.config.WaitingRecoveryNetworkTimestamp+app.config.DoNotAcquireLockByDisconnectedManager {
// Manager cant AcquireLock in delay
app.logger.Debugf(
"manager lost quorum for %f, manager cant acquire lock for %f delay",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there is a typo: cant -> can't. Could we amend it?

app.logger.Debugf(
"manager lost quorum for %f, manager cant acquire lock for %f delay",
lostQuorumDuration.Seconds(),
app.config.WaitingRecoveryNetworkTimestamp.Seconds()+app.config.DoNotAcquireLockByDisconnectedManager.Seconds(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we consider writing something like lostQuorumDuration - app.config.WaitingRecoveryNetworkTimestamp instead? If I understand correctly, the manager can't acquire lock refers to the duration following the WaitingRecoveryNetworkTimestamp period. So, wouldn't we need to subtract WaitingRecoveryNetworkTimestamp from the total lostQuorumDuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that by reading this log, you can understand how long the old manager will not be able to acquire lock.

Copy link
Contributor Author

@WithSoull WithSoull Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, manager can't acquire lock refers to the duration DoNotAcquireLockByDisconnectedManager

As option, we can make this:

app.logger.Debugf(
	"In the period of lost quorum %f: manager can`t acquire lock for %f of %f acquire cooldown",
	lostQuorumDuration.Seconds(),
	lostQuorumDuration.Seconds() - app.config.WaitingRecoveryNetworkTimestamp.Seconds(),
	app.config.DoNotAcquireLockByDisconnectedManager.Seconds(),
)

@@ -66,6 +66,8 @@ type Config struct {
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"`
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"`
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"`
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term timestamp generally refers to a specific moment in time rather than a duration. Perhaps we could consider renaming it to NetworkRecoveryWaitingDuration or NetworkRecoveryWaitingPeriod. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, "delay" is the best option

// Lost quorum less than 15 (default WaitingRecoveryNetworkTimestamp) seconds ago
// Just wait manager recover connection
if lostQuorumDuration <= app.config.WaitingRecoveryNetworkTimestamp {
app.logger.Debugf("manager lost quorum for %f, wait for revery network by manager", lostQuorumDuration.Seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there is a typo: revery network -> network recovery. Could we amend it?

@noname0443
Copy link
Collaborator

Could we create .feature tests for quorum loss scenario?

}

lostQuorumDuration := time.Since(lostQuorumTime)
if lostQuorumDuration < app.config.WaitingRecoveryNetworkTimestamp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it will be more clear, if we read config values to local variables and use them instead of app.config....

@@ -66,6 +66,8 @@ type Config struct {
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"`
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"`
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"`
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "Timestamp", but "Duration"/"Delay"

@@ -66,6 +66,8 @@ type Config struct {
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"`
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"`
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"`
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"`
DoNotAcquireLockByDisconnectedManager time.Duration `config:"do_not_acquire_lock_by_disconnected_manager" yaml:"do_not_acquire_lock_by_disconnected_manager"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AcquireLockCooldown

@@ -42,6 +42,8 @@ type App struct {
switchHelper mysql.ISwitchHelper
}

var lostQuorumTime time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this inside App struct

@@ -607,6 +609,58 @@ func (app *App) stateManager() appState {
return stateManager
}

// Counting the number of HA hosts visible to the manager.
visibleHAHostsCount := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a method like "ShouldReleaseLock" or smthng like

return stateCandidate
}
}
} else if workingHANodesCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not check if there are working nodes here...
I guess, only lostQuorumTime.IsZero() is needed

@noname0443
Copy link
Collaborator

I want to supplement documentation with the variables. So, what do you think about the next definitions?

ManagerElectionDelayAfterQuorumLoss is a period after quorum loss when nodes continue working in normal mode and acquire manager's lock in dcs.

ManagerLockAcquireDelayAfterQuorumLoss is a period after quorum loss when nodes wait for the old manager to start working again and can't acquire locks in dcs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants