-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Manager switching upon quorum loss #155
base: master
Are you sure you want to change the base?
Conversation
internal/app/app.go
Outdated
@@ -42,6 +42,8 @@ type App struct { | |||
switchHelper mysql.ISwitchHelper | |||
} | |||
|
|||
var lostQuorumTime time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All usage places of the variable are in App
methods; I suppose it will beneficial to place it in either mysql.Cluster
or App
. Could we do that?
internal/config/config.go
Outdated
@@ -66,6 +66,8 @@ type Config struct { | |||
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"` | |||
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"` | |||
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"` | |||
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml: "waiting_recovery_network_timestamp"` | |||
DoNotAcquireLockByDisconnectedManager time.Duration `config:"do_not_acquire_lock_by_disconnected_manager" yaml:"do_not_acquire_lock_by_disconnected_manager"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's more clear and understandable to use positive names, and the variable could sound like a boolean one. Could we please rename it to something like ManagerElectionDelayAfterQuorumLoss
or ManagerLockAcquireDelayAfterQuorumLoss
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will rename this variables
|
||
lostQuorumDuration := time.Now().Sub(lostQuorumTime) | ||
if lostQuorumDuration < app.config.WaitingRecoveryNetworkTimestamp { | ||
app.logger.Debug("manager try to acquire lock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about providing more information regarding the quorum situation in this line? When we have a non-zero lostQuorumDuration
, it might be helpful to highlight that in the application logs. I suggest adding something like In the period of lost quorum <DURATION>
as a suffix to the logger text, as I believe it would make the situation more clear to the reader.
internal/app/app.go
Outdated
} else if lostQuorumDuration <= app.config.WaitingRecoveryNetworkTimestamp+app.config.DoNotAcquireLockByDisconnectedManager { | ||
// Manager cant AcquireLock in delay | ||
app.logger.Debugf( | ||
"manager lost quorum for %f, manager cant acquire lock for %f delay", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there is a typo: cant
-> can't
. Could we amend it?
internal/app/app.go
Outdated
app.logger.Debugf( | ||
"manager lost quorum for %f, manager cant acquire lock for %f delay", | ||
lostQuorumDuration.Seconds(), | ||
app.config.WaitingRecoveryNetworkTimestamp.Seconds()+app.config.DoNotAcquireLockByDisconnectedManager.Seconds(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we consider writing something like lostQuorumDuration - app.config.WaitingRecoveryNetworkTimestamp
instead? If I understand correctly, the manager can't acquire lock
refers to the duration following the WaitingRecoveryNetworkTimestamp
period. So, wouldn't we need to subtract WaitingRecoveryNetworkTimestamp
from the total lostQuorumDuration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that by reading this log, you can understand how long the old manager will not be able to acquire lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, manager can't acquire lock
refers to the duration DoNotAcquireLockByDisconnectedManager
As option, we can make this:
app.logger.Debugf(
"In the period of lost quorum %f: manager can`t acquire lock for %f of %f acquire cooldown",
lostQuorumDuration.Seconds(),
lostQuorumDuration.Seconds() - app.config.WaitingRecoveryNetworkTimestamp.Seconds(),
app.config.DoNotAcquireLockByDisconnectedManager.Seconds(),
)
internal/config/config.go
Outdated
@@ -66,6 +66,8 @@ type Config struct { | |||
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"` | |||
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"` | |||
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"` | |||
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term timestamp
generally refers to a specific moment in time rather than a duration. Perhaps we could consider renaming it to NetworkRecoveryWaitingDuration
or NetworkRecoveryWaitingPeriod
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, "delay" is the best option
internal/app/app.go
Outdated
// Lost quorum less than 15 (default WaitingRecoveryNetworkTimestamp) seconds ago | ||
// Just wait manager recover connection | ||
if lostQuorumDuration <= app.config.WaitingRecoveryNetworkTimestamp { | ||
app.logger.Debugf("manager lost quorum for %f, wait for revery network by manager", lostQuorumDuration.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there is a typo: revery network
-> network recovery
. Could we amend it?
Could we create |
internal/app/app.go
Outdated
} | ||
|
||
lostQuorumDuration := time.Since(lostQuorumTime) | ||
if lostQuorumDuration < app.config.WaitingRecoveryNetworkTimestamp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it will be more clear, if we read config values to local variables and use them instead of app.config....
internal/config/config.go
Outdated
@@ -66,6 +66,8 @@ type Config struct { | |||
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"` | |||
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"` | |||
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"` | |||
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not "Timestamp", but "Duration"/"Delay"
internal/config/config.go
Outdated
@@ -66,6 +66,8 @@ type Config struct { | |||
InfoFileHandlerInterval time.Duration `config:"info_file_handler_interval" yaml:"info_file_handler_interval"` | |||
RecoveryCheckInterval time.Duration `config:"recoverycheck_interval" yaml:"recoverycheck_interval"` | |||
ExternalCAFileCheckInterval time.Duration `config:"external_ca_file_check_interval" yaml:"external_ca_file_check_interval"` | |||
WaitingRecoveryNetworkTimestamp time.Duration `config:"waiting_recovery_network_timestamp" yaml:"waiting_recovery_network_timestamp"` | |||
DoNotAcquireLockByDisconnectedManager time.Duration `config:"do_not_acquire_lock_by_disconnected_manager" yaml:"do_not_acquire_lock_by_disconnected_manager"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AcquireLockCooldown
internal/app/app.go
Outdated
@@ -42,6 +42,8 @@ type App struct { | |||
switchHelper mysql.ISwitchHelper | |||
} | |||
|
|||
var lostQuorumTime time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this inside App struct
internal/app/app.go
Outdated
@@ -607,6 +609,58 @@ func (app *App) stateManager() appState { | |||
return stateManager | |||
} | |||
|
|||
// Counting the number of HA hosts visible to the manager. | |||
visibleHAHostsCount := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a method like "ShouldReleaseLock" or smthng like
internal/app/app.go
Outdated
return stateCandidate | ||
} | ||
} | ||
} else if workingHANodesCount > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not check if there are working nodes here...
I guess, only lostQuorumTime.IsZero() is needed
I want to supplement documentation with the variables. So, what do you think about the next definitions? ManagerElectionDelayAfterQuorumLoss is a period after quorum loss when nodes continue working in normal mode and acquire manager's lock in dcs. ManagerLockAcquireDelayAfterQuorumLoss is a period after quorum loss when nodes wait for the old manager to start working again and can't acquire locks in dcs. |
When manager lost connection, it lost quorum, than we need to make switchover
Todo`s: