Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix topic writer infinite reconnections #1006

Merged
merged 5 commits into from
Feb 1, 2024

Conversation

art22m
Copy link
Contributor

@art22m art22m commented Jan 18, 2024

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Hello!

I've created topic writer with StartTimeout 30 seconds and Default or Retry RetryPolicy.
Then with ip6tables I've blocked the YDB port to get transport error.
The retries are expected to stop after 30 seconds, but this does not happened.

The reason why this is happened is connectionTimeout*resetAttemptEmpiricalCoefficient overflow.
If connectionTimeout is time.Duration(math.MaxInt64) (that is always true, since we do not have options to set connectionTimeout), then its multiplication to empirical coefficient (=10) gives negative value (see playground https://goplay.space/#qlIeS6o3PCz).
As a result, function CheckResetReconnectionCounters always returns true. Since CheckResetReconnectionCounters gives true, startOfRetries in connectionLoop loop method always sets to w.clock.Now(). Thus, in CheckRetryMode there are no chance to get retriesDuration > settings.StartTimeout to stop reconnections. As a result, we always get infinite reconnections in Retry and Default modes.

Pull request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

Topic writer with StartTimeout set to X and Default or Retry RetryPolicy.
Topic writer gets any error in Retry mode or retryable error in Default mode.
After X time topic writer continue reconnections.

Issue Number: N/A

What is the new behavior?

Topic writer with StartTimeout set to X and Default or Retry RetryPolicy.
Topic writer gets any error in Retry mode or retryable error in Default mode.
After X time topic writer stop reconnections.

Other information

Condition startOfRetries.IsZero() is needed to set startOfRetries for the first time.
Since with infinite connection duration CheckResetReconnectionCounters will return false.

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (adb5de8) 67.64% compared to head (00fb002) 67.45%.
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1006      +/-   ##
==========================================
- Coverage   67.64%   67.45%   -0.20%     
==========================================
  Files         261      252       -9     
  Lines       24686    24513     -173     
==========================================
- Hits        16700    16535     -165     
+ Misses       7127     7112      -15     
- Partials      859      866       +7     
Flag Coverage Δ
53.99% <100.00%> (-0.29%) ⬇️
go-1.17.x ?
go-1.20.x 53.87% <100.00%> (-13.64%) ⬇️
go-1.21.x 67.34% <100.00%> (-0.07%) ⬇️
integration 53.99% <100.00%> (-0.29%) ⬇️
macOS 38.87% <100.00%> (+0.05%) ⬆️
ubuntu ?
unit 38.87% <100.00%> (-0.05%) ⬇️
windows ?
ydb-22.5 53.73% <100.00%> (-0.13%) ⬇️
ydb-23.1 53.63% <100.00%> (-0.12%) ⬇️
ydb-23.2 53.71% <100.00%> (-0.03%) ⬇️
ydb-23.3 53.79% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rekby
Copy link
Member

rekby commented Jan 26, 2024

@art22m thanks for your PR, I have been see it and will check soon in details.

Fix broken test please.

@rekby
Copy link
Member

rekby commented Jan 30, 2024

Hello @art22m

Thanks for your pr and great work for research and describe the bug.

Function CheckResetReconnectionCounters used for detect when we have established stream and reset attempts counters - for good logging.
The PR solve bug with stop reconnection, but break logging/metric ux. Because in the case attempts count will increase infinite and will reset never.

What you mean about set some reasonable constant for check "connection established" state for case with connection timeout is infinite?

For example - 1 minute. It will be mean:
if last connetion attemp was earler, than 1 minute - then connection was fine established and worked. And reset attemp counter to zero.

@art22m
Copy link
Contributor Author

art22m commented Jan 31, 2024

Hi, thanks for feedback.

if last connetion attemp was earler, than 1 minute - then connection was fine established and worked. And reset attempt counter to zero.

May be I did not get your point, but should we check last connection was earlier then constant, not later?

I guess you want to compare constant with lastConnectionAttempt variable.
If the connection was fine stablished, and it worked for some time (reconnectReason = writer.WaitClose(ctx)), then lastConnectionAttempt will be larger then our constant and we should zero the attempts.
Contrariwise, if we do not established our connection, this variable always will be small due to this: prevAttemptTime = now

@rekby
Copy link
Member

rekby commented Feb 1, 2024

When reconnect timeout is infinite, then we should check: if duration since last attempt was more then constant.

I want detect a situation:

  1. Successfully establish a stream session
  2. Successfully receive init message
  3. Failed on fist (or one of firsts) message

And count the failure as new attempt for logs and for retry policy.

@art22m
Copy link
Contributor Author

art22m commented Feb 1, 2024

I've made one minute constant, but I guess right constant should be find empirically.

@rekby rekby merged commit eb49eb8 into ydb-platform:master Feb 1, 2024
40 of 42 checks passed
@rekby
Copy link
Member

rekby commented Feb 1, 2024

@art22m Thanks for the fix:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants