Fix topic writer infinite reconnections #1006

art22m · 2024-01-18T13:31:31Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Hello!

I've created topic writer with StartTimeout 30 seconds and Default or Retry RetryPolicy.
Then with ip6tables I've blocked the YDB port to get transport error.
The retries are expected to stop after 30 seconds, but this does not happened.

The reason why this is happened is connectionTimeout*resetAttemptEmpiricalCoefficient overflow.
If connectionTimeout is time.Duration(math.MaxInt64) (that is always true, since we do not have options to set connectionTimeout), then its multiplication to empirical coefficient (=10) gives negative value (see playground https://goplay.space/#qlIeS6o3PCz).
As a result, function CheckResetReconnectionCounters always returns true. Since CheckResetReconnectionCounters gives true, startOfRetries in connectionLoop loop method always sets to w.clock.Now(). Thus, in CheckRetryMode there are no chance to get retriesDuration > settings.StartTimeout to stop reconnections. As a result, we always get infinite reconnections in Retry and Default modes.

Pull request type

Please check the type of change your PR introduces:

What is the current behavior?

Topic writer with StartTimeout set to X and Default or Retry RetryPolicy.
Topic writer gets any error in Retry mode or retryable error in Default mode.
After X time topic writer continue reconnections.

Issue Number: N/A

What is the new behavior?

Topic writer with StartTimeout set to X and Default or Retry RetryPolicy.
Topic writer gets any error in Retry mode or retryable error in Default mode.
After X time topic writer stop reconnections.

Other information

Condition startOfRetries.IsZero() is needed to set startOfRetries for the first time.
Since with infinite connection duration CheckResetReconnectionCounters will return false.

codecov-commenter · 2024-01-22T08:45:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (adb5de8) 67.64% compared to head (00fb002) 67.45%.
Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1006      +/-   ##
==========================================
- Coverage   67.64%   67.45%   -0.20%     
==========================================
  Files         261      252       -9     
  Lines       24686    24513     -173     
==========================================
- Hits        16700    16535     -165     
+ Misses       7127     7112      -15     
- Partials      859      866       +7

Flag	Coverage Δ
	`53.99% <100.00%> (-0.29%)`	⬇️
go-1.17.x	`?`
go-1.20.x	`53.87% <100.00%> (-13.64%)`	⬇️
go-1.21.x	`67.34% <100.00%> (-0.07%)`	⬇️
integration	`53.99% <100.00%> (-0.29%)`	⬇️
macOS	`38.87% <100.00%> (+0.05%)`	⬆️
ubuntu	`?`
unit	`38.87% <100.00%> (-0.05%)`	⬇️
windows	`?`
ydb-22.5	`53.73% <100.00%> (-0.13%)`	⬇️
ydb-23.1	`53.63% <100.00%> (-0.12%)`	⬇️
ydb-23.2	`53.71% <100.00%> (-0.03%)`	⬇️
ydb-23.3	`53.79% <100.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rekby · 2024-01-26T09:43:32Z

@art22m thanks for your PR, I have been see it and will check soon in details.

Fix broken test please.

rekby · 2024-01-30T10:15:00Z

Hello @art22m

Thanks for your pr and great work for research and describe the bug.

Function CheckResetReconnectionCounters used for detect when we have established stream and reset attempts counters - for good logging.
The PR solve bug with stop reconnection, but break logging/metric ux. Because in the case attempts count will increase infinite and will reset never.

What you mean about set some reasonable constant for check "connection established" state for case with connection timeout is infinite?

For example - 1 minute. It will be mean:
if last connetion attemp was earler, than 1 minute - then connection was fine established and worked. And reset attemp counter to zero.

art22m · 2024-01-31T15:31:57Z

Hi, thanks for feedback.

if last connetion attemp was earler, than 1 minute - then connection was fine established and worked. And reset attempt counter to zero.

May be I did not get your point, but should we check last connection was earlier then constant, not later?

I guess you want to compare constant with lastConnectionAttempt variable.
If the connection was fine stablished, and it worked for some time (reconnectReason = writer.WaitClose(ctx)), then lastConnectionAttempt will be larger then our constant and we should zero the attempts.
Contrariwise, if we do not established our connection, this variable always will be small due to this: prevAttemptTime = now

rekby · 2024-02-01T05:44:29Z

When reconnect timeout is infinite, then we should check: if duration since last attempt was more then constant.

I want detect a situation:

Successfully establish a stream session
Successfully receive init message
Failed on fist (or one of firsts) message

And count the failure as new attempt for logs and for retry policy.

art22m · 2024-02-01T08:59:39Z

I've made one minute constant, but I guess right constant should be find empirically.

rekby · 2024-02-01T12:08:58Z

@art22m Thanks for the fix:)

art22m added 3 commits January 18, 2024 15:52

fix inf reconnections

d74f9fa

small refactoring

999f747

rm debug print

00fb002

fix test

e8f20d4

fix reviews

2690bff

rekby merged commit eb49eb8 into ydb-platform:master Feb 1, 2024
40 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix topic writer infinite reconnections #1006

Fix topic writer infinite reconnections #1006

art22m commented Jan 18, 2024 •

edited

Loading

codecov-commenter commented Jan 22, 2024 •

edited

Loading

rekby commented Jan 26, 2024

rekby commented Jan 30, 2024 •

edited

Loading

art22m commented Jan 31, 2024

rekby commented Feb 1, 2024

art22m commented Feb 1, 2024

rekby commented Feb 1, 2024

Fix topic writer infinite reconnections #1006

Fix topic writer infinite reconnections #1006

Conversation

art22m commented Jan 18, 2024 • edited Loading

Pull request type

What is the current behavior?

What is the new behavior?

Other information

codecov-commenter commented Jan 22, 2024 • edited Loading

Codecov Report

rekby commented Jan 26, 2024

rekby commented Jan 30, 2024 • edited Loading

art22m commented Jan 31, 2024

rekby commented Feb 1, 2024

art22m commented Feb 1, 2024

rekby commented Feb 1, 2024

art22m commented Jan 18, 2024 •

edited

Loading

codecov-commenter commented Jan 22, 2024 •

edited

Loading

rekby commented Jan 30, 2024 •

edited

Loading