Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry limiter to backoff function #1478

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Tema
Copy link
Contributor

@Tema Tema commented Oct 16, 2024

Current backoff policy does not help much to limit TiDB reties to retrieve Region from PD when there are issues with Region Metadata in PD:

image

This PR adds an ability to configure global retry limiter to Backoff function per each config. It also creates a new Backoff config dedicated to PD Region Metadata calls which will be used in TiDB in separate PR:

BoPDRegionMetadata = NewConfigWithRetryLimit("pdRegionMetadata", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), NewRetryRateLimiter(10, 0.1), tikverr.NewErrPDServerTimeout(""))

The config above allows a single retry per each 10 previous successful call (0.1), but limit overall retry budget to 10. It always start with full budget of retries.

@ti-chi-bot ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Oct 16, 2024
Copy link

ti-chi-bot bot commented Oct 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jackysp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 16, 2024
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Oct 18, 2024

@Tema
Thanks helping with the improvements.

It's ok to introduce a rate-limiting mechanism for the kv-client. Finding the optimal balance between error handling, retry success, and avoiding overloading PD could be challenging, or selecting a suitable default value that works for most scenarios.

Another approach is similar to TiKV health control feedback (as discussed in tikv/tikv#16297), where some processing capacity information is carried in PD responses and fed back to the KV client. Based on this feedback, the kv-client can then decide its concurrency control and rate-limiting strategy accordingly.

@Tema
Copy link
Contributor Author

Tema commented Oct 18, 2024

Thanks cfzjywxk for the comment. I think it is not always possible for PD to reply to provide this information to TiDB in case it is completely overloaded. tikv/pd#8678 proposes a more sophisticated solution to cover that case as well. Maybe it is worth to see if it could be incorporated into tikv/tikv#16297 which you mentioned.
Anyways, these referenced solutions looks too heavy and take some time to productionize. While this PR is more like a simple stop bleeding fix to prevent this problem asap.

@@ -96,6 +98,50 @@ func NewConfig(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFn
}
}

type RetryRateLimiter struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need comments like for exported functions and type definitions,

// RetryRateLimiter is used to limit retry times for PD requests.

or the lint check would fail.

cap int32
}

func NewRetryRateLimiter(cap int32, ratio float32) *RetryRateLimiter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for the comments, and better to explain the meaning of the input parameters.

Besides, would it be less expensive using int or uint type for ratio instead of float values?

}
}

// add a token to the rate limiter bucket according to configured retry to success ratio and cap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment format is like

// addRetryToken is a ...

, needs to start with the function name


// add a token to the rate limiter bucket according to configured retry to success ratio and cap
func (r *RetryRateLimiter) addRetryToken() {
if rand.Float32() < r.allowedRetryToSuccessRatio {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As metioned above, would it be less expensive to use integer random values?

return false
}

func NewConfigWithRetryLimit(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFnCfg, retryRateLimiter *RetryRateLimiter, err error) *Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for the comments of exported function.

BoTiFlashRPC = NewConfig("tiflashRPC", &metrics.BackoffHistogramRPC, NewBackoffFnCfg(100, 2000, EqualJitter), tikverr.ErrTiFlashServerTimeout)
BoTxnLock = NewConfig("txnLock", &metrics.BackoffHistogramLock, NewBackoffFnCfg(100, 3000, EqualJitter), tikverr.ErrResolveLockTimeout)
BoPDRPC = NewConfig("pdRPC", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), tikverr.NewErrPDServerTimeout(""))
BoPDRegionMetadata = NewConfigWithRetryLimit("pdRegionMetadata", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), NewRetryRateLimiter(10, 0.1), tikverr.NewErrPDServerTimeout(""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the previous comments, it would be challendge to choose a default value for all kinds of scenarios? Do we have some tests for the choice of 10, 0.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants