Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm's etcd client member add / remove can return errors but server side there could be success #3111

Closed
neolit123 opened this issue Sep 12, 2024 · 5 comments
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Sep 12, 2024

had a discussion offline with @ahrtr

basically errors like this cannot be trusted because etcd is a distributed system:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/etcd/etcd.go#L436

what can happen:

  • client sends member add request to server
  • server returns "context deadline" error to the client. the error could be due to e.g. network blips or slow infra.
  • server adds the member, regardless
  • client retries to add member due to the error in a poll, member is already there

the solution is to check the member list for the given peer URL before any add (learner or normal) / remove operation.

here:
https://github.com/kubernetes/kubernetes/blob/release-1.31/cmd/kubeadm/app/util/etcd/etcd.go#L430-L431
https://github.com/kubernetes/kubernetes/blob/release-1.31/cmd/kubeadm/app/util/etcd/etcd.go#L361

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/etcd labels Sep 12, 2024
@neolit123 neolit123 added this to the v1.32 milestone Sep 12, 2024
@neolit123
Copy link
Member Author

cc @pacoxu
this was a surprise, but in rare cases it can happen apparently.

@pacoxu
Copy link
Member

pacoxu commented Sep 20, 2024

We locked the FG to true in kubernetes/kubernetes#126374 this release cycle. The promotion can be reverted if needed.

Should we keep it beta for another 2 or 3 releases?

@SataQiu
Copy link
Member

SataQiu commented Sep 20, 2024

I think it's ok to promote etcd learner mode to GA, but we need to ensure it works well.
And perhaps a cherry-pick fix is needed. WDYT? @neolit123

@neolit123
Copy link
Member Author

+1 to promote to ga and backport fix

@neolit123
Copy link
Member Author

fixed for 1.32 and backported to >= 1.28 thanks to @SataQiu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

3 participants