Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48177: UPSTREAM: <carry>: disable etcd readiness checks by default #2174

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ingvagabund
Copy link
Member

@ingvagabund ingvagabund commented Jan 16, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • Each health check is also registered as a readyz check. Thus registration of both etcd and etcd-readiness checks can't be just simply commented out/removed.
  • The logic for excluding checks through ?exclude= URL construct does not distinguish between health, livez and readyz checks. So patching the code on the level of getExcludedChecks would require to extend underlying handleRootHealth and InstallPathHandlerWithHealthyFunc with extra arguments to inject the list of excluded checks.
  • I choose the middle ground of letting both checks to be added through AddReadyzChecks. Yet excluded from the final addition addition since AddReadyzChecks can be invoked from multiple places.

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@openshift-ci-robot openshift-ci-robot added backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 16, 2025
@openshift-ci-robot
Copy link

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 16, 2025
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci openshift-ci bot requested review from deads2k and p0lyn0mial January 16, 2025 13:07
@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Jan 16, 2025
@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from e3ba771 to fb986c1 Compare January 16, 2025 13:14
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • Each health check is also registered as a readyz check. This registration of both etcd and etcd-readiness checks can't be just simply commented out/removed.
  • The logic for excluding checks through ?exclude= URL construct does not distinguish between health, livez and readyz checks. So patching the code on the level of getExcludedChecks would require to extend underlying handleRootHealth and InstallPathHandlerWithHealthyFunc with extra arguments to inject the list of excluded checks.

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ingvagabund
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 16, 2025
@openshift-ci-robot
Copy link

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 January 16, 2025 14:08
@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from fb986c1 to aed00d9 Compare January 17, 2025 09:23
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

// This is a downstream patch only as OpenShift's way of using etcd is unique.
readyzChecks := []healthz.HealthChecker{}
for _, check := range healthChecks {
if check.Name() == "etcd" || check.Name() == "etcd-readiness" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this patch shouldn't we exclude the desired check directly from kas-o ? (xref: https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/pod.yaml#L111)

Copy link
Member Author

@ingvagabund ingvagabund Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not properly checked though are all KA instances always exposed through service endpoints or a load balancer? Cloud provider LB health check paths are not expected to accept any queries/params like ?exclude=... (AWS, Azure, GCP at least). Unless I misunderstood both openshit-apiserver and oauth-apiserver rely on k8s.io/apiserver/pkg/server code. I.e. https://github.com/openshift/openshift-apiserver/blob/master/go.mod#L197 and https://github.com/openshift/oauth-apiserver/blob/master/go.mod#L144.

Is https://github.com/openshift/kubernetes-apiserver expected to be in sync with https://github.com/openshift/kubernetes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure it will work because we already exclude etcd from the liveness probe (xref: https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/pod.yaml#L105)

Copy link
Member Author

@ingvagabund ingvagabund Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. So this will work for KA instances and https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/svc.yaml. What about o-a and oauth-a? I was told they are added to a LB. Through:

which OpenShift relies on when creating the LBs. It seems to be a pattern to ignore/reject any such exclusion in the LB's health check paths: https://issues.redhat.com/browse/OCPBUGS-48177?focusedId=26397412&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26397412.

So even though the ?exclude... could be used for KA instances, it will not work for o-a/oauth-a. Also, disabling the checks in the code directly instead of at every place /readyz is accessed leaves no space for "forgotten" cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about o-a and oauth-a?

We could also exclude the desired checks directly from the operators, for example:
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L122

I was told they are added to a LB

I don't think this is true. The extension API servers are proxied by the KAS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do but I think only for KAS.

We could verify it. Please create a cluster on AWS and then log in to the AWS console and check the setting for the LB.

Copy link
Member Author

@ingvagabund ingvagabund Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and Lukasz synced over a call to discuss this. Main points:

@ingvagabund
Copy link
Member Author

/retest-required

@benluddy
Copy link

I accept the argument for a patch like this, but are the test changes going to be painful to carry over time? Is there an alternative way to structure the test changes that reduces the risk that we someday break the test while resolving rebase conflicts?

@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from aed00d9 to 1b40bbf Compare January 22, 2025 14:09
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from 1b40bbf to a6f4dc5 Compare January 22, 2025 14:14
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from a6f4dc5 to a8242d8 Compare January 22, 2025 14:25
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177)
and have etcd operator take responsibility for properly reporting etcd readiness.
Justification: kube-apiserver instances get removed from a load balancer when etcd starts
to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness
longer than the readiness timeout is. Thus, it is not necessary to drop connections
in case etcd resumes its readiness before a client connection times out naturally.
This is a downstream patch only as OpenShift's way of using etcd is unique.
@ingvagabund ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from a8242d8 to 30b315a Compare January 22, 2025 14:26
@openshift-ci-robot
Copy link

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@ingvagabund
Copy link
Member Author

ingvagabund commented Jan 22, 2025

Good point. I have updated the tests to excluded the checks in the main test loop rather than in every case. This will help to reduce the drift.

@benluddy
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025
@deads2k
Copy link

deads2k commented Jan 22, 2025

I think Ben can be a DOWNSTREAM_APPROVER. I'll label this one while he gets a PR to add himself.

@deads2k deads2k added approved Indicates a PR has been approved by an approver from all required OWNERS files. backports/validated-commits Indicates that all commits come to merged upstream PRs. and removed backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. labels Jan 22, 2025
Copy link

openshift-ci bot commented Jan 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: benluddy, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ingvagabund
Copy link
Member Author

/retest-required

3 similar comments
@ingvagabund
Copy link
Member Author

/retest-required

@ingvagabund
Copy link
Member Author

/retest-required

@ingvagabund
Copy link
Member Author

/retest-required

@ingvagabund
Copy link
Member Author

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 30, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD cc13ce0 and 2 for PR HEAD 30b315a in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 1 for PR HEAD 30b315a in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

2 similar comments
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5562572 and 1 for PR HEAD 30b315a in total

Copy link

openshift-ci bot commented Feb 1, 2025

@ingvagabund: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 30b315a link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5562572 and 2 for PR HEAD 30b315a in total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. backports/validated-commits Indicates that all commits come to merged upstream PRs. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants