OCPBUGS-48177: UPSTREAM: <carry>: disable etcd readiness checks by default #2174

ingvagabund · 2025-01-16T13:05:21Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Each health check is also registered as a readyz check. Thus registration of both etcd and etcd-readiness checks can't be just simply commented out/removed.
The logic for excluding checks through ?exclude= URL construct does not distinguish between health, livez and readyz checks. So patching the code on the level of getExcludedChecks would require to extend underlying handleRootHealth and InstallPathHandlerWithHealthyFunc with extra arguments to inject the list of excluded checks.
I choose the middle ground of letting both checks to be added through AddReadyzChecks. Yet excluded from the final addition addition since AddReadyzChecks can be invoked from multiple places.

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

openshift-ci-robot · 2025-01-16T13:05:27Z

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-16T13:05:31Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

e3ba771|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2025-01-16T13:14:09Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

fb986c1|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2025-01-16T13:20:12Z

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Each health check is also registered as a readyz check. This registration of both etcd and etcd-readiness checks can't be just simply commented out/removed.

The logic for excluding checks through ?exclude= URL construct does not distinguish between health, livez and readyz checks. So patching the code on the level of getExcludedChecks would require to extend underlying handleRootHealth and InstallPathHandlerWithHealthyFunc with extra arguments to inject the list of excluded checks.

Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ingvagabund · 2025-01-16T14:08:26Z

/jira refresh

openshift-ci-robot · 2025-01-16T14:08:35Z

@ingvagabund: This pull request references Jira Issue OCPBUGS-48177, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-17T09:24:06Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

aed00d9|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

p0lyn0mial · 2025-01-17T13:32:07Z

staging/src/k8s.io/apiserver/pkg/server/config.go

+	// This is a downstream patch only as OpenShift's way of using etcd is unique.
+	readyzChecks := []healthz.HealthChecker{}
+	for _, check := range healthChecks {
+		if check.Name() == "etcd" || check.Name() == "etcd-readiness" {


instead of this patch shouldn't we exclude the desired check directly from kas-o ? (xref: https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/pod.yaml#L111)

I have not properly checked though are all KA instances always exposed through service endpoints or a load balancer? Cloud provider LB health check paths are not expected to accept any queries/params like ?exclude=... (AWS, Azure, GCP at least). Unless I misunderstood both openshit-apiserver and oauth-apiserver rely on k8s.io/apiserver/pkg/server code. I.e. https://github.com/openshift/openshift-apiserver/blob/master/go.mod#L197 and https://github.com/openshift/oauth-apiserver/blob/master/go.mod#L144.

Is https://github.com/openshift/kubernetes-apiserver expected to be in sync with https://github.com/openshift/kubernetes?

I am pretty sure it will work because we already exclude etcd from the liveness probe (xref: https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/pod.yaml#L105)

All right. So this will work for KA instances and https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/svc.yaml. What about o-a and oauth-a? I was told they are added to a LB. Through:

https://github.com/kubernetes-sigs/cluster-api-provider-aws

https://github.com/kubernetes-sigs/cluster-api-provider-azure

https://github.com/kubernetes-sigs/cluster-api-provider-gcp

which OpenShift relies on when creating the LBs. It seems to be a pattern to ignore/reject any such exclusion in the LB's health check paths: https://issues.redhat.com/browse/OCPBUGS-48177?focusedId=26397412&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26397412.

So even though the ?exclude... could be used for KA instances, it will not work for o-a/oauth-a. Also, disabling the checks in the code directly instead of at every place /readyz is accessed leaves no space for "forgotten" cases.

What about o-a and oauth-a?

We could also exclude the desired checks directly from the operators, for example:
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L122

I was told they are added to a LB

I don't think this is true. The extension API servers are proxied by the KAS.

So you are saying we do not rely on https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/elb/loadbalancer.go#L212?

We do but I think only for KAS.

We could verify it. Please create a cluster on AWS and then log in to the AWS console and check the setting for the LB.

Me and Lukasz synced over a call to discuss this. Main points:

kube-aggregator (as a proxy) discovers api services (through APIService objects) and registers a specific group to a set of api servers (e.g. openshift-apiserver, oauth-service)

kube-aggregator service resolver consumes endpoints:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/controllers/status/remote/remote_available_controller.go#L286C28-L286C43

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L46

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/util/proxy/proxy.go#L73C25-L73C33

endpoints controller takes into account whether a pod is ready or not.

https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/kube-apiserver/svc.yaml does not specify .spec.publishNotReadyAddresses so only ready pods are added to a list of Addresses (the rest to notReadyAddresses).

https://github.com/kubernetes/kubernetes/blob/cd5f3d9f9d5ae3153206178e6114d573dc24ad73/staging/src/k8s.io/apiserver/pkg/util/proxy/proxy.go#L87 reads only .Addreses field so not ready pods are excluded.

As a result both https://github.com/openshift/cluster-openshift-apiserver-operator/blob/8236e6ebd065e30dc479a770bb6674165f123f66/bindata/v3.11.0/openshift-apiserver/deploy.yaml#L132 and https://github.com/openshift/cluster-authentication-operator/blob/8538d46b59cca46f6b6987e0f16c946478204f06/bindata/oauth-apiserver/deploy.yaml#L123 /readyz probes need to be updated to ?exclude=etcd.

LB sends traffic to a host network -> kube-api server instances

meaning /readyz path is accessed directly (not via endpoints)

given AWS/Azure/GCP LB healthcheck path does not accept queries/params, /readyz needs to be patched within kube-apiserver code as is done in this PR.

ingvagabund · 2025-01-20T14:39:28Z

/retest-required

benluddy · 2025-01-21T17:00:10Z

I accept the argument for a patch like this, but are the test changes going to be painful to carry over time? Is there an alternative way to structure the test changes that reduces the risk that we someday break the test while resolving rebase conflicts?

openshift-ci-robot · 2025-01-22T14:09:37Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

1b40bbf|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2025-01-22T14:14:14Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

a6f4dc5|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2025-01-22T14:25:23Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

a8242d8|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Explicitly exclude etcd and etcd-readiness checks (OCPBUGS-48177) and have etcd operator take responsibility for properly reporting etcd readiness. Justification: kube-apiserver instances get removed from a load balancer when etcd starts to report not ready (as will KA's /readyz). Client connections can withstand etcd unreadiness longer than the readiness timeout is. Thus, it is not necessary to drop connections in case etcd resumes its readiness before a client connection times out naturally. This is a downstream patch only as OpenShift's way of using etcd is unique.

openshift-ci-robot · 2025-01-22T14:26:27Z

@ingvagabund: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

30b315a|UPSTREAM: : disable etcd readiness checks by default: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

ingvagabund · 2025-01-22T15:19:39Z

Good point. I have updated the tests to excluded the checks in the main test loop rather than in every case. This will help to reduce the drift.

benluddy · 2025-01-22T15:22:14Z

/lgtm

deads2k · 2025-01-22T17:07:40Z

I think Ben can be a DOWNSTREAM_APPROVER. I'll label this one while he gets a PR to add himself.

openshift-ci · 2025-01-22T17:08:05Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: benluddy, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ingvagabund · 2025-01-23T08:26:28Z

/retest-required

ingvagabund · 2025-01-23T13:40:36Z

/retest-required

ingvagabund · 2025-01-24T12:54:33Z

/retest-required

ingvagabund · 2025-01-29T09:47:15Z

/retest-required

ingvagabund · 2025-01-30T09:51:06Z

/label acknowledge-critical-fixes-only

openshift-ci-robot · 2025-01-30T10:56:50Z

/retest-required

Remaining retests: 0 against base HEAD cc13ce0 and 2 for PR HEAD 30b315a in total

openshift-ci-robot · 2025-01-30T16:56:59Z

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 1 for PR HEAD 30b315a in total

openshift-ci-robot · 2025-01-30T18:30:27Z

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

openshift-ci-robot · 2025-01-31T12:30:46Z

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

openshift-ci-robot · 2025-01-31T15:58:50Z

/retest-required

Remaining retests: 0 against base HEAD 13d3c9b and 2 for PR HEAD 30b315a in total

openshift-ci-robot · 2025-02-01T00:01:57Z

/retest-required

Remaining retests: 0 against base HEAD 5562572 and 1 for PR HEAD 30b315a in total

openshift-ci · 2025-02-01T09:35:55Z

@ingvagabund: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`30b315a`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-02-01T12:54:35Z

/retest-required

Remaining retests: 0 against base HEAD 5562572 and 2 for PR HEAD 30b315a in total

openshift-ci bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 16, 2025

openshift-ci bot requested review from deads2k and p0lyn0mial January 16, 2025 13:07

openshift-ci bot added the vendor-update Touching vendor dir or related files label Jan 16, 2025

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from e3ba771 to fb986c1 Compare January 16, 2025 13:14

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 16, 2025

openshift-ci bot requested a review from wangke19 January 16, 2025 14:08

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from fb986c1 to aed00d9 Compare January 17, 2025 09:23

p0lyn0mial reviewed Jan 17, 2025

View reviewed changes

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from aed00d9 to 1b40bbf Compare January 22, 2025 14:09

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from 1b40bbf to a6f4dc5 Compare January 22, 2025 14:14

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from a6f4dc5 to a8242d8 Compare January 22, 2025 14:25

ingvagabund force-pushed the exclude-etcd-readiness-by-default branch from a8242d8 to 30b315a Compare January 22, 2025 14:26

openshift-ci bot assigned benluddy Jan 22, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 30, 2025

OCPBUGS-48177: UPSTREAM: <carry>: disable etcd readiness checks by default #2174

Are you sure you want to change the base?

OCPBUGS-48177: UPSTREAM: <carry>: disable etcd readiness checks by default #2174

Conversation

ingvagabund commented Jan 16, 2025 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

openshift-ci-robot commented Jan 16, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

openshift-ci-robot commented Jan 16, 2025

openshift-ci-robot commented Jan 16, 2025

openshift-ci-robot commented Jan 16, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

ingvagabund commented Jan 16, 2025

openshift-ci-robot commented Jan 16, 2025

openshift-ci-robot commented Jan 17, 2025

p0lyn0mial Jan 17, 2025

Choose a reason for hiding this comment

ingvagabund Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

p0lyn0mial Jan 17, 2025

Choose a reason for hiding this comment

ingvagabund Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

p0lyn0mial Jan 20, 2025

Choose a reason for hiding this comment

ingvagabund Jan 20, 2025

Choose a reason for hiding this comment

p0lyn0mial Jan 20, 2025

Choose a reason for hiding this comment

ingvagabund Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

ingvagabund commented Jan 20, 2025

benluddy commented Jan 21, 2025

openshift-ci-robot commented Jan 22, 2025

openshift-ci-robot commented Jan 22, 2025

openshift-ci-robot commented Jan 22, 2025

openshift-ci-robot commented Jan 22, 2025

ingvagabund commented Jan 22, 2025 • edited Loading

benluddy commented Jan 22, 2025

deads2k commented Jan 22, 2025

openshift-ci bot commented Jan 22, 2025

ingvagabund commented Jan 23, 2025

ingvagabund commented Jan 23, 2025

ingvagabund commented Jan 24, 2025

ingvagabund commented Jan 29, 2025

ingvagabund commented Jan 30, 2025

openshift-ci-robot commented Jan 30, 2025

openshift-ci-robot commented Jan 30, 2025

openshift-ci-robot commented Jan 30, 2025

openshift-ci-robot commented Jan 31, 2025

openshift-ci-robot commented Jan 31, 2025

openshift-ci-robot commented Feb 1, 2025

openshift-ci bot commented Feb 1, 2025 • edited Loading

openshift-ci-robot commented Feb 1, 2025

ingvagabund commented Jan 16, 2025 •

edited

Loading

ingvagabund Jan 17, 2025 •

edited

Loading

ingvagabund Jan 17, 2025 •

edited

Loading

ingvagabund Jan 20, 2025 •

edited

Loading

ingvagabund commented Jan 22, 2025 •

edited

Loading

openshift-ci bot commented Feb 1, 2025 •

edited

Loading