Old unreachable revision is causing new pods to get created when it should scale down #14115

SaschaSchwarze0 · 2023-06-22T08:55:38Z

/area autoscale

What version of Knative?

1.10 (but exists in older versions as well)

Expected Behavior

When an old revision is unreachable, no more pods should be created for it

Actual Behavior

New (and many) pods are created if the old revision has a large minScale.

Steps to Reproduce the Problem

kn quickstart kind (issue also happens outside of KinD)
kn service create testsvc --image ghcr.io/knative/helloworld-go:latest --scale 9
Start to watch the pods by using kubectl or a tool like k9s
kn service update testsvc --scale 1

After the second revision is ready, it starts to terminate the old pods, but also brings up many additional pods for the old revision. Depending on how long pods are stuck in Terminating you can reach more than 60 existing pods for the old revision.

--

The problem is the IsReady function of the ServerlessService which is returning false if the generation is different in metadata and status.

Due to it temporarily not being ready, the PodAutoscaler is set to SKSReady=Unknown here.

The PodAutoscaler then goes Ready=False because SKSReady is part of Ready here.

The revision inherits the PodAutoscaler status and sets Active=Unknown here.

This causes the PodAutoscaler's Reachability to be set to Unknown here.

Once the PodAutoscaler is not anymore marked as unreachable, the scaling boundary will be set to the min value from the annotation again here. For an unreachable revision, it would otherwise always use min = 0 so that scaleDown happens. But now it increases it to 9 again.

I will open a PR in a few minutes that changes IsReady of the ServerlessService to only check the condition. This fixes the problem and I have not seen negative impact. Though, it undos the work of IsReady should take ObservedGeneration into account #8004 which sounds to me as if it tried to resolve an abstract issue while I here have a real one. Alternative could be to leave IsReady unchanged und look at the condition directly in kpa.go.

So, I need somebody with experience to carefully assess if there are side-effects.

The text was updated successfully, but these errors were encountered:

dprotaso · 2023-06-22T13:26:10Z

FYI - this is probably a duplicate of #13677

dprotaso · 2023-06-22T13:32:14Z

The problem is the IsReady function of the ServerlessService which is returning false if the generation is different in metadata and status.

We can only guarantee a resource is Ready=True is accurate when observed generation is equal to the metadata.generation.

I will open a PR in a few minutes that changes IsReady of the ServerlessService to only check the condition. This fixes the problem and I have not seen negative impact.

Alternative could be to leave IsReady unchanged und look at the condition directly in kpa.go.

I'd rather we not change the semantics of IsReady in this way but find an alternative as you suggest

SaschaSchwarze0 · 2023-06-22T13:59:17Z

Thanks @dprotaso for checking.

Alternative could be to leave IsReady unchanged und look at the condition directly in kpa.go.

I'd rather we not change the semantics of IsReady in this way but find an alternative as you suggest

That's fine. The alternative that I had in mind is to change https://github.com/knative/serving/blob/v0.37.2/pkg/reconciler/autoscaling/kpa/kpa.go#L169 from

	if sks.IsReady() {

to

	if sks.Status.GetCondition(nv1alpha1.ServerlessServiceConditionReady).IsTrue() {

Then we can leave IsReady unchanged, but still ignore the "transition phase" caused by the generation difference (kind of pending reconcile).

dprotaso · 2023-06-22T14:05:33Z

Let me think of the consequences of that change and get back to you

SaschaSchwarze0 · 2023-06-22T14:09:06Z

FYI - this is probably a duplicate of #13677

The scenario at least sounds different. And the podautoscaler in unknown status seems to be perstent there for a longer time while it here is only in that status for a millisecond at most.

Reminds me that I forgot to mention this ... eventually, in my scenario, an old revision would get fully cleaned up also without any code change. It just takes a few or many minutes and in the meantime many pods had been created and deleted. A much higher value of max-scale-down-rate in the autoscaler config actually helps once it allows the autoscaler to go from current to 1 directly.

SaschaSchwarze0 · 2023-08-21T12:04:11Z

I closed the PR with the change in the networking library and moved the logic change to the KPA reconciler.

SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 22, 2023

knative-prow bot added the area/autoscale label Jun 22, 2023

SaschaSchwarze0 mentioned this issue Jun 22, 2023

Ensure old revision stays Unreachable after Knative Service update knative/networking#818

Closed

SaschaSchwarze0 mentioned this issue Aug 21, 2023

Ensure old revision stays Unreachable after Knative Service update #14287

Closed

dprotaso mentioned this issue Aug 31, 2023

PodAutoscaler Active Condition should not affect Reachability #14309

Merged

knative-prow bot closed this as completed in #14309 Sep 13, 2023

dprotaso added this to the v1.12.0 milestone Sep 17, 2023

dprotaso mentioned this issue Jan 17, 2024

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old unreachable revision is causing new pods to get created when it should scale down #14115

Old unreachable revision is causing new pods to get created when it should scale down #14115

SaschaSchwarze0 commented Jun 22, 2023 •

edited

Loading

dprotaso commented Jun 22, 2023

dprotaso commented Jun 22, 2023

SaschaSchwarze0 commented Jun 22, 2023

dprotaso commented Jun 22, 2023

SaschaSchwarze0 commented Jun 22, 2023

SaschaSchwarze0 commented Aug 21, 2023 •

edited

Loading

Old unreachable revision is causing new pods to get created when it should scale down #14115

Old unreachable revision is causing new pods to get created when it should scale down #14115

Comments

SaschaSchwarze0 commented Jun 22, 2023 • edited Loading

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

dprotaso commented Jun 22, 2023

dprotaso commented Jun 22, 2023

SaschaSchwarze0 commented Jun 22, 2023

dprotaso commented Jun 22, 2023

SaschaSchwarze0 commented Jun 22, 2023

SaschaSchwarze0 commented Aug 21, 2023 • edited Loading

SaschaSchwarze0 commented Jun 22, 2023 •

edited

Loading

SaschaSchwarze0 commented Aug 21, 2023 •

edited

Loading