Cron schedule off-by-one error in desiredReplicas scaling #5263

OfficiallySomeGuy · 2023-12-06T23:40:18Z

Report

We are seeing an issue where KEDA does not consistently set the correct desiredReplicas in the ScaledObject HPA. We are using this with a fallback metric which should scale based on load if the cron schedule fails to behave as expected. We run a large deployment of multiple Kubernetes clusters, and see this come up occasionally in multiple clusters.

Below is a redacted YAML output of the ScaledObject which shows our cron schedule set to 89 during the US peak.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
...
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 600
            type: Percent
            value: 10
          stabilizationWindowSeconds: 3600
  cooldownPeriod: 300
  maxReplicaCount: 89
  minReplicaCount: 44
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: <redacted>/v1alpha1
    kind: <redacted>
    name: <redacted>
  triggers:
  - metadata:
      desiredReplicas: "89"
      end: 5 18 * * 1-5
      start: 50 4 * * 1-5
      timezone: America/Los_Angeles
    type: cron
  - metadata:
      metricName: <redacted>
      query: <redacted>
      serverAddress: http://monitoring-prometheus.monitoring.svc:9090
      threshold: "50"
    metricType: Value
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is performed because triggers are active
    reason: ScalerActive
    status: "True"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  externalMetricNames:
  - s0-cron-America-Los_Angeles-504xx1-5-518xx1-5
  - s1-prometheus-<redacted>
  health:
    s1-prometheus-<redacted>:
      numberOfFailures: 0
      status: Happy
...

And the HPA is below

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
...
metadata:
  annotations:
    ...
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"External","external":{"metricName":"s0-cron-America-Los_Angeles-504xx1-5-518xx1-5","metricSelector":{"matchLabels":{"scaledobject.keda.sh/name":"<redacted>"}},"currentValue":"0","currentAverageValue":"1012m"}},{"type":"External","external":{"metricName":"s1-prometheus-<redacted>","metricSelector":{"matchLabels":{"scaledobject.keda.sh/name":"<redacted>"}},"currentValue":"6352m"}}]'
...
annotations:
spec:
  maxReplicas: 89
  minReplicas: 44
  scaleTargetRef:
    apiVersion: <redacted>/v1alpha1
    kind: <redacted>
    name: <redacted>
status:
  currentReplicas: 88
  desiredReplicas: 88

The interesting thing I see about the HPA, is that the metric for s0-cron-America-Los_Angeles-504xx1-5-518xx1-5 correctly identifies it needs to be 1.012x the current desiredReplicas (1.012x88=~89).

"s0-cron-America-Los_Angeles-504xx1-5-518xx1-5" (target average value): 1012m / 1

The conditions on the HPA also seem reasonable

│   Type            Status  Reason              Message                                                                                │
│   ----            ------  ------              -------
│   AbleToScale     True    ReadyForNewScale    recommended size matches current size                                                  │
│   ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from <redacted>
│   ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

Looking for some help or advice debugging this further, unfortunately a large number of clusters we run are running 1.21 and therefore need to run this older KEDA version, but I can't find a similar issue scouring the known issues or in the changelog, so therefore believe this may also be an issue in the current mainline.

Expected Behavior

The HPA desiredReplicas to be correctly set to 89

Actual Behavior

The HPA set desiredReplicas to 88 (off by one), however metric appears correct

Steps to Reproduce the Problem

This issue appears to be intermittent, however we are seeing it on multiple clusters.

Create a ScaledObject with a cron schedule scaler, maxReplicas = scheduled replicas
Leave the ScaledObject running for an extended period

Logs from KEDA operator

With debug logging enabled, KEDA seems to be running happily

INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123"}
DEBUG	Parsed Group, Version, Kind, Resource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123", "GVK": "<redacted>/v1alpha1.<redacted>", "Resource": "<redacted>"}
DEBUG	ScaledObject is defined correctly and is ready for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123"}
DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "<redacted>", "scaledObject.Namespace": "<redacted>", "scaleTarget.Name": "<redacted>", "Metrics Name": "s0-cron-America-Los_Angeles-504xx1-5-518xx1-5"}
DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "<redacted>", "scaledObject.Namespace": "<redacted>", "scaleTarget.Name": "<redacted>", "Metrics Name": "s1-prometheus-<redacted>"}

KEDA Version

2.8.1

Kubernetes Version

< 1.23

Platform

Other

Scaler Details

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

zroubalik · 2023-12-07T17:10:41Z

If the reported metric is correct, then we should probably check HPA. It is doing the final calculation, maybe some rounding problem?

OfficiallySomeGuy · 2023-12-09T01:42:39Z

I see no errors on the HPA itself, I also increased the maxReplicas to be 90 to rule that out as a potential edge case.

JorTurFer · 2023-12-11T07:43:46Z

I think that this issue is related with rounding and tolerance problems. Once you've reached 88 instances, you won't never reach 89 because the variation (1012m over 1000m) is 0,012% as the HPA tolerance is 10%.
There is an issue in kubernetes repo talking about this and asking for supporting custom tolerances

zroubalik · 2023-12-11T15:33:52Z

A candidate for Troubleshooting/FAQ?

JorTurFer · 2023-12-11T16:36:09Z

A candidate for Troubleshooting/FAQ?

Could be

OfficiallySomeGuy · 2023-12-14T13:38:13Z

Understood, appreciate this is a known (or at least expected) issue, happy to take a stab at updating the docs if that would be helpful.

zroubalik · 2023-12-14T13:41:54Z

@OfficiallySomeGuy that would be great!

Reference kedacore/keda#5263, adding some more documentation for guiding debugging, and highlighting known issues Signed-off-by: OfficiallySomeGuy <[email protected]>

OfficiallySomeGuy · 2024-01-19T18:09:20Z

Raised kedacore/keda-docs#1298, apologies for the delay. I request we leave this issue open to track the fact this is unexpected behaviour.

stale · 2024-03-21T02:17:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2024-03-30T11:20:33Z

This issue has been automatically closed due to inactivity.

OfficiallySomeGuy added the bug Something isn't working label Dec 6, 2023

keda-automation added this to Roadmap - KEDA Core Dec 6, 2023

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Dec 6, 2023

OfficiallySomeGuy mentioned this issue Jan 19, 2024

Add more debugging scenarios for not scaling as expected kedacore/keda-docs#1298

Open

1 task

stale bot added the stale All issues that are marked as stale due to inactivity label Mar 21, 2024

stale bot closed this as completed Mar 30, 2024

github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron schedule off-by-one error in desiredReplicas scaling #5263

Cron schedule off-by-one error in desiredReplicas scaling #5263

OfficiallySomeGuy commented Dec 6, 2023

zroubalik commented Dec 7, 2023

OfficiallySomeGuy commented Dec 9, 2023

JorTurFer commented Dec 11, 2023

zroubalik commented Dec 11, 2023

JorTurFer commented Dec 11, 2023

OfficiallySomeGuy commented Dec 14, 2023 •

edited

Loading

zroubalik commented Dec 14, 2023

OfficiallySomeGuy commented Jan 19, 2024

stale bot commented Mar 21, 2024

stale bot commented Mar 30, 2024

Cron schedule off-by-one error in desiredReplicas scaling #5263

Cron schedule off-by-one error in desiredReplicas scaling #5263

Comments

OfficiallySomeGuy commented Dec 6, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

zroubalik commented Dec 7, 2023

OfficiallySomeGuy commented Dec 9, 2023

JorTurFer commented Dec 11, 2023

zroubalik commented Dec 11, 2023

JorTurFer commented Dec 11, 2023

OfficiallySomeGuy commented Dec 14, 2023 • edited Loading

zroubalik commented Dec 14, 2023

OfficiallySomeGuy commented Jan 19, 2024

stale bot commented Mar 21, 2024

stale bot commented Mar 30, 2024

OfficiallySomeGuy commented Dec 14, 2023 •

edited

Loading