Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cron schedule off-by-one error in desiredReplicas scaling #5263

Closed
OfficiallySomeGuy opened this issue Dec 6, 2023 · 10 comments
Closed

Cron schedule off-by-one error in desiredReplicas scaling #5263

OfficiallySomeGuy opened this issue Dec 6, 2023 · 10 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@OfficiallySomeGuy
Copy link

Report

We are seeing an issue where KEDA does not consistently set the correct desiredReplicas in the ScaledObject HPA. We are using this with a fallback metric which should scale based on load if the cron schedule fails to behave as expected. We run a large deployment of multiple Kubernetes clusters, and see this come up occasionally in multiple clusters.

Below is a redacted YAML output of the ScaledObject which shows our cron schedule set to 89 during the US peak.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
...
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 600
            type: Percent
            value: 10
          stabilizationWindowSeconds: 3600
  cooldownPeriod: 300
  maxReplicaCount: 89
  minReplicaCount: 44
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: <redacted>/v1alpha1
    kind: <redacted>
    name: <redacted>
  triggers:
  - metadata:
      desiredReplicas: "89"
      end: 5 18 * * 1-5
      start: 50 4 * * 1-5
      timezone: America/Los_Angeles
    type: cron
  - metadata:
      metricName: <redacted>
      query: <redacted>
      serverAddress: http://monitoring-prometheus.monitoring.svc:9090
      threshold: "50"
    metricType: Value
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is performed because triggers are active
    reason: ScalerActive
    status: "True"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  externalMetricNames:
  - s0-cron-America-Los_Angeles-504xx1-5-518xx1-5
  - s1-prometheus-<redacted>
  health:
    s1-prometheus-<redacted>:
      numberOfFailures: 0
      status: Happy
...

And the HPA is below

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
...
metadata:
  annotations:
    ...
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"External","external":{"metricName":"s0-cron-America-Los_Angeles-504xx1-5-518xx1-5","metricSelector":{"matchLabels":{"scaledobject.keda.sh/name":"<redacted>"}},"currentValue":"0","currentAverageValue":"1012m"}},{"type":"External","external":{"metricName":"s1-prometheus-<redacted>","metricSelector":{"matchLabels":{"scaledobject.keda.sh/name":"<redacted>"}},"currentValue":"6352m"}}]'
...
annotations:
spec:
  maxReplicas: 89
  minReplicas: 44
  scaleTargetRef:
    apiVersion: <redacted>/v1alpha1
    kind: <redacted>
    name: <redacted>
status:
  currentReplicas: 88
  desiredReplicas: 88

The interesting thing I see about the HPA, is that the metric for s0-cron-America-Los_Angeles-504xx1-5-518xx1-5 correctly identifies it needs to be 1.012x the current desiredReplicas (1.012x88=~89).

"s0-cron-America-Los_Angeles-504xx1-5-518xx1-5" (target average value): 1012m / 1

The conditions on the HPA also seem reasonable

│   Type            Status  Reason              Message                                                                                │
│   ----            ------  ------              -------
│   AbleToScale     True    ReadyForNewScale    recommended size matches current size                                                  │
│   ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from <redacted>
│   ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

Looking for some help or advice debugging this further, unfortunately a large number of clusters we run are running 1.21 and therefore need to run this older KEDA version, but I can't find a similar issue scouring the known issues or in the changelog, so therefore believe this may also be an issue in the current mainline.

Expected Behavior

The HPA desiredReplicas to be correctly set to 89

Actual Behavior

The HPA set desiredReplicas to 88 (off by one), however metric appears correct

Steps to Reproduce the Problem

This issue appears to be intermittent, however we are seeing it on multiple clusters.

  1. Create a ScaledObject with a cron schedule scaler, maxReplicas = scheduled replicas
  2. Leave the ScaledObject running for an extended period

Logs from KEDA operator

With debug logging enabled, KEDA seems to be running happily

INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123"}
DEBUG	Parsed Group, Version, Kind, Resource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123", "GVK": "<redacted>/v1alpha1.<redacted>", "Resource": "<redacted>"}
DEBUG	ScaledObject is defined correctly and is ready for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"<redacted>","namespace":"<redacted>"}, "namespace": "<redacted>", "name": "<redacted>", "reconcileID": "d45bc0ab-a321-4f96-afa5-4e2924e49123"}
DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "<redacted>", "scaledObject.Namespace": "<redacted>", "scaleTarget.Name": "<redacted>", "Metrics Name": "s0-cron-America-Los_Angeles-504xx1-5-518xx1-5"}
DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "<redacted>", "scaledObject.Namespace": "<redacted>", "scaleTarget.Name": "<redacted>", "Metrics Name": "s1-prometheus-<redacted>"}

KEDA Version

2.8.1

Kubernetes Version

< 1.23

Platform

Other

Scaler Details

No response

Anything else?

No response

@OfficiallySomeGuy OfficiallySomeGuy added the bug Something isn't working label Dec 6, 2023
@zroubalik
Copy link
Member

If the reported metric is correct, then we should probably check HPA. It is doing the final calculation, maybe some rounding problem?

@OfficiallySomeGuy
Copy link
Author

I see no errors on the HPA itself, I also increased the maxReplicas to be 90 to rule that out as a potential edge case.

@JorTurFer
Copy link
Member

I think that this issue is related with rounding and tolerance problems. Once you've reached 88 instances, you won't never reach 89 because the variation (1012m over 1000m) is 0,012% as the HPA tolerance is 10%.
There is an issue in kubernetes repo talking about this and asking for supporting custom tolerances

@zroubalik
Copy link
Member

A candidate for Troubleshooting/FAQ?

@JorTurFer
Copy link
Member

A candidate for Troubleshooting/FAQ?

Could be

@OfficiallySomeGuy
Copy link
Author

OfficiallySomeGuy commented Dec 14, 2023

Understood, appreciate this is a known (or at least expected) issue, happy to take a stab at updating the docs if that would be helpful.

@zroubalik
Copy link
Member

@OfficiallySomeGuy that would be great!

OfficiallySomeGuy added a commit to OfficiallySomeGuy/keda-docs that referenced this issue Jan 19, 2024
Reference kedacore/keda#5263, adding some more documentation for guiding debugging, and highlighting known issues

Signed-off-by: OfficiallySomeGuy <[email protected]>
@OfficiallySomeGuy
Copy link
Author

Raised kedacore/keda-docs#1298, apologies for the delay. I request we leave this issue open to track the fact this is unexpected behaviour.

Copy link

stale bot commented Mar 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Mar 21, 2024
Copy link

stale bot commented Mar 30, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Mar 30, 2024
@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

3 participants