Add threshold for scale down #1038

roypaulin · 2025-01-21T19:36:28Z

This adds a new field scaleDownThreshold to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value.
As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down.
This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience

NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.

cchen-vertica · 2025-01-22T18:19:00Z

kuttl-test.yaml

+  - script: make undeploy-prometheus || true
+  - script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus; fi
+  - script: make undeploy-prometheus-adapter || true
+  - script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus-adapter; fi


Do we run "make undeploy-prometheus" and "make undeploy-prometheus-adaptor" at the end of leg-12? If not, are all the generated prometheus objects are in one namespace and how we can make sure the test objects are cleaned up?

prometheus and prometheus-adapter are at the same level as the operator i.e a new instance is created for each (leg-12) run for the entire run(only one prometheus/adapter is used by all the test cases that need it). We are not explicitly cleaning them up because they will be removed with the kind cluster just like the operator.

After each test(running locally), we will not remove the operator so the prometheus objects will not be cleaned. We should have a follow-up low-priority jira to make sure those objects are cleaned.

cchen-vertica · 2025-01-22T18:39:33Z

tests/e2e-leg-12/scale-down-with-threshold/35-verify-custom-metrics.yaml

+kind: TestStep
+commands:
+  - script: |
+      echo "sleep 120 seconds before starting to verify metrics"


why do we need to sleep 120 seconds for getting the metrics?

Because it may take some time for the metrics to show up in the adapter. I will make it 60s and see.

Okay, good to know this.

cchen-vertica · 2025-01-22T18:40:37Z

tests/e2e-leg-12/scale-down-with-threshold/40-create-sessions.yaml

+    set -o errexit
+    set -o xtrace
+
+    kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &


Using a for-loop will be easily to find how many connections we created.

cchen-vertica · 2025-01-22T18:41:14Z

tests/e2e-leg-12/scale-down-with-threshold/40-create-sessions.yaml

+    kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
+    kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
+    kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" &
+    sleep 60


What is this sleep for?

Without this, the pod will immediately complete and those sessions won't stay active. I don't really.

cchen-vertica · 2025-01-22T18:46:11Z

tests/e2e-leg-12/scale-down-with-threshold/45-create-vas.yaml

+              name: vertica_sessions_running_total
+            target:
+              type: AverageValue
+              averageValue: 5


If the user doesn't specify "scaleDownThreshold", will we not scale down any more? If so, we probably need to make "scaleDownThreshold" required or use a default value for it.

We will scale down based on the threshold also used for scale up

After scaling up, the current value will fall below the target value, triggering a scale-down? This cycle could repeat and cause fluctuations. Is this how the old auto-scaler work?

Yes, but it has a stabilization window) to mitigate that, so it is not instantaneous. Usually to make autoscaling stable, you would set scale up and down stabilization windows (defaut: 0 for scale up and 5min for scale down). It will prevent fluctuations.

cchen-vertica · 2025-01-22T18:49:36Z

tests/e2e-leg-12/scale-down-with-threshold/55-assert.yaml

+  targetSize: 3
+status:
+  scalingCount: 1
+  currentSize: 3


Can we verify minReplicas and maxReplicas here as well?

cchen-vertica · 2025-01-22T18:52:55Z

api/v1beta1/verticaautoscaler_types.go

@@ -173,13 +179,22 @@ type VerticaAutoscalerConditionType string
 const (
 	// TargetSizeInitialized indicates whether the operator has initialized targetSize in the spec
 	TargetSizeInitialized VerticaAutoscalerConditionType = "TargetSizeInitialized"
+	// ScalingActive indicates that the horizontal pod autoscaler can fetch the metric
+	// and is ready for whenever scaling is needed.
+	ScalingActive VerticaAutoscalerConditionType = "ScalingActive"


How would this condition work? I cannot find where we assign the value to this condition.

In verifyhpa_reconciler.go.

It is set to true when the hpa is ready. The operator checks the hpa status to know that.

cchen-vertica · 2025-01-22T19:07:10Z

This adds a new field scaleDownThreshold to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value. As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down. This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience

NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.

For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas?

roypaulin · 2025-01-22T22:44:21Z

@LiboYu2, @qindotguan comments, thoughts?

roypaulin · 2025-01-23T00:35:21Z

For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas?

We cannot add a new field to the hpa as it is 3rd party struct.
In general, users are going to refer to the CR, the underlying hpa is just a dependent object. Maybe I can add that field in the CR, it will show the currentMinreplicas and the user-specified minReplicas. I will see if it makes sense.

cchen-vertica · 2025-01-24T15:09:28Z

pkg/controllers/vas/verifyhpa_reconciler.go

+		Type:   v1beta1.ScalingActive,
+		Status: corev1.ConditionFalse,
+	}
+	scalingActive := isStatusConditionPresentAndEqual(conds, autoscalingv2.ScalingActive, corev1.ConditionTrue) &&


At first, autoscalingv2.ScalingActive is false. Except for the line 63 in this file, are there other places we set it to true?

No, that's the only place.

roypaulin added 4 commits January 21, 2025 19:50

Add scale down threshold

31dba40

Add new test for scale down

0a119e5

Update kuttl-test.yaml

3b3195f

Merge remote-tracking branch 'origin/main' into roypaulin/metrics

c89aad0

roypaulin requested review from cchen-vertica, fenic-fawkes, HaoYang0000, qindotguan and LiboYu2 as code owners January 21, 2025 19:36

cchen-vertica reviewed Jan 22, 2025

View reviewed changes

roypaulin added 3 commits January 23, 2025 20:28

Address comments

2c4b3e4

Merge remote-tracking branch 'origin/main' into roypaulin/metrics

eb1f7ee

Fix test

1117e5c

cchen-vertica reviewed Jan 24, 2025

View reviewed changes

cchen-vertica approved these changes Jan 24, 2025

View reviewed changes

roypaulin merged commit c5c1660 into main Jan 24, 2025
41 checks passed

roypaulin deleted the roypaulin/metrics branch January 24, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threshold for scale down #1038

Add threshold for scale down #1038

roypaulin commented Jan 21, 2025

cchen-vertica Jan 22, 2025

roypaulin Jan 22, 2025

cchen-vertica Jan 24, 2025

cchen-vertica Jan 22, 2025

roypaulin Jan 22, 2025

cchen-vertica Jan 24, 2025

cchen-vertica Jan 22, 2025

cchen-vertica Jan 22, 2025

roypaulin Jan 23, 2025

cchen-vertica Jan 22, 2025

roypaulin Jan 22, 2025

cchen-vertica Jan 24, 2025

roypaulin Jan 24, 2025 •

edited

Loading

cchen-vertica Jan 22, 2025

cchen-vertica Jan 22, 2025

roypaulin Jan 22, 2025

roypaulin Jan 22, 2025

cchen-vertica commented Jan 22, 2025

roypaulin commented Jan 22, 2025

roypaulin commented Jan 23, 2025

cchen-vertica Jan 24, 2025

roypaulin Jan 24, 2025

Add threshold for scale down #1038

Add threshold for scale down #1038

Conversation

roypaulin commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roypaulin Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cchen-vertica commented Jan 22, 2025

roypaulin commented Jan 22, 2025

roypaulin commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roypaulin Jan 24, 2025 •

edited

Loading