-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add threshold for scale down #1038
Conversation
- script: make undeploy-prometheus || true | ||
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus; fi | ||
- script: make undeploy-prometheus-adapter || true | ||
- script: if [ "$NEED_PROMETHEUS" = "true" ]; then make deploy-prometheus-adapter; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we run "make undeploy-prometheus" and "make undeploy-prometheus-adaptor" at the end of leg-12? If not, are all the generated prometheus objects are in one namespace and how we can make sure the test objects are cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus and prometheus-adapter are at the same level as the operator i.e a new instance is created for each (leg-12) run for the entire run(only one prometheus/adapter is used by all the test cases that need it). We are not explicitly cleaning them up because they will be removed with the kind cluster just like the operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After each test(running locally), we will not remove the operator so the prometheus objects will not be cleaned. We should have a follow-up low-priority jira to make sure those objects are cleaned.
kind: TestStep | ||
commands: | ||
- script: | | ||
echo "sleep 120 seconds before starting to verify metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to sleep 120 seconds for getting the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it may take some time for the metrics to show up in the adapter. I will make it 60s and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good to know this.
set -o errexit | ||
set -o xtrace | ||
|
||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a for-loop will be easily to find how many connections we created.
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
kubectl exec svc/v-scale-down-threshold-pri1 -c server -- vsql -U dbadmin -w topsecret -c "select sleep('240');" & | ||
sleep 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this sleep for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, the pod will immediately complete and those sessions won't stay active. I don't really.
name: vertica_sessions_running_total | ||
target: | ||
type: AverageValue | ||
averageValue: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify "scaleDownThreshold", will we not scale down any more? If so, we probably need to make "scaleDownThreshold" required or use a default value for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will scale down based on the threshold also used for scale up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After scaling up, the current value will fall below the target value, triggering a scale-down? This cycle could repeat and cause fluctuations. Is this how the old auto-scaler work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it has a stabilization window) to mitigate that, so it is not instantaneous. Usually to make autoscaling stable, you would set scale up and down stabilization windows (defaut: 0 for scale up and 5min for scale down). It will prevent fluctuations.
targetSize: 3 | ||
status: | ||
scalingCount: 1 | ||
currentSize: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify minReplicas and maxReplicas here as well?
@@ -173,13 +179,22 @@ type VerticaAutoscalerConditionType string | |||
const ( | |||
// TargetSizeInitialized indicates whether the operator has initialized targetSize in the spec | |||
TargetSizeInitialized VerticaAutoscalerConditionType = "TargetSizeInitialized" | |||
// ScalingActive indicates that the horizontal pod autoscaler can fetch the metric | |||
// and is ready for whenever scaling is needed. | |||
ScalingActive VerticaAutoscalerConditionType = "ScalingActive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this condition work? I cannot find where we assign the value to this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In verifyhpa_reconciler.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is set to true when the hpa is ready. The operator checks the hpa status to know that.
For the users, they usually will view hpa to see current metrics, right? Then if they see minReplicas is changed, will it be strange? Could we add a new field in hpa to indicate the original minReplicas? |
@LiboYu2, @qindotguan comments, thoughts? |
We cannot add a new field to the hpa as it is 3rd party struct. |
Type: v1beta1.ScalingActive, | ||
Status: corev1.ConditionFalse, | ||
} | ||
scalingActive := isStatusConditionPresentAndEqual(conds, autoscalingv2.ScalingActive, corev1.ConditionTrue) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, autoscalingv2.ScalingActive is false. Except for the line 63 in this file, are there other places we set it to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's the only place.
This adds a new field
scaleDownThreshold
to the metric's definition. It will be used to control scale down i.e scale down will happen only when the metric's value is lower than this new field's value.As the hpa does not natively support 2 thresholds, this works through a hack: we know that scaling cannot happen if the currentReplicas is equal to the minReplicas so the operator sets the hpa minReplicas to the VerticaAutoscaler's targetSize as long as the metric's value is greater than scaleDownThreshold thus preventing scale down. The moment the value is lower than scaleDownThreshold, the operator update the hpa minReplicas to its correct value (from vas) which will trigger scale down.
This will not work as expected with the scale-down stabilization window as the new threshold is not part of the hpa. So, it will be recommended to either set a stabilization window or set scaleDownThreshold for a stable scaling down experience
NB: Using the scale-down stabilization window is also a very good way to have a stable scaling down process. A stabilization window long enough will assure the metric has been lower long enough to safely scale down.