Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VmAgents Shards Autoscaling Issues #924

Open
togikiran opened this issue Apr 18, 2024 · 11 comments
Open

VmAgents Shards Autoscaling Issues #924

togikiran opened this issue Apr 18, 2024 · 11 comments
Assignees
Labels
question Further information is requested

Comments

@togikiran
Copy link

togikiran commented Apr 18, 2024

@f41gh7
Why the vmagent pods are getting rotated whenever we are increasing the replica-count ?
Command used: kubectl scale vmagent-shard-ha --replicas=3

How should we configure hpa to scale based on cpu/memory utilisation ?
Screenshot 2024-04-18 at 11 33 13 AM

What the above hpa scaling metric is showing unknow
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE vmagent-shard-ha VMAgent/vmagent-shard-ha <unknown>/80% 3 3 3 14m

Issue: missing label selector status
kubectl get --raw /apis/operator.victoriametrics.com/v1beta1/namespaces/<namespace>/vmagents/monitoring-vmagent-ha/scale {"kind":"Scale","apiVersion":"autoscaling/v1","metadata":{"name":"monitoring-vmagent-ha","namespace":"<namespace>","uid":"d80e7371-cd49-4caa-8765-3a78220f9543","resourceVersion":"351547683","creationTimestamp":"2024-04-17T12:26:29Z"},"spec":{"replicas":3},"status":{"replicas":3}}

vm-operator version: v0.30.0
vmagent: v1.90.0

@Haleygo Haleygo added the bug Something isn't working label Apr 25, 2024
@Haleygo
Copy link
Contributor

Haleygo commented Apr 25, 2024

Hi @togikiran ,

Why the vmagent pods are getting rotated whenever we are increasing the replica-count ?
Command used: kubectl scale vmagent-shard-ha --replicas=3

What do you mean by pods are getting rotated, could you share the pods status?
If you scale vmagent using kubectl scale vmagent/vmagent-shard-ha --replicas=3 from replicas=1, the expected behavior is new pods created, old pod stay.

What the above hpa scaling metric is showing unknow

It's unknown because metrics server doesn't know what's vmagent's cpu utilization now, you need to create the metric and report to metrics server.

Issue: missing label selector status

That's a bug, since we don't have label propagation in vmagent.status now. We can add them, but I'm not sure if that's useful. Like in this hpa case, hpa doesn't have to know the pod label selector, it only need to scale the vmagent.shardCount, and operator will scale pods.

@togikiran
Copy link
Author

Hey @Haleygo

When i increased the replicas from 2->3 , the older pods got rolledout and new pods came. Shared the screenshots
Screenshot 2024-04-26 at 1 34 17 PM
Screenshot 2024-04-26 at 1 36 47 PM

It's unknown because metrics server doesn't know what's vmagent's cpu utilization now, you need to create the metric and report to metrics server.

Default metric-server knows the pods metrics right, do you mean we need to add for vmagent customresource as well ? If yes can you help with the approach ?

Can you please share a sample hpa yaml file (k8s) for autoscaling vmagents shards based on cpu.

Thanks

@f41gh7
Copy link
Collaborator

f41gh7 commented Apr 26, 2024

Hello,

Due to current sharding implementation of vmagent, all flags for the all vmagents must be changed. It requires restart of all pods with new flag value.

@togikiran
Copy link
Author

Hey @f41gh7 , Is there a way to skip or bypass the pods restart because this will impact and restart for every scaleup. Is there any mitigation inplace for this issue ?

Thanks

@Haleygo
Copy link
Contributor

Haleygo commented Apr 26, 2024

Is there a way to skip or bypass the pods restart because this will impact and restart for every scaleup

No, it won't work if pod doesn't restart with new -promscrape.cluster.membersCount arg.
Imaging you have vmagent with shardCount: 1 and set hpa scale threshold to cpu>80%. At first it scrapes 100 targets with -promscrape.cluster.membersCount=1 -promscrape.cluster.memberNum=0.
Then targets number bumps to 200 and cpu exceeds 80%, the hpa helps increase the vmagent. shardCount to 2, which means there will be two vmagent instances sharding 200 targets, each instance still scrape 100 targets with -promscrape.cluster.membersCount=2 -promscrape.cluster.memberNum=0 or 1. In this way, cpu usage will go down.
If we don't change the -promscrape.cluster.membersCount for each instance, both of them will scrape 200 targets, cpu usage won't go down and there is no point to have hpa.

Default metric-server knows the pods metrics right, do you mean we need to add for vmagent customresource as well ? If yes can you help with the approach ?

I'd recommend to use keda here. It can use prometheus as direct trigger, like this

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: test
spec:
  scaleTargetRef:
    kind: vmagent
    name: vmagent-test
  minReplicaCount: 2
  maxReplicaCount: 3
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://vmselect-address
      metricName: vmagent-cpu-usage
      threshold: '80'
      query: container_cpu_usage{you-container-here}

@Haleygo
Copy link
Contributor

Haleygo commented Apr 26, 2024

Issue: missing label selector status

jfyi, status label selector should be fixed in a6e3ad7.

@togikiran
Copy link
Author

@Haleygo observed metric loss during vmagents pod scaleup i.e all pods are getting recreated after increase in replica count. This is impacting production clusters. Is there any workaround for this ?

@togikiran
Copy link
Author

Observing metric loss while vmagent scaling. Added hpa on cpu and memory metrics, pods are getting rolled out and observing metric loss. Production clusters are impacted
Operator: v0.44.0
Vmagent version: v1.90.0
@f41gh7

@Haleygo
Copy link
Contributor

Haleygo commented Jun 11, 2024

@Haleygo observed metric loss during vmagents pod scaleup i.e all pods are getting recreated after increase in replica count. This is impacting production clusters. Is there any workaround for this ?

hmm, I'm afraid that's expected with current implementation, workaround would be also set vmagentSpec.replicaCount=2, and enable deduplication in vmcluster.

@Haleygo Haleygo added question Further information is requested and removed bug Something isn't working labels Jul 4, 2024
@Haleygo Haleygo self-assigned this Jul 4, 2024
@Haleygo
Copy link
Contributor

Haleygo commented Jul 4, 2024

Hey @togikiran ,
Do you have further questions on this issue?

@togikiran
Copy link
Author

@Haleygo VMAgent sharding relies on the -promscrape.cluster.membersCount flag, which needs to be updated whenever the deployment or statefulset scales up or down. This update necessitates recreating all pods. Enabling dynamic discovery of this value would eliminate the need for manual updates and pod recreation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants