VmAgents Shards Autoscaling Issues #924

togikiran · 2024-04-18T06:06:07Z

@f41gh7
Why the vmagent pods are getting rotated whenever we are increasing the replica-count ?
Command used: kubectl scale vmagent-shard-ha --replicas=3

How should we configure hpa to scale based on cpu/memory utilisation ?

What the above hpa scaling metric is showing unknow
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE vmagent-shard-ha VMAgent/vmagent-shard-ha <unknown>/80% 3 3 3 14m

Issue: missing label selector status
kubectl get --raw /apis/operator.victoriametrics.com/v1beta1/namespaces/<namespace>/vmagents/monitoring-vmagent-ha/scale {"kind":"Scale","apiVersion":"autoscaling/v1","metadata":{"name":"monitoring-vmagent-ha","namespace":"<namespace>","uid":"d80e7371-cd49-4caa-8765-3a78220f9543","resourceVersion":"351547683","creationTimestamp":"2024-04-17T12:26:29Z"},"spec":{"replicas":3},"status":{"replicas":3}}

vm-operator version: v0.30.0
vmagent: v1.90.0

The text was updated successfully, but these errors were encountered:

Haleygo · 2024-04-25T07:21:49Z

Hi @togikiran ,

Why the vmagent pods are getting rotated whenever we are increasing the replica-count ?
Command used: kubectl scale vmagent-shard-ha --replicas=3

What do you mean by pods are getting rotated, could you share the pods status?
If you scale vmagent using kubectl scale vmagent/vmagent-shard-ha --replicas=3 from replicas=1, the expected behavior is new pods created, old pod stay.

What the above hpa scaling metric is showing unknow

It's unknown because metrics server doesn't know what's vmagent's cpu utilization now, you need to create the metric and report to metrics server.

Issue: missing label selector status

That's a bug, since we don't have label propagation in vmagent.status now. We can add them, but I'm not sure if that's useful. Like in this hpa case, hpa doesn't have to know the pod label selector, it only need to scale the vmagent.shardCount, and operator will scale pods.

togikiran · 2024-04-26T08:10:34Z

Hey @Haleygo

When i increased the replicas from 2->3 , the older pods got rolledout and new pods came. Shared the screenshots

It's unknown because metrics server doesn't know what's vmagent's cpu utilization now, you need to create the metric and report to metrics server.

Default metric-server knows the pods metrics right, do you mean we need to add for vmagent customresource as well ? If yes can you help with the approach ?

Can you please share a sample hpa yaml file (k8s) for autoscaling vmagents shards based on cpu.

Thanks

f41gh7 · 2024-04-26T09:53:01Z

Hello,

Due to current sharding implementation of vmagent, all flags for the all vmagents must be changed. It requires restart of all pods with new flag value.

togikiran · 2024-04-26T10:14:10Z

Hey @f41gh7 , Is there a way to skip or bypass the pods restart because this will impact and restart for every scaleup. Is there any mitigation inplace for this issue ?

Thanks

Haleygo · 2024-04-26T10:54:03Z

Is there a way to skip or bypass the pods restart because this will impact and restart for every scaleup

No, it won't work if pod doesn't restart with new -promscrape.cluster.membersCount arg.
Imaging you have vmagent with shardCount: 1 and set hpa scale threshold to cpu>80%. At first it scrapes 100 targets with -promscrape.cluster.membersCount=1 -promscrape.cluster.memberNum=0.
Then targets number bumps to 200 and cpu exceeds 80%, the hpa helps increase the vmagent. shardCount to 2, which means there will be two vmagent instances sharding 200 targets, each instance still scrape 100 targets with -promscrape.cluster.membersCount=2 -promscrape.cluster.memberNum=0 or 1. In this way, cpu usage will go down.
If we don't change the -promscrape.cluster.membersCount for each instance, both of them will scrape 200 targets, cpu usage won't go down and there is no point to have hpa.

Default metric-server knows the pods metrics right, do you mean we need to add for vmagent customresource as well ? If yes can you help with the approach ?

I'd recommend to use keda here. It can use prometheus as direct trigger, like this

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: test
spec:
  scaleTargetRef:
    kind: vmagent
    name: vmagent-test
  minReplicaCount: 2
  maxReplicaCount: 3
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://vmselect-address
      metricName: vmagent-cpu-usage
      threshold: '80'
      query: container_cpu_usage{you-container-here}

Haleygo · 2024-04-26T13:10:51Z

Issue: missing label selector status

jfyi, status label selector should be fixed in a6e3ad7.

togikiran · 2024-06-01T13:31:19Z

@Haleygo observed metric loss during vmagents pod scaleup i.e all pods are getting recreated after increase in replica count. This is impacting production clusters. Is there any workaround for this ?

togikiran · 2024-06-11T06:16:49Z

Observing metric loss while vmagent scaling. Added hpa on cpu and memory metrics, pods are getting rolled out and observing metric loss. Production clusters are impacted
Operator: v0.44.0
Vmagent version: v1.90.0
@f41gh7

Haleygo · 2024-06-11T15:17:32Z

@Haleygo observed metric loss during vmagents pod scaleup i.e all pods are getting recreated after increase in replica count. This is impacting production clusters. Is there any workaround for this ?

hmm, I'm afraid that's expected with current implementation, workaround would be also set vmagentSpec.replicaCount=2, and enable deduplication in vmcluster.

Haleygo · 2024-07-04T14:39:34Z

Hey @togikiran ,
Do you have further questions on this issue?

togikiran · 2024-11-25T05:28:10Z

@Haleygo VMAgent sharding relies on the -promscrape.cluster.membersCount flag, which needs to be updated whenever the deployment or statefulset scales up or down. This update necessitates recreating all pods. Enabling dynamic discovery of this value would eliminate the need for manual updates and pod recreation.

Haleygo added the bug Something isn't working label Apr 25, 2024

Haleygo added question Further information is requested and removed bug Something isn't working labels Jul 4, 2024

Haleygo self-assigned this Jul 4, 2024

AndrewChubatiuk mentioned this issue Aug 20, 2024

Make promscrape.cluster.membersCount discoverable VictoriaMetrics/VictoriaMetrics#6842

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VmAgents Shards Autoscaling Issues #924

VmAgents Shards Autoscaling Issues #924

togikiran commented Apr 18, 2024 •

edited

Loading

Haleygo commented Apr 25, 2024

togikiran commented Apr 26, 2024

f41gh7 commented Apr 26, 2024

togikiran commented Apr 26, 2024

Haleygo commented Apr 26, 2024

Haleygo commented Apr 26, 2024

togikiran commented Jun 1, 2024

togikiran commented Jun 11, 2024

Haleygo commented Jun 11, 2024

Haleygo commented Jul 4, 2024

togikiran commented Nov 25, 2024

VmAgents Shards Autoscaling Issues #924

VmAgents Shards Autoscaling Issues #924

Comments

togikiran commented Apr 18, 2024 • edited Loading

Haleygo commented Apr 25, 2024

togikiran commented Apr 26, 2024

f41gh7 commented Apr 26, 2024

togikiran commented Apr 26, 2024

Haleygo commented Apr 26, 2024

Haleygo commented Apr 26, 2024

togikiran commented Jun 1, 2024

togikiran commented Jun 11, 2024

Haleygo commented Jun 11, 2024

Haleygo commented Jul 4, 2024

togikiran commented Nov 25, 2024

togikiran commented Apr 18, 2024 •

edited

Loading