Alloy silently drops samples in cluster mode #2649

sarita-maersk · 2025-02-07T14:39:25Z

What's wrong?

We are running Alloy in a StatefulSet with Clustering enabled to scrape Prometheus metrics for ~8 million active series from ~2600 scrape targets. We also have a HorizontalPodAutoscaler in place. When Alloy runs with a minimum of 5 replicas, it functions properly. However, when the HPA scales it down to 3 replicas, Alloy silently drops samples, despite CPU and memory usage remaining within the requested limits. We don't see any error in Alloy logs or debugging UI. We suspect that Alloy struggles to handle the data load with fewer than 5 replicas, but we lack concrete evidence to validate this.
Is there any guidance available to help us determine the optimal number of replicas for Alloy or how to monitor this issue?

Alloy data drop:

Targets are redistributed properly:

Steps to reproduce

Deploy Alloy for Prometheus metrics as StatefulSet with Clustering enabled.
Enable HPA.
Increase CPU request so that HPA will scale down Alloy pods.

System information

arm64

Software version

v1.5.1

Configuration

config.alloy


discovery.kubernetes "integrations_kubernetes_agent" {
	role = "pod"
}

discovery.relabel "integrations_kubernetes_agent" {
	targets = discovery.kubernetes.integrations_kubernetes_agent.targets

	rule {
		source_labels = ["__meta_kubernetes_pod_container_init"]
		regex         = "true"
		action        = "drop"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
		regex         = "true"
		action        = "keep"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_path"]
		regex         = "(.+)"
		target_label  = "__metrics_path__"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_label_env", "__meta_kubernetes_pod_label_environment"]
		regex         = "^;*([^;]+)(;.*)?$"
		target_label  = "env"
	}
}

prometheus.scrape "integrations_kubernetes_agent" {
	targets                   = discovery.relabel.integrations_kubernetes_agent.output
	forward_to                = [prometheus.relabel.integrations_kubernetes_agent.receiver]
	job_name                  = "integrations/kubernetes/agent"
	scrape_classic_histograms = true
	scrape_interval           = "15s"

	tls_config {
		insecure_skip_verify = true
	}
	
	clustering {
		enabled = true
	}
}

prometheus.relabel "integrations_kubernetes_agent" {
	forward_to = [prometheus.remote_write.default.receiver]

	rule {
		source_labels = ["__name__"]
		regex         = "node_dmi_info"
		action        = "drop"
	}
}

prometheus.remote_write "default" {
	external_labels = {
		__replica__ = "$${POD_NAME}",
		env         = "${env}",
		k8s_cluster = "${k8s_cluster}",
		metrics_ha  = "${k8s_cluster};${region};app",
		product_id  = "${product}",
		provider    = "${provider}",
		region      = "${region}",
	}

	endpoint {
		url = "${METRICS_REMOTE_WRITE_URL}"
		send_native_histograms = true

		oauth2 {
			client_id          = "client_id"
			client_secret_file = "/etc/oauth2/secrets/AUTH_CLIENT_SECRET"
			scopes             = ["api://ingestion/.default"]
			token_url          = "https://login.microsoftonline.com/abc.com/oauth2/v2.0/token"

			tls_config { }
		}

		queue_config {
			capacity             = 10000
			max_shards           = 50
			max_samples_per_send = 2000
			min_backoff          = "1s"
			max_backoff          = "2m0s"
		}

		metadata_config { }
	}
}


statefulset.yaml


apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alloy
  labels:
    app: alloy
  annotations:
    secret.reloader.stakater.com/reload: "grafana-agent"
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Delete
  serviceName: alloy
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        storageClassName: default
        resources:
          requests:
            storage: 32Gi
        accessModes:
          - ReadWriteOnce
        volumeMode: Filesystem
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: alloy
  template:
    metadata:
      labels:
        app: alloy
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: http
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
          fsGroup: 473
          runAsUser: 473
          runAsGroup: 473
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
      tolerations:
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule
      - key: kubernetes.io/arch
        operator: Equal
        value: arm64
        effect: NoSchedule
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: alloy
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: alloy
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                - key: kubernetes.azure.com/scalesetpriority
                  operator: In
                  values:
                    - spot
            - weight: 50
              preference:
                matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                    - arm64
      containers:
        - name: alloy
          image: harbor.maersk.io/dockerhub-proxy/grafana/alloy:v1.5.1
          args:
            - "run"
            - "/etc/alloy/config.alloy"
            - "--storage.path=/tmp/alloy"
            - --server.http.listen-addr=0.0.0.0:12345
            - --cluster.enabled=true
            - "--cluster.join-addresses=alloy-app.grafana-agent.svc.cluster.local"
            - --stability.level=generally-available
          env:
            - name: GOGC
              value: "60"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          ports:
            - containerPort: 12345
              name: http-web
              protocol: TCP
            - containerPort: 4318
              name: otlp-http
              protocol: TCP
          livenessProbe:
            failureThreshold: 6
            httpGet:
              path: /
              port: 12345
              scheme: HTTP
            periodSeconds: 30
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /-/ready
              port: 12345
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              cpu: "4"
              memory: 45Gi
            limits:
              cpu: "10"
              memory: 60Gi
          volumeMounts:
            - name: alloy
              mountPath: /etc/alloy/
            - name: storage
              mountPath: /alloy/
            - name: grafana-agent-secrets
              mountPath: /etc/oauth2/secrets
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ALL]
      serviceAccountName: grafana-agent
      imagePullSecrets:
      - name: harbor-read-only
      volumes:
        - name: alloy
          configMap:
            defaultMode: 420
            name: alloy-config
        - name: grafana-agent-secrets
          secret:
            defaultMode: 420
            secretName: grafana-agent


hpa.yaml


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: alloy
spec:
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  - resource:
      name: memory
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: alloy

service.yaml
apiVersion: v1
kind: Service
metadata:
  name: alloy
  labels:
    app: alloy
spec:
  ports:
  - port: 12345
    name: alloy
    targetPort: 12345
  clusterIP: None
  selector:
    app: alloy

Logs

The text was updated successfully, but these errors were encountered:

sarita-maersk · 2025-02-12T14:11:26Z

It looks like a bug in Alloy clustering mode where metrics are dropped silently (see the small gaps) even when total active series is under 2 million and 5 replicas of Alloy is running. As soon as I disable the clustering mode there are no gaps. I don't see any related error in debugging logs. Can anyone please help me here.

sarita-maersk added the bug Something isn't working label Feb 7, 2025

sarita-maersk changed the title ~~Alloy silently drops samples in cluster mode with fewer replicas and a high number of active series~~ Alloy silently drops samples in cluster mode Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alloy silently drops samples in cluster mode #2649

Alloy silently drops samples in cluster mode #2649

sarita-maersk commented Feb 7, 2025

sarita-maersk commented Feb 12, 2025 •

edited

Loading

Alloy silently drops samples in cluster mode #2649

Alloy silently drops samples in cluster mode #2649

Comments

sarita-maersk commented Feb 7, 2025

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

sarita-maersk commented Feb 12, 2025 • edited Loading

sarita-maersk commented Feb 12, 2025 •

edited

Loading