Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloy silently drops samples in cluster mode #2649

Open
sarita-maersk opened this issue Feb 7, 2025 · 1 comment
Open

Alloy silently drops samples in cluster mode #2649

sarita-maersk opened this issue Feb 7, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@sarita-maersk
Copy link

What's wrong?

We are running Alloy in a StatefulSet with Clustering enabled to scrape Prometheus metrics for ~8 million active series from ~2600 scrape targets. We also have a HorizontalPodAutoscaler in place. When Alloy runs with a minimum of 5 replicas, it functions properly. However, when the HPA scales it down to 3 replicas, Alloy silently drops samples, despite CPU and memory usage remaining within the requested limits. We don't see any error in Alloy logs or debugging UI. We suspect that Alloy struggles to handle the data load with fewer than 5 replicas, but we lack concrete evidence to validate this.
Is there any guidance available to help us determine the optimal number of replicas for Alloy or how to monitor this issue?

Alloy data drop:
Image

Targets are redistributed properly:

Image

Steps to reproduce

  • Deploy Alloy for Prometheus metrics as StatefulSet with Clustering enabled.
  • Enable HPA.
  • Increase CPU request so that HPA will scale down Alloy pods.

System information

arm64

Software version

v1.5.1

Configuration

config.alloy


discovery.kubernetes "integrations_kubernetes_agent" {
	role = "pod"
}

discovery.relabel "integrations_kubernetes_agent" {
	targets = discovery.kubernetes.integrations_kubernetes_agent.targets

	rule {
		source_labels = ["__meta_kubernetes_pod_container_init"]
		regex         = "true"
		action        = "drop"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
		regex         = "true"
		action        = "keep"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_path"]
		regex         = "(.+)"
		target_label  = "__metrics_path__"
	}

	rule {
		source_labels = ["__meta_kubernetes_pod_label_env", "__meta_kubernetes_pod_label_environment"]
		regex         = "^;*([^;]+)(;.*)?$"
		target_label  = "env"
	}
}

prometheus.scrape "integrations_kubernetes_agent" {
	targets                   = discovery.relabel.integrations_kubernetes_agent.output
	forward_to                = [prometheus.relabel.integrations_kubernetes_agent.receiver]
	job_name                  = "integrations/kubernetes/agent"
	scrape_classic_histograms = true
	scrape_interval           = "15s"

	tls_config {
		insecure_skip_verify = true
	}
	
	clustering {
		enabled = true
	}
}

prometheus.relabel "integrations_kubernetes_agent" {
	forward_to = [prometheus.remote_write.default.receiver]

	rule {
		source_labels = ["__name__"]
		regex         = "node_dmi_info"
		action        = "drop"
	}
}

prometheus.remote_write "default" {
	external_labels = {
		__replica__ = "$${POD_NAME}",
		env         = "${env}",
		k8s_cluster = "${k8s_cluster}",
		metrics_ha  = "${k8s_cluster};${region};app",
		product_id  = "${product}",
		provider    = "${provider}",
		region      = "${region}",
	}

	endpoint {
		url = "${METRICS_REMOTE_WRITE_URL}"
		send_native_histograms = true

		oauth2 {
			client_id          = "client_id"
			client_secret_file = "/etc/oauth2/secrets/AUTH_CLIENT_SECRET"
			scopes             = ["api://ingestion/.default"]
			token_url          = "https://login.microsoftonline.com/abc.com/oauth2/v2.0/token"

			tls_config { }
		}

		queue_config {
			capacity             = 10000
			max_shards           = 50
			max_samples_per_send = 2000
			min_backoff          = "1s"
			max_backoff          = "2m0s"
		}

		metadata_config { }
	}
}


statefulset.yaml


apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: alloy
  labels:
    app: alloy
  annotations:
    secret.reloader.stakater.com/reload: "grafana-agent"
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Delete
  serviceName: alloy
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        storageClassName: default
        resources:
          requests:
            storage: 32Gi
        accessModes:
          - ReadWriteOnce
        volumeMode: Filesystem
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: alloy
  template:
    metadata:
      labels:
        app: alloy
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: http
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
          fsGroup: 473
          runAsUser: 473
          runAsGroup: 473
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
      tolerations:
      - key: kubernetes.azure.com/scalesetpriority
        operator: Equal
        value: spot
        effect: NoSchedule
      - key: kubernetes.io/arch
        operator: Equal
        value: arm64
        effect: NoSchedule
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: alloy
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: alloy
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                - key: kubernetes.azure.com/scalesetpriority
                  operator: In
                  values:
                    - spot
            - weight: 50
              preference:
                matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                    - arm64
      containers:
        - name: alloy
          image: harbor.maersk.io/dockerhub-proxy/grafana/alloy:v1.5.1
          args:
            - "run"
            - "/etc/alloy/config.alloy"
            - "--storage.path=/tmp/alloy"
            - --server.http.listen-addr=0.0.0.0:12345
            - --cluster.enabled=true
            - "--cluster.join-addresses=alloy-app.grafana-agent.svc.cluster.local"
            - --stability.level=generally-available
          env:
            - name: GOGC
              value: "60"
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          ports:
            - containerPort: 12345
              name: http-web
              protocol: TCP
            - containerPort: 4318
              name: otlp-http
              protocol: TCP
          livenessProbe:
            failureThreshold: 6
            httpGet:
              path: /
              port: 12345
              scheme: HTTP
            periodSeconds: 30
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /-/ready
              port: 12345
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              cpu: "4"
              memory: 45Gi
            limits:
              cpu: "10"
              memory: 60Gi
          volumeMounts:
            - name: alloy
              mountPath: /etc/alloy/
            - name: storage
              mountPath: /alloy/
            - name: grafana-agent-secrets
              mountPath: /etc/oauth2/secrets
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ALL]
      serviceAccountName: grafana-agent
      imagePullSecrets:
      - name: harbor-read-only
      volumes:
        - name: alloy
          configMap:
            defaultMode: 420
            name: alloy-config
        - name: grafana-agent-secrets
          secret:
            defaultMode: 420
            secretName: grafana-agent


hpa.yaml


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: alloy
spec:
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  - resource:
      name: memory
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: alloy

service.yaml
apiVersion: v1
kind: Service
metadata:
  name: alloy
  labels:
    app: alloy
spec:
  ports:
  - port: 12345
    name: alloy
    targetPort: 12345
  clusterIP: None
  selector:
    app: alloy

Logs


@sarita-maersk sarita-maersk added the bug Something isn't working label Feb 7, 2025
@sarita-maersk
Copy link
Author

sarita-maersk commented Feb 12, 2025

It looks like a bug in Alloy clustering mode where metrics are dropped silently (see the small gaps) even when total active series is under 2 million and 5 replicas of Alloy is running. As soon as I disable the clustering mode there are no gaps. I don't see any related error in debugging logs. Can anyone please help me here.

Image Image Image

@sarita-maersk sarita-maersk changed the title Alloy silently drops samples in cluster mode with fewer replicas and a high number of active series Alloy silently drops samples in cluster mode Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant