You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running Alloy in a StatefulSet with Clustering enabled to scrape Prometheus metrics for ~8 million active series from ~2600 scrape targets. We also have a HorizontalPodAutoscaler in place. When Alloy runs with a minimum of 5 replicas, it functions properly. However, when the HPA scales it down to 3 replicas, Alloy silently drops samples, despite CPU and memory usage remaining within the requested limits. We don't see any error in Alloy logs or debugging UI. We suspect that Alloy struggles to handle the data load with fewer than 5 replicas, but we lack concrete evidence to validate this.
Is there any guidance available to help us determine the optimal number of replicas for Alloy or how to monitor this issue?
Alloy data drop:
Targets are redistributed properly:
Steps to reproduce
Deploy Alloy for Prometheus metrics as StatefulSet with Clustering enabled.
Enable HPA.
Increase CPU request so that HPA will scale down Alloy pods.
It looks like a bug in Alloy clustering mode where metrics are dropped silently (see the small gaps) even when total active series is under 2 million and 5 replicas of Alloy is running. As soon as I disable the clustering mode there are no gaps. I don't see any related error in debugging logs. Can anyone please help me here.
sarita-maersk
changed the title
Alloy silently drops samples in cluster mode with fewer replicas and a high number of active series
Alloy silently drops samples in cluster mode
Feb 13, 2025
What's wrong?
We are running Alloy in a StatefulSet with Clustering enabled to scrape Prometheus metrics for ~8 million active series from ~2600 scrape targets. We also have a HorizontalPodAutoscaler in place. When Alloy runs with a minimum of
5
replicas, it functions properly. However, when the HPA scales it down to3
replicas, Alloy silently drops samples, despite CPU and memory usage remaining within the requested limits. We don't see any error in Alloy logs or debugging UI. We suspect that Alloy struggles to handle the data load with fewer than 5 replicas, but we lack concrete evidence to validate this.Is there any guidance available to help us determine the optimal number of replicas for Alloy or how to monitor this issue?
Alloy data drop:
![Image](https://private-user-images.githubusercontent.com/135721969/410947972-27680a10-dd07-4f7f-a258-b63c197ef9a8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0NjUxMDIsIm5iZiI6MTczOTQ2NDgwMiwicGF0aCI6Ii8xMzU3MjE5NjkvNDEwOTQ3OTcyLTI3NjgwYTEwLWRkMDctNGY3Zi1hMjU4LWI2M2MxOTdlZjlhOC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxM1QxNjQwMDJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wYTNjZjdhYThhM2RjMjJlYjFlZjAzNmFiOTA1ZWFiZWJlNzZhNjE4MTRjM2IxNjVjZjVhZDNlOGU0NTI3NmZiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.6REtj6RYIjjEfeCEl3LVAzW6Zx9zShp43QBxrcd0Vj4)
Targets are redistributed properly:
Steps to reproduce
System information
arm64
Software version
v1.5.1
Configuration
Logs
The text was updated successfully, but these errors were encountered: