You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This happens from time to time where HTTP requests total metric is different across Prometheus-es. Looking on the current loadgen-querier code, I think we can easily hit this case where one Prometheus is slow enough that it makes groups not only go off sync (different queries called in different times vs other Prometheus), but more relevant all queries in the group are slower in total than the configured interval, so load is decreasing for that one Prometheus. In turn it might look like Prometheus is doing just fine vs it's just delaying loadgen scheduling.
Couple of things we could do:
Have a metric that shows loadgen delays caused by (potentially) slow Prometheus
Cancel too long groups (context deadline equal to group interval). This makes the load “easier” on slow Prometheus which might hide the perf issues.
Start new group queries no matter if the previous group iteration finished or now. This will give us the most “fair” and equal load for both Prometheus-es. However this will lead to cascading failures on both slow Prometheus and loadgen itself spamming goroutines.
It feels 1 is some improvement (comparing delays across clients), but isn't this equivalent to server total requests counter rate going down? 🤔
Doing (3) and allowing whole Prometheus slowly timeout things and/or OOM/get slower is a fair approach, but how to not starve the client (if it' s getting slower or OOMing it might "recover" Prometheus situation).
What stopping us from assuming that dropping request count == Prometheus is too slow? cc @bboreham
The text was updated successfully, but these errors were encountered:
Discussed in https://cloud-native.slack.com/archives/C07TT6DTQ02/p1737726558098399
This happens from time to time where HTTP requests total metric is different across Prometheus-es. Looking on the current
loadgen-querier
code, I think we can easily hit this case where one Prometheus is slow enough that it makes groups not only go off sync (different queries called in different times vs other Prometheus), but more relevant all queries in the group are slower in total than the configured interval, so load is decreasing for that one Prometheus. In turn it might look like Prometheus is doing just fine vs it's just delaying loadgen scheduling.Couple of things we could do:
It feels 1 is some improvement (comparing delays across clients), but isn't this equivalent to server total requests counter rate going down? 🤔
Doing (3) and allowing whole Prometheus slowly timeout things and/or OOM/get slower is a fair approach, but how to not starve the client (if it' s getting slower or OOMing it might "recover" Prometheus situation).
What stopping us from assuming that dropping request count == Prometheus is too slow? cc @bboreham
The text was updated successfully, but these errors were encountered: