Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure equal query load for fair efficiency comparisons. #822

Open
bwplotka opened this issue Jan 31, 2025 · 1 comment
Open

Ensure equal query load for fair efficiency comparisons. #822

bwplotka opened this issue Jan 31, 2025 · 1 comment

Comments

@bwplotka
Copy link
Member

Discussed in https://cloud-native.slack.com/archives/C07TT6DTQ02/p1737726558098399

This happens from time to time where HTTP requests total metric is different across Prometheus-es. Looking on the current loadgen-querier code, I think we can easily hit this case where one Prometheus is slow enough that it makes groups not only go off sync (different queries called in different times vs other Prometheus), but more relevant all queries in the group are slower in total than the configured interval, so load is decreasing for that one Prometheus. In turn it might look like Prometheus is doing just fine vs it's just delaying loadgen scheduling.

Couple of things we could do:

  1. Have a metric that shows loadgen delays caused by (potentially) slow Prometheus
  2. Cancel too long groups (context deadline equal to group interval). This makes the load “easier” on slow Prometheus which might hide the perf issues.
  3. Start new group queries no matter if the previous group iteration finished or now. This will give us the most “fair” and equal load for both Prometheus-es. However this will lead to cascading failures on both slow Prometheus and loadgen itself spamming goroutines.

It feels 1 is some improvement (comparing delays across clients), but isn't this equivalent to server total requests counter rate going down? 🤔

Doing (3) and allowing whole Prometheus slowly timeout things and/or OOM/get slower is a fair approach, but how to not starve the client (if it' s getting slower or OOMing it might "recover" Prometheus situation).

What stopping us from assuming that dropping request count == Prometheus is too slow? cc @bboreham

@bwplotka
Copy link
Member Author

For the record, CPU load is quite low on loadgen-queriers

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant