Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

bwplotka · 2025-01-31T08:30:29Z

I propose we restart Prometheus-es during the standard prombench runs e.g.

graceful restart (kubectl pod delete) after 3h of prombench run.
forceful restart ((kubectl pod delete --grace-period=0) after 6h of prombench run (so 3h after first restart).

This allows us to test important Prometheus features like using checkpoints WAL and memory snapshots during replay that in the past were causing resource spike and can take some time. We also planned more work to improve this flow, so reliable metrics would be nice to have.

This killing logic could be implemented in scaler perhaps, which already has access to Kube API.

On top of that I would ensure we:

Add dashboard panel for startup time metric (if such metric does not exist we might want to add one (time to readiness).
Add some vertical lines/threshold in dashboards to show that the drop in all metrics is expected, or maybe another panel/metric? (This could be perhaps done with some events?).

WDYT? @bboreham @kakkoyun

The text was updated successfully, but these errors were encountered:

bwplotka · 2025-01-31T08:32:21Z

I just used this technique manually for metadata in WAL feature to check if amount of WAL records in metadata makes a different during replay:

prometheus/prometheus#15907

bboreham · 2025-01-31T09:24:05Z

Agreed with the basic idea, but doesn’t the current startup rebuild Prometheus?
We need to separate those things to get proper timing.

bwplotka added enhancement / feature help wanted labels Jan 31, 2025

bwplotka changed the title ~~Benchmark Prometheus restart scenarios (for WAL + checkpointing cost and timing)~~ Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

bwplotka commented Jan 31, 2025 •

edited

Loading

bwplotka commented Jan 31, 2025 •

edited

Loading

bboreham commented Jan 31, 2025

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

Comments

bwplotka commented Jan 31, 2025 • edited Loading

bwplotka commented Jan 31, 2025 • edited Loading

bboreham commented Jan 31, 2025

bwplotka commented Jan 31, 2025 •

edited

Loading

bwplotka commented Jan 31, 2025 •

edited

Loading