Metrics granularity #1667

y0sher · 2024-08-27T14:02:50Z

Description

SSV Node emits many metrics, collecting these metrics might be a heavy task or require a lot of resources to store. Some users have expressed that they would like to only receive certain metrics. Currently it can be turned off by passing the metrics port as 0 but this will disable all metrics reporting as a whole.

Task

Introduce a metrics granularity system. Where some metrics can be turned off and not show up in the metrics endpoint.

Ideas

Do it based on prefixes or regex. NOTE: this might require to change some metric names to b able to create such system.
Example: passing a metrics filter with value ssv:consensus will only publish metrics that start with this prefix. ssv:* will do all metrics that is under ssv.
your idea here?

Call for feedback

Long before we implement anything like this, would love to hear input from the team, especially @zktaiga @oleg-ssvlabs @vaclav-ssvlabs

The text was updated successfully, but these errors were encountered:

zktaiga · 2024-08-27T14:21:10Z

I think this is a good idea. I advise against the : delimiter for metrics, it's usually used for recording rules and not Prometheus metrics themselves. Recording rules are just metrics computed from other metrics by Prometheus.

The main thing to watch out for is cardinality, we can't have many possible values for a given label. See this article on cardinality.
some_metric{peer="0x1234"} <- this is an anti-pattern that should be avoided where possible.

Historically, the way some Ethereum clients have solved this, is to limit via flag the amount of validators you are keeping metrics for, with some sensible default.

After all, the most important part is that we are flushing metrics, which I think is currently not the case. The default behavior of Prometheus is to persist metrics in-memory, there's no "garbage collection". If at some point you see some peer 123 and update a metric, and this peer goes offline and comes online with some other peer ID 321, the current /metrics endpoint would register:

At time0:

metric{peer=123} 0

At time1:

metric{peer=123} 0
metric{peer=321} 0

Now picture that happening with many peers and many metrics... it sends Prometheus memory consumption into overdrive. We need to use something like DeletePartialMatch to keep an up to date snapshot of peer metrics.

We need to investigate why this didn't help: https://github.com/ssvlabs/ssv/pull/1163/files#diff-996f83452b65766df6841fa9611f770eef54a6a8343fa3ab004dee76bdd2d613R474-R477

y0sher added the enhancement New feature or request label Aug 27, 2024

y0sher mentioned this issue Aug 27, 2024

High char amount within a log line #1666

Closed

ssvlabs deleted a comment Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics granularity #1667

Metrics granularity #1667

y0sher commented Aug 27, 2024

zktaiga commented Aug 27, 2024 •

edited

Loading

Metrics granularity #1667

Metrics granularity #1667

Comments

y0sher commented Aug 27, 2024

Description

Task

Ideas

Call for feedback

zktaiga commented Aug 27, 2024 • edited Loading

zktaiga commented Aug 27, 2024 •

edited

Loading