Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics granularity #1667

Open
y0sher opened this issue Aug 27, 2024 · 1 comment
Open

Metrics granularity #1667

y0sher opened this issue Aug 27, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@y0sher
Copy link
Contributor

y0sher commented Aug 27, 2024

Description

SSV Node emits many metrics, collecting these metrics might be a heavy task or require a lot of resources to store. Some users have expressed that they would like to only receive certain metrics. Currently it can be turned off by passing the metrics port as 0 but this will disable all metrics reporting as a whole.

Task

Introduce a metrics granularity system. Where some metrics can be turned off and not show up in the metrics endpoint.

Ideas

  • Do it based on prefixes or regex. NOTE: this might require to change some metric names to b able to create such system.
    Example: passing a metrics filter with value ssv:consensus will only publish metrics that start with this prefix. ssv:* will do all metrics that is under ssv.
  • your idea here?

Call for feedback

Long before we implement anything like this, would love to hear input from the team, especially @zktaiga @oleg-ssvlabs @vaclav-ssvlabs

@y0sher y0sher added the enhancement New feature or request label Aug 27, 2024
@zktaiga
Copy link
Contributor

zktaiga commented Aug 27, 2024

I think this is a good idea. I advise against the : delimiter for metrics, it's usually used for recording rules and not Prometheus metrics themselves. Recording rules are just metrics computed from other metrics by Prometheus.

The main thing to watch out for is cardinality, we can't have many possible values for a given label. See this article on cardinality.
some_metric{peer="0x1234"} <- this is an anti-pattern that should be avoided where possible.

Historically, the way some Ethereum clients have solved this, is to limit via flag the amount of validators you are keeping metrics for, with some sensible default.

After all, the most important part is that we are flushing metrics, which I think is currently not the case. The default behavior of Prometheus is to persist metrics in-memory, there's no "garbage collection". If at some point you see some peer 123 and update a metric, and this peer goes offline and comes online with some other peer ID 321, the current /metrics endpoint would register:

At time0:

metric{peer=123} 0

At time1:

metric{peer=123} 0
metric{peer=321} 0

Now picture that happening with many peers and many metrics... it sends Prometheus memory consumption into overdrive. We need to use something like DeletePartialMatch to keep an up to date snapshot of peer metrics.

We need to investigate why this didn't help: https://github.com/ssvlabs/ssv/pull/1163/files#diff-996f83452b65766df6841fa9611f770eef54a6a8343fa3ab004dee76bdd2d613R474-R477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants