Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding monitoring on slashing metrics #166

Open
apentori opened this issue Feb 14, 2024 · 9 comments
Open

Adding monitoring on slashing metrics #166

apentori opened this issue Feb 14, 2024 · 9 comments
Labels
metrics Scraping and graphing metrics

Comments

@apentori
Copy link
Contributor

What

  • Adding monitoring on the metrics validator_monitor_slashed

Why

The issue follow the incident of the 2024/01/31 on holesky where Validator were slashed due to BN and VC using the same validator keys.

@apentori apentori added the metrics Scraping and graphing metrics label Feb 14, 2024
@apentori
Copy link
Contributor Author

Monitoring already available: https://github.com/status-im/infra-hq/blob/3b0e8240fb83fd84d1a19f43f2d0de45b0ded316/ansible/roles/prometheus-slave/files/rules/nimbus.yml#L6.

I couldn't find im the VictorOps history if the alert was correctly created

@jakubgs
Copy link
Member

jakubgs commented Feb 20, 2024

What about metrics? Has the count risen during the Holesky fleet slashing?

@apentori
Copy link
Contributor Author

yes, the metrics had some change over time, for example on geth-07, between the 30/01 and 02/02:

1708506715_grim

We could simplify the alert condition to:

validator_monitor_slashed > 0 

@jakubgs
Copy link
Member

jakubgs commented Feb 21, 2024

Okay, I see how this works, there's a `` label which shows the whole count for the node:

validator_monitor_slashed{fleet="nimbus.prater", validator="total"}

image

And rest is for per-validator using beginning of public key. But we don't need to look at that, it's easier to look at the total.

The issue with validator_monitor_slashed > 0 is that it will trigger for those hosts that already have slashed validators:

image

Which means that alone is not useful. We need to detect an increase in that metric, not just above zero value.

@jakubgs
Copy link
Member

jakubgs commented Feb 21, 2024

For example, here we can see that the count just goes up:

validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}

image

But here we can see it's just a spike when we use delta() or increase():

increase(validator_monitor_slashed{
    instance="geth-09.ih-eu-mda1.nimbus.holesky", fleet="nimbus.holesky", validator="total"
}[1m])

image

If we incease the time range to 15 minutes we can see more spikes:

image

But it doesn't last, so the alert clears quickly.

@jakubgs
Copy link
Member

jakubgs commented Feb 21, 2024

What we use right now is:

sum by (instance,container)(delta(validator_monitor_slashed[15m])) > 0

There's at least two ways we can improve this:

sum by (instance,container)(increase(validator_monitor_slashed{validator="total"}[15m])) > 0

By using increase() and adding validator="total". But that doesn't really solve the issue of it clearing in 15 minutes.

Prometheus sends alerts every time it evaluates the check and it's false, but once it clear the alert disappears:

Prometheus's alerting rules are good at figuring what is broken right now, but they are not a fully-fledged notification solution. Another layer is needed to add summarization, notification rate limiting, silencing and alert dependencies on top of the simple alert definitions.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#sending-alert-notifications

@jakubgs
Copy link
Member

jakubgs commented Feb 21, 2024

I see no clear way to make such an alert last using Alertmanager configuration:
https://prometheus.io/docs/alerting/latest/configuration/

So it seems the only way to make it last longer is doing some metric query wizardry:
https://prometheus.io/docs/prometheus/latest/querying/functions/

@jakubgs
Copy link
Member

jakubgs commented Feb 21, 2024

Also, there is another way, which is not great, but we could ask for another metric:

# HELP validator_monitor_slashed_since_start Incremented by 1 for each slashed validator during this run.
# TYPE validator_monitor_slashed_since_start gauge
validator_monitor_slashed_since_start 0.0

Not sure if this would be accepted by nimbus-eth2 developers, but the only other alternatives are:

  1. Dark magic with Prometheus query to keep the value >0 for longer.
    • Not sure if at all doable and probably would be a horrible hack.
  2. Removing slashed validator keys from affected nodes.
    • Jacek expressed approval for running nodes with slashed validators to test that part of the code.

@jakubgs
Copy link
Member

jakubgs commented Jul 1, 2024

I guess we never came to a conclusion on how to fix this. Seems like a new metric would make this much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Scraping and graphing metrics
Projects
None yet
Development

No branches or pull requests

2 participants