Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability of validators for relayers #3057

Merged
merged 42 commits into from
Jan 2, 2024

Conversation

tkporter
Copy link
Collaborator

@tkporter tkporter commented Dec 14, 2023

Description

Goal of this was to have insight into validators of important sets being "up"

Introduces a new metric used by relayers: hyperlane_observed_validator_latest_index, e.g.:

hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 670
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test2",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 665
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test1",validator="0x15d34aaf54267db7d7c367839aaf71a00a2c6a65"} 652
hyperlane_observed_validator_latest_index{agent="relayer",app_context="default_ism",destination="test3",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 664
hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test2",validator="0x9965507d1a55bcc2695c58ba16fb37d819b0a4dc"} 658
hyperlane_observed_validator_latest_index{agent="relayer",app_context="testapp",destination="test1",hyperlane_baselib_version="0.1.0",origin="test3",validator="0x976ea74026e726554db657fa54763abd0c3a0aa9"} 641

Tapping into metadata building for multisig ISMs, the relayer will update the metric with the latest indices for the validators in a set. In order to prevent the cardinality being ridiculously high, only certain validator sets are tracked. This is done by introducing an app_context label (I'm very open to other names here, for some reason whenever idk how to name some kind of identifier I end up calling it a context 😆)

The app context can either be:

  • if a new setting, --metricAppContexts, is specified, a message will be classified based off the first matching list it matches. E.g. --metricAppContexts '[{"name": "testapp", "matchingList": [{"recipient_address": "0xd84379ceae14aa33c123af12424a37803f885889", "destination_domain": 13371 }] }]'. This is nice for e.g. warp route deployments, where the ISM is maybe not a default ISM, and can be changed
  • if a message doesn't get classified this way, it can also be classified with the "default_ism" app context, which is just for any message that happens to use the default ISM as its "root" ISM

This way we have insight in to the default ISM and any application-specific ISMs.

Some things to note:

  • it's possible for a message to actually have more than one validator set, e.g. if it's using an aggregation ISM. In this case, we'll have metrics on the union of all validator sets for that app context
  • Some effort is required to make sure that metrics don't stick around for a validator that has actually been removed from the set. To handle this, we cache the validator set for an app context and clear out the entire set each time we set the metrics

Drive-by changes

  • Zod's nonempty function for strings is deprecated, moves to .min(1) instead

Related issues

Backward compatibility

yes

Testing

Ran locally - I think i'll probably add something in e2e tests, but opening now

Copy link

changeset-bot bot commented Dec 14, 2023

⚠️ No Changeset found

Latest commit: 4bb3afd

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

rust/agents/relayer/src/msg/metadata/base.rs Show resolved Hide resolved
rust/agents/relayer/src/msg/metadata/base.rs Show resolved Hide resolved
rust/agents/relayer/src/settings/mod.rs Outdated Show resolved Hide resolved
rust/hyperlane-base/src/types/multisig.rs Outdated Show resolved Hide resolved
rust/hyperlane-base/src/metrics/core.rs Show resolved Hide resolved
rust/hyperlane-base/src/metrics/core.rs Show resolved Hide resolved
@tkporter tkporter enabled auto-merge (squash) December 20, 2023 16:53
@tkporter tkporter enabled auto-merge (squash) January 2, 2024 16:13
@tkporter tkporter merged commit 3f88aa6 into main Jan 2, 2024
13 of 19 checks passed
@tkporter tkporter deleted the trevor/relayer-validator-metrics branch January 2, 2024 16:36
daniel-savu added a commit that referenced this pull request Mar 13, 2024
### Description

Includes the `app_context` classification in `PendingMessage`, and adds
trait methods on `PendingOperation` to require always having such a
label on `OpQueue` operations. This is done by reusing the matching list
logic from the validator checkpoint labels
(#3057).

The nice thing is that this enables later support for retrying a group
of `OpQueue` operations just by specifying the `app_context` label,
without adding any new logic, since these labels are essentially
matching list results. One downside to using `app_context` for retries
is that the endpoint caller is constrained to only the matching lists
defined by the relayer operator - however imo only the relayer operator
that should be able to trigger retries.

### Drive-by changes

The `OpQueue` type alias is converted to an actual struct, that stores
the queue label (for metrics purposes), and also the `IntGaugeVec`
metric: the generic group of metrics associated with that queue
(basically only `submitter_queue_length` currently).

### Related issues

- Fixes #3240


### Backward compatibility

Yes

### Testing

Manual, by spinning up a relayer for injective and inevm. Sample
metrics, from `--metricAppContexts '[{"name": "injectivelabel",
"matchingList": [{"destination_domain": 6909546 }] }, {"name":
"inevmlabel", "matchingList": [{"destination_domain": 2525 }] }]'`
```
hyperlane_submitter_queue_length{agent="relayer",app_context="inevmlabel",hyperlane_baselib_version="0.1.0",queue_name="confirm_queue",remote="inevm"} 11
hyperlane_submitter_queue_length{agent="relayer",app_context="inevmlabel",hyperlane_baselib_version="0.1.0",queue_name="prepare_queue",remote="inevm"} 0
hyperlane_submitter_queue_length{agent="relayer",app_context="injectivelabel",hyperlane_baselib_version="0.1.0",queue_name="confirm_queue",remote="injective"} 63
hyperlane_submitter_queue_length{agent="relayer",app_context="injectivelabel",hyperlane_baselib_version="0.1.0",queue_name="prepare_queue",remote="injective"} 13281
```
yorhodes pushed a commit that referenced this pull request Mar 22, 2024
### Description

Includes the `app_context` classification in `PendingMessage`, and adds
trait methods on `PendingOperation` to require always having such a
label on `OpQueue` operations. This is done by reusing the matching list
logic from the validator checkpoint labels
(#3057).

The nice thing is that this enables later support for retrying a group
of `OpQueue` operations just by specifying the `app_context` label,
without adding any new logic, since these labels are essentially
matching list results. One downside to using `app_context` for retries
is that the endpoint caller is constrained to only the matching lists
defined by the relayer operator - however imo only the relayer operator
that should be able to trigger retries.

### Drive-by changes

The `OpQueue` type alias is converted to an actual struct, that stores
the queue label (for metrics purposes), and also the `IntGaugeVec`
metric: the generic group of metrics associated with that queue
(basically only `submitter_queue_length` currently).

### Related issues

- Fixes #3240


### Backward compatibility

Yes

### Testing

Manual, by spinning up a relayer for injective and inevm. Sample
metrics, from `--metricAppContexts '[{"name": "injectivelabel",
"matchingList": [{"destination_domain": 6909546 }] }, {"name":
"inevmlabel", "matchingList": [{"destination_domain": 2525 }] }]'`
```
hyperlane_submitter_queue_length{agent="relayer",app_context="inevmlabel",hyperlane_baselib_version="0.1.0",queue_name="confirm_queue",remote="inevm"} 11
hyperlane_submitter_queue_length{agent="relayer",app_context="inevmlabel",hyperlane_baselib_version="0.1.0",queue_name="prepare_queue",remote="inevm"} 0
hyperlane_submitter_queue_length{agent="relayer",app_context="injectivelabel",hyperlane_baselib_version="0.1.0",queue_name="confirm_queue",remote="injective"} 63
hyperlane_submitter_queue_length{agent="relayer",app_context="injectivelabel",hyperlane_baselib_version="0.1.0",queue_name="prepare_queue",remote="injective"} 13281
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Observe 3rd Party validators are signing at the tip in the relayer
3 participants