Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow public API usage for GetValidators #77

Merged
merged 15 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 53 additions & 13 deletions relayer/canonical_validator_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,20 @@ import (

"github.com/ava-labs/avalanchego/ids"
"github.com/ava-labs/avalanchego/snow/validators"
"github.com/ava-labs/avalanchego/utils/logging"
"github.com/ava-labs/avalanchego/vms/platformvm"
"go.uber.org/zap"
)

var _ validators.State = (*CanonicalValidatorClient)(nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the syntax &CanonicalValidatorClient{} on the rhs is a bit cleaner, and it's also what we've used in other places in our repos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. There was one other place in this repo that we were using the (*Type)(nil) syntax, so I updated it there as well 👍


// CanonicalValidatorClient wraps platformvm.Client and implements validators.State
type CanonicalValidatorClient struct {
client platformvm.Client
logger logging.Logger
}

func NewCanonicalValidatorClient(client platformvm.Client) *CanonicalValidatorClient {
func NewCanonicalValidatorClient(logger logging.Logger, client platformvm.Client) *CanonicalValidatorClient {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the logger isn't being set in the object instantiation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, my bad. Thanks for catching this.

return &CanonicalValidatorClient{
client: client,
}
Expand All @@ -34,19 +39,17 @@ func (v *CanonicalValidatorClient) GetSubnetID(ctx context.Context, chainID ids.
return v.client.ValidatedBy(ctx, chainID)
}

// Gets the current validator set of the given subnet ID, include the validators' BLS public keys.
// This implementation of GetValidatorSet currently makes two RPC requests, one to get the
// subnet validators, and another to get their BLS public keys. This is necessary in order to enable
// the use of the public APIs (which don't support "GetValidatorsAt") because BLS keys are currently
// only associated with primary network validation periods. If ACP-13 is implementated in the future
// (https://github.com/avalanche-foundation/ACPs/blob/main/ACPs/13-subnet-only-validators.md), it may
// become possible to reduce this to a single RPC request that returns both the subnet validators
// Gets the current validator set of the given subnet ID, including the validators' BLS public
// keys. The implementation currently makes two RPC requests, one to get the subnet validators,
// and another to get their BLS public keys. This is necessary in order to enable the use of
// the public APIs (which don't support "GetValidatorsAt") because BLS keys are currently only
// associated with primary network validation periods. If ACP-13 is implementated in the future
// (https://github.com/avalanche-foundation/ACPs/blob/main/ACPs/13-subnet-only-validators.md), it
// may become possible to reduce this to a single RPC request that returns both the subnet validators
// as well as their BLS public keys.
func (v *CanonicalValidatorClient) GetValidatorSet(
func (v *CanonicalValidatorClient) getCurrentValidatorSet(
ctx context.Context,
height uint64,
subnetID ids.ID,
) (map[ids.NodeID]*validators.GetValidatorOutput, error) {
subnetID ids.ID) (map[ids.NodeID]*validators.GetValidatorOutput, error) {
// Get the current subnet validators. These validators are not expected to include
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than fully replacing the call to GetValidatorsAt, what are your thoughts on adding this workaround as either a configurable switch, or a fallback option in the case that GetValidatorsAt fails?

For a switch, we could hardcode the public API endpoint and compare the config value against that. However, there's no guarantee that GetValidatorsAt is supported on some other node.

We could always try calling GetValidatorsAt then fallback to this workaround on failure, but that's strictly worse in terms of network round trips if we're using the public API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a config option indicating which API methods are supported by the configured node and switch on that. Alternatively, we could handle that flag dynamically by calling this method on startup and looking for the "method not supported" error message, though that would increase the set of things that could break unexpectedly.

I'm spitballing at this point, curious to hear your thoughts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For context, the reason I'm in favor of having some way to still call GetValidatorsAt is to have a path that minimizes latency for the relayer. If a project has a large enough number of validators, then they might want to run their own P-Chain API node to get that minimal latency path. That being said, I don't actually know if a call to GetValidatorsAt at the current height is necessarily faster than the workaround here, but I would suspect it is)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good thoughts...there are definitely trade offs one way or another at certain scales.

My main considerations are:

  • The relative frequency of relayer that don't want to set up their own nodes to use an API, and their performance requirements.
  • The relative frequency of relayer nodes that do want to set up their own nodes to use an API, and their performance requirements.
  • Code and configuration complexity.

I think this is a minute enough of a detail that adding configuration options for it is not worth the additional code complexity and mental bandwidth of those setting up nodes to try to properly understand what the proper value for their set up would be. Checking whether or not the API method is available at start up is a cool idea to get around this, but there's no guarantee than an API won't change it behavior at any point in time, so would be prone to breaking.

That makes me think either always using the workaround method, or always trying to first use GetValidatorsAt and then fallback to the workaround if needed is likely the best approach. For relayer deployments without access to their own API nodes, my inclination is that an additional request would not be a critical performance decrease. I would also guess that deployments that do have or set up their own API nodes, tend to have strictly performance expectations/requirements. Given that, I think using this option as a fallback in the even that GetValidatorsAt fails is likely the best route forward.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. Very good point that a relayer operator concerned with latency would likely roll their own API node; if they choose not to, then one extra round trip is likely not going to make a difference to them.

// BLS signing information given that addPermissionlessValidatorTx is only used to
// add primary network validators.
Expand All @@ -73,12 +76,49 @@ func (v *CanonicalValidatorClient) GetValidatorSet(

// Set the BLS keys of the result.
for _, primaryVdr := range primaryVdrs {
// We expect all of the primary network validators to already be in `res` because
// we filtered the request to node IDs that were identified as validators of the
// specific subnet ID.
vdr, ok := res[primaryVdr.NodeID]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever expecting primaryVdr to not be found in res? My understanding is that the above call to GetCurrentValidators applies this filter already.

Copy link
Collaborator Author

@michaelkaplan13 michaelkaplan13 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I don't think we ever expect it currently. I'll add a warning log for this case. I think it still makes sense to continue rather than returning an error, because the relayer should be able to handle this case the same as if the given validator didn't register a BLS public key on the P-chain. As long as a sufficient % of the subnet stake weight has BLS keys registered and set here, the message should be able to be successfully relayed still.

Update: Realizing now this is separate from validators that don't have BLS keys. It's if the RPC doesn't apply the proper filter. That being said, I'll still add a warning log and still thinking continue is the best option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree.

Can we add a check below that primaryVdr.Signer != nil? And if it is, we can emit a warning and continue like we do here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think that proper check is already there now actually. I chose not to emit any log in that case because it's expected (or at least certainly possible) for the primary network validator to not have any BLS public key registered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you feel strongly otherwise though. I'm neither here or there on the log 👍

Copy link
Collaborator

@cam-schultz cam-schultz Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but it looks like the first time we access primaryVdr.Signer it's unchecked. If it's nil, that would cause the app to panic.

EDIT: was looking at a stale commit

if !ok {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add documentation that if we have a validator taht we don't find it's public key in primary validator set we still include it for mapping, so it'd affect total stake weight, and need to check whether it has a public key

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll add it below. Worth noting that this case here is actually just if the "GetCurrentValidators" returns an unexpected node ID that wasn't included in the filter in the request body. I'll document that also.

v.logger.Warn(
"Unexpected primary network validator returned by getCurrentValidators request",
zap.String("subnetID", subnetID.String()),
zap.String("nodeID", primaryVdr.NodeID.String()))
continue
}
vdr.PublicKey = primaryVdr.Signer.Key()

// Validators that do not have a BLS public key registered on the P-chain are still
// included in the result because they affect the stake weight of the subnet validators.
// Such validators will not be queried for BLS signatures of warp messages. As long as
// sufficient stake percentage of subnet validators have registered BLS public keys,
// messages can still be successfully relayed.
if primaryVdr.Signer != nil {
vdr.PublicKey = primaryVdr.Signer.Key()
}
}

return res, nil
}

// Gets the validator set of the given subnet at the given P-chain block height.
// Attempts to use the "getValidatorsAt" API first. If not available, falls back
// to use "getCurrentValidators", ignoring the specified P-chain block height.
func (v *CanonicalValidatorClient) GetValidatorSet(
ctx context.Context,
height uint64,
subnetID ids.ID,
) (map[ids.NodeID]*validators.GetValidatorOutput, error) {
// First, attempt to use the "getValidatorsAt" RPC method. This method may not be available on
// all API nodes, in which case we can fall back to using "getCurrentValidators" if needed.
res, err := v.client.GetValidatorsAt(ctx, subnetID, height)
if err != nil {
v.logger.Debug(
"P-chain RPC to getValidatorAt returned error. Falling back to getCurrentValidators",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we include the error in log?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, added

zap.String("subnetID", subnetID.String()),
zap.Uint64("pChainHeight", height),
zap.Error(err))
return v.getCurrentValidatorSet(ctx, subnetID)
}
return res, nil
}
2 changes: 1 addition & 1 deletion relayer/relayer.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ func NewRelayer(
)
r := Relayer{
pChainClient: pChainClient,
canonicalValidatorClient: NewCanonicalValidatorClient(pChainClient),
canonicalValidatorClient: NewCanonicalValidatorClient(logger, pChainClient),
currentRequestID: rand.Uint32(), // Initialize to a random value to mitigate requestID collision
network: network,
sourceSubnetID: subnetID,
Expand Down
Loading