Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShouldSendMessage decider #344

Merged
merged 32 commits into from
Jul 26, 2024
Merged

ShouldSendMessage decider #344

merged 32 commits into from
Jul 26, 2024

Conversation

feuGeneA
Copy link
Contributor

@feuGeneA feuGeneA commented Jun 24, 2024

Why this should be merged

for #363

How this works

the relayer is informed of a decider service via config values that point to its host and port, and it communicates with that service process via grpc to make the decision.

How this was tested

a simple decider implementation is included and incorporated into the tests

@feuGeneA feuGeneA force-pushed the decider branch 5 times, most recently from 01d6860 to df9271e Compare June 28, 2024 15:45
scripts/e2e_test.sh Outdated Show resolved Hide resolved
main/main.go Outdated Show resolved Hide resolved
main/main.go Outdated Show resolved Hide resolved
proto/buf.yaml Outdated Show resolved Hide resolved
scripts/protobuf_codegen.sh Show resolved Hide resolved
tests/e2e_test.go Outdated Show resolved Hide resolved
main/main.go Outdated Show resolved Hide resolved
@feuGeneA feuGeneA marked this pull request as ready for review July 12, 2024 22:45
@feuGeneA feuGeneA requested a review from a team as a code owner July 12, 2024 22:45
@feuGeneA feuGeneA requested a review from iansuvak July 12, 2024 22:45
Copy link
Contributor

@bernard-avalabs bernard-avalabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally makes sense to me. I left a few comments.

main/main.go Outdated
@@ -216,6 +225,33 @@ func main() {

// Create listeners for each of the subnets configured as a source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this comment is duplicated on line 254. Was that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it, but I think the original crept back in as a merge conflict. Thanks for pointing it out. Addressed in 5c6422e

main/main.go Outdated
panic(err)
}
runtime.SetFinalizer(grpcClient, func(c *grpc.ClientConn) { c.Close() })
grpcClient.WaitForStateChange(ctx, connectivity.Ready)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe WaitForStateChange is currently marked as experimental:

https://pkg.go.dev/google.golang.org/grpc#ClientConn.WaitForStateChange

Not sure if this is an issue, just wanted to point it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -167,7 +183,47 @@ func (m *messageHandler) ShouldSendMessage(destinationClient vms.DestinationClie
return false, nil
}

return true, nil
var decision bool = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be initialized to false? It looks like the error cases are just returning decision without modifying it, and I assume we should return false in those cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended control flow here is to separate the existing ShouldSendMessage decision logic from the Decider logic that is dispatched to the service. If the Decider fails, then we should return the prior decision from the existing logic.

I think this could be made clearer by adding a new helper, and calling it like so:

ShouldSendMessage() {
  // Existing decision logic
  decision, err :=  m.dispatchToDecider()
  if err != nil
    // log the error, but do not return it
    log.Warn("Decider returned error")
    return true, nil  
  }
  return decision
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted the logic into a new helper, and adapted my changes to ShouldSendMessage to fit the existing flow logic (if __ return false; if ___ return false; ...; return true), in 584556e

proto/README.md Outdated Show resolved Hide resolved
main/main.go Outdated Show resolved Hide resolved
main/main.go Outdated Show resolved Hide resolved
messages/teleporter/message_handler.go Outdated Show resolved Hide resolved
scripts/protobuf_codegen.sh Outdated Show resolved Hide resolved
Comment on lines 76 to 90

var ctx context.Context
ctx, cancelDecider = context.WithCancel(context.Background())
// we'll call cancelDecider in AfterSuite, but also call it if this
// process is killed, because AfterSuite won't always run then:
signalChan := make(chan os.Signal, 2)
signal.Notify(signalChan, os.Interrupt, syscall.SIGTERM)
go func() {
<-signalChan
cancelDecider()
}()
decider = exec.CommandContext(ctx, "./tests/cmd/decider/decider")
decider.Start()
log.Info("Started decider service")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. If ginkgo exits for whatever reason, everything that it was running should be cleaned up - we don't currently look for OS signals for the other processes we run.

What we do need is something that watches the decider process, and if it exists abnormally, we should panic. Example here https://github.com/ava-labs/awm-relayer/blob/main/tests/utils/utils.go#L91

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's not necessary? A little while ago I tried interrupting a test with Ctrl+C, and from the stdout it didn't look like it ran the AfterSuite hook, and the decider process was still running. And just now I tried again and saw that it did execute that hook, and the decider process was not running. So I think the behavior is inconsistent.

If the decider is left running/orphaned/defunct after the tests have finished, then it will fail to start in the next test run, because the port it uses is always the same and that port will be in use if it's already running from a previous test run. And if you're working on some changes totally unrelated to the decider, it could be pretty confusing and unclear what's going on.

This uncertainty and fear of confusion is why I decided to put this code into place. I'm happy to defer to the team's opinion though.

The idea of watching the decider for abnormal exit is a great idea, thanks, I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should move to running our e2e tests in a docker container (if feasible), that way we get the cleanup for free, and wouldn't need to check for OS signals, but until then, this is helpful.

tests/e2e_test.go Show resolved Hide resolved
messages/teleporter/message_handler.go Outdated Show resolved Hide resolved
- name: Install buf
uses: bufbuild/[email protected]
with:
github_token: ${{ github.token }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the github token necessary? Does the repo even have access to a gh token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not. I removed this in ed3f4e7 and everything still ran fine. I had just copied this from avalanchego. Thanks for noticing it.

Comment on lines +41 to +43
git update-index --really-refresh >> /dev/null
git diff-index HEAD # to show the differences
git diff-index --quiet HEAD || (echo 'protobuf generated code changes have not all been checked in' && exit 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this method of checking that generated files are checked in versus the one we use in teleporter? https://github.com/ava-labs/teleporter/blob/main/.github/workflows/abi_bindings_checker.yml#L41

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -101,7 +101,7 @@ awm-relayer --help Display awm-relayer usag

### Building

Before building, be sure to install Go, which is required even if you're just building the Docker image.
Before building, be sure to install Go, which is required even if you're just building the Docker image. You'll also need to install [buf](github.com/bufbuild/buf/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is buf required even if the Decider is unused (i.e. the relevant config options are omitted)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to build, yes, because the relayer binary is built to be aware of the protobuf interfaces so that that same binary can be used later if you decide you do want to include those config options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add descriptions of the new config options.

config/config.go Outdated
Comment on lines 61 to 62
DeciderHost string `mapstructure:"decider-host" json:"decider-host"`
DeciderPort *uint16 `mapstructure:"decider-port" json:"decider-port"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some validation on these new fields? If they're provided, they should be able to construct a valid URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in a9c374c

Comment on lines 75 to 80
var deciderClient pbDecider.DeciderServiceClient
if grpcClient == nil {
deciderClient = nil
} else {
deciderClient = pbDecider.NewDeciderServiceClient(grpcClient)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to main so that grpcClient can be limited in scope to main? When we get to #365 we'll have to handle dialing different grpcClients per message type, so some refactoring will be necessary anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in b01b14b

@@ -167,7 +183,47 @@ func (m *messageHandler) ShouldSendMessage(destinationClient vms.DestinationClie
return false, nil
}

return true, nil
var decision bool = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended control flow here is to separate the existing ShouldSendMessage decision logic from the Decider logic that is dispatched to the service. If the Decider fails, then we should return the prior decision from the existing logic.

I think this could be made clearer by adding a new helper, and calling it like so:

ShouldSendMessage() {
  // Existing decision logic
  decision, err :=  m.dispatchToDecider()
  if err != nil
    // log the error, but do not return it
    log.Warn("Decider returned error")
    return true, nil  
  }
  return decision
}

return decision, err
}

// TODO: add a timeout to the context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include timeouts in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in f89bd35

proto/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,21 @@
syntax = "proto3";

package decider.v1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent behind v1 to map awm-relayer versions to Decider versions? If so, I'd prefer if we followed the pattern used by Avalanchego / subnet-evm where protocol version compatibility is provided out-of-band. Otherwise, the codebase would continue to grow with each new version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent was just to follow the normal/standard behavior/conventions of buf+proto, at least as our starting point here. I stripped all of the custom lint rules out of the buf config file when copying it from avalanchego, and one of those was about needing to include the version in the service name.

in 3631b7b I restored that linter rule and changed things to not include the version in the service/package names.

config/config.go Outdated
Comment on lines 61 to 62
DeciderHost string `mapstructure:"decider-host" json:"decider-host"`
DeciderPort *uint16 `mapstructure:"decider-port" json:"decider-port"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling this out because I wasn't aware how it worked exactly myself:

AvalancheGo communicates with VM plugins via gprc, but instead of assuming the VMs are running and providing their host/port in the configuration, it starts the configured VMs itself on the first port available on 127.0.0.1 here.

That type of pattern may be less error prone for relayers to set up. i.e: the relayer config would specify the decider plugin(s) (if any) to run for a given source chain, and the application would handle initializing/configuring them itself.

Not suggesting we do this now, but could be worth considering going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that what Michael described is the generally preferred pattern for colocated services, and I agree that less config is better and less error prone.

A counter example for why we would want to keep it, is that this allows for using remote deciders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cam and I also discussed this. For this iteration I was going for the simplest thing possible, and sub-process management proved non-trivial, so I went with this host/port approach instead, since we said that ultimately we want to be able to support both local and remote deciders.

Going forward, I'm open to doing the sub-process/plugin approach; we may be able to leverage the sub-process modules available in avalanchego, but if not then the complexity added to the code base for this would be a lot, at least as compared to the host/port approach.

Also, at the moment I'm leaning towards just the host/port approach so that there's one single config interface, rather than one for a local decider (the plugin path) and a separate/different one for a remote decider (the host/port).

Copy link
Contributor

@geoff-vball geoff-vball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

Copy link
Collaborator

@cam-schultz cam-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feuGeneA feuGeneA merged commit 6362bca into main Jul 26, 2024
6 checks passed
@feuGeneA feuGeneA deleted the decider branch July 26, 2024 16:07
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
follow the avalanchego convention of excluding the version number from the name
of the service.

addresses review comment
ava-labs/awm-relayer#344 (comment)
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
addresses review comment ava-labs/awm-relayer#344 (comment)

Signed-off-by: F. Eugene Aumson <[email protected]>
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
0xchainlover pushed a commit to avalabsorg/awm-relayer that referenced this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants