Replication/failover simulation skeleton #6627
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
Adding the skeleton of replication/failover simulation tests. Actual simulation logic will be added in a follow up PR.
The simulation runs 2 Cadence clusters. Both clusters are configured via environment variables in
docker-compose-local-replication-simulation.yml
file. The environment variables are passed to dockerize to materialize the config template during container startup.Cluster endpoints:
cadence-cluster0:7833
grpc endpointcadence-cluster1:7833
grpc endpointWeb endpoints:
http://localhost:8088
http://localhost:8089
There is single Cassandra instance. Each cluster uses its own Cassandra keyspace so they have their own isolated tables.
cluster: cluster0/1
label to metrics based on endpoint.http://localhost:9090/query?g0.expr=cadence_requests

Black Box Simulation Approach
Besides those above, there is another container with runs the simulation code. It will communicate with both clusters via their APIs exposed on endpoints mentioned above. This is different than how it's done in existing integration tests/simulations. They use internal clients to perform Cadence requests which requires initializing/mimicking full Cadence service component initialization in test code. This existing approach is useful if you want to mock server behavior which is possible because test code and Cadence services share same runtime.
New approach will interact with Cadence via exposed APIs. Each simulation scenario configuration includes a dynamic config override file and what to do during simulation.
Why do we need replication/failover simulation?
The simulation should help to quickly validate various edge cases locally which will make it easy to iterate for replication improvements and active-active mode development.
How did you test it?
./scripts/run_replication_simulator.sh
test-domain
which is supposed to be created by simulation code via grpc endpoints.What is next?