Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication/failover simulation skeleton #6627

Conversation

taylanisikdemir
Copy link
Member

@taylanisikdemir taylanisikdemir commented Jan 16, 2025

What changed?

Adding the skeleton of replication/failover simulation tests. Actual simulation logic will be added in a follow up PR.

  • The simulation runs 2 Cadence clusters. Both clusters are configured via environment variables in docker-compose-local-replication-simulation.yml file. The environment variables are passed to dockerize to materialize the config template during container startup.
    Cluster endpoints:

    • cluster0 listens on cadence-cluster0:7833 grpc endpoint
    • cluster1 listens on cadence-cluster1:7833 grpc endpoint

    Web endpoints:

    • http://localhost:8088
    • http://localhost:8089
  • There is single Cassandra instance. Each cluster uses its own Cassandra keyspace so they have their own isolated tables.

Screenshot 2025-01-15 at 5 45 58 PM
  • There is single Prometheus/Grafana pair for metrics. Scrape config adds cluster: cluster0/1 label to metrics based on endpoint.

http://localhost:9090/query?g0.expr=cadence_requests
Screenshot 2025-01-15 at 5 15 15 PM

Black Box Simulation Approach

Besides those above, there is another container with runs the simulation code. It will communicate with both clusters via their APIs exposed on endpoints mentioned above. This is different than how it's done in existing integration tests/simulations. They use internal clients to perform Cadence requests which requires initializing/mimicking full Cadence service component initialization in test code. This existing approach is useful if you want to mock server behavior which is possible because test code and Cadence services share same runtime.
New approach will interact with Cadence via exposed APIs. Each simulation scenario configuration includes a dynamic config override file and what to do during simulation.

Why do we need replication/failover simulation?

The simulation should help to quickly validate various edge cases locally which will make it easy to iterate for replication improvements and active-active mode development.

How did you test it?

  1. Run ./scripts/run_replication_simulator.sh
  2. Check full test logs in test.log
  3. Check existence of test-domain which is supposed to be created by simulation code via grpc endpoints.
cluster="cadence-cluster0"  # or "cadence-cluster0"
docker run -it --rm \
	--network services-network \
	ubercadence/cli:master \
	--address $cluster:7833 \
	--transport grpc \
	--domain test-domain \
	domain desc
  1. Also check the same via UI for both clusters

What is next?

  • Implement config driven simulation logic to run some workflows at various times, perform failover(s) at specified times.
  • As we did in matching simulation, additional structured logs will be emitted from various services. Those structured logs will be parsed to generate final simulation output. Example information to capture: domain failover flow, replication state of individual tasks, etc.
  • Add more scenarios as needed to validate/measure improvement. e.g. adaptive replication batch size feature would validate replication throughput/latency is improving. Or active-active mode feature would validate workflow states on each cluster.
  • Add the basic/default scenario as CI check to run along with other integration tests.

@taylanisikdemir taylanisikdemir marked this pull request as ready for review January 16, 2025 17:09
@taylanisikdemir taylanisikdemir changed the title Taylan/replication sim skeleton Replication/failover simulation skeletong Jan 16, 2025
@taylanisikdemir taylanisikdemir changed the title Replication/failover simulation skeletong Replication/failover simulation skeleton Jan 16, 2025
@taylanisikdemir taylanisikdemir merged commit 6f0a746 into cadence-workflow:master Jan 16, 2025
22 checks passed
@taylanisikdemir taylanisikdemir deleted the taylan/replication_sim_skeleton branch January 16, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants