Replication/failover simulation skeleton #6627

taylanisikdemir · 2025-01-16T05:11:34Z

What changed?

Adding the skeleton of replication/failover simulation tests. Actual simulation logic will be added in a follow up PR.

The simulation runs 2 Cadence clusters. Both clusters are configured via environment variables in docker-compose-local-replication-simulation.yml file. The environment variables are passed to dockerize to materialize the config template during container startup.
Cluster endpoints:
- cluster0 listens on cadence-cluster0:7833 grpc endpoint
- cluster1 listens on cadence-cluster1:7833 grpc endpoint
Web endpoints:
- http://localhost:8088
- http://localhost:8089
There is single Cassandra instance. Each cluster uses its own Cassandra keyspace so they have their own isolated tables.

There is single Prometheus/Grafana pair for metrics. Scrape config adds cluster: cluster0/1 label to metrics based on endpoint.

http://localhost:9090/query?g0.expr=cadence_requests

Black Box Simulation Approach

Besides those above, there is another container with runs the simulation code. It will communicate with both clusters via their APIs exposed on endpoints mentioned above. This is different than how it's done in existing integration tests/simulations. They use internal clients to perform Cadence requests which requires initializing/mimicking full Cadence service component initialization in test code. This existing approach is useful if you want to mock server behavior which is possible because test code and Cadence services share same runtime.
New approach will interact with Cadence via exposed APIs. Each simulation scenario configuration includes a dynamic config override file and what to do during simulation.

Why do we need replication/failover simulation?

The simulation should help to quickly validate various edge cases locally which will make it easy to iterate for replication improvements and active-active mode development.

How did you test it?

Run ./scripts/run_replication_simulator.sh
Check full test logs in test.log
Check existence of test-domain which is supposed to be created by simulation code via grpc endpoints.

cluster="cadence-cluster0"  # or "cadence-cluster0"
docker run -it --rm \
	--network services-network \
	ubercadence/cli:master \
	--address $cluster:7833 \
	--transport grpc \
	--domain test-domain \
	domain desc

Also check the same via UI for both clusters
- http://localhost:8088/domains/test-domain/settings
- http://localhost:8089/domains/test-domain/settings

What is next?

Implement config driven simulation logic to run some workflows at various times, perform failover(s) at specified times.
As we did in matching simulation, additional structured logs will be emitted from various services. Those structured logs will be parsed to generate final simulation output. Example information to capture: domain failover flow, replication state of individual tasks, etc.
Add more scenarios as needed to validate/measure improvement. e.g. adaptive replication batch size feature would validate replication throughput/latency is improving. Or active-active mode feature would validate workflow states on each cluster.
Add the basic/default scenario as CI check to run along with other integration tests.

taylanisikdemir added 3 commits January 14, 2025 15:53

Replication/failover simulation skeleton

b60a92b

fixes

b8e76ec

final fixes

adf7e1c

taylanisikdemir marked this pull request as ready for review January 16, 2025 17:09

taylanisikdemir requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx and demirkayaender as code owners January 16, 2025 17:09

taylanisikdemir changed the title ~~Taylan/replication sim skeleton~~ Replication/failover simulation skeletong Jan 16, 2025

taylanisikdemir changed the title ~~Replication/failover simulation skeletong~~ Replication/failover simulation skeleton Jan 16, 2025

import order fix

3172f87

Shaddoll approved these changes Jan 16, 2025

View reviewed changes

taylanisikdemir merged commit 6f0a746 into cadence-workflow:master Jan 16, 2025
22 checks passed

taylanisikdemir deleted the taylan/replication_sim_skeleton branch January 16, 2025 20:55

taylanisikdemir mentioned this pull request Jan 28, 2025

Replication/failover simulation continued #6645

Merged

taylanisikdemir mentioned this pull request Feb 6, 2025

Replication/failover simulation operations: startWorkflow, failover and validate #6655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication/failover simulation skeleton #6627

Replication/failover simulation skeleton #6627

taylanisikdemir commented Jan 16, 2025 •

edited

Loading

Replication/failover simulation skeleton #6627

Replication/failover simulation skeleton #6627

Conversation

taylanisikdemir commented Jan 16, 2025 • edited Loading

What changed?

Black Box Simulation Approach

Why do we need replication/failover simulation?

How did you test it?

What is next?

taylanisikdemir commented Jan 16, 2025 •

edited

Loading