Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis: Partition one Broker with Gateway doesn't affect other partitions #29

Open
Zelldon opened this issue Jun 26, 2020 · 1 comment
Labels
Contribution: Availability This issue will contribute to build up confidence in reliability. Hypothesis A thing which worries us and is ready for exploration. Impact: Medium The issue has an medium impact on the system. Likelihood: High The likelihood of this issue is really high!

Comments

@Zelldon
Copy link
Member

Zelldon commented Jun 26, 2020

Hypothesis

We believe that when we isolate one Broker (leader of a partition) with the Gateway that we do not affect other partitions.

Expected during the experiment:

  • the topology stays the same, since gateway can ping indirectly (is discussable whether this is ideal or not)
  • when Broker 0 is leader for a partition then the processing for that partition stops but other partitions should not be affected
  • We can somehow determine in the metrics that they can't connect to each other
  • After connecting again the affected partition should recover
@Zelldon Zelldon added Hypothesis A thing which worries us and is ready for exploration. Impact: Medium The issue has an medium impact on the system. Likelihood: High The likelihood of this issue is really high! Contribution: Availability This issue will contribute to build up confidence in reliability. labels Jun 26, 2020
@Zelldon
Copy link
Member Author

Zelldon commented Jun 26, 2020

Yesterday we run a Chaos experiment to verify this.

Observations:

As expected we see no difference in the Topology. All commands which are send to that partition time out. Other partitions haven't been affected 👍 With the metrics we have we seen that: there is no progress in the partition, the partition is still healthy (which makes sense) and we see a lot of timeouts happening.

Unfortunately we need multiple metrics to correlate somehow that it might be due to connectivity issues. I think we can improve here. For example it is not directly visible that one partition stopped processing. For that @pihme had a good idea and we will add a new panel, which directly shows the current record processing stats. I think this is also useful for exporting to directly see whether we have currently exporting problems.

reduce2
reduce3

What else is missing on the metrics side from my point of view:

  • a panel which shows me that all requests to a specific partition currently time out.
  • metrics for the transport between gateway and broker to better analyze problems like that. Would be nice to have Introduce gateway-broker transport metrics camunda/camunda#4487
  • Liveness and Health stats of the Gateway in the metrics. I think this is currently not supported?

After reconnecting the nodes we saw that the related partition started to process again. Interesting was that it seems that there piled some traffic up and after reconnecting we saw a burst against partition one (partition 2 was disconnected), but this caused no issues.

I think was good and interesting experiment again and gave us a bit more insights what else we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contribution: Availability This issue will contribute to build up confidence in reliability. Hypothesis A thing which worries us and is ready for exploration. Impact: Medium The issue has an medium impact on the system. Likelihood: High The likelihood of this issue is really high!
Projects
None yet
Development

No branches or pull requests

1 participant