Fix logic for reconnecting to new nodes #309

peixian · 2023-06-27T02:21:36Z

This is a WIP PR to fix a bug we've spotted in https://github.com/blitzstudios/triton, where nodes that are marked as down are immediately readmitted as up.

What

For example:

iex(10)>     :os.cmd('ccm node1 stop')

17:01:30.742 [debug] Received event: %Xandra.Cluster.StatusChange{effect: "DOWN", address: {127, 0, 0, 1}, port: 9042}

17:01:30.742 [debug] Socket closed

17:01:30.742 [debug] Opening new connection

17:01:30.742 [debug] Host marked as DOWN: 127.0.0.1:9042

17:01:30.743 [debug] Restarted pool to: 127.0.0.3:9042

17:01:30.767 [debug] Supported options: %{"COMPRESSION" => ["lz4"], "CQL_VERSION" => ["3.4.4"], "PROTOCOL_VERSIONS" => ["3/v3", "4/v4", "5/v5-beta"]}

17:01:30.767 [debug] Sending STARTUP frame with protocol Xandra.Protocol.V4 and requested options: %{"CQL_VERSION" => "3.4.4"}

17:01:30.767 [debug] Connected successfully, using protocol Xandra.Protocol.V4

17:01:30.767 [debug] Supported options: %{"COMPRESSION" => ["lz4"], "CQL_VERSION" => ["3.4.4"], "PROTOCOL_VERSIONS" => ["3/v3", "4/v4", "5/v5-beta"]}

17:01:30.767 [debug] Sending STARTUP frame with protocol Xandra.Protocol.V4 and requested options: %{"CQL_VERSION" => "3.4.4"}

17:01:30.768 [debug] Received READY frame

17:01:30.768 [debug] Received READY frame

17:01:30.779 [debug] Established control connection (protocol Xandra.Protocol.V4)

17:01:30.779 [debug] Host reported as UP: 127.0.0.1:9042

In this example, a newly connected cluster marks .01 down, but the driver immediately readmits it as up after marking it down. This is reproducible with ccm and just connecting to the cluster.

Currently, the logic in https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L486 automatically assumes any peers discovered are up, even though the previous state might be down. In addition, https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L544C9-L544C51 assumes that all nodes within the peers table are alive and up, which is not the case, as the table can lag behind newly connected nodes.

What basically happens is:

Node goes down, driver notices marks it down
Driver immediately reattempts to connect to new code
Connection to new node causes refresh_topology to be called with data from the peers table, which is always assumed to be up to date, even when it can lag behind.
The node which went down then becomes marked as UP again, even though it is no longer responding.

How

This PR adds an extra check when peers are added from the reconnect, and attempts to make sure the node is reachable before marking it as up again. It also adds a last_seen_at value to help debug.

There's a new testing environment introduced with ccm, the Docker image has everything that's needed, which can be run with docker build -f ./test_ccm/Dockerfile . and docker run. For example:

docker run -it $(docker build -f ./test_ccm/Dockerfile .) mix test ./test_ccm/test_ccm.exs should build and run the entire container.

peixian · 2023-06-27T02:23:01Z

@whatyouhide I'm putting this up in case you want to do any early feedback, otherwise I'll just continue fleshing out the additional ccm tests.

whatyouhide · 2023-06-28T08:40:28Z

@peixian I don't have time just right now to go through this, but I will likely have some time next week. Let me know when this is ready to review!

peixian · 2023-06-28T17:52:41Z

@whatyouhide no worries, this should be good to review now

whatyouhide

Thanks for the awesome work @peixian! 🙏

I think this has a fundamental issue: I don't think Xandra should attempt to connect to a node just to if it's up. The current code doesn't even close the connection once established (in attempt_to_connect_to_node/2, which means we'll have zombie connections lying around.

We already introduced a :connected state in #306. The three states of a node from Xandra's perspective are now:

:down - the node is reported as down by C*.
:up - the node is reported as up by C*, but Xandra doesn't really know. This is for nodes that are available to connect to according to C*.
:connected - Xandra actually established a connection to this node, so it's definitely up, regardless of lags and stuff.

I think these states should be enough to encode information about the nodes?

We can keep some of the logic in this new PR, such as checking if a node was recently downed, but we need to integrate that with the load-balancing strategies. In general, I think we need to think about how we can "score" nodes to connect to. For example, a node that recently went down should have a lower reconnection score, so that we reconnect to it later on.

Thoughts?

lib/xandra/cluster/control_connection.ex

whatyouhide · 2023-07-01T12:05:18Z

lib/xandra/cluster/control_connection.ex

+          res = {:ok, %ConnectedNode{connected_node | host: local_host}, peers}
+          Logger.debug("#{inspect(res)}")
+          res


I don't think this is something we should log. If we want to, we should log it with a friendly message, but I think we can skip the logging for now and go back to the previous code. Thoughts?

I added this section because between the last log DEBUG line and the next logging DEBUG line, there was quite a few places where state was being changed that couldn't be debugged without this. I'll add an additional message to it

whatyouhide · 2023-07-01T12:07:09Z

lib/xandra/connection/utils.ex

@@ -137,6 +137,7 @@ defmodule Xandra.Connection.Utils do
      # TODO: handle :error frames for things like :protocol_violation.
      case frame do
        %Frame{kind: :ready, body: <<>>} ->
+          Logger.debug("FRAME READY #{inspect(frame)}, #{inspect(rest)}, #{inspect(socket)}}")


This is already logged one line below, so I don't think we should log this line as well.

Co-authored-by: Andrea Leopardi <[email protected]>

peixian · 2023-07-05T22:07:01Z

This PR originally held state in a downed_nodes map: 3d88dcf#diff-7238ddf7dd6f8f3e8da221c8a8319d85d5f0f6a656bb4866d2569bff02e9b0dcR191, but I removed this because it required tracking additional state on top of the possibly stale state within the system.local (https://github.com/lexhide/xandra/blob/83394f01b8531eff14a04bec075c9a280457e40c/lib/xandra/cluster/control_connection.ex#L523-L524).

The additional complexity of keeping track of cluster state seemed not worth it, so I just opted for the simple solution.

I agree the connection is not great, I didn't particularly want to add it, but it seemed like connecting to the node was the only way to know if a node is alive and functioning after a DOWN event. Per https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec, it seems the only way to get a READY response is to connect to the node.

If the idea is that :up is possibly a node that may be down, then the write logic needs to be changed, as writes should not be going to a :up node without connecting to it first. Currently writes blindly assume that :up nodes are usable.

I'm open to whatever fixes this bug @whatyouhide, the problem is that /currently/ the library is not resistant to node restarts (for maintenance, chaos monkey, etc). Restarting a node currently triggers data loss and write failure in certain situations, reproducible if you are connecting to a node, and it goes down, then in the interim period writes will fail.

peixian · 2023-07-17T17:49:48Z

@whatyouhide hey, checking in did you get a chance to take a look at this guy?

whatyouhide · 2023-08-03T03:46:16Z

Hi @peixian, thanks for this, and sorry for the delay! I'm working on a few Xandra changes, so I'll close this out and incorporate your fixes in those changes. Thank you 💟

peixian · 2023-08-03T19:51:00Z

@whatyouhide sounds good, any idea when this change will make it in? We'd prefer sooner since this affects us in production

whatyouhide · 2023-08-04T07:31:49Z

@peixian no, no idea. Could be a few weeks. If you folks want to speed things up, there's a sponsor button in the homepage of this repo, and we can figure something out! 🙃

peixian · 2023-08-04T14:46:54Z

I'll ask about sponsoring, currently it's a lot easier for me to spend work time on this, so if you want help maintaining I can devote some time to helping out and upstreaming/testing changes.

peixian added 4 commits June 20, 2023 16:56

remove decimal

f926cb8

testing

ee711f4

prevent nodes from immediately being readded

3d88dcf

Hold connection state between reconnects

2549353

add additional tests

d66b766

peixian force-pushed the peixian/fix-clustering-logic branch from 6cb9b64 to d66b766 Compare June 28, 2023 02:08

peixian marked this pull request as ready for review June 28, 2023 02:08

whatyouhide reviewed Jul 1, 2023

View reviewed changes

Update lib/xandra/cluster/control_connection.ex

ea920c7

Co-authored-by: Andrea Leopardi <[email protected]>

whatyouhide mentioned this pull request Aug 2, 2023

Xandra.Cluster connects, but doesn't think so #314

Closed

Merge branch 'main' into peixian/fix-clustering-logic

cad97aa

whatyouhide changed the title ~~Fix Logic for Reconnecting to New Nodes~~ Fix logic for reconnecting to new nodes\ Aug 2, 2023

whatyouhide changed the title ~~Fix logic for reconnecting to new nodes\~~ Fix logic for reconnecting to new nodes Aug 2, 2023

whatyouhide added 2 commits August 2, 2023 11:31

WIP

1c85cca

FIXUP

4a8eb7c

whatyouhide closed this Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix logic for reconnecting to new nodes #309

Fix logic for reconnecting to new nodes #309

peixian commented Jun 27, 2023 •

edited

Loading

peixian commented Jun 27, 2023

whatyouhide commented Jun 28, 2023

peixian commented Jun 28, 2023

whatyouhide left a comment

whatyouhide Jul 1, 2023

peixian Jul 5, 2023

whatyouhide Jul 1, 2023

peixian commented Jul 5, 2023

peixian commented Jul 17, 2023

whatyouhide commented Aug 3, 2023

peixian commented Aug 3, 2023

whatyouhide commented Aug 4, 2023

peixian commented Aug 4, 2023 via email •

edited

Loading

Fix logic for reconnecting to new nodes #309

Fix logic for reconnecting to new nodes #309

Conversation

peixian commented Jun 27, 2023 • edited Loading

What

How

peixian commented Jun 27, 2023

whatyouhide commented Jun 28, 2023

peixian commented Jun 28, 2023

whatyouhide left a comment

Choose a reason for hiding this comment

whatyouhide Jul 1, 2023

Choose a reason for hiding this comment

peixian Jul 5, 2023

Choose a reason for hiding this comment

whatyouhide Jul 1, 2023

Choose a reason for hiding this comment

peixian commented Jul 5, 2023

peixian commented Jul 17, 2023

whatyouhide commented Aug 3, 2023

peixian commented Aug 3, 2023

whatyouhide commented Aug 4, 2023

peixian commented Aug 4, 2023 via email • edited Loading

peixian commented Jun 27, 2023 •

edited

Loading

peixian commented Aug 4, 2023 via email •

edited

Loading