Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logic for reconnecting to new nodes #309

Closed

Conversation

peixian
Copy link

@peixian peixian commented Jun 27, 2023

This is a WIP PR to fix a bug we've spotted in https://github.com/blitzstudios/triton, where nodes that are marked as down are immediately readmitted as up.

What

For example:

iex(10)>     :os.cmd('ccm node1 stop')

17:01:30.742 [debug] Received event: %Xandra.Cluster.StatusChange{effect: "DOWN", address: {127, 0, 0, 1}, port: 9042}

17:01:30.742 [debug] Socket closed

17:01:30.742 [debug] Opening new connection

17:01:30.742 [debug] Host marked as DOWN: 127.0.0.1:9042

17:01:30.743 [debug] Restarted pool to: 127.0.0.3:9042

17:01:30.767 [debug] Supported options: %{"COMPRESSION" => ["lz4"], "CQL_VERSION" => ["3.4.4"], "PROTOCOL_VERSIONS" => ["3/v3", "4/v4", "5/v5-beta"]}

17:01:30.767 [debug] Sending STARTUP frame with protocol Xandra.Protocol.V4 and requested options: %{"CQL_VERSION" => "3.4.4"}

17:01:30.767 [debug] Connected successfully, using protocol Xandra.Protocol.V4

17:01:30.767 [debug] Supported options: %{"COMPRESSION" => ["lz4"], "CQL_VERSION" => ["3.4.4"], "PROTOCOL_VERSIONS" => ["3/v3", "4/v4", "5/v5-beta"]}

17:01:30.767 [debug] Sending STARTUP frame with protocol Xandra.Protocol.V4 and requested options: %{"CQL_VERSION" => "3.4.4"}

17:01:30.768 [debug] Received READY frame

17:01:30.768 [debug] Received READY frame

17:01:30.779 [debug] Established control connection (protocol Xandra.Protocol.V4)

17:01:30.779 [debug] Host reported as UP: 127.0.0.1:9042

In this example, a newly connected cluster marks .01 down, but the driver immediately readmits it as up after marking it down. This is reproducible with ccm and just connecting to the cluster.

Currently, the logic in https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L486 automatically assumes any peers discovered are up, even though the previous state might be down. In addition, https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L544C9-L544C51 assumes that all nodes within the peers table are alive and up, which is not the case, as the table can lag behind newly connected nodes.

What basically happens is:

  1. Node goes down, driver notices marks it down
  2. Driver immediately reattempts to connect to new code
  3. Connection to new node causes refresh_topology to be called with data from the peers table, which is always assumed to be up to date, even when it can lag behind.
  4. The node which went down then becomes marked as UP again, even though it is no longer responding.

How

This PR adds an extra check when peers are added from the reconnect, and attempts to make sure the node is reachable before marking it as up again. It also adds a last_seen_at value to help debug.

There's a new testing environment introduced with ccm, the Docker image has everything that's needed, which can be run with docker build -f ./test_ccm/Dockerfile . and docker run. For example:

docker run -it $(docker build -f ./test_ccm/Dockerfile .) mix test ./test_ccm/test_ccm.exs should build and run the entire container.

@peixian
Copy link
Author

peixian commented Jun 27, 2023

@whatyouhide I'm putting this up in case you want to do any early feedback, otherwise I'll just continue fleshing out the additional ccm tests.

@peixian peixian force-pushed the peixian/fix-clustering-logic branch from 6cb9b64 to d66b766 Compare June 28, 2023 02:08
@peixian peixian marked this pull request as ready for review June 28, 2023 02:08
@whatyouhide
Copy link
Owner

@peixian I don't have time just right now to go through this, but I will likely have some time next week. Let me know when this is ready to review!

@peixian
Copy link
Author

peixian commented Jun 28, 2023

@whatyouhide no worries, this should be good to review now

Copy link
Owner

@whatyouhide whatyouhide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome work @peixian! 🙏

I think this has a fundamental issue: I don't think Xandra should attempt to connect to a node just to if it's up. The current code doesn't even close the connection once established (in attempt_to_connect_to_node/2, which means we'll have zombie connections lying around.

We already introduced a :connected state in #306. The three states of a node from Xandra's perspective are now:

  • :down - the node is reported as down by C*.
  • :up - the node is reported as up by C*, but Xandra doesn't really know. This is for nodes that are available to connect to according to C*.
  • :connected - Xandra actually established a connection to this node, so it's definitely up, regardless of lags and stuff.

I think these states should be enough to encode information about the nodes?

We can keep some of the logic in this new PR, such as checking if a node was recently downed, but we need to integrate that with the load-balancing strategies. In general, I think we need to think about how we can "score" nodes to connect to. For example, a node that recently went down should have a lower reconnection score, so that we reconnect to it later on.

Thoughts?

lib/xandra/cluster/control_connection.ex Outdated Show resolved Hide resolved
lib/xandra/cluster/control_connection.ex Outdated Show resolved Hide resolved
Comment on lines 416 to 418
res = {:ok, %ConnectedNode{connected_node | host: local_host}, peers}
Logger.debug("#{inspect(res)}")
res
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something we should log. If we want to, we should log it with a friendly message, but I think we can skip the logging for now and go back to the previous code. Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this section because between the last log DEBUG line and the next logging DEBUG line, there was quite a few places where state was being changed that couldn't be debugged without this. I'll add an additional message to it

@@ -137,6 +137,7 @@ defmodule Xandra.Connection.Utils do
# TODO: handle :error frames for things like :protocol_violation.
case frame do
%Frame{kind: :ready, body: <<>>} ->
Logger.debug("FRAME READY #{inspect(frame)}, #{inspect(rest)}, #{inspect(socket)}}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already logged one line below, so I don't think we should log this line as well.

@peixian
Copy link
Author

peixian commented Jul 5, 2023

This PR originally held state in a downed_nodes map: 3d88dcf#diff-7238ddf7dd6f8f3e8da221c8a8319d85d5f0f6a656bb4866d2569bff02e9b0dcR191, but I removed this because it required tracking additional state on top of the possibly stale state within the system.local (https://github.com/lexhide/xandra/blob/83394f01b8531eff14a04bec075c9a280457e40c/lib/xandra/cluster/control_connection.ex#L523-L524).

The additional complexity of keeping track of cluster state seemed not worth it, so I just opted for the simple solution.

I agree the connection is not great, I didn't particularly want to add it, but it seemed like connecting to the node was the only way to know if a node is alive and functioning after a DOWN event. Per https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec, it seems the only way to get a READY response is to connect to the node.


If the idea is that :up is possibly a node that may be down, then the write logic needs to be changed, as writes should not be going to a :up node without connecting to it first. Currently writes blindly assume that :up nodes are usable.

I'm open to whatever fixes this bug @whatyouhide, the problem is that /currently/ the library is not resistant to node restarts (for maintenance, chaos monkey, etc). Restarting a node currently triggers data loss and write failure in certain situations, reproducible if you are connecting to a node, and it goes down, then in the interim period writes will fail.

@peixian
Copy link
Author

peixian commented Jul 17, 2023

@whatyouhide hey, checking in did you get a chance to take a look at this guy?

@whatyouhide whatyouhide changed the title Fix Logic for Reconnecting to New Nodes Fix logic for reconnecting to new nodes\ Aug 2, 2023
@whatyouhide whatyouhide changed the title Fix logic for reconnecting to new nodes\ Fix logic for reconnecting to new nodes Aug 2, 2023
@whatyouhide
Copy link
Owner

Hi @peixian, thanks for this, and sorry for the delay! I'm working on a few Xandra changes, so I'll close this out and incorporate your fixes in those changes. Thank you 💟

@whatyouhide whatyouhide closed this Aug 3, 2023
@peixian
Copy link
Author

peixian commented Aug 3, 2023

@whatyouhide sounds good, any idea when this change will make it in? We'd prefer sooner since this affects us in production

@whatyouhide
Copy link
Owner

@peixian no, no idea. Could be a few weeks. If you folks want to speed things up, there's a sponsor button in the homepage of this repo, and we can figure something out! 🙃

@peixian
Copy link
Author

peixian commented Aug 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants