-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix logic for reconnecting to new nodes #309
Conversation
@whatyouhide I'm putting this up in case you want to do any early feedback, otherwise I'll just continue fleshing out the additional ccm tests. |
6cb9b64
to
d66b766
Compare
@peixian I don't have time just right now to go through this, but I will likely have some time next week. Let me know when this is ready to review! |
@whatyouhide no worries, this should be good to review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the awesome work @peixian! 🙏
I think this has a fundamental issue: I don't think Xandra should attempt to connect to a node just to if it's up. The current code doesn't even close the connection once established (in attempt_to_connect_to_node/2
, which means we'll have zombie connections lying around.
We already introduced a :connected
state in #306. The three states of a node from Xandra's perspective are now:
:down
- the node is reported as down by C*.:up
- the node is reported as up by C*, but Xandra doesn't really know. This is for nodes that are available to connect to according to C*.:connected
- Xandra actually established a connection to this node, so it's definitely up, regardless of lags and stuff.
I think these states should be enough to encode information about the nodes?
We can keep some of the logic in this new PR, such as checking if a node was recently downed, but we need to integrate that with the load-balancing strategies. In general, I think we need to think about how we can "score" nodes to connect to. For example, a node that recently went down should have a lower reconnection score, so that we reconnect to it later on.
Thoughts?
res = {:ok, %ConnectedNode{connected_node | host: local_host}, peers} | ||
Logger.debug("#{inspect(res)}") | ||
res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is something we should log. If we want to, we should log it with a friendly message, but I think we can skip the logging for now and go back to the previous code. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this section because between the last log DEBUG line and the next logging DEBUG line, there was quite a few places where state was being changed that couldn't be debugged without this. I'll add an additional message to it
lib/xandra/connection/utils.ex
Outdated
@@ -137,6 +137,7 @@ defmodule Xandra.Connection.Utils do | |||
# TODO: handle :error frames for things like :protocol_violation. | |||
case frame do | |||
%Frame{kind: :ready, body: <<>>} -> | |||
Logger.debug("FRAME READY #{inspect(frame)}, #{inspect(rest)}, #{inspect(socket)}}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already logged one line below, so I don't think we should log this line as well.
Co-authored-by: Andrea Leopardi <[email protected]>
This PR originally held state in a The additional complexity of keeping track of cluster state seemed not worth it, so I just opted for the simple solution. I agree the connection is not great, I didn't particularly want to add it, but it seemed like connecting to the node was the only way to know if a node is alive and functioning after a DOWN event. Per https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec, it seems the only way to get a If the idea is that I'm open to whatever fixes this bug @whatyouhide, the problem is that /currently/ the library is not resistant to node restarts (for maintenance, chaos monkey, etc). Restarting a node currently triggers data loss and write failure in certain situations, reproducible if you are connecting to a node, and it goes down, then in the interim period writes will fail. |
@whatyouhide hey, checking in did you get a chance to take a look at this guy? |
Hi @peixian, thanks for this, and sorry for the delay! I'm working on a few Xandra changes, so I'll close this out and incorporate your fixes in those changes. Thank you 💟 |
@whatyouhide sounds good, any idea when this change will make it in? We'd prefer sooner since this affects us in production |
@peixian no, no idea. Could be a few weeks. If you folks want to speed things up, there's a sponsor button in the homepage of this repo, and we can figure something out! 🙃 |
I'll ask about sponsoring, currently it's a lot easier for me to spend work time on this, so if you want help maintaining I can devote some time to helping out and upstreaming/testing changes.
|
This is a WIP PR to fix a bug we've spotted in https://github.com/blitzstudios/triton, where nodes that are marked as down are immediately readmitted as up.
What
For example:
In this example, a newly connected cluster marks
.01
down, but the driver immediately readmits it as up after marking it down. This is reproducible withccm
and just connecting to the cluster.Currently, the logic in https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L486 automatically assumes any peers discovered are up, even though the previous state might be down. In addition, https://github.com/lexhide/xandra/blob/main/lib/xandra/cluster/control_connection.ex#L544C9-L544C51 assumes that all nodes within the
peers
table are alive and up, which is not the case, as the table can lag behind newly connected nodes.What basically happens is:
refresh_topology
to be called with data from thepeers
table, which is always assumed to be up to date, even when it can lag behind.How
This PR adds an extra check when peers are added from the reconnect, and attempts to make sure the node is reachable before marking it as up again. It also adds a
last_seen_at
value to help debug.There's a new testing environment introduced with
ccm
, the Docker image has everything that's needed, which can be run withdocker build -f ./test_ccm/Dockerfile .
anddocker run
. For example:docker run -it $(docker build -f ./test_ccm/Dockerfile .) mix test ./test_ccm/test_ccm.exs
should build and run the entire container.