Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection doesn't propagate information about being closed to Cluster #345

Open
Lorak-mmk opened this issue Jul 15, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working triage

Comments

@Lorak-mmk
Copy link

Discovered when investigating https://github.com/scylladb/scylla-dtest/issues/4364

When the node goes down it will close client connections (probably not always? I guess if it dies unexpectedly then it has no way to), and the connections in the driver will notice it. The logs look like this:

18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180404560) 127.0.10.1:9042> closed by server
18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180404560) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185696976) 127.0.10.1:9042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185696976) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185158224) 127.0.10.1:19042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185158224) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180402832) 127.0.10.1:19042> closed by server
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180402832) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042

the problem is that the information about those connections closing is not propagated anywhere: driver still thinks it has fully functioning connection pool - and if dead node was the one driver had control connection opened to, then the driver still thinks it has functioning control connection and waits for events.
Driver will notice that those connections are dead only when it tries to use them - send heartbeat / cql query / refresh schema etc.

This is a problem in the following scenario (this is done in https://github.com/scylladb/scylla-dtest/issues/4364):

  • cluster consists of 2 nodes (but the issue scales for any number of nodes I think)
  • driver has control connection to node 1
  • node 1 is restarted - driver doesn't notice it
  • node 2 is stopped
  • Now driver has no working pools and no control connection (but doesn't know it)
  • When query is executed it will fail: for node 2 because it is down, and for 1 because driver will notice that connection is closed.

What the driver should do is propagate the information from single connection upwards and reopen connections / mark host as down.

@Lorak-mmk Lorak-mmk added the bug Something isn't working label Jul 15, 2024
@mykaul
Copy link

mykaul commented Jul 16, 2024

We should really use TCP keep-alive everywhere, just like the GoCQL now uses it by default.

@Lorak-mmk
Copy link
Author

TCP keep-alive is not the solution here. The connection itself (and by connection I mean instance of Connection class) was closed gracefully and the connection knows that it was closed.
The issue is that the connection doesn't propagate this information to the Cluster object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

3 participants