Connection doesn't propagate information about being closed to Cluster #345

Lorak-mmk · 2024-07-15T17:04:59Z

Discovered when investigating https://github.com/scylladb/scylla-dtest/issues/4364

When the node goes down it will close client connections (probably not always? I guess if it dies unexpectedly then it has no way to), and the connections in the driver will notice it. The logs look like this:

18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180404560) 127.0.10.1:9042> closed by server
18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180404560) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185696976) 127.0.10.1:9042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185696976) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185158224) 127.0.10.1:19042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185158224) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180402832) 127.0.10.1:19042> closed by server
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180402832) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042

the problem is that the information about those connections closing is not propagated anywhere: driver still thinks it has fully functioning connection pool - and if dead node was the one driver had control connection opened to, then the driver still thinks it has functioning control connection and waits for events.
Driver will notice that those connections are dead only when it tries to use them - send heartbeat / cql query / refresh schema etc.

This is a problem in the following scenario (this is done in https://github.com/scylladb/scylla-dtest/issues/4364):

cluster consists of 2 nodes (but the issue scales for any number of nodes I think)
driver has control connection to node 1
node 1 is restarted - driver doesn't notice it
node 2 is stopped
Now driver has no working pools and no control connection (but doesn't know it)
When query is executed it will fail: for node 2 because it is down, and for 1 because driver will notice that connection is closed.

What the driver should do is propagate the information from single connection upwards and reopen connections / mark host as down.

The text was updated successfully, but these errors were encountered:

mykaul · 2024-07-16T08:35:24Z

We should really use TCP keep-alive everywhere, just like the GoCQL now uses it by default.

Lorak-mmk · 2024-07-16T08:37:12Z

TCP keep-alive is not the solution here. The connection itself (and by connection I mean instance of Connection class) was closed gracefully and the connection knows that it was closed.
The issue is that the connection doesn't propagate this information to the Cluster object.

Lorak-mmk added the bug Something isn't working label Jul 15, 2024

roydahan added the triage label Jul 30, 2024

roydahan assigned Lorak-mmk Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection doesn't propagate information about being closed to Cluster #345

Connection doesn't propagate information about being closed to Cluster #345

Lorak-mmk commented Jul 15, 2024

mykaul commented Jul 16, 2024

Lorak-mmk commented Jul 16, 2024

Connection doesn't propagate information about being closed to Cluster #345

Connection doesn't propagate information about being closed to Cluster #345

Comments

Lorak-mmk commented Jul 15, 2024

mykaul commented Jul 16, 2024

Lorak-mmk commented Jul 16, 2024