Connection pool renewal after concurrent node bootstraps causes double statement execution #317

kbr-scylla · 2024-04-23T09:55:01Z

We boot 2 Scylla nodes concurrently into existing cluster.
Python driver obtains two on_add notifications, one for each node.
Each notification calls add_or_renew_pool, which creates connection pool to each node.

But then, for some reason, one of the on_adds may cause another add_or_renew_pool to be called for the other server. This happens from _finalize_add -> update_created_pools.
This may cause a second pool to be created for the other server and the initially established pool to that server to be closed.

There could be a statement running on the initially established pool. The statement may have already been executed on Scylla side, but the driver didn't get a response yet.
The pool is closed before response arrives. This causes driver to retry the statement on the new pool, leading to double execution.

In our tests, we observe this by "CREATE KEYSPACE" statement failing with "already exists" error message (scylladb/scylladb#17654)

Reproducer:

Python driver branch with sleep + logging added: https://github.com/kbr-scylla/python-driver/tree/debug-double-execution

I added a tactical sleep there:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index 8ed0647b..e79daf7e 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3320,6 +3320,8 @@ class Session(object):
                         self._lock.acquire()
                         return False
                     self._lock.acquire()
+                if previous:
+                    time.sleep(2)
                 self._pools[host] = new_pool
 
             log.debug("Added pool for host %s to session", host)

ScyllaDB branch with sleep + logging added before "create keyspace" statement returns: https://github.com/kbr-scylla/scylladb/tree/debug-double-execution

this is just coroutinization of create_keyspace_statement::execute, then sleep + logging added:

diff --git a/cql3/statements/create_keyspace_statement.cc b/cql3/statements/create_keyspace_statement.cc
index e66779ac0d..f8b3b1f766 100644
--- a/cql3/statements/create_keyspace_statement.cc
+++ b/cql3/statements/create_keyspace_statement.cc
@@ -267,13 +267,15 @@ std::vector<sstring> check_against_restricted_replication_strategies(
 future<::shared_ptr<messages::result_message>>
 create_keyspace_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
     std::vector<sstring> warnings = check_against_restricted_replication_strategies(qp, keyspace(), *_attrs, qp.get_cql_stats());
-        return schema_altering_statement::execute(qp, state, options, std::move(guard)).then([warnings = std::move(warnings)] (::shared_ptr<messages::result_message> msg) {
-        for (const auto& warning : warnings) {
-            msg->add_warning(warning);
-            mylogger.warn("{}", warning);
-        }
-        return msg;
-    });
+    auto msg = co_await schema_altering_statement::execute(qp, state, options, std::move(guard));
+    for (const auto& warning : warnings) {
+        msg->add_warning(warning);
+        mylogger.warn("{}", warning);
+    }
+    mylogger.info("sleep before returning create keyspace message");
+    co_await seastar::sleep(std::chrono::seconds{2});
+    mylogger.info("return create keyspace message");
+    co_return std::move(msg);
 }

Test (included in the above branch):

@pytest.mark.asyncio
async def test_double_execution(request, manager: ManagerClient):
   await manager.server_add()
   await manager.servers_add(2)

   logging.info(f'SLEEP 1')
   await asyncio.sleep(1)

   cql = manager.get_cql()
   hosts = cql.cluster.metadata.all_hosts()
   logging.info(f"hosts: {hosts}")

   logging.info(f'create ks')
   await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")

I run it like this:

PYTHONPATH=$PYTHONPATH:/home/kbr/dev/python-driver ./test.py --mode dev test_double_execution --repeat 4

in /home/kbr/dev/python-driver I have the above Python driver branch checked out.

Logs from example run:
scylla-10.log
scylla-9.log
scylla-3.log
topology_custom.test_double_execution.3.log

Here are the relevant excerpts cut out from the test log (those are messages I added):

11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.9:9042
11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.10:9042
11:29:55.414 INFO> SLEEP 1
11:29:55.417 INFO> finalize_add update_created_pools 127.58.145.10:9042
11:29:55.417 INFO> update_created_pools add_or_renew_pool 127.58.145.9:9042
11:29:55.418 INFO> finalize_add update_created_pools 127.58.145.9:9042
11:29:56.415 INFO> create ks
11:29:57.421 DEBUG> set new pool 127.58.145.9:9042 previous True
11:29:57.421 DEBUG> Shutting down connections to 127.58.145.9:9042

What happened is that finalize_add for host 127.58.145.10:9042 (scylla-10) caused update_created_pools call, which called add_or_renew_pool for host 127.58.145.9:9042 (scylla-9). But pool for scylla-9 was already established. We start running "create keyspace" on scylla-9. In the meantime, add_or_renew_pool establishes a new pool and drops the old one, causing "create keyspace" to be retried on the new pool, leading to double execution:

>       await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")
E       cassandra.AlreadyExists: Keyspace 'ks' already exists

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2024-04-23T09:55:20Z

This can be easily worked around in tests, so it's low priority.

Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement may sometimes get executed twice (scylladb/python-driver#317), leading to "Keyspace ... already exists" error in our tests (scylladb#17654). Work around this by using "IF NOT EXISTS". Fixes: scylladb#17654

Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement may sometimes get executed twice (scylladb/python-driver#317), leading to "Keyspace ... already exists" error in our tests (#17654). Work around this by using "IF NOT EXISTS". Fixes: #17654 Closes #18368

Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement may sometimes get executed twice (scylladb/python-driver#317), leading to "Keyspace ... already exists" error in our tests (scylladb#17654). Work around this by using "IF NOT EXISTS". Fixes: scylladb#17654 Closes scylladb#18368

sylwiaszunejko · 2024-05-13T09:17:58Z

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

kbr-scylla · 2024-05-15T15:12:27Z

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

Huh? The reproducer I posted above:

ScyllaDB branch with sleep + logging added before "create keyspace" statement returns: https://github.com/kbr-scylla/scylladb/tree/debug-double-execution

runs in consistent-topology-changes mode.

In fact it explicitly depends on consistent-topology-changes, because it bootstraps 2 nodes concurrently (it uses manager.servers_add(2)). The idea behind the reproducer is that notifications about the two concurrently booting nodes arrive at the driver at roughly the same time and race with each other. I did not manage to reproduce the issue by booting nodes sequentially - then the notifications are arriving sequentially and the problem does not seem to happen.

How did you reproduce the issue outside consistent-topology-changes -- did you use some different test case? Are you sure you reproduced this exact issue and not something different (maybe similar)?

sylwiaszunejko · 2024-05-15T15:49:58Z

Maybe I misunderstood something, I used exactly what is in your fork, but after I added cfg = {'experimental_features': ['consistent-topology-changes'],} in config option of servers_add it stopped reproducing. Probably some stupid mistake on my end, if so sorry for that.

kbr-scylla · 2024-05-15T17:09:54Z

When was that? cfg = {'experimental_features': ['consistent-topology-changes']} doesn't do anything after scylladb/scylladb@d8313dd, and before scylladb/scylladb@d8313dd it was the default in topology tests.

So I suspect that your change didn't actually change anything, but then you were unlucky with your runs and failed to reproduce it, used a too small sample, and then concluded that the change was the reason.

kbr-scylla · 2024-05-15T17:12:49Z

Hm, actually cfg = {'experimental_features': ['consistent-topology-changes']} could do something, it disables some other experimental features. The usual set of experimental features on that branch was:

         'experimental_features': ['udf',
                                   'alternator-streams',
                                   'consistent-topology-changes',
                                   'broadcast-tables',
                                   'keyspace-storage-options'],

and you reduced it to simply ['consistent-topology-changes']

But neither of these other features seems likely to be related, so I would again recheck with higher sample of runs

(Preferably -- just rebase my reproducer branch on latest master, where consistent-topology-changes is no longer experimental, and check there)

kostja · 2024-06-04T13:03:43Z

Ping on this one, the connected issue scylladb/scylladb#16219 is impacting our ability to run sct tests.

kbr-scylla · 2024-06-04T13:21:26Z

@kostja , modifying python driver will not help with java driver failures.

sylwiaszunejko · 2024-06-04T13:27:02Z

This can be easily worked around in tests, so it's low priority.

We though that this issue is a low priority one so it was not planned for this sprint, should we reconsider that?

We want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met. Fixes: scylladb#317

sylwiaszunejko · 2024-10-01T14:45:35Z

@kbr-scylla I have submitted the PR with one possible solution #380

We want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met. Fixes: scylladb#317

In some cases We want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met when needed. Fixes: scylladb#317

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: scylladb#21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. Closes #21056

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. (cherry picked from commit f847591)

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. (cherry picked from commit f847591) Closes #21107

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. (cherry picked from commit f847591)

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. (cherry picked from commit 3969ffb) Closes #21106

The testcase is flaky due to a known python driver issue: scylladb/python-driver#317. This issue causes the `CREATE KEYSPACE` statement to be sometimes executed twice in a row, and the 2nd CREATE statement causes the test to fail. In order to work around it, it's enough to add `if not exists` when creating a ks. Fixes: #21034 Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch. (cherry picked from commit 3969ffb) Closes #21134

kbr-scylla assigned avelanarius Apr 23, 2024

This was referenced Apr 23, 2024

[dev, aarch64] topology_experimental_raft.test_fencing failed with Keyspace 'test_1709718295500' already exists scylladb/scylladb#17654

Closed

Avoid failing requests when re-establishing connections to the cluster #273

Open

kbr-scylla mentioned this issue Apr 23, 2024

test/pylib: random_tables: use IF NOT EXISTS when creating keyspace scylladb/scylladb#18368

Closed

kostja mentioned this issue Apr 24, 2024

[dtest]: nodetool_additional_test.TestNodetool.test_disablebinary_and_disablegossip failed on timeout (c-s workload timing out) scylladb/scylladb#16219

Open

avelanarius assigned sylwiaszunejko Apr 25, 2024

roydahan unassigned avelanarius Jun 3, 2024

Lorak-mmk added the bug Something isn't working label Jun 18, 2024

sylwiaszunejko mentioned this issue Oct 1, 2024

Fix double execution after concurrent node bootstraps #380

Closed

ptrsmrn mentioned this issue Oct 14, 2024

test: fix flaky test_multidc_alter_tablets_rf scylladb/scylladb#21056

Closed

sylwiaszunejko mentioned this issue Oct 14, 2024

Fix pool management #382

Open

This was referenced Oct 14, 2024

[Backport 6.1] test: fix flaky test_multidc_alter_tablets_rf scylladb/scylladb#21106

Closed

[Backport 6.2] test: fix flaky test_multidc_alter_tablets_rf scylladb/scylladb#21107

Closed

mergify bot mentioned this issue Oct 16, 2024

[Backport 6.0] test: fix flaky test_multidc_alter_tablets_rf scylladb/scylladb#21134

Closed

ptrsmrn mentioned this issue Oct 31, 2024

alternator_tests.test_slow_query_logging is flaky in debug mode ('Table DKHKS6GZ6G already exists' error) scylladb/scylladb#15456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

kbr-scylla commented Apr 23, 2024

kbr-scylla commented Apr 23, 2024

sylwiaszunejko commented May 13, 2024

kbr-scylla commented May 15, 2024

sylwiaszunejko commented May 15, 2024 •

edited

Loading

kbr-scylla commented May 15, 2024 •

edited

Loading

kbr-scylla commented May 15, 2024

kostja commented Jun 4, 2024

kbr-scylla commented Jun 4, 2024

sylwiaszunejko commented Jun 4, 2024

sylwiaszunejko commented Oct 1, 2024

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

Comments

kbr-scylla commented Apr 23, 2024

kbr-scylla commented Apr 23, 2024

sylwiaszunejko commented May 13, 2024

kbr-scylla commented May 15, 2024

sylwiaszunejko commented May 15, 2024 • edited Loading

kbr-scylla commented May 15, 2024 • edited Loading

kbr-scylla commented May 15, 2024

kostja commented Jun 4, 2024

kbr-scylla commented Jun 4, 2024

sylwiaszunejko commented Jun 4, 2024

sylwiaszunejko commented Oct 1, 2024

sylwiaszunejko commented May 15, 2024 •

edited

Loading

kbr-scylla commented May 15, 2024 •

edited

Loading