Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

Open
kbr-scylla opened this issue Apr 23, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@kbr-scylla
Copy link

We boot 2 Scylla nodes concurrently into existing cluster.
Python driver obtains two on_add notifications, one for each node.
Each notification calls add_or_renew_pool, which creates connection pool to each node.

But then, for some reason, one of the on_adds may cause another add_or_renew_pool to be called for the other server. This happens from _finalize_add -> update_created_pools.
This may cause a second pool to be created for the other server and the initially established pool to that server to be closed.

There could be a statement running on the initially established pool. The statement may have already been executed on Scylla side, but the driver didn't get a response yet.
The pool is closed before response arrives. This causes driver to retry the statement on the new pool, leading to double execution.

In our tests, we observe this by "CREATE KEYSPACE" statement failing with "already exists" error message (scylladb/scylladb#17654)


Reproducer:

I added a tactical sleep there:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index 8ed0647b..e79daf7e 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3320,6 +3320,8 @@ class Session(object):
                         self._lock.acquire()
                         return False
                     self._lock.acquire()
+                if previous:
+                    time.sleep(2)
                 self._pools[host] = new_pool
 
             log.debug("Added pool for host %s to session", host)

this is just coroutinization of create_keyspace_statement::execute, then sleep + logging added:

diff --git a/cql3/statements/create_keyspace_statement.cc b/cql3/statements/create_keyspace_statement.cc
index e66779ac0d..f8b3b1f766 100644
--- a/cql3/statements/create_keyspace_statement.cc
+++ b/cql3/statements/create_keyspace_statement.cc
@@ -267,13 +267,15 @@ std::vector<sstring> check_against_restricted_replication_strategies(
 future<::shared_ptr<messages::result_message>>
 create_keyspace_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
     std::vector<sstring> warnings = check_against_restricted_replication_strategies(qp, keyspace(), *_attrs, qp.get_cql_stats());
-        return schema_altering_statement::execute(qp, state, options, std::move(guard)).then([warnings = std::move(warnings)] (::shared_ptr<messages::result_message> msg) {
-        for (const auto& warning : warnings) {
-            msg->add_warning(warning);
-            mylogger.warn("{}", warning);
-        }
-        return msg;
-    });
+    auto msg = co_await schema_altering_statement::execute(qp, state, options, std::move(guard));
+    for (const auto& warning : warnings) {
+        msg->add_warning(warning);
+        mylogger.warn("{}", warning);
+    }
+    mylogger.info("sleep before returning create keyspace message");
+    co_await seastar::sleep(std::chrono::seconds{2});
+    mylogger.info("return create keyspace message");
+    co_return std::move(msg);
 }

Test (included in the above branch):

@pytest.mark.asyncio
async def test_double_execution(request, manager: ManagerClient):
   await manager.server_add()
   await manager.servers_add(2)

   logging.info(f'SLEEP 1')
   await asyncio.sleep(1)

   cql = manager.get_cql()
   hosts = cql.cluster.metadata.all_hosts()
   logging.info(f"hosts: {hosts}")

   logging.info(f'create ks')
   await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")

I run it like this:

PYTHONPATH=$PYTHONPATH:/home/kbr/dev/python-driver ./test.py --mode dev test_double_execution --repeat 4

in /home/kbr/dev/python-driver I have the above Python driver branch checked out.


Logs from example run:
scylla-10.log
scylla-9.log
scylla-3.log
topology_custom.test_double_execution.3.log

Here are the relevant excerpts cut out from the test log (those are messages I added):

11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.9:9042
11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.10:9042
11:29:55.414 INFO> SLEEP 1
11:29:55.417 INFO> finalize_add update_created_pools 127.58.145.10:9042
11:29:55.417 INFO> update_created_pools add_or_renew_pool 127.58.145.9:9042
11:29:55.418 INFO> finalize_add update_created_pools 127.58.145.9:9042
11:29:56.415 INFO> create ks
11:29:57.421 DEBUG> set new pool 127.58.145.9:9042 previous True
11:29:57.421 DEBUG> Shutting down connections to 127.58.145.9:9042

What happened is that finalize_add for host 127.58.145.10:9042 (scylla-10) caused update_created_pools call, which called add_or_renew_pool for host 127.58.145.9:9042 (scylla-9). But pool for scylla-9 was already established. We start running "create keyspace" on scylla-9. In the meantime, add_or_renew_pool establishes a new pool and drops the old one, causing "create keyspace" to be retried on the new pool, leading to double execution:

>       await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")
E       cassandra.AlreadyExists: Keyspace 'ks' already exists
@kbr-scylla
Copy link
Author

This can be easily worked around in tests, so it's low priority.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 23, 2024
Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement
may sometimes get executed twice (scylladb/python-driver#317), leading
to "Keyspace ... already exists" error in our tests
(scylladb#17654). Work around this by using "IF NOT EXISTS".

Fixes: scylladb#17654
denesb pushed a commit to scylladb/scylladb that referenced this issue Apr 24, 2024
Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement
may sometimes get executed twice (scylladb/python-driver#317), leading
to "Keyspace ... already exists" error in our tests
(#17654). Work around this by using "IF NOT EXISTS".

Fixes: #17654

Closes #18368
Jadw1 pushed a commit to Jadw1/scylla that referenced this issue Apr 26, 2024
Due to Python driver's unexpected behavior, "CREATE KEYSPACE" statement
may sometimes get executed twice (scylladb/python-driver#317), leading
to "Keyspace ... already exists" error in our tests
(scylladb#17654). Work around this by using "IF NOT EXISTS".

Fixes: scylladb#17654

Closes scylladb#18368
@sylwiaszunejko
Copy link
Collaborator

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

@kbr-scylla
Copy link
Author

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

Huh? The reproducer I posted above:

ScyllaDB branch with sleep + logging added before "create keyspace" statement returns: https://github.com/kbr-scylla/scylladb/tree/debug-double-execution

runs in consistent-topology-changes mode.

In fact it explicitly depends on consistent-topology-changes, because it bootstraps 2 nodes concurrently (it uses manager.servers_add(2)). The idea behind the reproducer is that notifications about the two concurrently booting nodes arrive at the driver at roughly the same time and race with each other. I did not manage to reproduce the issue by booting nodes sequentially - then the notifications are arriving sequentially and the problem does not seem to happen.

How did you reproduce the issue outside consistent-topology-changes -- did you use some different test case? Are you sure you reproduced this exact issue and not something different (maybe similar)?

@sylwiaszunejko
Copy link
Collaborator

sylwiaszunejko commented May 15, 2024

Maybe I misunderstood something, I used exactly what is in your fork, but after I added cfg = {'experimental_features': ['consistent-topology-changes'],} in config option of servers_add it stopped reproducing. Probably some stupid mistake on my end, if so sorry for that.

@kbr-scylla
Copy link
Author

kbr-scylla commented May 15, 2024

When was that? cfg = {'experimental_features': ['consistent-topology-changes']} doesn't do anything after scylladb/scylladb@d8313dd, and before scylladb/scylladb@d8313dd it was the default in topology tests.

So I suspect that your change didn't actually change anything, but then you were unlucky with your runs and failed to reproduce it, used a too small sample, and then concluded that the change was the reason.

@kbr-scylla
Copy link
Author

Hm, actually cfg = {'experimental_features': ['consistent-topology-changes']} could do something, it disables some other experimental features. The usual set of experimental features on that branch was:

         'experimental_features': ['udf',
                                   'alternator-streams',
                                   'consistent-topology-changes',
                                   'broadcast-tables',
                                   'keyspace-storage-options'],

and you reduced it to simply ['consistent-topology-changes']

But neither of these other features seems likely to be related, so I would again recheck with higher sample of runs

(Preferably -- just rebase my reproducer branch on latest master, where consistent-topology-changes is no longer experimental, and check there)

@kostja
Copy link

kostja commented Jun 4, 2024

Ping on this one, the connected issue scylladb/scylladb#16219 is impacting our ability to run sct tests.

@kbr-scylla
Copy link
Author

@kostja , modifying python driver will not help with java driver failures.

@sylwiaszunejko
Copy link
Collaborator

This can be easily worked around in tests, so it's low priority.

We though that this issue is a low priority one so it was not planned for this sprint, should we reconsider that?

@Lorak-mmk Lorak-mmk added the bug Something isn't working label Jun 18, 2024
sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 1, 2024
We want to only update the pool if previous do not exist
or is shutdown. This commit adds additional validation to
add_or_renew_pool to make sure this condition is met.

Fixes: scylladb#317
sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 1, 2024
We want to only update the pool if previous do not exist
or is shutdown. This commit adds additional validation to
add_or_renew_pool to make sure this condition is met.

Fixes: scylladb#317
sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 1, 2024
We want to only update the pool if previous do not exist
or is shutdown. This commit adds additional validation to
add_or_renew_pool to make sure this condition is met.

Fixes: scylladb#317
@sylwiaszunejko
Copy link
Collaborator

@kbr-scylla I have submitted the PR with one possible solution #380

sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 8, 2024
We want to only update the pool if previous do not exist
or is shutdown. This commit adds additional validation to
add_or_renew_pool to make sure this condition is met.

Fixes: scylladb#317
sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 8, 2024
In some cases We want to only update the pool if previous
do not exist or is shutdown. This commit adds additional
validation to add_or_renew_pool to make sure this condition
is met when needed.

Fixes: scylladb#317
sylwiaszunejko added a commit to sylwiaszunejko/python-driver that referenced this issue Oct 8, 2024
In some cases We want to only update the pool if previous
do not exist or is shutdown. This commit adds additional
validation to add_or_renew_pool to make sure this condition
is met when needed.

Fixes: scylladb#317
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue Oct 14, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: scylladb#21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.
kbr-scylla pushed a commit to scylladb/scylladb that referenced this issue Oct 14, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

Closes #21056
mergify bot pushed a commit to scylladb/scylladb that referenced this issue Oct 14, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f847591)
mergify bot pushed a commit to scylladb/scylladb that referenced this issue Oct 14, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f847591)
denesb pushed a commit to scylladb/scylladb that referenced this issue Oct 15, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f847591)

Closes #21107
mergify bot pushed a commit to scylladb/scylladb that referenced this issue Oct 16, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f847591)
denesb pushed a commit to scylladb/scylladb that referenced this issue Oct 17, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit 3969ffb)

Closes #21106
denesb pushed a commit to scylladb/scylladb that referenced this issue Oct 17, 2024
The testcase is flaky due to a known python driver issue:
scylladb/python-driver#317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit 3969ffb)

Closes #21134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants