Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decoding multiple frames in a single envelope in native protocol v5 #368

Merged
merged 5 commits into from
Jul 10, 2024

Conversation

whatyouhide
Copy link
Owner

@whatyouhide whatyouhide commented Jun 11, 2024

@jvf @harunzengin this PR can be a collaborative attempt at figuring out what the flying horse crab hell is going on with #356.

At the time of opening it, the PR only adds a test to reproduce the timeouts (which does reproduce them ~90% of the time in my experience) and some additional logging.

Most of the time, Xandra does not receive the frame that times out. This is weird. I’m running this with a locally-running Dockerized Cassandra in case it helps.

Btw, I’m opening this because I won't have a ton of time to dedicate to this as I’m pretty busy at work, but I figured we can dig in together, especially after @jvf's fantastic reproducing steps and tests in #356 🙃

@jvf
Copy link
Contributor

jvf commented Jun 13, 2024

We will have a look.

@jvf
Copy link
Contributor

jvf commented Jun 27, 2024

I may have found something:

CASSANDRA_NATIVE_PROTOCOL=v3 mix test --only test:"test concurrent requests on a single connection"

and

CASSANDRA_NATIVE_PROTOCOL=v4 mix test --only test:"test concurrent requests on a single connection"

do not produce a failure, only

CASSANDRA_NATIVE_PROTOCOL=v5 mix test --only test:"test concurrent requests on a single connection"

does. Tested with up to max_requests = 100 (in test/xandra_test.exs:349). So this may be a problem with native protocol v5!

@whatyouhide
Copy link
Owner Author

Wooooah fantastic find!!!!

@jvf
Copy link
Contributor

jvf commented Jul 5, 2024

I started looking at the v5 implementation, but nothing jumped out at me. Since the workaround (forcing protocol_version: :v4) is sufficient for us, I did not get the approval to investigate this further.

@whatyouhide
Copy link
Owner Author

I've created https://issues.apache.org/jira/browse/CASSANDRA-19753 to see if this might be a C* issue.

@whatyouhide
Copy link
Owner Author

With the help of Sam in the JIRA issue, we figured out that the issue was that I screwed up decoding multiple frames in a single envelope in native protocol v5 🤦 My bad. I pushed fixes into this PR.

@whatyouhide whatyouhide changed the title Timeouts on a single connection Fix decoding multiple frames in a single envelope in native protocol v5 Jul 10, 2024
@whatyouhide whatyouhide merged commit 379fcce into main Jul 10, 2024
5 checks passed
@whatyouhide whatyouhide deleted the al/timeouts-on-single-conn branch July 10, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants