Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where invalidation messages were getting sent to closing clients #1823

Merged
merged 3 commits into from
Mar 10, 2025

Conversation

madolson
Copy link
Member

@madolson madolson commented Mar 6, 2025

So I think we were seeing these timeouts because QUIT behaves differently between IO threading and non-IO Threading. In both cases, QUIT is a close after reply command. Once the client has written out the results, it gets added to the queue to that gets cleaned up at the end of the event loop. Normally this is fine, as before we circle around to the next event loop this client is definitely killed.

For IO threads, we need to process the pending IO commands to add the client to the kill queue. This may not happen immediately, which means we might go down and process that SET command before we free the client that is supposedly already quit. This is very sensitive to timing, so it's not very likely, but still possible. Once the SET has been executed, the invariants in the tests are off since it will get a correct invalidation.

The fix is to also mark a client as broken if it's being closed.

The test was also hanging because of a test issue, because the conditional lsearch check was returning 1 or 0 strings, which are both valid exit criteria for the wait_for.

./runtest --io-threads --accurate --verbose --tags network --dump-logs --single unit/tracking --loops 500 --clients 25

Fixes #1647 (I believe this now!)

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 6, 2025
@madolson madolson requested a review from ranshid March 6, 2025 07:50
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.02%. Comparing base (0cc0bf7) to head (8565165).
Report is 8 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1823      +/-   ##
============================================
+ Coverage     70.87%   71.02%   +0.14%     
============================================
  Files           123      123              
  Lines         65651    65665      +14     
============================================
+ Hits          46529    46636     +107     
+ Misses        19122    19029      -93     
Files with missing lines Coverage Δ
src/tracking.c 99.04% <100.00%> (ø)

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rjd15372
Copy link
Member

rjd15372 commented Mar 6, 2025

Fixes #1647

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, it is new to me...

Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say, that doesn't really explain to me why it was hanging on the final read.

I am not sure how we can thus say is "Fixes" #1647 ?

If I understand you claim that due to the fact that sync io is used, we might be able to identify the client in sendTrackingMessage so we will not send the tracking-redir-broken?

I agree this does not seem like the root cause for this specific error.

I would also ask why we are only satisfied with checking the redirection client exists in sendTrackingMessage and not try to lookup his flags are not close_asap or close_after_reply? seems like a bug.

@madolson
Copy link
Member Author

madolson commented Mar 6, 2025

I am not sure how we can thus say is "Fixes" #1647 ?

I'm not confident there aren't more edge cases. However, it ran a little over 17 million times over night on my laptop and never hung (although some other tests failed I've never seen fail), whereas it consistently hung after ~100 iterations before the change, so, at the very least this is more stable.

I would also ask why we are only satisfied with checking the redirection client exists in sendTrackingMessage and not try to lookup his flags are not close_asap or close_after_reply? seems like a bug.

I'm not exactly sure what you mean, we're not sending the tracking message to the client that is closing. EDIT: I understand now, let me try this.

@madolson
Copy link
Member Author

madolson commented Mar 6, 2025

Oh, I figured out why it's hanging, the test is not actually checking for the invalidation.

Signed-off-by: Madelyn Olson <[email protected]>
@madolson madolson changed the title Make it so tracking tests use kill instead of quit Fix bug where invalidation messages were getting sent to closing clients Mar 6, 2025
@madolson madolson requested review from enjoy-binbin and ranshid March 6, 2025 17:08
@madolson madolson added the release-notes This issue should get a line item in the release notes label Mar 6, 2025
Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@madolson madolson merged commit 8221a15 into valkey-io:unstable Mar 10, 2025
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[test-failure] Test timeout for RESP3 client redirection for tracked key
4 participants