Fix bug where invalidation messages were getting sent to closing clients #1823

madolson · 2025-03-06T07:50:26Z

So I think we were seeing these timeouts because QUIT behaves differently between IO threading and non-IO Threading. In both cases, QUIT is a close after reply command. Once the client has written out the results, it gets added to the queue to that gets cleaned up at the end of the event loop. Normally this is fine, as before we circle around to the next event loop this client is definitely killed.

For IO threads, we need to process the pending IO commands to add the client to the kill queue. This may not happen immediately, which means we might go down and process that SET command before we free the client that is supposedly already quit. This is very sensitive to timing, so it's not very likely, but still possible. Once the SET has been executed, the invariants in the tests are off since it will get a correct invalidation.

The fix is to also mark a client as broken if it's being closed.

The test was also hanging because of a test issue, because the conditional lsearch check was returning 1 or 0 strings, which are both valid exit criteria for the wait_for.

./runtest --io-threads --accurate --verbose --tags network --dump-logs --single unit/tracking --loops 500 --clients 25

Fixes #1647 (I believe this now!)

Signed-off-by: Madelyn Olson <[email protected]>

codecov · 2025-03-06T08:06:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.02%. Comparing base (0cc0bf7) to head (8565165).
Report is 8 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1823      +/-   ##
============================================
+ Coverage     70.87%   71.02%   +0.14%     
============================================
  Files           123      123              
  Lines         65651    65665      +14     
============================================
+ Hits          46529    46636     +107     
+ Misses        19122    19029      -93

Files with missing lines	Coverage Δ
src/tracking.c	`99.04% <100.00%> (ø)`

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rjd15372 · 2025-03-06T09:15:21Z

Fixes #1647

enjoy-binbin

Great, it is new to me...

ranshid

I will say, that doesn't really explain to me why it was hanging on the final read.

I am not sure how we can thus say is "Fixes" #1647 ?

If I understand you claim that due to the fact that sync io is used, we might be able to identify the client in sendTrackingMessage so we will not send the tracking-redir-broken?

I agree this does not seem like the root cause for this specific error.

I would also ask why we are only satisfied with checking the redirection client exists in sendTrackingMessage and not try to lookup his flags are not close_asap or close_after_reply? seems like a bug.

madolson · 2025-03-06T16:16:18Z

I am not sure how we can thus say is "Fixes" #1647 ?

I'm not confident there aren't more edge cases. However, it ran a little over 17 million times over night on my laptop and never hung (although some other tests failed I've never seen fail), whereas it consistently hung after ~100 iterations before the change, so, at the very least this is more stable.

I would also ask why we are only satisfied with checking the redirection client exists in sendTrackingMessage and not try to lookup his flags are not close_asap or close_after_reply? seems like a bug.

I'm not exactly sure what you mean, we're not sending the tracking message to the client that is closing. EDIT: I understand now, let me try this.

Signed-off-by: Madelyn Olson <[email protected]>

madolson · 2025-03-06T16:58:30Z

Oh, I figured out why it's hanging, the test is not actually checking for the invalidation.

Signed-off-by: Madelyn Olson <[email protected]>

ranshid

LGTM

Make it so tracking tests use kill instead of quit

7944e09

Signed-off-by: Madelyn Olson <[email protected]>

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 6, 2025

madolson requested a review from ranshid March 6, 2025 07:50

rjd15372 mentioned this pull request Mar 6, 2025

[test-failure] Test timeout for RESP3 client redirection for tracked key #1647

Closed

enjoy-binbin approved these changes Mar 6, 2025

View reviewed changes

ranshid reviewed Mar 6, 2025

View reviewed changes

Change to properly invalidate client

48d6e99

Signed-off-by: Madelyn Olson <[email protected]>

Actually consume invalidation

8565165

Signed-off-by: Madelyn Olson <[email protected]>

madolson changed the title ~~Make it so tracking tests use kill instead of quit~~ Fix bug where invalidation messages were getting sent to closing clients Mar 6, 2025

madolson requested review from enjoy-binbin and ranshid March 6, 2025 17:08

madolson added the release-notes This issue should get a line item in the release notes label Mar 6, 2025

enjoy-binbin approved these changes Mar 7, 2025

View reviewed changes

ranshid approved these changes Mar 7, 2025

View reviewed changes

madolson merged commit 8221a15 into valkey-io:unstable Mar 10, 2025
58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where invalidation messages were getting sent to closing clients #1823

Fix bug where invalidation messages were getting sent to closing clients #1823

madolson commented Mar 6, 2025 •

edited

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading

rjd15372 commented Mar 6, 2025

enjoy-binbin left a comment

ranshid left a comment

madolson commented Mar 6, 2025 •

edited

Loading

madolson commented Mar 6, 2025 •

edited

Loading

ranshid left a comment

Fix bug where invalidation messages were getting sent to closing clients #1823

Fix bug where invalidation messages were getting sent to closing clients #1823

Conversation

madolson commented Mar 6, 2025 • edited Loading

codecov bot commented Mar 6, 2025 • edited Loading

Codecov Report

rjd15372 commented Mar 6, 2025

enjoy-binbin left a comment

Choose a reason for hiding this comment

ranshid left a comment

Choose a reason for hiding this comment

madolson commented Mar 6, 2025 • edited Loading

madolson commented Mar 6, 2025 • edited Loading

ranshid left a comment

Choose a reason for hiding this comment

madolson commented Mar 6, 2025 •

edited

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading

madolson commented Mar 6, 2025 •

edited

Loading

madolson commented Mar 6, 2025 •

edited

Loading