-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix initialization of DataStorm samples after session recovery #3294
Changes from 5 commits
0e31c7a
676ea4f
a2d7028
476a67a
6d7c008
9a8226d
4cc3ba7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,7 +57,7 @@ void ::Reader::run(int argc, char* argv[]) | |
auto connection = node.getSessionConnection(sample.getSession()); | ||
while (!connection) | ||
{ | ||
this_thread::sleep_for(chrono::milliseconds(200)); | ||
this_thread::sleep_for(chrono::milliseconds(10)); | ||
connection = node.getSessionConnection(sample.getSession()); | ||
} | ||
connection->close().get(); | ||
|
@@ -68,6 +68,71 @@ void ::Reader::run(int argc, char* argv[]) | |
writer.update(0); | ||
writer.waitForNoReaders(); | ||
} | ||
|
||
{ | ||
Topic<string, int> topic(node, "int2"); | ||
auto reader = makeSingleKeyReader(topic, "element", "", config); | ||
string session; | ||
|
||
// Read 100 samples from the "element" key and close the connection. | ||
for (int i = 0; i < 100; ++i) | ||
{ | ||
auto sample = reader.getNextUnread(); | ||
if (sample.getValue() != i) | ||
{ | ||
cerr << "unexpected sample: " << sample.getValue() << " expected:" << i << endl; | ||
test(false); | ||
} | ||
session = sample.getSession(); | ||
pepone marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
auto connection = node.getSessionConnection(session); | ||
while (!connection) | ||
{ | ||
this_thread::sleep_for(chrono::milliseconds(10)); | ||
connection = node.getSessionConnection(session); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This functions sometimes returns There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it returns nullptr when session is disconnected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the while loop required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we might be able to remove it. The idea was that the session might be recovering from a previous close connection. But here seems there is always a connection. |
||
} | ||
connection->close().get(); | ||
|
||
// Send a sample to the writer on "reader_barrier" to let it know that the connection was closed. | ||
// The writer will read it after the session is reestablished. | ||
auto writerB = makeSingleKeyWriter(topic, "reader_barrier"); | ||
writerB.waitForReaders(); | ||
writerB.update(0); | ||
|
||
// Wait for the writer to acknowledge the sample send on "reader_barrier" and close the connection again. | ||
auto readerB = makeSingleKeyReader(topic, "writer_barrier"); | ||
[[maybe_unused]] auto _ = readerB.getNextUnread(); | ||
|
||
// Session was reestablished; close it again. | ||
connection = node.getSessionConnection(session); | ||
while (!connection) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question here. Shouldn't the connection exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes fixed |
||
{ | ||
this_thread::sleep_for(chrono::milliseconds(10)); | ||
connection = node.getSessionConnection(session); | ||
} | ||
connection->close().get(); | ||
|
||
// Let the writer know the connection was closed again, and that it can proceed with the second batch of | ||
// samples. | ||
writerB.update(0); | ||
|
||
for (int i = 0; i < 100; ++i) | ||
{ | ||
auto sample = reader.getNextUnread(); | ||
if (sample.getValue() != i + 100) | ||
{ | ||
cerr << "unexpected sample: " << sample.getValue() << " expected:" << (i + 100) << endl; | ||
test(false); | ||
} | ||
session = sample.getSession(); | ||
pepone marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// Let the writer know we have processed all samples. | ||
writerB.waitForReaders(); | ||
writerB.update(0); | ||
writerB.waitForNoReaders(); | ||
} | ||
} | ||
|
||
DEFINE_TEST(::Reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code, when samples.empty(), we set elementSubscriber->lastId to 0.
It's not immediately clear why we don't need that. Is this lastId already 0 for some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to explain the logic of
lastId
.lastId is default initialized to 0 in Session.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, and the fix in this PR is exactly that: to not set lastId to 0 when samples is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is to not reset to 0 when the subscriber is initialized after recovery.
The subscriber received some samples and lastId is updated accordingly.
Then the Session is lost, when it reconnects subscriberInitialized is called again.
If the new call sent no samples, because there were no new samples since the recovery, the previous code was reseting lastId to 0. (that is the bug).
Now if session is lost again, the next recovery would tell the peer that the lastId it saw is 0, and the peer would send all queues elements. That is what was happening with the test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed an additional test that allows reproducing the initial issue.