Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix initialization of DataStorm samples after session recovery #3294

Merged
merged 7 commits into from
Dec 30, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 36 additions & 19 deletions cpp/src/DataStorm/SessionI.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1134,28 +1134,45 @@ SessionI::subscriberInitialized(
out << _id << ": initialized '" << element << "' from 'e" << elementId << '@' << topicId << "'";
}
elementSubscriber->initialized = true;
elementSubscriber->lastId = samples.empty() ? 0 : samples.back().id;

vector<shared_ptr<Sample>> samplesI;
samplesI.reserve(samples.size());
auto sampleFactory = element->getTopic()->getSampleFactory();
auto keyFactory = element->getTopic()->getKeyFactory();
for (const auto& sample : samples)
// If the samples collection is empty, the element subscriber's lastId remains unchanged:
// - If no samples have been received, lastId is 0.
// - If the element subscriber has been initialized before, lastId represents the ID of the latest received sample.
//
// If the samples collection is not empty:
// - It contains samples queued in the peer writer for the element that are valid according to the element's
// configuration.
// - These samples have not yet been processed by the element subscriber, according to the subscriber's lastId.
if (samples.empty())
{
assert((!key && !sample.keyValue.empty()) || key == subscriber.keys[sample.keyId].first);

samplesI.push_back(sampleFactory->create(
_id,
elementSubscribers->name,
sample.id,
sample.event,
key ? key : keyFactory->decode(_instance->getCommunicator(), sample.keyValue),
subscriber.tags[sample.tag],
sample.value,
sample.timestamp));
assert(samplesI.back()->key);
return {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code, when samples.empty(), we set elementSubscriber->lastId to 0.

It's not immediately clear why we don't need that. Is this lastId already 0 for some other reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to explain the logic of lastId.

lastId is default initialized to 0 in Session.h

// The ID of the last processed sample.
std::int64_t lastId{0};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and the fix in this PR is exactly that: to not set lastId to 0 when samples is empty?

Copy link
Member Author

@pepone pepone Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is to not reset to 0 when the subscriber is initialized after recovery.

The subscriber received some samples and lastId is updated accordingly.

Then the Session is lost, when it reconnects subscriberInitialized is called again.

If the new call sent no samples, because there were no new samples since the recovery, the previous code was reseting lastId to 0. (that is the bug).

Now if session is lost again, the next recovery would tell the peer that the lastId it saw is 0, and the peer would send all queues elements. That is what was happening with the test failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an additional test that allows reproducing the initial issue.

}
else
{
assert(samples.front().id > elementSubscriber->lastId);
elementSubscriber->lastId = samples.back().id;

vector<shared_ptr<Sample>> samplesI;
samplesI.reserve(samples.size());
auto sampleFactory = element->getTopic()->getSampleFactory();
auto keyFactory = element->getTopic()->getKeyFactory();
for (const auto& sample : samples)
{
assert((!key && !sample.keyValue.empty()) || key == subscriber.keys[sample.keyId].first);

samplesI.push_back(sampleFactory->create(
_id,
elementSubscribers->name,
sample.id,
sample.event,
key ? key : keyFactory->decode(_instance->getCommunicator(), sample.keyValue),
subscriber.tags[sample.tag],
sample.value,
sample.timestamp));
assert(samplesI.back()->key);
}
return samplesI;
}
return samplesI;
}

void
Expand Down
67 changes: 66 additions & 1 deletion cpp/test/DataStorm/reliability/Reader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ void ::Reader::run(int argc, char* argv[])
auto connection = node.getSessionConnection(sample.getSession());
while (!connection)
{
this_thread::sleep_for(chrono::milliseconds(200));
this_thread::sleep_for(chrono::milliseconds(10));
connection = node.getSessionConnection(sample.getSession());
}
connection->close().get();
Expand All @@ -68,6 +68,71 @@ void ::Reader::run(int argc, char* argv[])
writer.update(0);
writer.waitForNoReaders();
}

{
Topic<string, int> topic(node, "int2");
auto reader = makeSingleKeyReader(topic, "element", "", config);
string session;

// Read 100 samples from the "element" key and close the connection.
for (int i = 0; i < 100; ++i)
{
auto sample = reader.getNextUnread();
if (sample.getValue() != i)
{
cerr << "unexpected sample: " << sample.getValue() << " expected:" << i << endl;
test(false);
}
session = sample.getSession();
pepone marked this conversation as resolved.
Show resolved Hide resolved
}

auto connection = node.getSessionConnection(session);
while (!connection)
{
this_thread::sleep_for(chrono::milliseconds(10));
connection = node.getSessionConnection(session);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions sometimes returns nullptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it returns nullptr when session is disconnected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the while loop required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to remove it. The idea was that the session might be recovering from a previous close connection. But here seems there is always a connection.

}
connection->close().get();

// Send a sample to the writer on "reader_barrier" to let it know that the connection was closed.
// The writer will read it after the session is reestablished.
auto writerB = makeSingleKeyWriter(topic, "reader_barrier");
writerB.waitForReaders();
writerB.update(0);

// Wait for the writer to acknowledge the sample send on "reader_barrier" and close the connection again.
auto readerB = makeSingleKeyReader(topic, "writer_barrier");
[[maybe_unused]] auto _ = readerB.getNextUnread();

// Session was reestablished; close it again.
connection = node.getSessionConnection(session);
while (!connection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. Shouldn't the connection exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes fixed

{
this_thread::sleep_for(chrono::milliseconds(10));
connection = node.getSessionConnection(session);
}
connection->close().get();

// Let the writer know the connection was closed again, and that it can proceed with the second batch of
// samples.
writerB.update(0);

for (int i = 0; i < 100; ++i)
{
auto sample = reader.getNextUnread();
if (sample.getValue() != i + 100)
{
cerr << "unexpected sample: " << sample.getValue() << " expected:" << (i + 100) << endl;
test(false);
}
session = sample.getSession();
pepone marked this conversation as resolved.
Show resolved Hide resolved
}

// Let the writer know we have processed all samples.
writerB.waitForReaders();
writerB.update(0);
writerB.waitForNoReaders();
}
}

DEFINE_TEST(::Reader)
37 changes: 37 additions & 0 deletions cpp/test/DataStorm/reliability/Writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,43 @@ void ::Writer::run(int argc, char* argv[])
[[maybe_unused]] auto _ = makeSingleKeyReader(topic, "barrier").getNextUnread();
}
cout << "ok" << endl;

// Publish a batch of samples to a topic's key, follow by two consecutive session recovery events without writer
// activity on the given key.
// Then send a second batch of samples to the same topic's key and ensure the reader continue reading from when it
// left off.
cout << "testing reader multiple connection closure without writer activity... " << flush;
{
Topic<string, int> topic(node, "int2");
auto writer = makeSingleKeyWriter(topic, "element", "", config);
writer.waitForReaders();
for (int i = 0; i < 100; ++i)
{
writer.update(i);
}

auto readerB = makeSingleKeyReader(topic, "reader_barrier");

// A control sample sent by the reader to let the writer know the connection was closed. The writer processes this
// sample after the first session reestablishment.
auto sample = readerB.getNextUnread();

// Send a control sample to let the reader know session was reestablished.
auto writerB = makeSingleKeyWriter(topic, "writer_barrier");
writerB.update(0);

// Wait for a second control sample from the reader indicating the second session closure. The writer process
// this sample after the second session reestablishment.
sample = readerB.getNextUnread();

// Session has been reestablish twice without activity in "element" key. Send the second batch of samples.
for (int i = 0; i < 100; ++i)
{
writer.update(i + 100);
}
sample = readerB.getNextUnread();
}
cout << "ok" << endl;
}

DEFINE_TEST(::Writer)
Loading