Tracking: correct the semantics of `offset` in source connectors #18697

stdrc · 2024-09-25T06:07:16Z

Motivation

This project originates from a bug fix #16046, where I found the concept of “offset” especially “start offset” is very confusing in our source connectors, easily leading to potential bugs.

Background

refactor(source): rename update_with_offset(start_offset) to update_offset(last_read_offset) #16256
fix(source): change kafka "offset" to be left inclusive and right exclusive #16257

For KafkaSplit and also some other such XxxSplits, the start_offset field has two different semantics in the following two situations:

When a split struct is created by Enumerator, the start_offset means the offset of the first message we should read.
When update_offset is called (by SourceExecutor), the start_offset is replaced with a “last seen offset” which is got from the stream chunk returned by Reader.

This semantic mismatch is the key problem.

Now our Kafka source does work, because we made the semantics in the 1st situation align with the 2nd one, by subtracting 1 from the offset when creating a Split and adding 1 to the offset when creating/recovering a Reader. However the solution is very counter-intuitive. By its name, the start_offset should’ve store the “start offset” or “next offset”, but it actually stores “last seen offset” or “previous offset”. And that also caused other source implementations to follow the error, because new sources just copied the design from old ones, especially file source, which is acting totally wrong.

I think a more reasonable and intuitive solution should be align the 2nd semantics with the 1st one, which is exactly the way how Flink, Kafka and plenty of other industrial products handle offsets. That is, to be more clear, always storing the “next offset” we should read next time. And this is also the only way that can correctly work for file source.

Progress

Kafka
NATS
- fix(nats): align the semantics of scan.startup.mode=latest of NATS with others #18733
Debezium CDC
Kinesis: refactor(kinesis): rename fields of KinesisOffset and KinesisSplit to make everything explicit #18704
S3
Pulsar
~~PubSub~~
~~MQTT~~
Nextmark & Datagen

The text was updated successfully, but these errors were encountered:

stdrc added type/fix Bug fix type/tracking Tracking issue. labels Sep 25, 2024

stdrc self-assigned this Sep 25, 2024

github-actions bot added this to the release-2.1 milestone Sep 25, 2024

stdrc mentioned this issue Sep 25, 2024

refactor(kinesis): rename fields of KinesisOffset and KinesisSplit to make everything explicit #18704

Merged

9 tasks

stdrc changed the title ~~Tracking: correct the semantics of next_offset in source connectors~~ Tracking: correct the semantics of offset in source connectors Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: correct the semantics of `offset` in source connectors #18697

Tracking: correct the semantics of `offset` in source connectors #18697

stdrc commented Sep 25, 2024 •

edited

Loading

Tracking: correct the semantics of offset in source connectors #18697

Tracking: correct the semantics of offset in source connectors #18697

Comments

stdrc commented Sep 25, 2024 • edited Loading

Motivation

Background

Progress

Tracking: correct the semantics of `offset` in source connectors #18697

Tracking: correct the semantics of `offset` in source connectors #18697

stdrc commented Sep 25, 2024 •

edited

Loading