You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project originates from a bug fix #16046, where I found the concept of “offset” especially “start offset” is very confusing in our source connectors, easily leading to potential bugs.
For KafkaSplit and also some other such XxxSplits, the start_offset field has two different semantics in the following two situations:
When a split struct is created by Enumerator, the start_offset means the offset of the first message we should read.
When update_offset is called (by SourceExecutor), the start_offset is replaced with a “last seen offset” which is got from the stream chunk returned by Reader.
This semantic mismatch is the key problem.
Now our Kafka source does work, because we made the semantics in the 1st situation align with the 2nd one, by subtracting 1 from the offset when creating a Split and adding 1 to the offset when creating/recovering a Reader. However the solution is very counter-intuitive. By its name, the start_offset should’ve store the “start offset” or “next offset”, but it actually stores “last seen offset” or “previous offset”. And that also caused other source implementations to follow the error, because new sources just copied the design from old ones, especially file source, which is acting totally wrong.
I think a more reasonable and intuitive solution should be align the 2nd semantics with the 1st one, which is exactly the way how Flink, Kafka and plenty of other industrial products handle offsets. That is, to be more clear, always storing the “next offset” we should read next time. And this is also the only way that can correctly work for file source.
stdrc
changed the title
Tracking: correct the semantics of next_offset in source connectors
Tracking: correct the semantics of offset in source connectors
Sep 26, 2024
Motivation
This project originates from a bug fix #16046, where I found the concept of “offset” especially “start offset” is very confusing in our source connectors, easily leading to potential bugs.
Background
update_with_offset(start_offset)
toupdate_offset(last_read_offset)
#16256For
KafkaSplit
and also some other suchXxxSplit
s, thestart_offset
field has two different semantics in the following two situations:Enumerator
, thestart_offset
means the offset of the first message we should read.update_offset
is called (bySourceExecutor
), thestart_offset
is replaced with a “last seen offset” which is got from the stream chunk returned byReader
.This semantic mismatch is the key problem.
Now our Kafka source does work, because we made the semantics in the 1st situation align with the 2nd one, by subtracting 1 from the offset when creating a
Split
and adding 1 to the offset when creating/recovering aReader
. However the solution is very counter-intuitive. By its name, thestart_offset
should’ve store the “start offset” or “next offset”, but it actually stores “last seen offset” or “previous offset”. And that also caused other source implementations to follow the error, because new sources just copied the design from old ones, especially file source, which is acting totally wrong.I think a more reasonable and intuitive solution should be align the 2nd semantics with the 1st one, which is exactly the way how Flink, Kafka and plenty of other industrial products handle offsets. That is, to be more clear, always storing the “next offset” we should read next time. And this is also the only way that can correctly work for file source.
Progress
scan.startup.mode=latest
of NATS with others #18733KinesisOffset
andKinesisSplit
to make everything explicit #18704PubSubMQTTThe text was updated successfully, but these errors were encountered: