Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: correct the semantics of offset in source connectors #18697

Open
3 of 9 tasks
stdrc opened this issue Sep 25, 2024 · 0 comments
Open
3 of 9 tasks

Tracking: correct the semantics of offset in source connectors #18697

stdrc opened this issue Sep 25, 2024 · 0 comments
Assignees
Labels
type/fix Bug fix type/tracking Tracking issue.
Milestone

Comments

@stdrc
Copy link
Contributor

stdrc commented Sep 25, 2024

Motivation

This project originates from a bug fix #16046, where I found the concept of “offset” especially “start offset” is very confusing in our source connectors, easily leading to potential bugs.

Background

For KafkaSplit and also some other such XxxSplits, the start_offset field has two different semantics in the following two situations:

  1. When a split struct is created by Enumerator, the start_offset means the offset of the first message we should read.
  2. When update_offset is called (by SourceExecutor), the start_offset is replaced with a “last seen offset” which is got from the stream chunk returned by Reader.

This semantic mismatch is the key problem.

Now our Kafka source does work, because we made the semantics in the 1st situation align with the 2nd one, by subtracting 1 from the offset when creating a Split and adding 1 to the offset when creating/recovering a Reader. However the solution is very counter-intuitive. By its name, the start_offset should’ve store the “start offset” or “next offset”, but it actually stores “last seen offset” or “previous offset”. And that also caused other source implementations to follow the error, because new sources just copied the design from old ones, especially file source, which is acting totally wrong.

I think a more reasonable and intuitive solution should be align the 2nd semantics with the 1st one, which is exactly the way how Flink, Kafka and plenty of other industrial products handle offsets. That is, to be more clear, always storing the “next offset” we should read next time. And this is also the only way that can correctly work for file source.

Progress

@stdrc stdrc added type/fix Bug fix type/tracking Tracking issue. labels Sep 25, 2024
@stdrc stdrc self-assigned this Sep 25, 2024
@github-actions github-actions bot added this to the release-2.1 milestone Sep 25, 2024
@stdrc stdrc changed the title Tracking: correct the semantics of next_offset in source connectors Tracking: correct the semantics of offset in source connectors Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/fix Bug fix type/tracking Tracking issue.
Projects
None yet
Development

No branches or pull requests

1 participant