Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load CDK: Stream Manager Support for Id-Based Checkpointing #53646

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnny-schmidt
Copy link
Contributor

@johnny-schmidt johnny-schmidt commented Feb 12, 2025

What

This is a nonfunctional change, as nothing is wired into the CDK pipeline yet.

This adds support for index-based checkpointing.

Currently we use range-based, which works like:

  • each record has an associated index
  • the indexes are aggregated into ranges for batches, then into rangesets in the stream manager
  • when all contiguous rangesets up to record X have been persisted, state for up to that index can be evicted
  • when all have been completed (and all the records have been read) then the stream can be closed

This causes some issues:

  • managing rangesets is inefficient
  • managing rangesets is error-prone

Changes for the new interface make bookkeeping a little more generic, which increases the costs of the above enough that they actually start having performance implications. (25% perf hit hacking around it in my POC PR).

This adds partial support for checkpointing by checkpoint id:

  • each time we checkpoint a stream, we increment a monotonic id
  • all records up to the checkpoint are associated with that id
  • when we persist, we just count the persisted records against that id
  • when we complete, we count completed records against that id

Sufficiency checks then are just

  • persisted for id N: sum of all persisted counts up to id N == sum of all read counts up to id N
  • completed: sum of all completed counts == total records read, and we've seen EOS

This is way easier than trying to manage rangesets, though we lose idempotence.

@johnny-schmidt johnny-schmidt requested a review from a team as a code owner February 12, 2025 00:54
Copy link

vercel bot commented Feb 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 14, 2025 0:46am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants