Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add in-commit timestamp support for change data feed #617

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 30, 2024

What changes are proposed in this pull request?

This adds support for in-commit timestamps when performing change data feed. Now when a commit contains commitInfo with inCommitTimestamp, that timestamp will be the one used for all changed rows in the commit.

Depends on #581

Please only review these commits.

How was this change tested?

Add tests to check that the timestamp extracted from commits containing in-commit-timestamps are the ICT instead of file modification time.

Copy link

codecov bot commented Dec 30, 2024

Codecov Report

Attention: Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.15%. Comparing base (06d8dbb) to head (d295ffc).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/log_replay.rs 86.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #617   +/-   ##
=======================================
  Coverage   84.14%   84.15%           
=======================================
  Files          77       77           
  Lines       17710    17801   +91     
  Branches    17710    17801   +91     
=======================================
+ Hits        14902    14980   +78     
- Misses       2096     2106   +10     
- Partials      712      715    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OussamaSaoudi-db OussamaSaoudi-db changed the title feat: Add in-commit timestamp support for change data fede feat: Add in-commit timestamp support for change data feed Jan 2, 2025
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few things looks good tho!

for actions in action_iter {
let actions = actions?;

let mut visitor = PreparePhaseVisitor {
add_paths: &mut add_paths,
remove_dvs: &mut remove_dvs,
has_cdc_action: &mut has_cdc_action,
commit_timestamp: &mut timestamp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be clearer?

Suggested change
commit_timestamp: &mut timestamp,
in_commit_timestamp: &mut timestamp,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We initialize this field with the file modification timestamp, so it would be inaccurate to call it that. I do like the update you made below tho when we actually read ICT from a commitinfo.

@@ -136,15 +137,14 @@ impl LogReplayScanner {
/// 2. Construct a map from path to deletion vector of remove actions that share the same path
/// as an add action.
/// 3. Perform validation on each protocol and metadata action in the commit.
/// 4. Extract the in-commit timestamp from [`CommitInfo`] if it is present.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on L130 above but I think we need to do some comment updates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I went through every mention of ICT and I think I got them all.

kernel/src/table_changes/log_replay.rs Outdated Show resolved Hide resolved
Comment on lines +622 to +626
Action::CommitInfo(CommitInfo {
in_commit_timestamp: Some(timestamp),
..Default::default()
}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if commit info isn't first? do we still read it? I know the protocol says it must be first with ICT enabled but I wonder what the expected behavior is when it isn't first? do we do the right thing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but probably don't solve here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed a little here:
#581 (comment)

I'm still quite certain that delta-spark doesn't care about the ordering because it goes through the all actions in the commit looking for CommitInfo

        var commitInfo: Option[CommitInfo] = None
        actions.foreach {
          case c: AddCDCFile =>
            cdcActions.append(c)
            totalFiles += 1L
            totalBytes += c.size
          case a: AddFile =>
            totalFiles += 1L
            totalBytes += a.size
          case r: RemoveFile =>
            totalFiles += 1L
            totalBytes += r.size.getOrElse(0L)
          case i: CommitInfo => commitInfo = Some(i)
          case _ => // do nothing
        }

I've added a check that only puts in the ICT if it is the first action in the log, but there comes a question: should we fail if it isn't the first action?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also revert the check that CommitInfo is first and revisit that in a future PR.

Arc::new(StructType::new(vec![
Option::<Add>::get_struct_field(ADD_NAME),
Option::<Remove>::get_struct_field(REMOVE_NAME),
Option::<Cdc>::get_struct_field(CDC_NAME),
Option::<Metadata>::get_struct_field(METADATA_NAME),
Option::<Protocol>::get_struct_field(PROTOCOL_NAME),
StructField::new("commitInfo", StructType::new([ict_type]), true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though i wonder if we can do something similar to above like Option<CommitInfo>::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?

but for now at least can use COMMIT_INFO_NAME?

Suggested change
StructField::new("commitInfo", StructType::new([ict_type]), true),
StructField::new(COMMIT_INFO_NAME, StructType::new([ict_type]), true),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we can do something similar to above like Option::get_struct_field(COMMIT_INFO_NAME) and get struct field inCommitTimestamp of that?

We would get a StructField of type CommitInfo, which we'd have to 1) get datatype, 2) cast to a struct 3) get the ICT field. So I'll stick with your suggested change 👍

Action::Cdc(cdc.clone()),
Action::CommitInfo(commit_info.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these ordered? should commit info be first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swapped ordering

@OussamaSaoudi-db OussamaSaoudi-db force-pushed the cdf_ict_impl branch 2 times, most recently from 7242904 to f50c202 Compare January 10, 2025 00:13
@OussamaSaoudi OussamaSaoudi added the merge hold Don't allow the PR to merge label Jan 10, 2025
Update docs for ICT

Assert selection vector

wip ict impl

add tmp ict fixup for writes

remove non_ict schema

fix commit

Remove cdf changes for ict

remove unused imports

Add clarifying comment for inCommitTimestamp

Add documentation for ICT

Revert "Remove cdf changes for ict"

This reverts commit e2e38cb.

Fix ict reading

Address nits

make ICT only work if it is the first row in a commit

Rename and patch comments

Fix naming referring to CommitInfo

Patch up docs
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump merge hold Don't allow the PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants