Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable file merging by last modification time using preserve-insertion-order #3157

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

esarili
Copy link
Contributor

@esarili esarili commented Jan 24, 2025

This change leverages the previously unused preserve-insertion-order configuration to enable merging files sorted by their last modification time during compaction. This is particularly beneficial for append-only workloads, improving data locality after optimize runs by merging files that were created around similar times.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jan 24, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@esarili esarili changed the title Enable file merging by last modification time using preserve-insertion-order feat: enable file merging by last modification time using preserve-insertion-order Jan 24, 2025
…sertion-order

This change leverages the previously unused `preserve-insertion-order`
configuration to enable merging files sorted by their last modification
time during compaction. This is particularly beneficial for append-only
workloads, improving data locality after optimize runs by merging files
that were created around similar times.

Signed-off-by: esarili <[email protected]>
@hntd187
Copy link
Collaborator

hntd187 commented Jan 24, 2025

@esarili can you sign off the commits? There should be directions in the failing CI check below. https://github.com/delta-io/delta-rs/pull/3157/checks?check_run_id=36132013214

@esarili esarili force-pushed the preserve-insertion-order branch from a03f097 to f4cfa9f Compare January 24, 2025 16:08
Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 54.54545% with 5 lines in your changes missing coverage. Please review.

Project coverage is 71.75%. Comparing base (f67e828) to head (f4cfa9f).

Files with missing lines Patch % Lines
crates/core/src/operations/optimize.rs 54.54% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3157   +/-   ##
=======================================
  Coverage   71.74%   71.75%           
=======================================
  Files         138      138           
  Lines       44362    44369    +7     
  Branches    44362    44369    +7     
=======================================
+ Hits        31828    31836    +8     
- Misses      10496    10497    +1     
+ Partials     2038     2036    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 880 to +881
writer_properties: WriterProperties,
preserve_insertion_order: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what you think about this going into WriterProperties rather than all these functions growing an additional argument.

Does this only benefit call paths for the create_merge_plan flow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writer properties are used in other operations, so we should only add it if it makes senso for those ops as well

@roeap
Copy link
Collaborator

roeap commented Jan 24, 2025

as it so happens, @hntd187 and I just had a quick discussion around using the file updated times more generally. Turns out there are at least some scenarios where this may not do what one would expect it to.

For the use case at hand, I wonder if this could also be done using "single value z-order" which should just degrade to sorting for a single value while also targeting a specific file size.

This of course assumes that there is some value in the data that somehow correlates when the data was inserted. IF that were not the case, there would also not be much value though in preserving the order as query engines could not leverage this information.

@esarili
Copy link
Contributor Author

esarili commented Jan 25, 2025

as it so happens, @hntd187 and I just had a quick discussion around using the file updated times more generally. Turns out there are at least some scenarios where this may not do what one would expect it to.

For the use case at hand, I wonder if this could also be done using "single value z-order" which should just degrade to sorting for a single value while also targeting a specific file size.

This of course assumes that there is some value in the data that somehow correlates when the data was inserted. IF that were not the case, there would also not be much value though in preserving the order as query engines could not leverage this information.

We run optimize command in regular intervals (every two-hours, on last two partitions where data is partitioned by day). AFAIU z-order runs on entire partition and sorts data on record level which might be resource intensive and might not be necessary for the append only workflows. With using file update times, we are hoping to avoid sorting entire partition record by record but still have some amount of locality after optimize runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants