-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable file merging by last modification time using preserve-insertion-order #3157
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
…sertion-order This change leverages the previously unused `preserve-insertion-order` configuration to enable merging files sorted by their last modification time during compaction. This is particularly beneficial for append-only workloads, improving data locality after optimize runs by merging files that were created around similar times. Signed-off-by: esarili <[email protected]>
@esarili can you sign off the commits? There should be directions in the failing CI check below. https://github.com/delta-io/delta-rs/pull/3157/checks?check_run_id=36132013214 |
a03f097
to
f4cfa9f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3157 +/- ##
=======================================
Coverage 71.74% 71.75%
=======================================
Files 138 138
Lines 44362 44369 +7
Branches 44362 44369 +7
=======================================
+ Hits 31828 31836 +8
- Misses 10496 10497 +1
+ Partials 2038 2036 -2 ☔ View full report in Codecov by Sentry. |
writer_properties: WriterProperties, | ||
preserve_insertion_order: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what you think about this going into WriterProperties rather than all these functions growing an additional argument.
Does this only benefit call paths for the create_merge_plan flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The writer properties are used in other operations, so we should only add it if it makes senso for those ops as well
as it so happens, @hntd187 and I just had a quick discussion around using the file updated times more generally. Turns out there are at least some scenarios where this may not do what one would expect it to. For the use case at hand, I wonder if this could also be done using "single value z-order" which should just degrade to sorting for a single value while also targeting a specific file size. This of course assumes that there is some value in the data that somehow correlates when the data was inserted. IF that were not the case, there would also not be much value though in preserving the order as query engines could not leverage this information. |
We run optimize command in regular intervals (every two-hours, on last two partitions where data is partitioned by day). AFAIU z-order runs on entire partition and sorts data on record level which might be resource intensive and might not be necessary for the append only workflows. With using file update times, we are hoping to avoid sorting entire partition record by record but still have some amount of locality after optimize runs. |
This change leverages the previously unused
preserve-insertion-order
configuration to enable merging files sorted by their last modification time during compaction. This is particularly beneficial for append-only workloads, improving data locality after optimize runs by merging files that were created around similar times.