-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove unnecessary metadata action when overwriting partition #2923
base: main
Are you sure you want to change the base?
fix: remove unnecessary metadata action when overwriting partition #2923
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a test as follow up.
@@ -1075,7 +1075,7 @@ impl std::future::IntoFuture for WriteBuilder { | |||
actions.push(protocol.into()) | |||
} | |||
|
|||
if schema != table_schema { | |||
if try_cast_batch(schema.fields(), table_schema.fields()).is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try cast batch is too lenient, you can already see it in the test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple issues that are worth digging into:
- can_cast_types seems like the wrong function to use in
try_cast_batch
considering it allows incompatible type downcasting, i.e. casting from i64 to i8. - The failing rust test
test_issue_2105
has theid
column type defined asPrimitiveType::Integer
, which based on delta spec should map to 4 bytes int (i32). But the test expects the query result to have the typeArrowDataType::Int64
. Is this expected? The original MR that introduced this test has the expected type defined asArrowDataType::Int32
, which seems more reasonable, but it got changed to i64 in later schema casting MR. - The failed python test
test_parse_stats_with_new_schema
is definitely a valid test failure that we should fix @PeterKeDer . Although it might be related to the same use ofcan_cast_types
intry_cast_batch
issue mentioned in my first point.
Description
Fixes spurious
metadata
action whenwrite_deltalake
is called with modeoverwrite
, using a predicate and with a string partition column. This is undesirable because all concurrent writes will fail and need to be retried due to thismetadata
action.This is caused by the
schema != table_schema
check.table_schema
is from callingtable.input_schema()
which converts string partition columns to dictionary and causes it to be different fromschema
.We fix this issue by comparing it using
try_cast_batch
, so the behavior becomes identical to writing withmode='append'
.To replicate (on
deltalake==0.20.1
):If we look at the latter 2 transaction JSONs, they will have a
metadata
action indicating a schema change, even though the schema is identical.