Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Merged
merged 11 commits into from
Jan 31, 2025
17 changes: 10 additions & 7 deletions integration_tests/src/main/python/delta_lake_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023, NVIDIA CORPORATION.
# Copyright (c) 2023-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -157,12 +157,15 @@ def setup_delta_dest_table(spark, path, dest_table_func, use_cdf, partition_colu
dest_df = dest_table_func(spark)
writer = dest_df.write.format("delta")
ddl = schema_to_ddl(spark, dest_df.schema)
table_properties = {}
if use_cdf:
table_properties['delta.enableChangeDataFeed'] = 'true'
if enable_deletion_vectors:
table_properties['delta.enableDeletionVectors'] = 'true'
if len(table_properties) > 0:
table_properties = {
'delta.enableChangeDataFeed': str(use_cdf).lower(),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner.

# prevent on 11.3: Unknown configuration was specified: delta.enableDeletionVectors
# TODO OSS Delta 2.3+
if is_databricks122_or_later():
table_properties['delta.enableDeletionVectors'] = str(enable_deletion_vectors).lower()

gerashegalov marked this conversation as resolved.
Show resolved Hide resolved
if use_cdf or enable_deletion_vectors:
# if any table properties are specified then we need to use SQL to define the table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we enable deletion_vectors and use_cdf is False, but we are not on databrick 122 or later? I know that this didn't really change from the previous code. It is just confusing to me and I would like to understand better what is happening.

nit: Also why is writer.mode("append") in a separate if clause right after this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot enable_deleltion_vectors before 12.2, hence all tests where enable_deletion_vectors may be True are

@pytest.mark.skipif(not supports_delta_lake_deletion_vectors(), reason="Deletion vectors are new in Spark 3.4.0 / DBR 12.2")

re: write.mode("append"). I did not want to introduce changes beyond the absolute minimum.

sql_text = "CREATE TABLE delta.`{path}` ({ddl}) USING DELTA".format(path=path, ddl=ddl)
if partition_columns:
Expand Down
Loading