-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly set Delta table props to accommodate for different defaults [databricks] #11970
Conversation
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
build |
Signed-off-by: Gera Shegalov <[email protected]>
build |
if is_databricks122_or_later(): | ||
table_properties['delta.enableDeletionVectors'] = str(enable_deletion_vectors).lower() | ||
|
||
if use_cdf or enable_deletion_vectors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we enable deletion_vectors and use_cdf is False, but we are not on databrick 122 or later? I know that this didn't really change from the previous code. It is just confusing to me and I would like to understand better what is happening.
nit: Also why is writer.mode("append")
in a separate if clause right after this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot enable_deleltion_vectors before 12.2, hence all tests where enable_deletion_vectors may be True are
@pytest.mark.skipif(not supports_delta_lake_deletion_vectors(), reason="Deletion vectors are new in Spark 3.4.0 / DBR 12.2")
re: write.mode("append"). I did not want to introduce changes beyond the absolute minimum.
Signed-off-by: Gera Shegalov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I understand it now.
build |
build |
table_properties = { | ||
'delta.enableChangeDataFeed': str(use_cdf).lower(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner.
Fixing DBR 12.2 failures
This appears to be related to the parametrization for use_cdf when the value is |
build |
9bed5d0
Signed-off-by: Gera Shegalov <[email protected]>
I believe it's happening because you are still saving the dataframe right after you are using SQL to write the table. |
writer = writer.mode("append") | ||
properties = ', '.join(key + ' = ' + value for key, value in table_properties.items()) | ||
sql_text += " TBLPROPERTIES ({})".format(properties) | ||
spark.sql(sql_text) | ||
writer.save(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving right after you have already created a table will cause a DELTA_PATH_EXISTS
error to be thrown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled by the append mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see that you are setting the mode to append
Signed-off-by: Gera Shegalov <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please coordinate with @razajafri on how we want to do this? #12048 is doing something similar, but very differently.
I think I prefer this method because it is using the front door for setting up the table properties instead of using a config that is not well documented.
But at the same time I think we need what @razajafri has with the parametrization of the tests so that we don't run with deletion vectors in situations we don't support them and/or situations where delta does not support them. Unless we are planning to have multiple distinct tests for each case.
- run with deletion vectors if they are supported by the plugin and delta lake and verify the result is correct
- run with deletion vectors if delta lake supports them, but we do not and verify that we fall back to the CPU
- run with deletion vectors disabled and verify that the result is correct.
It is more complex if when the test involves multiple delta lake operations i.e. read + write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind about coordinating with @razajafri I see you two have already done it and I am the one out of sync.
Addresses #11541
Table properties should be set unconditionally to accommodate diverging defaults in different Databricks versions
Standardize table creation to be via SQL
Improves