Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

gerashegalov · 2025-01-16T08:21:10Z

Addresses #11541

Table properties should be set unconditionally to accommodate diverging defaults in different Databricks versions

Standardize table creation to be via SQL

Improves

env TEST_PARALLEL=0 \
    TEST_MODE=DELTA_LAKE_ONLY TESTS=delta_lake_update_test.py \
    PYSP_TEST_spark_rapids_sql_detectDeltaLogQueries=false \
    PYSP_TEST_spark_rapids_sql_format_parquet_reader_type=PERFILE \
./jenkins/databricks/test.sh

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov · 2025-01-16T08:27:15Z

build

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov · 2025-01-18T08:20:49Z

build

integration_tests/src/main/python/delta_lake_utils.py

revans2 · 2025-01-21T14:44:19Z

integration_tests/src/main/python/delta_lake_utils.py

+    if is_databricks122_or_later():
+        table_properties['delta.enableDeletionVectors'] = str(enable_deletion_vectors).lower()
+
+    if use_cdf or enable_deletion_vectors:


What happens if we enable deletion_vectors and use_cdf is False, but we are not on databrick 122 or later? I know that this didn't really change from the previous code. It is just confusing to me and I would like to understand better what is happening.

nit: Also why is writer.mode("append") in a separate if clause right after this one?

We cannot enable_deleltion_vectors before 12.2, hence all tests where enable_deletion_vectors may be True are

@pytest.mark.skipif(not supports_delta_lake_deletion_vectors(), reason="Deletion vectors are new in Spark 3.4.0 / DBR 12.2")

re: write.mode("append"). I did not want to introduce changes beyond the absolute minimum.

Signed-off-by: Gera Shegalov <[email protected]>

revans2

OK I understand it now.

gerashegalov · 2025-01-22T16:32:11Z

build

gerashegalov · 2025-01-22T20:14:27Z

build

mythrocks · 2025-01-22T21:35:08Z

integration_tests/src/main/python/delta_lake_utils.py

+    table_properties = {
+        'delta.enableChangeDataFeed': str(use_cdf).lower(),


This is much cleaner.

gerashegalov · 2025-01-23T23:21:22Z

Fixing DBR 12.2 failures
test_delta_delete_partitions

  pyspark.errors.exceptions.AnalysisException: Cannot write to already existent path file:/tmp/pyspark_tests/0122-204345-nozez35d-10-59-175-195-gw1-24259-1671166402/DELTA_DATA/CPU without setting OVERWRITE = 'true'.

This appears to be related to the parametrization for use_cdf when the value is False. It indicates an issue with the test on 12.2 where the default is probably True. So the test was executing the same callpaths for either parameter prior to this PR fixing exactly this kind of problems.

gerashegalov · 2025-01-28T08:49:21Z

build

Signed-off-by: Gera Shegalov <[email protected]>

razajafri · 2025-01-31T04:56:37Z

Fixing DBR 12.2 failures test_delta_delete_partitions
  pyspark.errors.exceptions.AnalysisException: Cannot write to already existent path file:/tmp/pyspark_tests/0122-204345-nozez35d-10-59-175-195-gw1-24259-1671166402/DELTA_DATA/CPU without setting OVERWRITE = 'true'.
This appears to be related to the parametrization for use_cdf when the value is False. It indicates an issue with the test on 12.2 where the default is probably True. So the test was executing the same callpaths for either parameter prior to this PR fixing exactly this kind of problems.

I believe it's happening because you are still saving the dataframe right after you are using SQL to write the table.

razajafri · 2025-01-31T04:57:36Z

integration_tests/src/main/python/delta_lake_utils.py

-        writer = writer.mode("append")
+    properties = ', '.join(key + ' = ' + value for key, value in table_properties.items())
+    sql_text += " TBLPROPERTIES ({})".format(properties)
+    spark.sql(sql_text)
    writer.save(path)


Saving right after you have already created a table will cause a DELTA_PATH_EXISTS error to be thrown

This is handled by the append mode

Nevermind, I see that you are setting the mode to append

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov · 2025-01-31T05:39:34Z

build

revans2

Can you please coordinate with @razajafri on how we want to do this? #12048 is doing something similar, but very differently.

I think I prefer this method because it is using the front door for setting up the table properties instead of using a config that is not well documented.

But at the same time I think we need what @razajafri has with the parametrization of the tests so that we don't run with deletion vectors in situations we don't support them and/or situations where delta does not support them. Unless we are planning to have multiple distinct tests for each case.

run with deletion vectors if they are supported by the plugin and delta lake and verify the result is correct
run with deletion vectors if delta lake supports them, but we do not and verify that we fall back to the CPU
run with deletion vectors disabled and verify that the result is correct.

It is more complex if when the test involves multiple delta lake operations i.e. read + write.

revans2

Never mind about coordinating with @razajafri I see you two have already done it and I am the one out of sync.

Always set props for different defaults

3273a1c

gerashegalov self-assigned this Jan 16, 2025

gerashegalov added the test Only impacts tests label Jan 16, 2025

gerashegalov changed the title ~~Always set props for different defaults~~ Always set props for different defaults [databricks] Jan 16, 2025

gerashegalov changed the title ~~Always set props for different defaults [databricks]~~ Explicitly set table props to accommodate for different defaults [databricks] Jan 16, 2025

Copyright and signoff

12f27b5

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov requested a review from a team as a code owner January 16, 2025 08:24

fix copyright

9ae918f

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov marked this pull request as draft January 17, 2025 18:32

enableDV is not valid for 11.3

cfe122f

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov marked this pull request as ready for review January 18, 2025 08:20

gerashegalov requested review from razajafri, mythrocks and revans2 January 21, 2025 14:32

gerashegalov changed the title ~~Explicitly set table props to accommodate for different defaults [databricks]~~ Explicitly set Delta table props to accommodate for different defaults [databricks] Jan 21, 2025

revans2 reviewed Jan 21, 2025

View reviewed changes

Review: supports_delta_lake_deletion_vectors()

0f6321a

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov requested a review from revans2 January 21, 2025 23:19

revans2 previously approved these changes Jan 22, 2025

View reviewed changes

Missing import

a17a1b6

gerashegalov dismissed revans2’s stale review via a17a1b6 January 22, 2025 20:13

revans2 previously approved these changes Jan 22, 2025

View reviewed changes

mythrocks reviewed Jan 22, 2025

View reviewed changes

mythrocks previously approved these changes Jan 22, 2025

View reviewed changes

razajafri previously approved these changes Jan 23, 2025

View reviewed changes

Merge branch 'branch-25.02' into deta_props

4a7b332

Deferring use_cdf bug to the subsequent PRs

9bed5d0

gerashegalov dismissed stale reviews from revans2, mythrocks, and razajafri via 9bed5d0 January 29, 2025 01:17

gerashegalov added 2 commits January 29, 2025 17:47

Force Delta table creation using SQL

6a6a5a1

Signed-off-by: Gera Shegalov <[email protected]>

Merge remote-tracking branch 'origin/branch-25.02' into deta_props

3f95aab

razajafri reviewed Jan 31, 2025

View reviewed changes

allow DeltaInvariantCheckerExec

dbbd288

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov requested review from revans2, razajafri and mythrocks January 31, 2025 15:08

revans2 reviewed Jan 31, 2025

View reviewed changes

revans2 approved these changes Jan 31, 2025

View reviewed changes

gerashegalov merged commit 73765bd into NVIDIA:branch-25.02 Jan 31, 2025
51 of 52 checks passed

gerashegalov deleted the deta_props branch January 31, 2025 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

gerashegalov commented Jan 16, 2025 •

edited

Loading

gerashegalov commented Jan 16, 2025

gerashegalov commented Jan 18, 2025

revans2 Jan 21, 2025

gerashegalov Jan 21, 2025

revans2 left a comment

gerashegalov commented Jan 22, 2025

gerashegalov commented Jan 22, 2025

mythrocks Jan 22, 2025

gerashegalov commented Jan 23, 2025 •

edited

Loading

gerashegalov commented Jan 28, 2025

razajafri commented Jan 31, 2025

razajafri Jan 31, 2025

gerashegalov Jan 31, 2025

razajafri Jan 31, 2025

gerashegalov commented Jan 31, 2025

revans2 left a comment

revans2 left a comment

		table_properties = {
		'delta.enableChangeDataFeed': str(use_cdf).lower(),

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Conversation

gerashegalov commented Jan 16, 2025 • edited Loading

gerashegalov commented Jan 16, 2025

gerashegalov commented Jan 18, 2025

revans2 Jan 21, 2025

Choose a reason for hiding this comment

gerashegalov Jan 21, 2025

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

gerashegalov commented Jan 22, 2025

gerashegalov commented Jan 22, 2025

mythrocks Jan 22, 2025

Choose a reason for hiding this comment

gerashegalov commented Jan 23, 2025 • edited Loading

gerashegalov commented Jan 28, 2025

razajafri commented Jan 31, 2025

razajafri Jan 31, 2025

Choose a reason for hiding this comment

gerashegalov Jan 31, 2025

Choose a reason for hiding this comment

razajafri Jan 31, 2025

Choose a reason for hiding this comment

gerashegalov commented Jan 31, 2025

revans2 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

gerashegalov commented Jan 16, 2025 •

edited

Loading

gerashegalov commented Jan 23, 2025 •

edited

Loading