Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Merged
merged 11 commits into from
Jan 31, 2025

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Jan 16, 2025

Addresses #11541

Table properties should be set unconditionally to accommodate diverging defaults in different Databricks versions

Standardize table creation to be via SQL

Improves

env TEST_PARALLEL=0 \
    TEST_MODE=DELTA_LAKE_ONLY TESTS=delta_lake_update_test.py \
    PYSP_TEST_spark_rapids_sql_detectDeltaLogQueries=false \
    PYSP_TEST_spark_rapids_sql_format_parquet_reader_type=PERFILE \
./jenkins/databricks/test.sh 

@gerashegalov gerashegalov self-assigned this Jan 16, 2025
@gerashegalov gerashegalov added the test Only impacts tests label Jan 16, 2025
@gerashegalov gerashegalov changed the title Always set props for different defaults Always set props for different defaults [databricks] Jan 16, 2025
@gerashegalov gerashegalov changed the title Always set props for different defaults [databricks] Explicitly set table props to accommodate for different defaults [databricks] Jan 16, 2025
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov requested a review from a team as a code owner January 16, 2025 08:24
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov marked this pull request as draft January 17, 2025 18:32
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov marked this pull request as ready for review January 18, 2025 08:20
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov changed the title Explicitly set table props to accommodate for different defaults [databricks] Explicitly set Delta table props to accommodate for different defaults [databricks] Jan 21, 2025
if is_databricks122_or_later():
table_properties['delta.enableDeletionVectors'] = str(enable_deletion_vectors).lower()

if use_cdf or enable_deletion_vectors:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we enable deletion_vectors and use_cdf is False, but we are not on databrick 122 or later? I know that this didn't really change from the previous code. It is just confusing to me and I would like to understand better what is happening.

nit: Also why is writer.mode("append") in a separate if clause right after this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot enable_deleltion_vectors before 12.2, hence all tests where enable_deletion_vectors may be True are

@pytest.mark.skipif(not supports_delta_lake_deletion_vectors(), reason="Deletion vectors are new in Spark 3.4.0 / DBR 12.2")

re: write.mode("append"). I did not want to introduce changes beyond the absolute minimum.

@gerashegalov gerashegalov requested a review from revans2 January 21, 2025 23:19
revans2
revans2 previously approved these changes Jan 22, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I understand it now.

@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Jan 22, 2025
Comment on lines 160 to 161
table_properties = {
'delta.enableChangeDataFeed': str(use_cdf).lower(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner.

mythrocks
mythrocks previously approved these changes Jan 22, 2025
razajafri
razajafri previously approved these changes Jan 23, 2025
@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Jan 23, 2025

Fixing DBR 12.2 failures
test_delta_delete_partitions

  pyspark.errors.exceptions.AnalysisException: Cannot write to already existent path file:/tmp/pyspark_tests/0122-204345-nozez35d-10-59-175-195-gw1-24259-1671166402/DELTA_DATA/CPU without setting OVERWRITE = 'true'.

This appears to be related to the parametrization for use_cdf when the value is False. It indicates an issue with the test on 12.2 where the default is probably True. So the test was executing the same callpaths for either parameter prior to this PR fixing exactly this kind of problems.

@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov dismissed stale reviews from revans2, mythrocks, and razajafri via 9bed5d0 January 29, 2025 01:17
@razajafri
Copy link
Collaborator

Fixing DBR 12.2 failures test_delta_delete_partitions

  pyspark.errors.exceptions.AnalysisException: Cannot write to already existent path file:/tmp/pyspark_tests/0122-204345-nozez35d-10-59-175-195-gw1-24259-1671166402/DELTA_DATA/CPU without setting OVERWRITE = 'true'.

This appears to be related to the parametrization for use_cdf when the value is False. It indicates an issue with the test on 12.2 where the default is probably True. So the test was executing the same callpaths for either parameter prior to this PR fixing exactly this kind of problems.

I believe it's happening because you are still saving the dataframe right after you are using SQL to write the table.

writer = writer.mode("append")
properties = ', '.join(key + ' = ' + value for key, value in table_properties.items())
sql_text += " TBLPROPERTIES ({})".format(properties)
spark.sql(sql_text)
writer.save(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving right after you have already created a table will cause a DELTA_PATH_EXISTS error to be thrown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled by the append mode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that you are setting the mode to append

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please coordinate with @razajafri on how we want to do this? #12048 is doing something similar, but very differently.

I think I prefer this method because it is using the front door for setting up the table properties instead of using a config that is not well documented.

But at the same time I think we need what @razajafri has with the parametrization of the tests so that we don't run with deletion vectors in situations we don't support them and/or situations where delta does not support them. Unless we are planning to have multiple distinct tests for each case.

  1. run with deletion vectors if they are supported by the plugin and delta lake and verify the result is correct
  2. run with deletion vectors if delta lake supports them, but we do not and verify that we fall back to the CPU
  3. run with deletion vectors disabled and verify that the result is correct.

It is more complex if when the test involves multiple delta lake operations i.e. read + write.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind about coordinating with @razajafri I see you two have already done it and I am the one out of sync.

@gerashegalov gerashegalov merged commit 73765bd into NVIDIA:branch-25.02 Jan 31, 2025
51 of 52 checks passed
@gerashegalov gerashegalov deleted the deta_props branch January 31, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants