Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] Support CTAS from a table with a deleted default value #4142

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

olaky
Copy link
Contributor

@olaky olaky commented Feb 11, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Default column values are part of the schema values, also after the default value has actually been removed because otherwise data would need to be backfilled.
Default values for removed defaults are stored in the schema metadata field EXISTS_DEFAULT_COLUMN_METADATA_KEY.
In this PR we remove this field in handleCreateTableAsSelect, because CTAS will actually materialize and hence backfill all the values, so we do not need it any more on the new table.

How was this patch tested?

Added new tests

Does this PR introduce any user-facing changes?

Previously it was not possible to CTAS from a table with a removed default without setting the delta.feature.allowColumnDefaults even though no default value was actually part of the new table. this is now possible.

Comment on lines +2420 to +2421
val deltaLog = DeltaLog.forTable(spark, TableIdentifier(tableName))
val protocol = deltaLog.unsafeVolatileSnapshot.protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid the unsafeVolatileSnapshot use

Suggested change
val deltaLog = DeltaLog.forTable(spark, TableIdentifier(tableName))
val protocol = deltaLog.unsafeVolatileSnapshot.protocol
val (_, snapshot) = DeltaLog.forTableWithSnapshot(spark, TableIdentifier(tableName))
val protocol = snapshot.protocol

}

assert(defaultsTableFeatureEnabled("test_table"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this EXISTS_DEFAULT_COLUMN_METADATA_KEY is applied retroactively on data IIUC. So it's not stored in the parquet files.

The semantics of CTAS are such that we want "a target table produced from reading the data of the source" which may not necessarily have the same target schema. I think we should have some data in this table to test that by removing the metadata where we are we still end up with a consistent target table.

We should also test when the column is added after data already exist via alter table, since it seems to go through a different logic. We should assert on the metadata we expect in the schema (CURRENT_DEFAULT_COLUMN_METADATA_KEY, EXISTS_DEFAULT_COLUMN_METADATA_KEY).

schema.exists(
_.metadata.contains(ResolveDefaultColumnsUtils.EXISTS_DEFAULT_COLUMN_METADATA_KEY))
}
val metadataForOp =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could use a comment to explain why we're doing it this way. IIUC we need to preserve the metadata on the source because the defaults are applied on read retroactively in the "exists default" case? So we keep the metadata on the source and we remove it from the target table, which makes it ineligible for DeltaTableUtils.removeInternalWriterMetadata and ColumnWithDefaultExprUtils.removeDefaultExpressions.

val metadataForOp =
if (schemaContainsExistsColumnMetadata(txn.metadata.schema)) {
val newSchema = SchemaMergingUtils.transformColumns(schema) {
case (_, field: StructField, _) if field.metadata.contains(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that default is only available as a top level field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants