Skip to content

Delta Lake 3.3.0

Latest
Compare
Choose a tag to compare
@allisonport-db allisonport-db released this 06 Jan 22:04
· 52 commits to master since this release

We are excited to announce the release of Delta Lake 3.3.0! This release includes several exciting new features.

Highlights

Details by each component.

Delta Spark

Delta Spark 3.3.0 is built on Apache Spark™ 3.5.3. Similarly to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

The key features of this release are:

  • Support for Identity Column: Delta Lake identity columns are a type of generated column that automatically assigns unique values to each record inserted into a table. Users do not need to explicitly provide values for these columns during data insertion. They offer a straightforward and efficient mechanism to generate unique keys for table rows, combining ease of use with high performance. See the documentation for more information.
  • Support VACUUM LITE to deliver faster VACUUM for periodically run VACUUM commands. When running VACUUM in LITE mode, instead of finding all files in the table directory, VACUUM LITE uses the Delta transaction log to identify and remove files no longer referenced by any table versions within the retention duration.
  • Support for Row Tracking Backfill:Row Tracking feature can now be used on existing Delta Lake tables to track row-level lineage in Delta Spark, previously it was only possible for new tables. Users can now use ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = true) syntax to alter an existing table to enable Row Tracking. When enabled, users can identify rows across multiple versions of the table and can access this tracking information using the two metadata fields _metadata.row_id and _metadata.row_commit_version. Refer to the documentation on Row Tracking for more information and examples.
  • Delta Lake now generates version checksums for each table commit, providing stronger consistency guarantees and improved debugging capabilities. It tracks detailed table metrics including file counts, table size, data distribution histograms, etc. This enables automatic detection of potential state inconsistencies and helps maintain table integrity in distributed environments. The state validation is performed on every checkpoint. The Checksum is also used to bypass the initial Spark query that retrieves the Protocol and Metadata actions, resulting in a decreased snapshot initialization latency.
  • Liquid clustering updates:
    • Support OPTIMIZE FULL to fully recluster a Liquid table. This command optimizes all records in a table that uses liquid clustering, including data that might have previously been clustered.
    • Support enabling liquid clustering on an existing unpartitioned Delta table using ALTER TABLE <table_name> CLUSTER BY (<clustering_columns>). Previously, liquid clustering could only be enabled upon table creation.
    • Support creating clustered table from an external location
  • The In-Commit Timestamp table feature is no longer in preview When enabled, this feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. With this, time travel queries yield consistent results, even if the table directory is relocated. This feature was available as a preview feature in Delta 3.2 and is now generally available in Delta 3.3. See the documentation for more information.

Other notable changes include:

  • Protocol upgrade/downgrade improvements
    • Support dropping table features for columnMapping, vacuumProtocolCheck, and checkConstraints.
    • Improve table protocol transitions to simplify the CUJ when altering the table protocol.
    • Support protocol version downgrades when the existing table features exist in the lower protocol version.
    • Update protocol upgrades behavior such that when enabling a legacy feature via a table property (e.g. setting delta.enableChangeDataFeed=true) the protocol is upgraded to (1,7) and only the legacy feature is enabled. Previously the minimum protocol version would be selected and all preceding legacy features enabled.
    • Support enabling a table feature on a table using the Python DeltaTable API with deltaTable.addFeatureSupport(...).
  • Type-widening improvements
    • Support automatic type widening in Delta Sink when type widening is enabled on the table and schema evolution is enabled on the sink.
    • Support type widening on nested fields when other nested fields in the same struct are referenced by check constraints or generated column expressions.
    • Fix type-widening operation validation for map, array or struct columns used in generated column expressions or check constraints.
    • Fix to directly read the file schema from the parquet footers when identifying the files to be rewritten when dropping the type widening table feature.
    • Fix using type widening on a table containing a char/varchar column.
  • Liquid clustering improvements
    • Fix liquid clustering to automatically fall back to Z-order clustering when clustering on a single column. Previously, any attempts to optimize the table would fail.
    • Support RESTORE on clustered tables. Previously, RESTORE operations would not restore clustering metadata.
    • Support SHOW TBLPROPERTIES for clustered tables.
    • Support for partition-like data skipping filters (preview): When enabled by setting spark.databricks.delta.skipping.partitionLikeFilters.enabled, applies arbitrary data skipping filters referencing Liquid clustering columns to files with the same min and max values on clustering columns. This may decrease the files scanned for selective queries on large Liquid tables.
  • Performance improvements
    • Improve the performance of finding the last complete checkpoint with more efficient file listing.
    • Pushdown query filters when reading CDF so the filters can be used for partition pruning and row group skipping.
    • Fix streaming CDF queries to not read log entries beyond endOffset for reduced processing time.
    • Optimize REORG TABLE to remove dropped columns from the parquet files.
    • Optimize MERGE resolution by batching the resolution of target column names and the resolution of assignment expressions to improve analysis performance on wide tables.
  • Miscellaneous enhancements
    • Support the CLONE operation in the Scala DeltaTable API.
    • Support running optimize operations in batches by configuring the config spark.databricks.delta.optimize.batchSize.
    • Support specifying a struct column that contains nested columns of unsupported stats types in delta.dataSkippingStatsColumns. Previously this would throw an exception.
    • Support schema evolution for map or array root types.
    • Support schema evolution for struct types nested within a map.
    • Support non-deterministic expressions in the updated/inserted clauses of MERGE.
    • Support reading the _commit_timestamp column using In Commit Timestamps in CDF reads when ICT is enabled on a table.
    • Write timestamp partition values in UTC. This prevents ambiguous behavior when reading timestamp partition columns that do not store timezone information.
  • Miscellaneous bug fixes
    • Support CREATE TABLE LIKE with user provided properties. Previously any properties that were provided in the SQL command were ignored and only the properties from the source table were used.
    • Fix a bug where providing a query filter that compares two Literal expressions would cause an infinite loop when constructing data skipping filters.
    • Fix In-Commit Timestamps to use clock.currentTimeMillis() instead of System.nanoTime() for large commits since some systems return a very small number when System.nanoTime() is called.
    • Support concurrent CREATE TABLE IF NOT EXISTS commands. Previously the second command would fail with “TABLE_ALREADY_EXISTS” error.
    • Fix to correctly read the _commit_timestamp column in Change Data Feed when reading from timezones other than UTC.
    • Improve the error messages when a by-path Delta table cannot be accessed by forwarding encountered error messages instead of always throwing a “missing delta table” exception.
    • Fix inconsistencies with special characters in the min/max file statistics.
    • Fix the Scala/Python DeltaTable merge API to execute using the Spark DataFrame API instead of completely bypassing Spark’s analysis. This allows the merge operation to be visible in the QueryExecutionListener.
    • Fix an issue with CDF queries failing when using Spark Connect.
    • Fix a bug in the conflict resolution for OPTIMIZE when a concurrent transaction adds deletion vectors.

Delta Universal Format (UniForm)

You can now enable UniForm Iceberg on existing Delta tables without rewriting data files. You can then seamlessly read the table downstream in Iceberg clients such as Spark and Snowflake. See Enable by altering an existing table.

Other notable changes include:

  • Support Timestamp-type partition columns for UniForm Iceberg.
  • Support automatically running expireSnapshot on the UniForm Iceberg table to cleanup old manifests whenever OPTIMIZE is run on the Delta table.
  • Support retrying for the Delta UniForm Iceberg conversion.
  • Support list and map data types for UniForm Hudi.
  • Miscellaneous bug fixes
    • Fix a race condition in UniForm Iceberg.
    • Fix UniForm Iceberg to skip committing for commits without data change.
    • Fix UniForm Hudi to work for tables with struct columns.
    • Fix a bug in the Delta to Hudi conversion when the number of actions to convert is greater than the action batch size.

Delta Kernel

The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.

This release of Delta Kernel Java contains the following changes:

  • Delta Kernel Java and Rust now support reading Delta tables that have Type Widening enabled. The default ParquetHandlers provided by both Delta kernel implementations include support for reading tables that had any of the type changes covered by the feature applied.
  • Support cleaning up expired log files as part of checkpointing.
  • Support data skipping on timestamp and timestamp_ntz type columns.
  • Support writing to tables with the inCommitTimestamp table feature enabled.
  • Other notable read-side changes
    • Support reading tables with schemas with long type field metadata.
    • Support pushing down predicates IS NULL and IS NOT NULL into the default parquet reader to prune the row groups read.
    • Fix a bug with reading tables with spaces in the table path.
    • Add a new utility function to determine if a given partition exists (i.e. actually contains data) in a snapshot.
    • Support retrieving the timestamp of the latest commit in a snapshot using Snapshot.getTimestamp.
    • Support getting the partition columns of the table for a snapshot.
    • Support legacy Parquet schemas for map type and array type in the DefaultParquetHandler.
    • Support reading timestamp_ntz columns that are stored in parquet as INT96.
  • Other notable write-side changes
    • Write timestamps using the INT64 physical format in Parquet in the DefaultParquetHandler. Previously they were written as INT96 which is an outdated and deprecated format for timestamps.
    • Fix a bug when writing decimal type data as binary in the DefaultParquetHandler.
    • Fix to correctly escape special characters in the partition values when constructing the partition paths during writes.
  • Expression support changes and fixes
    • Lazily evaluate comparator expressions in the DefaultExpressionHandler. Previously expressions would be eagerly evaluated for every row in the underlying vectors.
    • Support SQL expression LIKE for comparing strings.
    • Support SQL expression IS NOT DISTINCT FROM (aka null safe equals).
    • Fix string comparison in the DefaultExpressionHandler to use UTF8.
    • Fix binary comparisons to use unsigned comparison in the DefaultExpressionHandler.
  • Developer-facing improvements

Other projects

Delta Sharing Spark

Delta Storage

  • Fix an issue where the S3DynamoDBLogStore (used for safe, concurrent multi-cluster writes to S3) would make extraneous GET calls to DynamoDB during Delta VACUUM operations, impacting performance.

Delta Flink

  • Support partition columns of Date type in the Delta Sink.

Credits

Abhishek Radhakrishnan, Adam Binford, Alden Lau, Aleksei Shishkin, Alexey Shishkin, Allison Portis, Ami Oka, Amogh Jahagirdar, Andreas Chatzistergiou, Andrew Xue, Anish, Annie Wang, Avril Aysha, Bart Samwel, Burak Yavuz, Carmen Kwan, Charlene Lyu, ChengJi-db, Chirag Singh, Christos Stavrakakis, Cuong Nguyen, Dhruv Arya, Eduard Tudenhoefner, Felipe Pessoto, Fokko Driesprong, Fred Storage Liu, Hao Jiang, Hyukjin Kwon, Jacek Laskowski, Jackie Zhang, Jade Wang, James DeLoye, Jiaheng Tang, Jintao Shen, Johan Lasperas, Juliusz Sompolski, Jun, Jungtaek Lim, Kaiqi Jin, Kam Cheung Ting, Krishnan Paranji Ravi, Lars Kroll, Leon Windheuser, Lin Zhou, Liwen Sun, Lukas Rupprecht, Marko Ilić, Matt Braymer-Hayes, Maxim Gekk, Michael Zhang, Ming DAI, Mingkang Li, Nils Andre, Ole Sasse, Paddy Xu, Prakhar Jain, Qianru Lao, Qiyuan Dong, Rahul Shivu Mahadev, Rajesh Parangi, Rakesh Veeramacheneni, Richard Chen, Richard-code-gig, Robert Dillitz, Robin Moffatt, Ryan Johnson, Sabir Akhadov, Scott Sandre, Sergiu Pocol, Shawn Chang, Shixiong Zhu, Sumeet Varma, Tai Le Manh, Taiga Matsumoto, Tathagata Das, Thang Long Vu, Tom van Bussel, Tulio Cavalcanti, Venki Korukanti, Vishwas Modhera, Wenchen Fan, Yan Zhao, YotillaAntoni, Yumingxuan Guo, Yuya Ebihara, Zhipeng Mao, Zihao Xu, zzl-7