You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I saw this example and thought it might be a better way to load data.
Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:
HIVE-style directories aren't guaranteed in the Delta Lake format. The delta protocol states that "This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log." 1
Deletion vectors and column mapping mean reading the parquet files as-is won't give you the correct data, once we start supporting reader protocols 2 and 3.
In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.
I saw this example and thought it might be a better way to load data.
Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:
In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.
Footnotes
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#data-files ↩
The text was updated successfully, but these errors were encountered: