Consider using dataset fragments instead of parquet uris #13

wjones127 · 2023-05-12T22:49:58Z

I saw this example and thought it might be a better way to load data.

Right now it looks like you rely on being able to just read from the Parquet files and load the partition values from HIVE-style directories. This isn't robust in two ways:

HIVE-style directories aren't guaranteed in the Delta Lake format. The delta protocol states that "This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log." ¹
Deletion vectors and column mapping mean reading the parquet files as-is won't give you the correct data, once we start supporting reader protocols 2 and 3.

In the future, it would be best not to rely on reading from the file URIs and instead read from the dataset fragments, which will provide the correct data as the Delta Protocol continues to evolve.

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#data-files ↩

wjones127 · 2023-05-12T22:54:18Z

BTW, this would allow column projection pushdown, based on this protocol: https://github.com/dask/dask/blob/12c7c10d0c15391a6522fe2dc7df191f8088967e/dask/dataframe/io/utils.py#L224

Maybe that same protocol will support filter pushdown too?

wjones127 · 2023-05-12T23:06:18Z

FYI I'm working towards standardizing the interface for PyArrow datasets to make it easier for engines to consume, including Dask. My research for that is how I found that. If interested, feel free to read and/or comment on this document. https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using dataset fragments instead of parquet uris #13

Consider using dataset fragments instead of parquet uris #13

wjones127 commented May 12, 2023

wjones127 commented May 12, 2023

wjones127 commented May 12, 2023

Consider using dataset fragments instead of parquet uris #13

Consider using dataset fragments instead of parquet uris #13

Comments

wjones127 commented May 12, 2023

Footnotes

wjones127 commented May 12, 2023

wjones127 commented May 12, 2023