datafusion-python integration #3334

westonpace · 2025-01-03T21:56:13Z

The datafusion-python project recently added support for "foreign table providers" in apache/datafusion-python#921.

We should be able to utilize this to create a foreign table provider from lance. This would make it very easy to query lance datasets using python and would be comparable to our duckdb integration.

chenkovsky · 2025-01-04T13:31:16Z

I have a question, how to expose _rowid and _rowaddr, it seems that datafusion api and duckdb don't support these pseudo columns.

westonpace · 2025-01-04T14:24:04Z

For the duckdb integration you can create a dataset with default scan options. You can't filter on the column yet unfortunately because pyarrow and datafusion have interpreted unsigned integers slightly differently in the filtering language (Substrait) and so there is a DF change needed.

def test_duckdb_filter_on_rowid(tmp_path):
    tab = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
    ds = lance.write_dataset(tab, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    row_ids = ds.scanner(columns=[], with_row_id=True).to_table().column(0).to_pylist()
    expected = tab.slice(1, 1)
    actual = duckdb.query(
        f"SELECT _rowid FROM ds"
    ).fetch_arrow_table()

For datafusion you choose whether you want these columns to appear when you create the table provider:

impl LanceTableProvider {
    fn new(dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool) -> Self {
        ...
    }
    ...
}

chenkovsky · 2025-01-04T14:43:20Z

You can't filter on the column

cannt filter on rowid or any column ?

I tested the following ut.

def test_duckdb_rowid(tmp_path):
    duckdb = pytest.importorskip("duckdb")
    tbl = create_table_for_duckdb()
    ds = lance.write_dataset(tbl, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    duckdb.query("SELECT id, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, id, meta, price FROM ds").to_df() # error
    duckdb.query("SELECT id, meta, price FROM ds").to_df() # OK

chenkovsky · 2025-01-04T14:47:53Z

impl LanceTableProvider {

Yes, with_row_id, with_row_addr these flags will always work. but I think spark's SupportsMetadataColumns interface is much better.

chenkovsky · 2025-01-09T14:44:39Z

I created a PR for datafusion to illustrate my idea for _rowid support apache/datafusion#14057

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datafusion-python integration #3334

datafusion-python integration #3334

westonpace commented Jan 3, 2025

chenkovsky commented Jan 4, 2025

westonpace commented Jan 4, 2025

chenkovsky commented Jan 4, 2025 •

edited

Loading

chenkovsky commented Jan 4, 2025

chenkovsky commented Jan 9, 2025

datafusion-python integration #3334

datafusion-python integration #3334

Comments

westonpace commented Jan 3, 2025

chenkovsky commented Jan 4, 2025

westonpace commented Jan 4, 2025

chenkovsky commented Jan 4, 2025 • edited Loading

chenkovsky commented Jan 4, 2025

chenkovsky commented Jan 9, 2025

chenkovsky commented Jan 4, 2025 •

edited

Loading