Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datafusion-python integration #3334

Open
westonpace opened this issue Jan 3, 2025 · 5 comments
Open

datafusion-python integration #3334

westonpace opened this issue Jan 3, 2025 · 5 comments

Comments

@westonpace
Copy link
Contributor

The datafusion-python project recently added support for "foreign table providers" in apache/datafusion-python#921.

We should be able to utilize this to create a foreign table provider from lance. This would make it very easy to query lance datasets using python and would be comparable to our duckdb integration.

@chenkovsky
Copy link
Contributor

I have a question, how to expose _rowid and _rowaddr, it seems that datafusion api and duckdb don't support these pseudo columns.

@westonpace
Copy link
Contributor Author

For the duckdb integration you can create a dataset with default scan options. You can't filter on the column yet unfortunately because pyarrow and datafusion have interpreted unsigned integers slightly differently in the filtering language (Substrait) and so there is a DF change needed.

def test_duckdb_filter_on_rowid(tmp_path):
    tab = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
    ds = lance.write_dataset(tab, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    row_ids = ds.scanner(columns=[], with_row_id=True).to_table().column(0).to_pylist()
    expected = tab.slice(1, 1)
    actual = duckdb.query(
        f"SELECT _rowid FROM ds"
    ).fetch_arrow_table()

For datafusion you choose whether you want these columns to appear when you create the table provider:

impl LanceTableProvider {
    fn new(dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool) -> Self {
        ...
    }
    ...
}

@chenkovsky
Copy link
Contributor

chenkovsky commented Jan 4, 2025

You can't filter on the column

cannt filter on rowid or any column ?

I tested the following ut.

def test_duckdb_rowid(tmp_path):
    duckdb = pytest.importorskip("duckdb")
    tbl = create_table_for_duckdb()
    ds = lance.write_dataset(tbl, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    duckdb.query("SELECT id, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, id, meta, price FROM ds").to_df() # error
    duckdb.query("SELECT id, meta, price FROM ds").to_df() # OK

@chenkovsky
Copy link
Contributor

impl LanceTableProvider {

Yes, with_row_id, with_row_addr these flags will always work. but I think spark's SupportsMetadataColumns interface is much better.

@chenkovsky
Copy link
Contributor

I created a PR for datafusion to illustrate my idea for _rowid support apache/datafusion#14057

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants