Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars count(*) + filter projects too many columns. #20902

Closed
2 tasks done
kszlim opened this issue Jan 24, 2025 · 2 comments · Fixed by #20923
Closed
2 tasks done

Polars count(*) + filter projects too many columns. #20902

kszlim opened this issue Jan 24, 2025 · 2 comments · Fixed by #20923
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@kszlim
Copy link
Contributor

kszlim commented Jan 24, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import numpy as np

lineitem = pd.DataFrame({
    'l_partkey': np.random.randint(1, 1000001, size=100000000),
    'l_quantity': np.random.uniform(1, 50, size=100000000),
    'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
})
lineitem.to_parquet('lineitem.parquet', index=False)


import polars as pl
import time
import datafusion as dfn

ctx = dfn.SessionContext()

start = time.time()
ctx.register_parquet("lineitem", "lineitem.parquet")
ctx.sql("select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20").show()
print(f"Took {time.time() - start}s for datafusion without filter pushdown")

ctx = dfn.SessionContext(dfn.SessionConfig({"datafusion.execution.parquet.pushdown_filters": "true"}))

start = time.time()
ctx.register_parquet("lineitem", "lineitem.parquet")
ctx.sql("select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20").show()
print(f"Took {time.time() - start}s for datafusion with filter pushdown")

start = time.time()
lineitem_ldf = pl.scan_parquet("lineitem.parquet")
print(lineitem_ldf.filter(pl.col("l_partkey").is_between(10, 20)).select(pl.len()).collect()) # Tried rewriting the filter as & of two conds without any difference
print(f"Took {time.time() - start}s for polars")

Log output

DataFrame()
+----------+
| count(*) |
+----------+
| 1143     |
+----------+
Took 0.10332036018371582s for datafusion without filter pushdown
DataFrame()
+----------+
| count(*) |
+----------+
| 1143     |
+----------+
Took 0.09206247329711914s for datafusion with filter pushdown
shape: (1, 1)
┌──────┐
│ len  │
│ ---  │
│ u32  │
╞══════╡
│ 1143 │
└──────┘
Took 0.2900676727294922s for polars

Issue description

Using a simple filter, polars is about 2-3x slower for me on a simple count of the number of rows.

Make sure you uv pip install -U polars numpy pyarrow pandas datafusion

Expected behavior

Should be roughly as fast

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.10.230-202.885.x86_64-x86_64-with-glibc2.26
Python:              3.11.7 (main, Dec  5 2023, 22:00:36) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.36.6
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         0.16.1
matplotlib           3.10.0
numpy                2.2.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              18.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@kszlim kszlim added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 24, 2025
@kszlim kszlim changed the title Polars parquet reading ~ 2-3x slower than datafusion with a simple filter Polars parquet reading ~2-3x slower than datafusion with a simple filter Jan 24, 2025
@coastalwhite coastalwhite changed the title Polars parquet reading ~2-3x slower than datafusion with a simple filter Polars parquet reading row count is ~2-3x slower than datafusion with a simple filter Jan 25, 2025
@ritchie46
Copy link
Member

ritchie46 commented Jan 25, 2025

I think this is related to the predicate pushdown in combination with the count.

The Polars "count star" equivalent: print(lineitem_ldf.select(pl.len()).collect()) is just as fast as datafusions "select count(*)" for me.

I see that we we project two columns:

' SELECT [len()] FROM\n  Parquet SCAN [lineitem.parquet]\n  PROJECT 2/3 COLUMNS\n  SELECTION: col("l_partkey").is_between([10, 20])'

This is due to the fact that projection pushdown selects the first column to be able to compute the pl.len(). This is wasteful as it should have projected the column that was used by the predicate.

Got an idea to improve the optimizer here.

@ritchie46 ritchie46 changed the title Polars parquet reading row count is ~2-3x slower than datafusion with a simple filter Polars count(*) + filter projects too many columns. Jan 25, 2025
@coastalwhite
Copy link
Collaborator

coastalwhite commented Jan 25, 2025

This can be made very fast when Parquet expressions properly lands, but indeed until that time we should try to reduce the amount of columns projected.

@ritchie46 ritchie46 self-assigned this Jan 25, 2025
@c-peters c-peters added the accepted Ready for implementation label Jan 27, 2025
@c-peters c-peters added this to Backlog Jan 27, 2025
@c-peters c-peters moved this to Done in Backlog Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants