Polars `count(*) + filter` projects too many columns. #20902

kszlim · 2025-01-24T22:48:24Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import numpy as np

lineitem = pd.DataFrame({
    'l_partkey': np.random.randint(1, 1000001, size=100000000),
    'l_quantity': np.random.uniform(1, 50, size=100000000),
    'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
})
lineitem.to_parquet('lineitem.parquet', index=False)


import polars as pl
import time
import datafusion as dfn

ctx = dfn.SessionContext()

start = time.time()
ctx.register_parquet("lineitem", "lineitem.parquet")
ctx.sql("select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20").show()
print(f"Took {time.time() - start}s for datafusion without filter pushdown")

ctx = dfn.SessionContext(dfn.SessionConfig({"datafusion.execution.parquet.pushdown_filters": "true"}))

start = time.time()
ctx.register_parquet("lineitem", "lineitem.parquet")
ctx.sql("select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20").show()
print(f"Took {time.time() - start}s for datafusion with filter pushdown")

start = time.time()
lineitem_ldf = pl.scan_parquet("lineitem.parquet")
print(lineitem_ldf.filter(pl.col("l_partkey").is_between(10, 20)).select(pl.len()).collect()) # Tried rewriting the filter as & of two conds without any difference
print(f"Took {time.time() - start}s for polars")

Log output

DataFrame()
+----------+
| count(*) |
+----------+
| 1143     |
+----------+
Took 0.10332036018371582s for datafusion without filter pushdown
DataFrame()
+----------+
| count(*) |
+----------+
| 1143     |
+----------+
Took 0.09206247329711914s for datafusion with filter pushdown
shape: (1, 1)
┌──────┐
│ len  │
│ ---  │
│ u32  │
╞══════╡
│ 1143 │
└──────┘
Took 0.2900676727294922s for polars

Issue description

Using a simple filter, polars is about 2-3x slower for me on a simple count of the number of rows.

Make sure you uv pip install -U polars numpy pyarrow pandas datafusion

Expected behavior

Should be roughly as fast

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.10.230-202.885.x86_64-x86_64-with-glibc2.26
Python:              3.11.7 (main, Dec  5 2023, 22:00:36) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.36.6
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         0.16.1
matplotlib           3.10.0
numpy                2.2.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              18.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2025-01-25T11:59:24Z

I think this is related to the predicate pushdown in combination with the count.

The Polars "count star" equivalent: print(lineitem_ldf.select(pl.len()).collect()) is just as fast as datafusions "select count(*)" for me.

I see that we we project two columns:

' SELECT [len()] FROM\n  Parquet SCAN [lineitem.parquet]\n  PROJECT 2/3 COLUMNS\n  SELECTION: col("l_partkey").is_between([10, 20])'

This is due to the fact that projection pushdown selects the first column to be able to compute the pl.len(). This is wasteful as it should have projected the column that was used by the predicate.

Got an idea to improve the optimizer here.

coastalwhite · 2025-01-25T13:44:22Z

This can be made very fast when Parquet expressions properly lands, but indeed until that time we should try to reduce the amount of columns projected.

kszlim added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 24, 2025

kszlim changed the title ~~Polars parquet reading ~ 2-3x slower than datafusion with a simple filter~~ Polars parquet reading ~2-3x slower than datafusion with a simple filter Jan 24, 2025

coastalwhite changed the title ~~Polars parquet reading ~2-3x slower than datafusion with a simple filter~~ Polars parquet reading row count is ~2-3x slower than datafusion with a simple filter Jan 25, 2025

ritchie46 changed the title ~~Polars parquet reading row count is ~2-3x slower than datafusion with a simple filter~~ Polars count(*) + filter projects too many columns. Jan 25, 2025

ritchie46 self-assigned this Jan 25, 2025

ritchie46 mentioned this issue Jan 26, 2025

perf: Ensure count query select minimal columns #20923

Merged

ritchie46 closed this as completed in #20923 Jan 27, 2025

c-peters added the accepted Ready for implementation label Jan 27, 2025

c-peters added this to Backlog Jan 27, 2025

c-peters moved this to Done in Backlog Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars `count(*) + filter` projects too many columns. #20902

Polars `count(*) + filter` projects too many columns. #20902

kszlim commented Jan 24, 2025 •

edited

Loading

ritchie46 commented Jan 25, 2025 •

edited

Loading

coastalwhite commented Jan 25, 2025 •

edited

Loading

Polars count(*) + filter projects too many columns. #20902

Polars count(*) + filter projects too many columns. #20902

Comments

kszlim commented Jan 24, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Jan 25, 2025 • edited Loading

coastalwhite commented Jan 25, 2025 • edited Loading

Polars `count(*) + filter` projects too many columns. #20902

Polars `count(*) + filter` projects too many columns. #20902

kszlim commented Jan 24, 2025 •

edited

Loading

ritchie46 commented Jan 25, 2025 •

edited

Loading

coastalwhite commented Jan 25, 2025 •

edited

Loading