Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice giving incorrect empty result #9887

Closed
2 tasks done
jonashaag opened this issue Jul 14, 2023 · 10 comments · Fixed by #12239
Closed
2 tasks done

slice giving incorrect empty result #9887

jonashaag opened this issue Jul 14, 2023 · 10 comments · Fixed by #12239
Labels
bug Something isn't working python Related to Python Polars

Comments

@jonashaag
Copy link
Contributor

jonashaag commented Jul 14, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Unfortunately I don't have a minimal example right now. This happens in a memory constrained environment.

# Written on macOS host
>>> N = 1_000_000
>>> df = pl.DataFrame({"idx": range(N), **{f"col{i}": [f"{i}abc{j}" for j in range(N)] for i in range(50)}})
>>> df.with_columns(idx=df["idx"].shuffle()).write_parquet("df", row_group_size=10_000)

# Executed in Docker Linux guest (df is baked into Docker image)
>>> print(len(pl.scan_parquet("df").slice(0,5).collect()))
0
>>> print(len(pl.scan_parquet("df").limit(5).collect()))
0
>>> print(len(pl.scan_parquet("df").fetch()))
500

Issue description

See example

Any ideas how to debug this further?

Plan:

  Parquet SCAN df
  PROJECT */51 COLUMNS
  N_ROWS: 5

Expected behavior

Should have same length

Installed versions


--------Version info---------
Polars:              0.18.7
Index type:          UInt32
Platform:            Linux-6.3.11-200.fc38.aarch64-aarch64-with-glibc2.36
Python:              3.11.4 (main, Jul  4 2023, 21:57:59) [GCC 12.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              <not installed>
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@jonashaag jonashaag added bug Something isn't working python Related to Python Polars labels Jul 14, 2023
@jonashaag
Copy link
Contributor Author

With streaming=True it works correctly

@ritchie46
Copy link
Member

I cannot reproduce? Anyone else?

@jonashaag
Copy link
Contributor Author

jonashaag commented Jul 27, 2023

I can reproduce with Polars 0.18.8

Instructions

Use Podman on an M2 Mac with a machine with 2048 GB of memory and 4 CPUs. I didn't check if this is reproducible without Podman or with non-aarch64

❯ podman run -it python bash
root@ deefc3eda5f9:/# uname -a
Linux deefc3eda5f9 6.3.11-200.fc38.aarch64 #1 SMP PREEMPT_DYNAMIC Sun Jul  2 13:39:47 UTC 2023 aarch64 GNU/Linux

root@deefc3eda5f9:/# pip install polars
...
Successfully installed polars-0.18.8

root@deefc3eda5f9:/# python
Python 3.11.4 (main, Jul  4 2023, 21:57:59) [GCC 12.2.0] on linux
>>> import polars as pl
>>> N = 300_000
>>> df = pl.DataFrame({"idx": range(N), **{f"col{i}": [f"{i}abc{j}" for j in range(N)] for i in range(50)}})
>>> df.with_columns(idx=df["idx"].shuffle()).write_parquet("df", row_group_size=10_000)

root@deefc3eda5f9:/# python
>>> import polars as pl
>>> print(len(pl.scan_parquet("df").slice(0,5).collect()))
0
>>> print(len(pl.scan_parquet("df").limit(5).collect()))
0
>>> print(len(pl.scan_parquet("df").fetch()))
500
>>>

@kgutwin
Copy link

kgutwin commented Aug 24, 2023

I can reproduce this, only on aarch64 though. You can reproduce it on non-aarch64 systems by using QEMU through Docker:

% docker run -ti --rm --platform linux/arm64 python bash
root@8ee8c6605318:/# uname -m
aarch64
root@8ee8c6605318:/# pip install polars
Collecting polars
  Downloading polars-0.18.15-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.4 MB)
Installing collected packages: polars
Successfully installed polars-0.18.15
root@8ee8c6605318:/# cat >/tmp/test.csv <<EOF
> a,b
> 1,10
> 2,20
> 3,30
> 4,40
> 5,50
> EOF
root@8ee8c6605318:/# python
Python 3.11.4 (main, Aug 16 2023, 07:34:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
>>> lf = pl.scan_csv('/tmp/test.csv')
>>> lf.slice(0,100).collect()
shape: (0, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
└─────┴─────┘
>>> lf.collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
│ 4   ┆ 40  │
│ 5   ┆ 50  │
└─────┴─────┘

This seems to affect LazyFrames created by scan_csv or scan_parquet at least. A DataFrame that is converted to a LazyFrame is not affected.

>>> df = pl.read_csv('/tmp/test.csv')
>>> df.lazy().slice(0,100).collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
│ 2   ┆ 20  │
│ 3   ┆ 30  │
│ 4   ┆ 40  │
│ 5   ┆ 50  │
└─────┴─────┘

@kgutwin
Copy link

kgutwin commented Aug 24, 2023

I did a quick bisect search to try to find which polars release started failing. I was able to determine that polars 0.16.18 shows the bug, but polars 0.16.17 behaves as expected. The release notes for 0.16.18 don't show anything obvious; there don't seem to be a lot of changes in that release. Unfortunately, I wasn't able to build polars from source inside my container, but someone who can reproduce the bug and can git bisect between 1a7103f and 69516a7 may be able to track it down.

Reproduction test script (make sure there is any CSV file in /tmp/test.csv):

#!/usr/bin/env python

import sys
import polars as pl

n = len(pl.scan_csv('/tmp/test.csv').slice(0, 100).collect())

if n == 0:
    print('FAIL')
    sys.exit(1)

print('ok')

@jonashaag
Copy link
Contributor Author

I can't reproduce with your example. Can you share the CSV file?

@kgutwin
Copy link

kgutwin commented Sep 16, 2023

The CSV file I was using was really simple:

a,b
1,10
2,20
3,30
4,40
5,50

@jonashaag
Copy link
Contributor Author

jonashaag commented Sep 16, 2023

Aha, it is only reproducible with the release build.

@jonashaag
Copy link
Contributor Author

jonashaag commented Sep 16, 2023

Bug introduced in #7940 and fixed in #10467

@ritchie46 is this something we should be concerned about?

Also, should we add a test? Seems kind of specific but also very simple test:

def test_scan_csv(tmp_path):
    (tmp_path / "a.csv").write_text("a\na")
    assert len(pl.scan_csv(tmp_path / "a.csv").slice(0).collect()) == 1

@jonashaag
Copy link
Contributor Author

@stinodego I'm pinging you randomly, hoping that you are less busy than Ritchie :) Any opinion on my last comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants