Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Open
2 tasks done
karond-is-me opened this issue Jan 12, 2024 · 2 comments
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@karond-is-me
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
pl.read_csv_batched("access - 副本 (2).log",batch_size=100).next_batches(1) #OK
pl.read_csv_batched("access - 副本 (2).log",separator=' ',batch_size=100).next_batches(1)  #NG
pl.scan_csv("access - 副本 (2).log").lazy().fetch(1)  #OK
pl.scan_csv("access - 副本 (2).log",separator=' ').lazy().fetch(1)  #NG

Log output

No response

Issue description

When attempting to read a large 18GB CSV file using streaming or batched reading methods, setting the separator parameter to a non-default value might lead to a memory explosion, despite this phenomenon not being reflected in the Windows process manager.
Additionally, I speculate that bug #9266 may be related to this issue.
图片

Expected behavior

To avoid loading the entire content into memory, you can utilize streaming or batched reading methods instead.

Installed versions

Polars: 0.20.3
Index type: UInt32
Platform: Windows-10-10.0.18363-SP0
Python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:
cloudpickle: 2.2.1
connectorx:
deltalake:
fsspec: 2023.4.0
gevent:
hvplot: 0.8.4
matplotlib: 3.7.2
numpy: 1.24.3
openpyxl: 3.0.10
pandas: 2.0.3
pyarrow: 11.0.0
pydantic: 1.10.8
pyiceberg:
pyxlsb:
sqlalchemy: 1.4.39
xlsx2csv:
xlsxwriter:

@karond-is-me karond-is-me added bug Something isn't working python Related to Python Polars labels Jan 12, 2024
@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@itamarst
Copy link
Contributor

I tried this on Linux, with /usr/bin/time -v python script.py to measure max resident memory, with version on main from Jan 17 2024. Was unable to see any memory usage difference between runs with and without separator, albeit with a different file than the one reporter used.

@stinodego stinodego added the A-io-csv Area: reading/writing CSV files label Jan 21, 2024
@karond-is-me
Copy link
Author

After updating Polars to version 0.20.5, I noticed no discernible changes on my Windows computer.
图片

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants