Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

karond-is-me · 2024-01-12T03:48:32Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
pl.read_csv_batched("access - 副本 (2).log",batch_size=100).next_batches(1) #OK
pl.read_csv_batched("access - 副本 (2).log",separator=' ',batch_size=100).next_batches(1)  #NG
pl.scan_csv("access - 副本 (2).log").lazy().fetch(1)  #OK
pl.scan_csv("access - 副本 (2).log",separator=' ').lazy().fetch(1)  #NG

Log output

No response

Issue description

When attempting to read a large 18GB CSV file using streaming or batched reading methods, setting the separator parameter to a non-default value might lead to a memory explosion, despite this phenomenon not being reflected in the Windows process manager.
Additionally, I speculate that bug #9266 may be related to this issue.

Expected behavior

To avoid loading the entire content into memory, you can utilize streaming or batched reading methods instead.

Installed versions

Polars: 0.20.3
Index type: UInt32
Platform: Windows-10-10.0.18363-SP0
Python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:
cloudpickle: 2.2.1
connectorx:
deltalake:
fsspec: 2023.4.0
gevent:
hvplot: 0.8.4
matplotlib: 3.7.2
numpy: 1.24.3
openpyxl: 3.0.10
pandas: 2.0.3
pyarrow: 11.0.0
pydantic: 1.10.8
pyiceberg:
pyxlsb:
sqlalchemy: 1.4.39
xlsx2csv:
xlsxwriter:

itamarst · 2024-01-19T15:30:26Z

I tried this on Linux, with /usr/bin/time -v python script.py to measure max resident memory, with version on main from Jan 17 2024. Was unable to see any memory usage difference between runs with and without separator, albeit with a different file than the one reporter used.

karond-is-me · 2024-01-24T02:31:20Z

After updating Polars to version 0.20.5, I noticed no discernible changes on my Windows computer.

karond-is-me added bug Something isn't working python Related to Python Polars labels Jan 12, 2024

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

stinodego added the A-io-csv Area: reading/writing CSV files label Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

karond-is-me commented Jan 12, 2024

itamarst commented Jan 19, 2024

karond-is-me commented Jan 24, 2024

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Comments

karond-is-me commented Jan 12, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

itamarst commented Jan 19, 2024

karond-is-me commented Jan 24, 2024