read_csv reads entire BytesIO stream into memory #9266

corneliusroemer · 2023-06-06T19:21:12Z

Question about Polars

from xopen import xopen
import polars as pl
with xopen("data/sc2/metadata.tsv.zst","rb") as f:
    df=pl.read_csv(f,separator="\t",columns=["strain"])

I'm surprised that pl.read_csv seems to read the entire file into memory before starting parse. In my case, when using a 1.3GB (compressed)/20GB (uncompressed) file this is really slow even on a 32GB mac as a lot of swapping happens.

I will try to use lazy mode, but it could be good to warn that read_csv requires memory on the order of the file read even if one uses only a subset of columns: columns for read_csv does not behave like pandas use_columns which was my expectation.

Update: Lazy mode doesn't support BytesIO, hence the xopen solution doesn't work in that case. One needs to uncompress to disk first.

You can obtain the file for replicating the observed behaviour here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

Related to #7287
Polars version 0.18.15

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2023-06-06T19:25:08Z

Setting low_memory=True doesn't seem to help at all, it still reads the entire file into memory first.

ritchie46 · 2023-06-07T06:02:24Z

Decompression is still done completely in memory. We plan to do this in the batched csv reader in the future, but we haven't worked on that yet.

In the mean time it is advised to decompress first, and then scan.

corneliusroemer · 2023-06-07T11:45:28Z

I don't quite see how this is related to decompression as decompression is handled by xopen in my example. When using xopen the compression is handled by it and polars just gets BytesIO. Do you only read it entirely into memory if using BytesIO? Would maybe be good to mention the behaviour in the read_csv documentation.

ghuls · 2023-06-16T11:19:43Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

itamarst · 2023-08-17T19:17:44Z

I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column:

Pandas, max RSS memory of 130MB.
Polars, max RSS of 440MB. low_memory doesn't make a difference.

However, this may just be because of using mmap(), and having enough memory available that it doesn't get swapped out after reading, in which case it's resident but doesn't need to be, so not really an actual problem. And indeed memory_map=True with Pandas makes it match Polars.

itamarst · 2023-08-17T19:52:06Z

When reading from BytesIO, where mmap() cannot be used, Polars does not read the enter file into memory before parsing: max memory usage is 473MB. So in this case clearly the file's bytes are not being copied a second time.

When passing open("test.csv", "rb"), max usage is 642MB. And it does warn against doing this, to be fair, so I'm not sure it's worth fixing the memory usage. Perhaps the error could also mention that.

In any case, it's not clear to me if the originally reported bug is accurate.

This is Polars 0.18.15.

There is a bunch of memory allocation due to rechunking being on by default. So one thought is that maybe the default rechunking behavior should be changed from True to None indicating a heuristic. And the heuristic would be (to begin with) something like:

if rechunk is None:
    rechunk = not low_memory

Update: I opened #12631.

corneliusroemer · 2023-08-18T08:46:10Z

@itamarst: I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column: [...] When reading from BytesIO, where mmap() cannot be used, Polars does not read the entire file into memory before parsing: max memory usage is 473MB.

Why do you say polars does not read the entire file into memory before parsing? The file takes 300 MB, so as long as peak memory is >300MB (which it is as you state) it is possible the whole thing is in memory before being parsed in chunks.

@itamarst So in this case clearly the file's bytes are not being copied a second time.

I never said bytes would be copied a second time. Instead, I said that the whole file is read into memory once, before being parsed. My expectation was that the BytesIO stream would be parsed as soon as possible and not kept entirely in memory.

@itamarst In any case, it's not clear to me if the originally reported bug is accurate.

What about it is not accurate? To rephrase: When passing BytesIO, read_csv appears to read the whole stream into memory before parsing. This is surprising because there is no inherent reason why the csv cannot be parsed in chunks (once schema has been determined/inferred) - unless one does schema inference on the whole file.

@itamarst When passing open("test.csv", "rb"), max usage is 642MB. And it does warn against doing this, to be fair, so I'm not sure it's worth fixing the memory usage. Perhaps the error could also mention that.

Please read the bug report carefully and note that I don't open but xopen and operate on test.csv.zst not test.csv. The warning/suggestion "Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance." is a) inaccurate (it found a BytesIO stream not a filename) and b) not actionable, as .zst decompression is not supported by Polars at the time of writing.

Can you try to replicate the issue with something closer to the initial report, with a stream on the order of your machine's RAM to see what I mean limiting behaviour?

You can download a large metadata.tsv.zst (~1GB compressed, ~20GB uncompressed) here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

itamarst · 2023-11-22T15:23:23Z

@itamarst: I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column: [...] When reading from BytesIO, where mmap() cannot be used, Polars does not read the entire file into memory before parsing: max memory usage is 473MB.

Why do you say polars does not read the entire file into memory before parsing? The file takes 300 MB, so as long as peak memory is >300MB (which it is as you state) it is possible the whole thing is in memory before being parsed in chunks.

BytesIO means there's already a copy in memory. So given a 300MB file, you will have 300MB just from the BytesIO, nothing to do with Polars. So if max memory is 473MB, that means Polars allocated at most 173MB (presumably less, there's various overhead from importing modules and Python running). Which means Polars didn't read the whole file into memory; if it had, you'd expect memory usage to be 600MB at absolute minimum.

itamarst · 2023-11-22T15:39:44Z

Consider:

def print_max_resident_memory(where):
     from resource import getrusage, RUSAGE_SELF
     print(where, getrusage(RUSAGE_SELF).ru_maxrss / 1024, "MB")

print_max_resident_memory("Startup")

from io import BytesIO
f = BytesIO(open("test2.csv", "rb").read())
print_max_resident_memory("Loaded to BytesIO")

import polars as pl
print_max_resident_memory("Imported Polars")
df = pl.read_csv(f, columns=["route_id"], null_values=["NA"], dtypes={"route_id": pl.Utf8})
print_max_resident_memory("read_csv() just one column")

Output:

$ python demo.py 
Startup 8.75 MB
Loaded to BytesIO 314.375 MB
Imported Polars 351.046875 MB
read_csv() just one column 448.546875 MB

Based on that I would think Polars is not reading the whole contents of BytesIO into memory at the same time, but rather reading and processing in chunks. Otherwise max memory usage would be 650MB minimum. I am currently reading through the code, though, perhaps I'm wrong.

One thing I did notice is that for BytesIO specifically there's an optimization one could do which Polars doesn't, using getbuffer() instead of read(), but that has safety concerns since the underlying file could be mutated behind our back.

itamarst · 2023-11-22T19:56:52Z

OK, found the relevant code, and it does actually call getvalue() (in py-polars/src/file.rs):

        if let Ok(bytes) = py_f.call_method0("getvalue") {
            let bytes = bytes.downcast::<PyBytes>()?;
            Ok(Box::new(Cursor::new(bytes.as_bytes())))
        }

It's possible just deleting these lines would solve the issue specifically for BytesIO, because it would fall back to read()ing which does not, I think, load everything into memory at once:

corneliusroemer added the question label Jun 6, 2023

corneliusroemer mentioned this issue Jun 6, 2023

Add BytesIO support to scan_csv #4950

Closed

corneliusroemer mentioned this issue Jun 7, 2023

Support reading of zstd compressed csv files #9283

Open

corneliusroemer changed the title ~~read_csv surprisingly reads entire file into memory before passing~~ read_csv surprisingly reads entire file into memory before parsing Jun 7, 2023

corneliusroemer changed the title ~~read_csv surprisingly reads entire file into memory before parsing~~ read_csv reads entire BytesIO stream into memory Aug 18, 2023

stinodego added bug Something isn't working python Related to Python Polars and removed question labels Aug 22, 2023

Wainberg mentioned this issue Dec 31, 2023

Meta-issue: improving CSV reading and writing #13346

Open

karond-is-me mentioned this issue Jan 12, 2024

Reading a CSV file when the separator parameter is set to a non-default value, it will always load the entirety of its contents into memory. #13655

Open

2 tasks

stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv reads entire BytesIO stream into memory #9266

read_csv reads entire BytesIO stream into memory #9266

corneliusroemer commented Jun 6, 2023 •

edited

Loading

corneliusroemer commented Jun 6, 2023

ritchie46 commented Jun 7, 2023

corneliusroemer commented Jun 7, 2023

ghuls commented Jun 16, 2023

itamarst commented Aug 17, 2023 •

edited

Loading

itamarst commented Aug 17, 2023 •

edited

Loading

corneliusroemer commented Aug 18, 2023

itamarst commented Nov 22, 2023

itamarst commented Nov 22, 2023 •

edited

Loading

itamarst commented Nov 22, 2023

read_csv reads entire BytesIO stream into memory #9266

read_csv reads entire BytesIO stream into memory #9266

Comments

corneliusroemer commented Jun 6, 2023 • edited Loading

Question about Polars

corneliusroemer commented Jun 6, 2023

ritchie46 commented Jun 7, 2023

corneliusroemer commented Jun 7, 2023

ghuls commented Jun 16, 2023

itamarst commented Aug 17, 2023 • edited Loading

itamarst commented Aug 17, 2023 • edited Loading

corneliusroemer commented Aug 18, 2023

itamarst commented Nov 22, 2023

itamarst commented Nov 22, 2023 • edited Loading

itamarst commented Nov 22, 2023

corneliusroemer commented Jun 6, 2023 •

edited

Loading

itamarst commented Aug 17, 2023 •

edited

Loading

itamarst commented Aug 17, 2023 •

edited

Loading

itamarst commented Nov 22, 2023 •

edited

Loading