Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv reads entire BytesIO stream into memory #9266

Open
corneliusroemer opened this issue Jun 6, 2023 · 10 comments
Open

read_csv reads entire BytesIO stream into memory #9266

corneliusroemer opened this issue Jun 6, 2023 · 10 comments
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@corneliusroemer
Copy link

corneliusroemer commented Jun 6, 2023

Question about Polars

from xopen import xopen
import polars as pl
with xopen("data/sc2/metadata.tsv.zst","rb") as f:
    df=pl.read_csv(f,separator="\t",columns=["strain"])

I'm surprised that pl.read_csv seems to read the entire file into memory before starting parse. In my case, when using a 1.3GB (compressed)/20GB (uncompressed) file this is really slow even on a 32GB mac as a lot of swapping happens.

I will try to use lazy mode, but it could be good to warn that read_csv requires memory on the order of the file read even if one uses only a subset of columns: columns for read_csv does not behave like pandas use_columns which was my expectation.

Update: Lazy mode doesn't support BytesIO, hence the xopen solution doesn't work in that case. One needs to uncompress to disk first.

You can obtain the file for replicating the observed behaviour here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

Related to #7287
Polars version 0.18.15

@corneliusroemer
Copy link
Author

Setting low_memory=True doesn't seem to help at all, it still reads the entire file into memory first.

@ritchie46
Copy link
Member

Decompression is still done completely in memory. We plan to do this in the batched csv reader in the future, but we haven't worked on that yet.

In the mean time it is advised to decompress first, and then scan.

@corneliusroemer
Copy link
Author

I don't quite see how this is related to decompression as decompression is handled by xopen in my example. When using xopen the compression is handled by it and polars just gets BytesIO. Do you only read it entirely into memory if using BytesIO? Would maybe be good to mention the behaviour in the read_csv documentation.

@corneliusroemer corneliusroemer changed the title read_csv surprisingly reads entire file into memory before passing read_csv surprisingly reads entire file into memory before parsing Jun 7, 2023
@ghuls
Copy link
Collaborator

ghuls commented Jun 16, 2023

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

@itamarst
Copy link
Contributor

itamarst commented Aug 17, 2023

I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column:

  • Pandas, max RSS memory of 130MB.
  • Polars, max RSS of 440MB. low_memory doesn't make a difference.

However, this may just be because of using mmap(), and having enough memory available that it doesn't get swapped out after reading, in which case it's resident but doesn't need to be, so not really an actual problem. And indeed memory_map=True with Pandas makes it match Polars.

@itamarst
Copy link
Contributor

itamarst commented Aug 17, 2023

When reading from BytesIO, where mmap() cannot be used, Polars does not read the enter file into memory before parsing: max memory usage is 473MB. So in this case clearly the file's bytes are not being copied a second time.

When passing open("test.csv", "rb"), max usage is 642MB. And it does warn against doing this, to be fair, so I'm not sure it's worth fixing the memory usage. Perhaps the error could also mention that.

In any case, it's not clear to me if the originally reported bug is accurate.

This is Polars 0.18.15.

There is a bunch of memory allocation due to rechunking being on by default. So one thought is that maybe the default rechunking behavior should be changed from True to None indicating a heuristic. And the heuristic would be (to begin with) something like:

if rechunk is None:
    rechunk = not low_memory

Update: I opened #12631.

@corneliusroemer
Copy link
Author

@itamarst: I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column: [...] When reading from BytesIO, where mmap() cannot be used, Polars does not read the entire file into memory before parsing: max memory usage is 473MB.

Why do you say polars does not read the entire file into memory before parsing? The file takes 300 MB, so as long as peak memory is >300MB (which it is as you state) it is possible the whole thing is in memory before being parsed in chunks.

@itamarst So in this case clearly the file's bytes are not being copied a second time.

I never said bytes would be copied a second time. Instead, I said that the whole file is read into memory once, before being parsed. My expectation was that the BytesIO stream would be parsed as soon as possible and not kept entirely in memory.

@itamarst In any case, it's not clear to me if the originally reported bug is accurate.

What about it is not accurate? To rephrase: When passing BytesIO, read_csv appears to read the whole stream into memory before parsing. This is surprising because there is no inherent reason why the csv cannot be parsed in chunks (once schema has been determined/inferred) - unless one does schema inference on the whole file.

@itamarst When passing open("test.csv", "rb"), max usage is 642MB. And it does warn against doing this, to be fair, so I'm not sure it's worth fixing the memory usage. Perhaps the error could also mention that.

Please read the bug report carefully and note that I don't open but xopen and operate on test.csv.zst not test.csv. The warning/suggestion "Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance." is a) inaccurate (it found a BytesIO stream not a filename) and b) not actionable, as .zst decompression is not supported by Polars at the time of writing.

Can you try to replicate the issue with something closer to the initial report, with a stream on the order of your machine's RAM to see what I mean limiting behaviour?

You can download a large metadata.tsv.zst (~1GB compressed, ~20GB uncompressed) here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst

@corneliusroemer corneliusroemer changed the title read_csv surprisingly reads entire file into memory before parsing read_csv reads entire BytesIO stream into memory Aug 18, 2023
@stinodego stinodego added bug Something isn't working python Related to Python Polars and removed question labels Aug 22, 2023
@itamarst
Copy link
Contributor

@itamarst: I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column: [...] When reading from BytesIO, where mmap() cannot be used, Polars does not read the entire file into memory before parsing: max memory usage is 473MB.

Why do you say polars does not read the entire file into memory before parsing? The file takes 300 MB, so as long as peak memory is >300MB (which it is as you state) it is possible the whole thing is in memory before being parsed in chunks.

BytesIO means there's already a copy in memory. So given a 300MB file, you will have 300MB just from the BytesIO, nothing to do with Polars. So if max memory is 473MB, that means Polars allocated at most 173MB (presumably less, there's various overhead from importing modules and Python running). Which means Polars didn't read the whole file into memory; if it had, you'd expect memory usage to be 600MB at absolute minimum.

@itamarst
Copy link
Contributor

itamarst commented Nov 22, 2023

Consider:

def print_max_resident_memory(where):
     from resource import getrusage, RUSAGE_SELF
     print(where, getrusage(RUSAGE_SELF).ru_maxrss / 1024, "MB")

print_max_resident_memory("Startup")

from io import BytesIO
f = BytesIO(open("test2.csv", "rb").read())
print_max_resident_memory("Loaded to BytesIO")

import polars as pl
print_max_resident_memory("Imported Polars")
df = pl.read_csv(f, columns=["route_id"], null_values=["NA"], dtypes={"route_id": pl.Utf8})
print_max_resident_memory("read_csv() just one column")

Output:

$ python demo.py 
Startup 8.75 MB
Loaded to BytesIO 314.375 MB
Imported Polars 351.046875 MB
read_csv() just one column 448.546875 MB

Based on that I would think Polars is not reading the whole contents of BytesIO into memory at the same time, but rather reading and processing in chunks. Otherwise max memory usage would be 650MB minimum. I am currently reading through the code, though, perhaps I'm wrong.

One thing I did notice is that for BytesIO specifically there's an optimization one could do which Polars doesn't, using getbuffer() instead of read(), but that has safety concerns since the underlying file could be mutated behind our back.

@itamarst
Copy link
Contributor

OK, found the relevant code, and it does actually call getvalue() (in py-polars/src/file.rs):

        if let Ok(bytes) = py_f.call_method0("getvalue") {
            let bytes = bytes.downcast::<PyBytes>()?;
            Ok(Box::new(Cursor::new(bytes.as_bytes())))
        }

It's possible just deleting these lines would solve the issue specifically for BytesIO, because it would fall back to read()ing which does not, I think, load everything into memory at once:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants