-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv reads entire BytesIO stream into memory #9266
Comments
Setting |
Decompression is still done completely in memory. We plan to do this in the batched csv reader in the future, but we haven't worked on that yet. In the mean time it is advised to decompress first, and then scan. |
I don't quite see how this is related to decompression as decompression is handled by xopen in my example. When using |
You can use |
I tried to verify if this is the case: given a 300MB CSV file, where I only read the first column:
However, this may just be because of using mmap(), and having enough memory available that it doesn't get swapped out after reading, in which case it's resident but doesn't need to be, so not really an actual problem. And indeed |
When reading from BytesIO, where When passing In any case, it's not clear to me if the originally reported bug is accurate. This is Polars 0.18.15. There is a bunch of memory allocation due to rechunking being on by default. So one thought is that maybe the default rechunking behavior should be changed from if rechunk is None:
rechunk = not low_memory Update: I opened #12631. |
Why do you say polars does not read the entire file into memory before parsing? The file takes 300 MB, so as long as peak memory is >300MB (which it is as you state) it is possible the whole thing is in memory before being parsed in chunks.
I never said bytes would be copied a second time. Instead, I said that the whole file is read into memory once, before being parsed. My expectation was that the BytesIO stream would be parsed as soon as possible and not kept entirely in memory.
What about it is not accurate? To rephrase: When passing BytesIO, read_csv appears to read the whole stream into memory before parsing. This is surprising because there is no inherent reason why the csv cannot be parsed in chunks (once schema has been determined/inferred) - unless one does schema inference on the whole file.
Please read the bug report carefully and note that I don't Can you try to replicate the issue with something closer to the initial report, with a stream on the order of your machine's RAM to see what I mean limiting behaviour? You can download a large metadata.tsv.zst (~1GB compressed, ~20GB uncompressed) here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst |
|
Consider: def print_max_resident_memory(where):
from resource import getrusage, RUSAGE_SELF
print(where, getrusage(RUSAGE_SELF).ru_maxrss / 1024, "MB")
print_max_resident_memory("Startup")
from io import BytesIO
f = BytesIO(open("test2.csv", "rb").read())
print_max_resident_memory("Loaded to BytesIO")
import polars as pl
print_max_resident_memory("Imported Polars")
df = pl.read_csv(f, columns=["route_id"], null_values=["NA"], dtypes={"route_id": pl.Utf8})
print_max_resident_memory("read_csv() just one column") Output:
Based on that I would think Polars is not reading the whole contents of BytesIO into memory at the same time, but rather reading and processing in chunks. Otherwise max memory usage would be 650MB minimum. I am currently reading through the code, though, perhaps I'm wrong. One thing I did notice is that for BytesIO specifically there's an optimization one could do which Polars doesn't, using |
OK, found the relevant code, and it does actually call if let Ok(bytes) = py_f.call_method0("getvalue") {
let bytes = bytes.downcast::<PyBytes>()?;
Ok(Box::new(Cursor::new(bytes.as_bytes())))
} It's possible just deleting these lines would solve the issue specifically for |
Question about Polars
I'm surprised that
pl.read_csv
seems to read the entire file into memory before starting parse. In my case, when using a 1.3GB (compressed)/20GB (uncompressed) file this is really slow even on a 32GB mac as a lot of swapping happens.I will try to use lazy mode, but it could be good to warn that read_csv requires memory on the order of the file read even if one uses only a subset of columns:
columns
for read_csv does not behave like pandasuse_columns
which was my expectation.Update: Lazy mode doesn't support BytesIO, hence the xopen solution doesn't work in that case. One needs to uncompress to disk first.
You can obtain the file for replicating the observed behaviour here: https://data.nextstrain.org/files/ncov/open/metadata.tsv.zst
Related to #7287
Polars version 0.18.15
The text was updated successfully, but these errors were encountered: