-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to read a Parquet dataset of memory size > 30% of available memory? #9888
Comments
(a) Parquet files are compressed, so the on-disk usage is typically significantly smaller than the in-memory usage; I don't think this is exactly what you are asking about but it's good to keep in mind. I was just looking at a file that's 70mb on disk and 600mb in memory, the compression can be extremely effective. (b) If you have a dataset that will never fit in memory, you will have to restrict yourself to the streaming engine and finish with a |
Thanks, I clarified the title, I'm looking to read a dataset that fits in memory but reading goes OOM. |
It is not very useful if your dataset takes all memory. We need auxilery memory to do operations. Try to use the streaming API, that API can spill to disk if needed. |
My example uses the streaming API and goes OOM, max I can fill with the streaming API is 30% of memory |
Note that it's a reduced example. In my real world problem the dataset is many GB in memory and I'm doing more filtering before |
Closing this in favor of the linked bug report. |
Research
I have searched the above Polars tags on Stack Overflow for similar questions.
I have asked my usage related question on Stack Overflow.
Link to question on Stack Overflow
No response
Question about Polars
I wonder how to read a Parquet dataset whose final Polars frame will occupy most of the available memory. So far I'm unable to read anything larger than 30% of memory.
To reproduce: Same toy dataset as here #9887
I want to read parts of this in a environment constrained to 512 MiB of memory.
I can do this:
Trying to read
210_000
rows will go OOM although the final size would be only 10% larger.I understand there is a need to hold more than the final size in memory during Parquet reading. What are some levers to optimise the memory consumption in this case?
The text was updated successfully, but these errors were encountered: