How to read a Parquet dataset of memory size > 30% of available memory? #9888

jonashaag · 2023-07-14T19:01:29Z

Research

I have searched the above Polars tags on Stack Overflow for similar questions.
I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

No response

Question about Polars

I wonder how to read a Parquet dataset whose final Polars frame will occupy most of the available memory. So far I'm unable to read anything larger than 30% of memory.

To reproduce: Same toy dataset as here #9887

>>> N = 1_000_000
>>> df = pl.DataFrame({"idx": range(N), **{f"col{i}": [f"{i}abc{j}" for j in range(N)] for i in range(50)}}
>>> df.with_columns(idx=df["idx"].shuffle()).write_parquet("df", row_group_size=10_000)
>>> df.estimated_size()/1e6
942.4449

I want to read parts of this in a environment constrained to 512 MiB of memory.

I can do this:

# Note that streaming=True is required because of #9887
>>> df = pl.scan_parquet("df").limit(200_000).collect(streaming=True)
>>> df.estimated_size()
184.0685

Trying to read 210_000 rows will go OOM although the final size would be only 10% larger.

I understand there is a need to hold more than the final size in memory during Parquet reading. What are some levers to optimise the memory consumption in this case?

The text was updated successfully, but these errors were encountered:

mishpat · 2023-07-14T23:58:48Z

(a) Parquet files are compressed, so the on-disk usage is typically significantly smaller than the in-memory usage; I don't think this is exactly what you are asking about but it's good to keep in mind. I was just looking at a file that's 70mb on disk and 600mb in memory, the compression can be extremely effective.

(b) If you have a dataset that will never fit in memory, you will have to restrict yourself to the streaming engine and finish with a ldf.sink_[parquet or ipc] call, e.g. sink_parquet.

jonashaag · 2023-07-15T06:54:27Z

Thanks, I clarified the title, I'm looking to read a dataset that fits in memory but reading goes OOM.

ritchie46 · 2023-07-15T07:23:42Z

Thanks, I clarified the title, I'm looking to read a dataset that fits in memory but reading goes OOM.

It is not very useful if your dataset takes all memory. We need auxilery memory to do operations. Try to use the streaming API, that API can spill to disk if needed.

jonashaag · 2023-07-15T11:13:04Z

My example uses the streaming API and goes OOM, max I can fill with the streaming API is 30% of memory

jonashaag · 2023-07-15T11:37:20Z

Note that it's a reduced example. In my real world problem the dataset is many GB in memory and I'm doing more filtering before collect(). The filtered dataset easily fits into memory but loading goes OOM even though I'm using the streaming engine.

stinodego · 2023-08-22T12:58:20Z

Closing this in favor of the linked bug report.

jonashaag added the question label Jul 14, 2023

jonashaag changed the title ~~How to read a Parquet dataset of size > 30% of available memory?~~ How to read a Parquet dataset of memory size > 30% of available memory? Jul 15, 2023

jonashaag mentioned this issue Jul 18, 2023

streaming=True goes OOM but streaming=False doesn't #9943

Closed

2 tasks

stinodego closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read a Parquet dataset of memory size > 30% of available memory? #9888

How to read a Parquet dataset of memory size > 30% of available memory? #9888

jonashaag commented Jul 14, 2023 •

edited

Loading

mishpat commented Jul 14, 2023

jonashaag commented Jul 15, 2023

ritchie46 commented Jul 15, 2023

jonashaag commented Jul 15, 2023

jonashaag commented Jul 15, 2023

stinodego commented Aug 22, 2023

How to read a Parquet dataset of memory size > 30% of available memory? #9888

How to read a Parquet dataset of memory size > 30% of available memory? #9888

Comments

jonashaag commented Jul 14, 2023 • edited Loading

Research

Link to question on Stack Overflow

Question about Polars

mishpat commented Jul 14, 2023

jonashaag commented Jul 15, 2023

ritchie46 commented Jul 15, 2023

jonashaag commented Jul 15, 2023

jonashaag commented Jul 15, 2023

stinodego commented Aug 22, 2023

jonashaag commented Jul 14, 2023 •

edited

Loading