Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read a Parquet dataset of memory size > 30% of available memory? #9888

Closed
1 of 2 tasks
jonashaag opened this issue Jul 14, 2023 · 6 comments
Closed
1 of 2 tasks

Comments

@jonashaag
Copy link
Contributor

jonashaag commented Jul 14, 2023

Research

  • I have searched the above Polars tags on Stack Overflow for similar questions.

  • I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

No response

Question about Polars

I wonder how to read a Parquet dataset whose final Polars frame will occupy most of the available memory. So far I'm unable to read anything larger than 30% of memory.

To reproduce: Same toy dataset as here #9887

>>> N = 1_000_000
>>> df = pl.DataFrame({"idx": range(N), **{f"col{i}": [f"{i}abc{j}" for j in range(N)] for i in range(50)}}
>>> df.with_columns(idx=df["idx"].shuffle()).write_parquet("df", row_group_size=10_000)
>>> df.estimated_size()/1e6
942.4449

I want to read parts of this in a environment constrained to 512 MiB of memory.

I can do this:

# Note that streaming=True is required because of #9887
>>> df = pl.scan_parquet("df").limit(200_000).collect(streaming=True)
>>> df.estimated_size()
184.0685

Trying to read 210_000 rows will go OOM although the final size would be only 10% larger.

I understand there is a need to hold more than the final size in memory during Parquet reading. What are some levers to optimise the memory consumption in this case?

@mishpat
Copy link
Contributor

mishpat commented Jul 14, 2023

(a) Parquet files are compressed, so the on-disk usage is typically significantly smaller than the in-memory usage; I don't think this is exactly what you are asking about but it's good to keep in mind. I was just looking at a file that's 70mb on disk and 600mb in memory, the compression can be extremely effective.

(b) If you have a dataset that will never fit in memory, you will have to restrict yourself to the streaming engine and finish with a ldf.sink_[parquet or ipc] call, e.g. sink_parquet.

@jonashaag jonashaag changed the title How to read a Parquet dataset of size > 30% of available memory? How to read a Parquet dataset of memory size > 30% of available memory? Jul 15, 2023
@jonashaag
Copy link
Contributor Author

Thanks, I clarified the title, I'm looking to read a dataset that fits in memory but reading goes OOM.

@ritchie46
Copy link
Member

Thanks, I clarified the title, I'm looking to read a dataset that fits in memory but reading goes OOM.

It is not very useful if your dataset takes all memory. We need auxilery memory to do operations. Try to use the streaming API, that API can spill to disk if needed.

@jonashaag
Copy link
Contributor Author

My example uses the streaming API and goes OOM, max I can fill with the streaming API is 30% of memory

@jonashaag
Copy link
Contributor Author

Note that it's a reduced example. In my real world problem the dataset is many GB in memory and I'm doing more filtering before collect(). The filtered dataset easily fits into memory but loading goes OOM even though I'm using the streaming engine.

@stinodego
Copy link
Member

Closing this in favor of the linked bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants