-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to best optimize reading from S3? #278
Comments
Hi @stevbear Optimizing the reading from S3 has been on my list for a while and I just hadn't gotten around to it. It didn't get prioritized because no one had filed any issues concerning it, so thank you for filing this!
This is likely because the default buffer size is 16KB, have you tried increasing the I have a few ideas to further optimize via pre-buffering (which all have different trade-offs) so can you give me a bit more context to make sure that your use case would be helped and to identify which idea would work best for you? Specifically: If reading an entire column for a single row group gives you OOM, you either have a significantly large row group, or I'm guessing it's string data with a lot of large strings? That leads to the question of what you're doing with the column data after you read it from the row group. If you can't hold the entire column in memory from a single row group, are you streaming the data somewhere? Are you reading only a single column at a time or multiple columns from the row group? Can you give me more of an idea of the sizes of the columns / row group of the file and the memory limitations of your system? Is the issue the copy that happens when decoding/decompressing the column data? etc. The more information the better so we can figure out a good solution here, gives me the opportunity to improve the memory usage of the parquet package like I've been wanting to! 😄 |
Hi @zeroshade, thanks for getting back to me. |
The BufferSize matters because it controls the underlying
Assuming the Does that make sense? |
Thanks! That makes complete sense! |
That's where the current problem is, as it currently stands there isn't a good way to actually use the column and offset indexes when reading pages as right now there isn't a way to tell the page reader to skip an entire page or to utilize the pagelocation/offset information to go to a specific page. It would be helpful if you had an example of your use case scenario, and we could potentially work together to figure out what a good new API would look like to add support for leveraging the indexes to skip pages etc. |
I see. Thanks for that explanation.
|
I started implementing this a bit and realized that when you start dealing with repeated columns a It makes more sense with the Arrow column and record readers in the |
Describe the usage question you have. Please include as many useful details as possible.
Hi!
I have a use case of reading certain row groups from S3.
I see that there is an option BufferedStreamEnabled.
When I set BufferedStreamEnabled to false, it seems to try to read all of the data of a column for a row group at once, which will, unfortunately, result in OOM for us.
When I set BufferedStreamEnabled to true, the library seems to be reading the row group page by page, which is not optimal for cloud usage.
How can I improve this? I imagine that the best way to improve this would be to read multiple pages in one read() sys call?
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: