Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc2text tries to read large warc records into memory, causing OOM #20

Open
jelmervdl opened this issue Jan 20, 2021 · 1 comment
Open

Comments

@jelmervdl
Copy link
Member

When trying to warc2text this massive (29G) file on cirrus /beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz it is killed by the OOM killer.

I know this is a design choice, and it's not really high priority, but it would be nice if there would be a way to deal with it without skipping the whole warc file. E.g.

  1. partially read a record so we can parse the header and estimate whether it is of interest, and if not just skip over the rest of the record, not trying to store it in memory.
  2. or easier, have WARCReader::getRecord skip a record if, while reading & deflating, it starts to become large.
zuny26 added a commit that referenced this issue Jan 21, 2021
@zuny26
Copy link
Collaborator

zuny26 commented Jan 21, 2021

I have implemented the easy fix for now, but it is definitely a good idea to refactor in the future and parse the WARC header before reading the entire body
I'm not sure what the max size of the record should be (I set it to 20MB), so feel free to change that value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants