warc2text tries to read large warc records into memory, causing OOM #20

jelmervdl · 2021-01-20T18:02:39Z

When trying to warc2text this massive (29G) file on cirrus /beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz it is killed by the OOM killer.

I know this is a design choice, and it's not really high priority, but it would be nice if there would be a way to deal with it without skipping the whole warc file. E.g.

partially read a record so we can parse the header and estimate whether it is of interest, and if not just skip over the rest of the record, not trying to store it in memory.
or easier, have WARCReader::getRecord skip a record if, while reading & deflating, it starts to become large.

The text was updated successfully, but these errors were encountered:

zuny26 · 2021-01-21T11:22:05Z

I have implemented the easy fix for now, but it is definitely a good idea to refactor in the future and parse the WARC header before reading the entire body
I'm not sure what the max size of the record should be (I set it to 20MB), so feel free to change that value

zuny26 added a commit that referenced this issue Jan 21, 2021

quick workaround for #20

1b6068a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warc2text tries to read large warc records into memory, causing OOM #20

warc2text tries to read large warc records into memory, causing OOM #20

jelmervdl commented Jan 20, 2021

zuny26 commented Jan 21, 2021

warc2text tries to read large warc records into memory, causing OOM #20

warc2text tries to read large warc records into memory, causing OOM #20

Comments

jelmervdl commented Jan 20, 2021

zuny26 commented Jan 21, 2021