Parallel decompression of FASTQ files generated by seqkit #443

divonlan · 2024-03-09T10:10:35Z

Greetings,

I am the author of Genozip (www.genozip.com) - a software package for compressing FASTQ / BAM / VCF.

When compressing a fastq file, Genozip first inflates the gzip compression before re-compressing with genozip. Many fastq files are compressed with BGZF (bgzip) - BGZF is essentially a concatenation of 64KB gzip blocks, with the crucial property of populating the gzip optional Extra field with the length of the compressed block. This allows downstream applications, like htslib, aligners and also Genozip, to fan out the gzip blocks to multiple threads for parallel decompression - as it is possible to know the length of a gzip block without decompressing it.

I have recently encountered a FASTQ file that I think was generated by seqkit - it appears to be a concatenation of gzip blocks, each containing, when uncompressed, 1 MB of fastq data - the hallmark of pgzip compression. Unfortunately, since there is no information as to the length of a compressed block, Genozip as well as any other applications cannot parallelize decompressing it.

To solve this issue, I would like to suggest that you add an Extra subfield containing the compressed length of the block. As a reference, in page 13 of the SAM specification you can see how this is done in BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf. An alternative solution would be switching from pgzip to bgzf.

Thanks,

-divon

shenwei356 · 2024-03-09T12:08:28Z

Hi Divon, I'm not familiar with gzip format actually, but it looks simple at first glance. First, I'd like to continue using pgzip, cause it's fast. But you said gzip uses blocks of 1MB, it seems true. If it does write blocks of gzip data, we might just fork and add the Extra field to include the block size, right?

divonlan · 2024-03-09T12:14:26Z

It might be even simpler than that - I think pgzip API allows specifying the "Extra" field. The catch is that the compressed block length is only known after the compression, so the gzip header needs to be finalized after compression but before writing the block to disk. Not sure if that is possible with the current pgzip API or would it require some kind of hack. In addition to the block length itself, I recommend starting the Extra data with a ~4 byte "magic" number that would allow identifying that this is seqkit data.

shenwei356 · 2024-03-09T20:48:51Z

have recently encountered a FASTQ file that I think was generated by seqkit - it appears to be a concatenation of gzip blocks, each containing, when uncompressed, 1 MB of fastq data - the hallmark of pgzip compression.

After reading pgzip's code, I'm sure it only writes a standard gzip file, not the concatenation of multiple gzip blocks. 1M is just the default block size for compression.

Not sure if that is possible with the current pgzip API or would it require some kind of hack.

It would need buffer all data in RAM.

klauspost · 2024-03-10T09:55:48Z

pgzip uses concatenated deflate blocks. Blocks back-reference previous blocks, which is why there is no practical compression loss. Therefore it is not possible to decompress these blocks even if you know the offsets.

There is an ancient bgzf implementation. In principle you can fork it and replace the imports to use the faster ones. It haven't really experimented with it for a while - but that seems like a possible solution.

divonlan · 2024-03-10T10:29:47Z

Personally, I would advocate for bgzf, as it would be compatible with virtually all bioinformatics tools that are capable of multi-threading, while decompression of current seqkit-generated files cannot be parallelized.

shenwei356 added the enhancement label Mar 9, 2024

shenwei356 mentioned this issue Mar 9, 2024

Is the output gzip file a concatenation of gzip blocks? klauspost/pgzip#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel decompression of FASTQ files generated by seqkit #443

Parallel decompression of FASTQ files generated by seqkit #443

divonlan commented Mar 9, 2024

shenwei356 commented Mar 9, 2024

divonlan commented Mar 9, 2024

shenwei356 commented Mar 9, 2024

klauspost commented Mar 10, 2024

divonlan commented Mar 10, 2024

Parallel decompression of FASTQ files generated by seqkit #443

Parallel decompression of FASTQ files generated by seqkit #443

Comments

divonlan commented Mar 9, 2024

shenwei356 commented Mar 9, 2024

divonlan commented Mar 9, 2024

shenwei356 commented Mar 9, 2024

klauspost commented Mar 10, 2024

divonlan commented Mar 10, 2024