Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel decompression of FASTQ files generated by seqkit #443

Open
divonlan opened this issue Mar 9, 2024 · 5 comments
Open

Parallel decompression of FASTQ files generated by seqkit #443

divonlan opened this issue Mar 9, 2024 · 5 comments

Comments

@divonlan
Copy link

divonlan commented Mar 9, 2024

Greetings,

I am the author of Genozip (www.genozip.com) - a software package for compressing FASTQ / BAM / VCF.

When compressing a fastq file, Genozip first inflates the gzip compression before re-compressing with genozip. Many fastq files are compressed with BGZF (bgzip) - BGZF is essentially a concatenation of 64KB gzip blocks, with the crucial property of populating the gzip optional Extra field with the length of the compressed block. This allows downstream applications, like htslib, aligners and also Genozip, to fan out the gzip blocks to multiple threads for parallel decompression - as it is possible to know the length of a gzip block without decompressing it.

I have recently encountered a FASTQ file that I think was generated by seqkit - it appears to be a concatenation of gzip blocks, each containing, when uncompressed, 1 MB of fastq data - the hallmark of pgzip compression. Unfortunately, since there is no information as to the length of a compressed block, Genozip as well as any other applications cannot parallelize decompressing it.

To solve this issue, I would like to suggest that you add an Extra subfield containing the compressed length of the block. As a reference, in page 13 of the SAM specification you can see how this is done in BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf. An alternative solution would be switching from pgzip to bgzf.

Thanks,

-divon

@shenwei356
Copy link
Owner

Hi Divon, I'm not familiar with gzip format actually, but it looks simple at first glance. First, I'd like to continue using pgzip, cause it's fast. But you said gzip uses blocks of 1MB, it seems true. If it does write blocks of gzip data, we might just fork and add the Extra field to include the block size, right?

@divonlan
Copy link
Author

divonlan commented Mar 9, 2024

It might be even simpler than that - I think pgzip API allows specifying the "Extra" field. The catch is that the compressed block length is only known after the compression, so the gzip header needs to be finalized after compression but before writing the block to disk. Not sure if that is possible with the current pgzip API or would it require some kind of hack. In addition to the block length itself, I recommend starting the Extra data with a ~4 byte "magic" number that would allow identifying that this is seqkit data.

@shenwei356
Copy link
Owner

have recently encountered a FASTQ file that I think was generated by seqkit - it appears to be a concatenation of gzip blocks, each containing, when uncompressed, 1 MB of fastq data - the hallmark of pgzip compression.

After reading pgzip's code, I'm sure it only writes a standard gzip file, not the concatenation of multiple gzip blocks. 1M is just the default block size for compression.

Not sure if that is possible with the current pgzip API or would it require some kind of hack.

It would need buffer all data in RAM.

@klauspost
Copy link

pgzip uses concatenated deflate blocks. Blocks back-reference previous blocks, which is why there is no practical compression loss. Therefore it is not possible to decompress these blocks even if you know the offsets.

There is an ancient bgzf implementation. In principle you can fork it and replace the imports to use the faster ones. It haven't really experimented with it for a while - but that seems like a possible solution.

@divonlan
Copy link
Author

Personally, I would advocate for bgzf, as it would be compatible with virtually all bioinformatics tools that are capable of multi-threading, while decompression of current seqkit-generated files cannot be parallelized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants