-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel decompression of FASTQ files generated by seqkit #443
Comments
Hi Divon, I'm not familiar with gzip format actually, but it looks simple at first glance. First, I'd like to continue using pgzip, cause it's fast. But you said gzip uses blocks of 1MB, it seems true. If it does write blocks of gzip data, we might just fork and add the Extra field to include the block size, right? |
It might be even simpler than that - I think pgzip API allows specifying the "Extra" field. The catch is that the compressed block length is only known after the compression, so the gzip header needs to be finalized after compression but before writing the block to disk. Not sure if that is possible with the current pgzip API or would it require some kind of hack. In addition to the block length itself, I recommend starting the Extra data with a ~4 byte "magic" number that would allow identifying that this is seqkit data. |
After reading pgzip's code, I'm sure it only writes a standard gzip file, not the concatenation of multiple gzip blocks. 1M is just the default block size for compression.
It would need buffer all data in RAM. |
pgzip uses concatenated deflate blocks. Blocks back-reference previous blocks, which is why there is no practical compression loss. Therefore it is not possible to decompress these blocks even if you know the offsets. There is an ancient bgzf implementation. In principle you can fork it and replace the imports to use the faster ones. It haven't really experimented with it for a while - but that seems like a possible solution. |
Personally, I would advocate for bgzf, as it would be compatible with virtually all bioinformatics tools that are capable of multi-threading, while decompression of current seqkit-generated files cannot be parallelized. |
Greetings,
I am the author of Genozip (www.genozip.com) - a software package for compressing FASTQ / BAM / VCF.
When compressing a fastq file, Genozip first inflates the gzip compression before re-compressing with genozip. Many fastq files are compressed with BGZF (bgzip) - BGZF is essentially a concatenation of 64KB gzip blocks, with the crucial property of populating the gzip optional Extra field with the length of the compressed block. This allows downstream applications, like htslib, aligners and also Genozip, to fan out the gzip blocks to multiple threads for parallel decompression - as it is possible to know the length of a gzip block without decompressing it.
I have recently encountered a FASTQ file that I think was generated by seqkit - it appears to be a concatenation of gzip blocks, each containing, when uncompressed, 1 MB of fastq data - the hallmark of pgzip compression. Unfortunately, since there is no information as to the length of a compressed block, Genozip as well as any other applications cannot parallelize decompressing it.
To solve this issue, I would like to suggest that you add an Extra subfield containing the compressed length of the block. As a reference, in page 13 of the SAM specification you can see how this is done in BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf. An alternative solution would be switching from pgzip to bgzf.
Thanks,
-divon
The text was updated successfully, but these errors were encountered: