-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use biobear to read ZSTD-compressed .fasta files (#89)
* Use biobear to read .zst-compressed .fasta files Biobear has a built-in method that returns the contents of a .fasta file an arrow batch reader object, in turn, can be parsed into a polars dataframe so we can filter the .fasta and write the filtered results as chunks (instead of reading/filtering/writing line-by-line) * Update dependences and add biobear * Use the biobear package when filtering .fasta sequences compressed with ZSTD Biobear has the ability to read .fasta files as batches that can be slurped into a Polars Dataframe. This method provides significant performance over Cladetime's prior method of using biopython to process a .fasta file line by line. Related issue: #82 * Make the GitHub workflow checkout action more secure https://yossarian.net/til/post/actions-checkout-can-leak-github-credentials/ * Fix up some type errors reported by mypy * Add a test for missing clade assignments in line report This test verifies that the final line report returned by assign_clades (i.e., the clade assignments merged with the sequence metadata) contains valid clade assignemnts). The test currently fails due to a bug in the biobear feature (that fix will be the next commit) * Add description field to fasta records written by SeqIO The biobear fasta reader does not bring in a description when processing sequence records, which resulting in an "unknown" description when writing record back out using SeqIO. That "unknown" string eventually found its way into the "strain" field of the nextclade cli output, which resulted in null values for "clade" after the nextclade output was joined to the metadata. * Add sequence and assigned sequence counts to assign_clade metadata This changeset also adds a test for clade assignments when not all sequences in the input list get an assignment * Remove warning about assigning large volumes of clades We get this warning all the time when using Cladetime in variant-nowcast-hub because we're always assigning more than 30 days worth of sequences. The 30 day trigger for the warning was always arbitrary, and it's not serving us well. * Remove breakpoint 🙃 * Revert "Remove warning about assigning large volumes of clades" This reverts commit c1f8ffc. * Re-word the warning many sequence assignments * Update src/cladetime/cladetime.py Fix typo Co-authored-by: Evan Ray <[email protected]> --------- Co-authored-by: Evan Ray <[email protected]>
- Loading branch information
Showing
10 changed files
with
140 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters