-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use biobear to read ZSTD-compressed .fasta files #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the actual changes related to this PR look good, with some minor questions below. I remember seeing somewhere that you were doing some checks to make sure that in a realistic example, the same number of sequences ended up getting saved by sequence.filter
in pathways where biobear is or isn't used. I guess those all worked out well?
src/cladetime/cladetime.py
Outdated
# if there are many sequences in the filtered metadata, warn that clade assignment will | ||
# take a long time and require a lot of resources | ||
if sequence_count > self._config.clade_assignment_warning_threshold: | ||
msg = ( | ||
f"Sequence count is {sequence_count}: clade assignment will run longer than usual. " | ||
"You may want to run clade assignments on smaller subsets of sequences." | ||
) | ||
warnings.warn( | ||
msg, | ||
category=CladeTimeSequenceWarning, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the main PR comment, you wrote "The bottleneck now appears to be clade assignment itself." This makes me think it could still be helpful to include this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a warning could be useful, but the wording we had was so ambiguous/arbitrary, e.g., "run longer than usual"
What about keeping somewhat arbitrary warning trigger (which is in the config, so easy to change), but changing the warning text to something like:
"About to assign clades to {sequence_count} sequences. The assignment process is resource-intensive, and depending on the limitations of your machine, you may want to use a smaller subset of sequences."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the warning back!
Yes, I didn't explain it super well in the PR comment! I ran the variant-nowcast-hub's run-post-submission-jobs workflow manually, pointing it to a branch that uses this biobear version of cladetime The expected outcome was for that workflow to run successfully and create exactly the same versions of the target data that the automated, non-biobear run created. The full test workflow run is here, but the relevant bit is this output. The run didn't create a new PR because nothing had changed:
|
Biobear has a built-in method that returns the contents of a .fasta file an arrow batch reader object, in turn, can be parsed into a polars dataframe so we can filter the .fasta and write the filtered results as chunks (instead of reading/filtering/writing line-by-line)
…th ZSTD Biobear has the ability to read .fasta files as batches that can be slurped into a Polars Dataframe. This method provides significant performance over Cladetime's prior method of using biopython to process a .fasta file line by line. Related issue: #82
This test verifies that the final line report returned by assign_clades (i.e., the clade assignments merged with the sequence metadata) contains valid clade assignemnts). The test currently fails due to a bug in the biobear feature (that fix will be the next commit)
The biobear fasta reader does not bring in a description when processing sequence records, which resulting in an "unknown" description when writing record back out using SeqIO. That "unknown" string eventually found its way into the "strain" field of the nextclade cli output, which resulted in null values for "clade" after the nextclade output was joined to the metadata.
This changeset also adds a test for clade assignments when not all sequences in the input list get an assignment
We get this warning all the time when using Cladetime in variant-nowcast-hub because we're always assigning more than 30 days worth of sequences. The 30 day trigger for the warning was always arbitrary, and it's not serving us well.
This reverts commit c1f8ffc.
e45a8c2
to
4625e63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, one minor change to the error message
Fix typo Co-authored-by: Evan Ray <[email protected]>
Closes #82
Background
TL;DR: try to improve the performance of filtering the sequence file prior to clade assignment
The ticket linked above has more details, as well as a reference to the corresponding RFC.
Testing
Run post-submission jobs
against that branch generated target data and did not create a pull request because the target data files match those that created against the main branch earlier this week. Output from that run: https://github.com/reichlab/variant-nowcast-hub/actions/runs/12837193998/job/35800364997Timing
The above GitHub action run with biobear reduced the time it takes to filter the sequence file by about 35%. However, the overall run time of
create-target-data
reduced about 12%. Of course, these statistics are from a single run.Specs for the standard runner are 4 processors, 16 GB memory, and 14 GB storage. This explains why the GitHub actions run didn't yield the performance gains that I saw locally.
The bottleneck now appears to be clade assignment itself.