Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use biobear to read ZSTD-compressed .fasta files #89

Merged
merged 13 commits into from
Jan 24, 2025
Merged

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Jan 17, 2025

Closes #82

Background

TL;DR: try to improve the performance of filtering the sequence file prior to clade assignment
The ticket linked above has more details, as well as a reference to the corresponding RFC.

Testing

  • The PR contains an additional test for end-to-end clade assignment (though admittedly the test data could be more...robust).
  • There's a branch on variant-nowcast-hub that generates target data using this branch of Cladetime. A manual run of the Run post-submission jobs against that branch generated target data and did not create a pull request because the target data files match those that created against the main branch earlier this week. Output from that run: https://github.com/reichlab/variant-nowcast-hub/actions/runs/12837193998/job/35800364997

Timing

The above GitHub action run with biobear reduced the time it takes to filter the sequence file by about 35%. However, the overall run time of create-target-data reduced about 12%. Of course, these statistics are from a single run.

Specs for the standard runner are 4 processors, 16 GB memory, and 14 GB storage. This explains why the GitHub actions run didn't yield the performance gains that I saw locally.
The bottleneck now appears to be clade assignment itself.

@bsweger bsweger requested a review from elray1 January 21, 2025 18:42
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the actual changes related to this PR look good, with some minor questions below. I remember seeing somewhere that you were doing some checks to make sure that in a realistic example, the same number of sequences ended up getting saved by sequence.filter in pathways where biobear is or isn't used. I guess those all worked out well?

Comment on lines 315 to 328
# if there are many sequences in the filtered metadata, warn that clade assignment will
# take a long time and require a lot of resources
if sequence_count > self._config.clade_assignment_warning_threshold:
msg = (
f"Sequence count is {sequence_count}: clade assignment will run longer than usual. "
"You may want to run clade assignments on smaller subsets of sequences."
)
warnings.warn(
msg,
category=CladeTimeSequenceWarning,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the main PR comment, you wrote "The bottleneck now appears to be clade assignment itself." This makes me think it could still be helpful to include this warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a warning could be useful, but the wording we had was so ambiguous/arbitrary, e.g., "run longer than usual"

What about keeping somewhat arbitrary warning trigger (which is in the config, so easy to change), but changing the warning text to something like:

"About to assign clades to {sequence_count} sequences. The assignment process is resource-intensive, and depending on the limitations of your machine, you may want to use a smaller subset of sequences."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the warning back!

src/cladetime/cladetime.py Show resolved Hide resolved
src/cladetime/sequence.py Show resolved Hide resolved
@bsweger
Copy link
Collaborator Author

bsweger commented Jan 23, 2025

I remember seeing somewhere that you were doing some checks to make sure that in a realistic example, the same number of sequences ended up getting saved by sequence.filter in pathways where biobear is or isn't used. I guess those all worked out well?

Yes, I didn't explain it super well in the PR comment! I ran the variant-nowcast-hub's run-post-submission-jobs workflow manually, pointing it to a branch that uses this biobear version of cladetime

The expected outcome was for that workflow to run successfully and create exactly the same versions of the target data that the automated, non-biobear run created.

The full test workflow run is here, but the relevant bit is this output. The run didn't create a new PR because nothing had changed:

nothing to commit, working tree clean
Switched to a new branch '2024-10-16-target-data_2025-01-17_21-47-02'
No changes to commit in target-data/

Biobear has a built-in method that returns the contents
of a .fasta file an arrow batch reader object, in turn,
can be parsed into a polars dataframe so we can filter
the .fasta and write the filtered results as chunks
(instead of reading/filtering/writing line-by-line)
…th ZSTD

Biobear has the ability to read .fasta files as batches that can be
slurped into a Polars Dataframe. This method provides significant performance
over Cladetime's prior method of using biopython to process a .fasta file
line by line.

Related issue:
#82
This test verifies that the final line report returned by assign_clades
(i.e., the clade assignments merged with the sequence metadata) contains
valid clade assignemnts).

The test currently fails due to a bug in the biobear feature (that fix will be
the next commit)
The biobear fasta reader does not bring in a description when processing
sequence records, which resulting in an "unknown" description when writing
record back out using SeqIO. That "unknown" string eventually found its way
into the "strain" field of the nextclade cli output, which resulted in null
values for "clade" after the nextclade output was joined to the metadata.
This changeset also adds a test for clade assignments when not all
sequences in the input list get an assignment
We get this warning all the time when using Cladetime in variant-nowcast-hub
because we're always assigning more than 30 days worth of sequences. The 30
day trigger for the warning was always arbitrary, and it's not serving us well.
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, one minor change to the error message

src/cladetime/cladetime.py Outdated Show resolved Hide resolved
Fix typo

Co-authored-by: Evan Ray <[email protected]>
@bsweger bsweger merged commit f8bbd50 into main Jan 24, 2025
3 checks passed
@bsweger bsweger deleted the bs/add-biobear/82 branch January 24, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use biobear to read sequence .fasta files
2 participants