Use biobear to read ZSTD-compressed .fasta files #89

bsweger · 2025-01-17T22:42:14Z

Closes #82

Background

TL;DR: try to improve the performance of filtering the sequence file prior to clade assignment
The ticket linked above has more details, as well as a reference to the corresponding RFC.

Testing

The PR contains an additional test for end-to-end clade assignment (though admittedly the test data could be more...robust).
There's a branch on variant-nowcast-hub that generates target data using this branch of Cladetime. A manual run of the Run post-submission jobs against that branch generated target data and did not create a pull request because the target data files match those that created against the main branch earlier this week. Output from that run: https://github.com/reichlab/variant-nowcast-hub/actions/runs/12837193998/job/35800364997

Timing

The above GitHub action run with biobear reduced the time it takes to filter the sequence file by about 35%. However, the overall run time of create-target-data reduced about 12%. Of course, these statistics are from a single run.

Specs for the standard runner are 4 processors, 16 GB memory, and 14 GB storage. This explains why the GitHub actions run didn't yield the performance gains that I saw locally.
The bottleneck now appears to be clade assignment itself.

elray1

All the actual changes related to this PR look good, with some minor questions below. I remember seeing somewhere that you were doing some checks to make sure that in a realistic example, the same number of sequences ended up getting saved by sequence.filter in pathways where biobear is or isn't used. I guess those all worked out well?

elray1 · 2025-01-23T15:04:25Z

src/cladetime/cladetime.py

-        # if there are many sequences in the filtered metadata, warn that clade assignment will
-        # take a long time and require a lot of resources
-        if sequence_count > self._config.clade_assignment_warning_threshold:
-            msg = (
-                f"Sequence count is {sequence_count}: clade assignment will run longer than usual. "
-                "You may want to run clade assignments on smaller subsets of sequences."
-            )
-            warnings.warn(
-                msg,
-                category=CladeTimeSequenceWarning,
-            )
-


In the main PR comment, you wrote "The bottleneck now appears to be clade assignment itself." This makes me think it could still be helpful to include this warning?

I agree that a warning could be useful, but the wording we had was so ambiguous/arbitrary, e.g., "run longer than usual"

What about keeping somewhat arbitrary warning trigger (which is in the config, so easy to change), but changing the warning text to something like:

"About to assign clades to {sequence_count} sequences. The assignment process is resource-intensive, and depending on the limitations of your machine, you may want to use a smaller subset of sequences."

That sounds good to me.

Added the warning back!

src/cladetime/cladetime.py

src/cladetime/sequence.py

bsweger · 2025-01-23T16:18:44Z

I remember seeing somewhere that you were doing some checks to make sure that in a realistic example, the same number of sequences ended up getting saved by sequence.filter in pathways where biobear is or isn't used. I guess those all worked out well?

Yes, I didn't explain it super well in the PR comment! I ran the variant-nowcast-hub's run-post-submission-jobs workflow manually, pointing it to a branch that uses this biobear version of cladetime

The expected outcome was for that workflow to run successfully and create exactly the same versions of the target data that the automated, non-biobear run created.

The full test workflow run is here, but the relevant bit is this output. The run didn't create a new PR because nothing had changed:

nothing to commit, working tree clean
Switched to a new branch '2024-10-16-target-data_2025-01-17_21-47-02'
No changes to commit in target-data/

Biobear has a built-in method that returns the contents of a .fasta file an arrow batch reader object, in turn, can be parsed into a polars dataframe so we can filter the .fasta and write the filtered results as chunks (instead of reading/filtering/writing line-by-line)

…th ZSTD Biobear has the ability to read .fasta files as batches that can be slurped into a Polars Dataframe. This method provides significant performance over Cladetime's prior method of using biopython to process a .fasta file line by line. Related issue: #82

https://yossarian.net/til/post/actions-checkout-can-leak-github-credentials/

This test verifies that the final line report returned by assign_clades (i.e., the clade assignments merged with the sequence metadata) contains valid clade assignemnts). The test currently fails due to a bug in the biobear feature (that fix will be the next commit)

The biobear fasta reader does not bring in a description when processing sequence records, which resulting in an "unknown" description when writing record back out using SeqIO. That "unknown" string eventually found its way into the "strain" field of the nextclade cli output, which resulted in null values for "clade" after the nextclade output was joined to the metadata.

This changeset also adds a test for clade assignments when not all sequences in the input list get an assignment

We get this warning all the time when using Cladetime in variant-nowcast-hub because we're always assigning more than 30 days worth of sequences. The 30 day trigger for the warning was always arbitrary, and it's not serving us well.

This reverts commit c1f8ffc.

elray1

lgtm, one minor change to the error message

src/cladetime/cladetime.py

Fix typo Co-authored-by: Evan Ray <[email protected]>

bsweger requested a review from elray1 January 21, 2025 18:42

elray1 reviewed Jan 23, 2025

View reviewed changes

bsweger added 12 commits January 23, 2025 15:07

Update dependences and add biobear

a87ca36

Make the GitHub workflow checkout action more secure

9bf82fd

https://yossarian.net/til/post/actions-checkout-can-leak-github-credentials/

Fix up some type errors reported by mypy

6c66d7c

Add sequence and assigned sequence counts to assign_clade metadata

13b9767

This changeset also adds a test for clade assignments when not all sequences in the input list get an assignment

Remove breakpoint 🙃

74794f0

Revert "Remove warning about assigning large volumes of clades"

e76dc64

This reverts commit c1f8ffc.

Re-word the warning many sequence assignments

4625e63

bsweger force-pushed the bs/add-biobear/82 branch from e45a8c2 to 4625e63 Compare January 23, 2025 20:22

elray1 requested changes Jan 24, 2025

View reviewed changes

src/cladetime/cladetime.py Outdated Show resolved Hide resolved

Update src/cladetime/cladetime.py

4a1a339

Fix typo Co-authored-by: Evan Ray <[email protected]>

elray1 approved these changes Jan 24, 2025

View reviewed changes

bsweger merged commit f8bbd50 into main Jan 24, 2025
3 checks passed

bsweger deleted the bs/add-biobear/82 branch January 24, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use biobear to read ZSTD-compressed .fasta files #89

Use biobear to read ZSTD-compressed .fasta files #89

bsweger commented Jan 17, 2025

elray1 left a comment

elray1 Jan 23, 2025

bsweger Jan 23, 2025

elray1 Jan 23, 2025

bsweger Jan 23, 2025

bsweger commented Jan 23, 2025

elray1 left a comment

Use biobear to read ZSTD-compressed .fasta files #89

Use biobear to read ZSTD-compressed .fasta files #89

Conversation

bsweger commented Jan 17, 2025

elray1 left a comment

Choose a reason for hiding this comment

elray1 Jan 23, 2025

Choose a reason for hiding this comment

bsweger Jan 23, 2025

Choose a reason for hiding this comment

elray1 Jan 23, 2025

Choose a reason for hiding this comment

bsweger Jan 23, 2025

Choose a reason for hiding this comment

bsweger commented Jan 23, 2025

elray1 left a comment

Choose a reason for hiding this comment