Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a simple tool (currently filed in
exploration
) to visualize lineages over time from the NextStrain data without modeling.It can make plots like the below, where we focus on a particular time range (here 2022) and filter out lineages not ever seen above some percent (here 10%).
I've eschewed stacked charts so it's a bit easier to see what happens to any particular lineage, because the point of this is for us to choose parts of the life cycle of a lineage to model.
Out-of-scope additions
In making this, I found that some sequences are assigned impossible clades. Like a sequence from 2020 being assigned to 24A. Among all data with valid dates and valid clades, this appears to happen <1% of the time.
I have thus added
linmod.data.with_bad_ns_assign()
as a function to add a columnimpossible
to a polars dataframe that says whether a lineage assignment is impossible or not.I have also plugged this into our filtering in
linmod.data.main
.