Filter which lineages are modeled #29

thanasibakis · 2024-08-14T16:45:01Z

One assumption that we've been hard-coding and need to make configurable is that we want to model all lineages. (... for the July data I'm playing with, the hierarchical model shows pretty clearly that most of the lineages have negligible proportions.)

Originally posted by @afmagee42 in #25 (comment)

afmagee42 · 2024-08-19T14:36:46Z

While we're at it, we should probably be filtering on the lineage assigned being valid. Right now I don't think we're doing any filtering? But in the whole-US, all-time data, I'm seeing:

['23B', '20F', '21K', '20I', 'recombinant', '20E', '21E', '22C', '23G', '23C', '21I', '20H', '21H', '22D', '20G', '22E', '20B', '23H', '21G', '20A', '22F', '23D', None, '20D', '21J', '23A', '22A', '24A', '21B', '21M', '24B', '20J', '23E', '23F', '21L', '21C', '22B', '21D', '21F', '23I', '20C', '19B', '19A', '24C', '21A']

24C hasn't been put into the tree of clades yet, sadly.

None should be removed.

"Recombinant" I'm still not entirely sure what we want to do with, but it's probably best dealt with on a weekly basis.

(NB: added none-removal in #32)

This PR identifies two sources of difficulty in fitting the model in early 2022 (end of Delta, start of Omicron). 1. Data filtering now allows removing trivial lineages and grouping them into "other," resolving #29 and greatly reducing the computational burden when many negligible lineages are floating around. 2. The hierarchical model appears to have been a bit too flexible, even with the changes in #41. Here we remove one layer of the hierarchy, fixing `sigma_beta_1` instead of inferring it. The combined result is that the hierarchical model now works (MCMC is believable) and produces reasonable-seeming results for a 2022-01-01 forecast date. Late addition, mostly out of scope but worth including: This PR also removes the filtering based on comparing sequence date to clade name. The intent was to avoid clearly incorrect calls like 23A in April 2020. However, it was causing problems for lineages like 24A, which takes off in late 2023 and starts 2024 at high prevalence. As the percent of all instances of (clade year) > (sample year) is small, and as many of those instances are clearly valid, leaving the remainder in the dataset is the lesser evil. Especially with the institution of (1) which should sweep those into "other," minimizing issues. --------- Co-authored-by: Thanasi Bakis <[email protected]>

afmagee42 · 2024-08-27T14:47:01Z

Fixed in #45

afmagee42 mentioned this issue Aug 23, 2024

Enable fitting to Omicron #45

Merged

afmagee42 closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter which lineages are modeled #29

Filter which lineages are modeled #29

thanasibakis commented Aug 14, 2024

afmagee42 commented Aug 19, 2024 •

edited

Loading

afmagee42 commented Aug 27, 2024

Filter which lineages are modeled #29

Filter which lineages are modeled #29

Comments

thanasibakis commented Aug 14, 2024

afmagee42 commented Aug 19, 2024 • edited Loading

afmagee42 commented Aug 27, 2024

afmagee42 commented Aug 19, 2024 •

edited

Loading