Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8 write create trends ensemble function #11

Merged
merged 11 commits into from
Nov 14, 2024

Conversation

lshandross
Copy link
Collaborator

No description provided.

@lshandross lshandross linked an issue Nov 14, 2024 that may be closed by this pull request
@lshandross
Copy link
Collaborator Author

lshandross commented Nov 14, 2024

The lint check is only failing due to get_baseline_predictions() having a cyclomatic complexity of 16 after I fixed a validation. I don't feel like it's a great use of my time to refactor that function given that flu forecasting starts on Wednesday, so I filed an issue (#12) about refactoring and would like to merge this PR once everything else looks good.

Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this and it basically looks good. Made 1 or 2 minor suggestions. Two bigger comments:

  1. I had some questions around the temporal aggregation, but I thought we had discussed (or maybe I imagined discussing or planned to discuss and didn't follow through) that we were going to ditch the stuff in here that's related to temporal aggregation because the available data going forward in the near future will only be for the weekly scale. So I propose to essentially go ahead with what's here, regardless of any questions on this front. We just need to get something that handles a single temporal resolution of "weekly" in place.
  2. My main question is actually how we're going to deal with the sampling. Two sub-questions on this:
    1. I think that we should add an n_sim argument to this top level function to allow us to separate the concepts of (a) how many samples are generated from the predictive distribution for subsequent summarizing into predictive quantiles; and (b) how many samples are returned if we have a sample output type
    2. Because hubEnsembles::linear_pool only handles the simplest case where the number of samples for the ensemble is unrestricted, we need to figure out a way to deal with the requirement of getting to 100 samples for the ensemble that we submit. I see two options: (a) update hubEnsembles::linear_pool to allow for specification of a target number of samples for the ensemble, doing sampling if necessary. (b) within this function, pick the number of samples to output from each baseline model so that in total you end up with 100 samples. In our setting with a target of 100 samples for the ensemble and 8 baseline models, we would have 4 baselines generate 12 samples and 4 generate 13. I'm ok with going with option (b) in the short term if it's easier, but note that we will want to do option (a) in the near future as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it would be good to have some tests for this function. However, I am going to file an issue for that as something we could do later because: (a) it probably works! (b) I feel like we don't even know what our data will look like and if we will use this.

R/create_trends_ensemble.R Outdated Show resolved Hide resolved
R/create_trends_ensemble.R Show resolved Hide resolved
quantile_levels = c(.1, .5, .9),
n_samples = NULL,
return_baseline_predictions = FALSE) |>
expect_error(regex = "Currently `component_variations` may only contain one unique temporal resolution value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low priority -- is this true? I thought I saw stuff about splitting by the temporal resolution up above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the current ensembling for samples makes it so that having multiple temporal resolution values results in different numbers of samples per model, but I wanted to put the check closer to the top to avoid unnecessary calculations. This validation will be removed later on once support is added, but I figured I would just put it in until then

@lshandross
Copy link
Collaborator Author

lshandross commented Nov 14, 2024

  1. I'm not sure if we discussed the topic of ditching all temporal aggregation, but I thought of implementing it for the final ensemble function so that we would have a way to regenerate the trends ensemble from past seasons. Open to having further discussion about this later.
  2. Took care of adding the n_sim arg to all functions instead of just the lowest level. As for the other point, I may do option b for this week if I run out of time before Wednesday (currently working on a script for generating the trends ensemble and have some other functions for this package drafted), but I will put option a as my next highest priority with a hope of getting it implemented within the first few weeks of FluSight submission.

Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@lshandross lshandross merged commit 8c94339 into main Nov 14, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write create_trends_ensemble function
2 participants