Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a list of sequence collections by date and location #92

Merged
merged 12 commits into from
Oct 9, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Oct 3, 2024

DO NOT MERGE UNTIL virus-clade-utils rename has been merged Done!

Closes #50

WIP so that the people using this information when scoring can take a look at the output. Will add some annotations inline.

Remaining work before the PR comes out of draft:

  1. Review and merge this PR from virus-clade-utils
  2. Respond to feedback about contents and format of the "sequence counts by location and date" file (see inline note)
  3. Add the scheduled GitHub action that will run on Wednesdays after round closing

By default, the dates included will be 31 days prior to the date
of the round that just closed (the script will be scheduled to
run on Wednesday evenings)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output of the script that counts sequence collections by date and location.

@nickreich would love your thoughts on the notes below.

  • I didn't filter out the zeroes, because it seemed easier for a user to work with a complete dataset than have to infer from missing data which states and days are eligible for scoring.
  • The data originates from a Polars DataFrame, so we can write it out as json, parquet, etc. It doesn't have to be a .csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree with not filtering out zeroes. And like that we are actually storing the counts themselves as this might allow us later on to compute what fraction of counts were observed at the forecast due date.
  2. I don't have a strong feeling about .csv vs. .parquet. I think I might suggest not JSON, for simpler tabular readability into R/python.
  3. The more tidy data way to store the data would be to have a column for location, a column for date, and a column for count. Given that all of the evaluation/model-output will work in this way, I suggest reformatting into this "longer" version of the data. I would suggest naming the columns location, target_date and count. So that it could be merged in to a model-output or target-data file at some point down the road as-is. Would making the data "tidy" increase the size of the files? I think even if it would, it might be worth doing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @nickreich. Can definitely put the data into 3rd normal form--it's so small and will always be limited to 31 days by definition, so am not worried about size. I think the current format is what @elray1 and I discussed, so hopefully he can chime in (it's also entirely possible that I'm misremembering things!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Nick's suggestions make sense. Very minor -- what do you think about a folder name of unscored-location-dates? (unscored rather than unscore)

…t time

This change will make the script more precise and more accurate
should we need to re-run it. Rather than using the version of
genome sequence metadata at the time script is run, this
update explicitly uses the sequence metadata as it existed at
the time the most recent round closed.

The update generated a slightly different version of the resulting
data, which is included in the commit.
Although the documentation considers streaming to be "unstable",
this changeset adds it as a hedge against out of memory errors.
Introduce another GitHub workflow that runs on Wednesdays after
a rounds closes. It generates a count of Sars-Cov-2 sequences
collected for the past 31 days (grouped by location and day).
uv is a drop-in replacement for the setup-python action,
so we don't need the latter
on:
schedule:
# Not precise, but GithHub actions don't support time zones
- cron: "0 02 * 11-12,1-3 3" # 2 AM UTC every Wed (Nov-Dec, Jan-Mar)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timezones 🙀

Rounds close at 8 PM US Eastern, and GitHub actions are UTC, so tried to come as close as possible when accounting to daylight savings time (not exact, obviously)

@bsweger bsweger marked this pull request as ready for review October 7, 2024 16:00
@bsweger bsweger requested a review from elray1 October 7, 2024 16:00
@bsweger bsweger force-pushed the bsweger/sequence-by-state-date/50 branch from bbe02fb to 2a5dec3 Compare October 8, 2024 17:03
@bsweger bsweger force-pushed the bsweger/sequence-by-state-date/50 branch from c58b239 to 0e6f05f Compare October 8, 2024 17:38
The virus-clade-utils repo is now named cladetime. The
virus_clade_utils Python package is now named cladetime.
More information:
reichlab/cladetime#31
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR, but am noting that we are careful to provide the correct round closing datetime when instantiating a CladeTime object, but as of now I believe this is still being truncated over in CladeTime. So for now, I believe that our data may still be one day off? (not sure, may depend on timezone conversion...)

@bsweger
Copy link
Collaborator Author

bsweger commented Oct 9, 2024

Approving this PR, but am noting that we are careful to provide the correct round closing datetime when instantiating a CladeTime object, but as of now I believe this is still being truncated over in CladeTime. So for now, I believe that our data may still be one day off? (not sure, may depend on timezone conversion...)

Good note! The current state of things is a bit confusing. You're right: cladetime's download_covid_genome_metadata does truncate datetimes that are passed to it. There's still an open issue to fix that.

However, the code in this PR doesn't use that download method, because it works directly against the sequence metadata on S3 file (i.e., no download required). The function used to get the S3 URL does use a full timestamp.

Now that we're using the CladeTime class as our interface, we'll probably end up making this less confusing by updating the download_covid_genome_metadata to have a URL passed to it, instead of having it call _get_s3_object_url directly. Thanks for being patient as we make the transition!

@bsweger bsweger merged commit a9bc303 into main Oct 9, 2024
@bsweger bsweger deleted the bsweger/sequence-by-state-date/50 branch October 9, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save a list of state/date combinations not to score for a specific variant-nowcast-hub round
3 participants