Generate a list of sequence collections by date and location #92

bsweger · 2024-10-03T21:15:06Z

~~DO NOT MERGE UNTIL virus-clade-utils rename has been merged~~ Done!

Closes #50

WIP so that the people using this information when scoring can take a look at the output. Will add some annotations inline.

Remaining work before the PR comes out of draft:

Review and merge this PR from virus-clade-utils
Respond to feedback about contents and format of the "sequence counts by location and date" file (see inline note)
Add the scheduled GitHub action that will run on Wednesdays after round closing

By default, the dates included will be 31 days prior to the date of the round that just closed (the script will be scheduled to run on Wednesday evenings)

bsweger · 2024-10-03T21:19:10Z

auxiliary-data/unscore-location-dates/2024-10-02.csv

This is the output of the script that counts sequence collections by date and location.

@nickreich would love your thoughts on the notes below.

I didn't filter out the zeroes, because it seemed easier for a user to work with a complete dataset than have to infer from missing data which states and days are eligible for scoring.

The data originates from a Polars DataFrame, so we can write it out as json, parquet, etc. It doesn't have to be a .csv

I agree with not filtering out zeroes. And like that we are actually storing the counts themselves as this might allow us later on to compute what fraction of counts were observed at the forecast due date.

I don't have a strong feeling about .csv vs. .parquet. I think I might suggest not JSON, for simpler tabular readability into R/python.

The more tidy data way to store the data would be to have a column for location, a column for date, and a column for count. Given that all of the evaluation/model-output will work in this way, I suggest reformatting into this "longer" version of the data. I would suggest naming the columns location, target_date and count. So that it could be merged in to a model-output or target-data file at some point down the road as-is. Would making the data "tidy" increase the size of the files? I think even if it would, it might be worth doing.

Thanks for the feedback @nickreich. Can definitely put the data into 3rd normal form--it's so small and will always be limited to 31 days by definition, so am not worried about size. I think the current format is what @elray1 and I discussed, so hopefully he can chime in (it's also entirely possible that I'm misremembering things!)

I think Nick's suggestions make sense. Very minor -- what do you think about a folder name of unscored-location-dates? (unscored rather than unscore)

…t time This change will make the script more precise and more accurate should we need to re-run it. Rather than using the version of genome sequence metadata at the time script is run, this update explicitly uses the sequence metadata as it existed at the time the most recent round closed. The update generated a slightly different version of the resulting data, which is included in the commit.

Although the documentation considers streaming to be "unstable", this changeset adds it as a hedge against out of memory errors.

Introduce another GitHub workflow that runs on Wednesdays after a rounds closes. It generates a count of Sars-Cov-2 sequences collected for the past 31 days (grouped by location and day).

uv is a drop-in replacement for the setup-python action, so we don't need the latter

bsweger · 2024-10-07T15:59:11Z

.github/workflows/create-location-date-sequence-counts.yaml

+on:
+  schedule:
+    # Not precise, but GithHub actions don't support time zones
+    - cron: "0 02 * 11-12,1-3 3" # 2 AM UTC every Wed (Nov-Dec, Jan-Mar)


Timezones 🙀

Rounds close at 8 PM US Eastern, and GitHub actions are UTC, so tried to come as close as possible when accounting to daylight savings time (not exact, obviously)

The virus-clade-utils repo is now named cladetime. The virus_clade_utils Python package is now named cladetime. More information: reichlab/cladetime#31

elray1

Approving this PR, but am noting that we are careful to provide the correct round closing datetime when instantiating a CladeTime object, but as of now I believe this is still being truncated over in CladeTime. So for now, I believe that our data may still be one day off? (not sure, may depend on timezone conversion...)

bsweger · 2024-10-09T16:34:11Z

Approving this PR, but am noting that we are careful to provide the correct round closing datetime when instantiating a CladeTime object, but as of now I believe this is still being truncated over in CladeTime. So for now, I believe that our data may still be one day off? (not sure, may depend on timezone conversion...)

Good note! The current state of things is a bit confusing. You're right: cladetime's download_covid_genome_metadata does truncate datetimes that are passed to it. There's still an open issue to fix that.

However, the code in this PR doesn't use that download method, because it works directly against the sequence metadata on S3 file (i.e., no download required). The function used to get the S3 URL does use a full timestamp.

Now that we're using the CladeTime class as our interface, we'll probably end up making this less confusing by updating the download_covid_genome_metadata to have a URL passed to it, instead of having it call _get_s3_object_url directly. Thanks for being patient as we make the transition!

Add a script that counts sequences reported by location and date

bdb14ec

By default, the dates included will be 31 days prior to the date of the round that just closed (the script will be scheduled to run on Wednesday evenings)

bsweger commented Oct 3, 2024

View reviewed changes

bsweger added 7 commits October 4, 2024 09:21

Add a check to the computed location/state counts

f30302e

Add some logging, as a treat

c06d8a9

Change virus-clade-utils back to main branch

8a787e9

Enable streaming mode on the Polars collect() operation

79940c2

Although the documentation considers streaming to be "unstable", this changeset adds it as a hedge against out of memory errors.

Add GitHub action to run after round close

090406e

Introduce another GitHub workflow that runs on Wednesdays after a rounds closes. It generates a count of Sars-Cov-2 sequences collected for the past 31 days (grouped by location and day).

GitHub workflow cleanup

85b5515

uv is a drop-in replacement for the setup-python action, so we don't need the latter

bsweger commented Oct 7, 2024

View reviewed changes

bsweger marked this pull request as ready for review October 7, 2024 16:00

bsweger requested a review from elray1 October 7, 2024 16:00

Eliminate date pivot from the output .csv

c9a571f

bsweger force-pushed the bsweger/sequence-by-state-date/50 branch from bbe02fb to 2a5dec3 Compare October 8, 2024 17:03

bsweger added 2 commits October 8, 2024 13:38

rename folder for summarized location/date counts

e757dcc

Add location of unscored location/date file to README

0e6f05f

bsweger force-pushed the bsweger/sequence-by-state-date/50 branch from c58b239 to 0e6f05f Compare October 8, 2024 17:38

Update for changes in the virus-clade-utils repo

fe16e00

The virus-clade-utils repo is now named cladetime. The virus_clade_utils Python package is now named cladetime. More information: reichlab/cladetime#31

elray1 approved these changes Oct 9, 2024

View reviewed changes

bsweger merged commit a9bc303 into main Oct 9, 2024

bsweger deleted the bsweger/sequence-by-state-date/50 branch October 9, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a list of sequence collections by date and location #92

Generate a list of sequence collections by date and location #92

bsweger commented Oct 3, 2024 •

edited

Loading

bsweger Oct 3, 2024

nickreich Oct 4, 2024

bsweger Oct 7, 2024

elray1 Oct 7, 2024

bsweger Oct 7, 2024

elray1 left a comment

bsweger commented Oct 9, 2024

Generate a list of sequence collections by date and location #92

Generate a list of sequence collections by date and location #92

Conversation

bsweger commented Oct 3, 2024 • edited Loading

bsweger Oct 3, 2024

Choose a reason for hiding this comment

nickreich Oct 4, 2024

Choose a reason for hiding this comment

bsweger Oct 7, 2024

Choose a reason for hiding this comment

elray1 Oct 7, 2024

Choose a reason for hiding this comment

bsweger Oct 7, 2024

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

bsweger commented Oct 9, 2024

bsweger commented Oct 3, 2024 •

edited

Loading