-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a list of sequence collections by date and location #92
Conversation
By default, the dates included will be 31 days prior to the date of the round that just closed (the script will be scheduled to run on Wednesday evenings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the output of the script that counts sequence collections by date and location.
@nickreich would love your thoughts on the notes below.
- I didn't filter out the zeroes, because it seemed easier for a user to work with a complete dataset than have to infer from missing data which states and days are eligible for scoring.
- The data originates from a Polars DataFrame, so we can write it out as json, parquet, etc. It doesn't have to be a .csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree with not filtering out zeroes. And like that we are actually storing the counts themselves as this might allow us later on to compute what fraction of counts were observed at the forecast due date.
- I don't have a strong feeling about .csv vs. .parquet. I think I might suggest not JSON, for simpler tabular readability into R/python.
- The more tidy data way to store the data would be to have a column for location, a column for date, and a column for count. Given that all of the evaluation/model-output will work in this way, I suggest reformatting into this "longer" version of the data. I would suggest naming the columns
location
,target_date
andcount
. So that it could be merged in to a model-output or target-data file at some point down the road as-is. Would making the data "tidy" increase the size of the files? I think even if it would, it might be worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @nickreich. Can definitely put the data into 3rd normal form--it's so small and will always be limited to 31 days by definition, so am not worried about size. I think the current format is what @elray1 and I discussed, so hopefully he can chime in (it's also entirely possible that I'm misremembering things!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Nick's suggestions make sense. Very minor -- what do you think about a folder name of unscored-location-dates
? (unscored rather than unscore)
…t time This change will make the script more precise and more accurate should we need to re-run it. Rather than using the version of genome sequence metadata at the time script is run, this update explicitly uses the sequence metadata as it existed at the time the most recent round closed. The update generated a slightly different version of the resulting data, which is included in the commit.
Although the documentation considers streaming to be "unstable", this changeset adds it as a hedge against out of memory errors.
Introduce another GitHub workflow that runs on Wednesdays after a rounds closes. It generates a count of Sars-Cov-2 sequences collected for the past 31 days (grouped by location and day).
uv is a drop-in replacement for the setup-python action, so we don't need the latter
on: | ||
schedule: | ||
# Not precise, but GithHub actions don't support time zones | ||
- cron: "0 02 * 11-12,1-3 3" # 2 AM UTC every Wed (Nov-Dec, Jan-Mar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timezones 🙀
Rounds close at 8 PM US Eastern, and GitHub actions are UTC, so tried to come as close as possible when accounting to daylight savings time (not exact, obviously)
bbe02fb
to
2a5dec3
Compare
c58b239
to
0e6f05f
Compare
The virus-clade-utils repo is now named cladetime. The virus_clade_utils Python package is now named cladetime. More information: reichlab/cladetime#31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this PR, but am noting that we are careful to provide the correct round closing datetime when instantiating a CladeTime object, but as of now I believe this is still being truncated over in CladeTime. So for now, I believe that our data may still be one day off? (not sure, may depend on timezone conversion...)
Good note! The current state of things is a bit confusing. You're right: cladetime's However, the code in this PR doesn't use that download method, because it works directly against the sequence metadata on S3 file (i.e., no download required). The function used to get the S3 URL does use a full timestamp. Now that we're using the CladeTime class as our interface, we'll probably end up making this less confusing by updating the |
DO NOT MERGE UNTIL virus-clade-utils rename has been mergedDone!Closes #50
WIP so that the people using this information when scoring can take a look at the output. Will add some annotations inline.
Remaining work before the PR comes out of draft:
virus-clade-utils