-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare assign_clades CLI to use functionality in CladeTime and Tree classes #42
Conversation
This functionality of this file is specific to the variant-nowcast-hub, so it's been moved there: reichlab/variant-nowcast-hub#123
The assign_clades CLI will no longer get its sequence data from the NCBI API. Instead, it will use methods from the newer CladeTime class to retrieve sequence data and metadata from Nextstrain. This changeset cleans up the NCBI compoments from the library and mocks out a new process for the assign_clades CLI. Once the changes are merged, the assign_clades CLI won't actually do anything, but the scaffolding will be in place to build out a newer version based on the building blocks that we're adding to the CladeTime and Tree classes.
8bd9623
to
3661b93
Compare
] | ||
) | ||
|
||
logger.info("Assigned sequences to clades via Nextclade CLI", output_file=config.assignment_no_metadata_file) | ||
logger.info("Assigned sequences to clades via Nextclade CLI", output_file="some path stuff") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A temporary placeholder so the logger doesn't error when referencing a config attribute that no longer exists.
merged_data = merge_metadata(config) | ||
merged_data.write_csv(config.assignment_file) | ||
|
||
with tempfile.TemporaryDirectory() as tmpdir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike prior versions of assign_clade
, this new thinking assumes that we can save the intermediate files required for clade assignment in a temporary directory and only worry about saving the final output(s) to disk.
data_path_root: str | None, | ||
): | ||
if data_path_root: | ||
self.data_path = AnyPath(data_path_root) | ||
self.data_path = Path(data_path_root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnyPath wasn't really helping us but was responsible for a lot of annoying type errors, so switched back to good old Path
session: Session, bucket: str, key: str, data_path: Path, as_of: str | None = None, use_existing: bool = False | ||
) -> Path: | ||
"""Download the latest GenBank genome metadata data from Nextstrain.""" | ||
def _download_from_url(session: Session, url: str, data_path: Path) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that CladeTime
provides the S3 URLs that point to the correct versions of sequence data and metadata, we can replace download_covid_genome_metadata
with a more generic "download data using a given URL" function.
@@ -67,31 +65,6 @@ def test_get_covid_genome_metadata_url(s3_setup, test_file_path, metadata_file): | |||
assert isinstance(metadata, pl.LazyFrame) | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these tests because the function is gone and was been replaced by something that doesn't have any logic (it just downloads data using the requests library)
@@ -2,7 +2,11 @@ | |||
from unittest.mock import MagicMock, patch | |||
|
|||
import pytest | |||
from cladetime.get_clade_list import main | |||
|
|||
pytest.importorskip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this parameterized test for checking our "get_clade_list" logic. However, that script now (rightfully) lives in the variant nowcast hub.
Need to do some more thinking about how to incorporate these tests in the other repo, where we don't necessary have a python/pytest environment set up.
In the meantime, added this code to skip the test file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved
Closes #19
Background
This PR lays the groundwork for upcoming "needed for eval" cladetime issues.
Even though it will be possible to write a "get target data script" using cladetime's new CladeTime and Tree classes, @elray1 and I decided to retain the small
assign_clades
CLI for usability.Specific changes
There's also a few one-off tidying steps like removing the
get_clade_list
script, which has moved to variant-nowcast-hub and marking some of the functions as "private"