Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare assign_clades CLI to use functionality in CladeTime and Tree classes #42

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Oct 24, 2024

Closes #19

Background

This PR lays the groundwork for upcoming "needed for eval" cladetime issues.

Even though it will be possible to write a "get target data script" using cladetime's new CladeTime and Tree classes, @elray1 and I decided to retain the small assign_clades CLI for usability.

Specific changes

  • Remove NCBI-related code
  • Clean up the Config class
  • For assign_clades steps that already have a CladeTime counterpart, switch to using CladeTime object
  • Create stubs for assign_clades steps that require not-yet-implemented-or-merged CladeTime/Tree features

There's also a few one-off tidying steps like removing the get_clade_list script, which has moved to variant-nowcast-hub and marking some of the functions as "private"

This functionality of this file is specific to the
variant-nowcast-hub, so it's been moved there:
reichlab/variant-nowcast-hub#123
The assign_clades CLI will no longer get its sequence data
from the NCBI API. Instead, it will use methods from the
newer CladeTime class to retrieve sequence data and metadata
from Nextstrain.

This changeset cleans up the NCBI compoments from the library
and mocks out a new process for the assign_clades CLI. Once
the changes are merged, the assign_clades CLI won't actually
do anything, but the scaffolding will be in place to build
out a newer version based on the building blocks that we're
adding to the CladeTime and Tree classes.
@bsweger bsweger force-pushed the bsweger/new-clade-assign-scaffolding/6 branch from 8bd9623 to 3661b93 Compare October 24, 2024 14:12
]
)

logger.info("Assigned sequences to clades via Nextclade CLI", output_file=config.assignment_no_metadata_file)
logger.info("Assigned sequences to clades via Nextclade CLI", output_file="some path stuff")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A temporary placeholder so the logger doesn't error when referencing a config attribute that no longer exists.

merged_data = merge_metadata(config)
merged_data.write_csv(config.assignment_file)

with tempfile.TemporaryDirectory() as tmpdir:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike prior versions of assign_clade, this new thinking assumes that we can save the intermediate files required for clade assignment in a temporary directory and only worry about saving the final output(s) to disk.

data_path_root: str | None,
):
if data_path_root:
self.data_path = AnyPath(data_path_root)
self.data_path = Path(data_path_root)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnyPath wasn't really helping us but was responsible for a lot of annoying type errors, so switched back to good old Path

session: Session, bucket: str, key: str, data_path: Path, as_of: str | None = None, use_existing: bool = False
) -> Path:
"""Download the latest GenBank genome metadata data from Nextstrain."""
def _download_from_url(session: Session, url: str, data_path: Path) -> Path:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that CladeTime provides the S3 URLs that point to the correct versions of sequence data and metadata, we can replace download_covid_genome_metadata with a more generic "download data using a given URL" function.

@@ -67,31 +65,6 @@ def test_get_covid_genome_metadata_url(s3_setup, test_file_path, metadata_file):
assert isinstance(metadata, pl.LazyFrame)


@pytest.mark.parametrize(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these tests because the function is gone and was been replaced by something that doesn't have any logic (it just downloads data using the requests library)

@@ -2,7 +2,11 @@
from unittest.mock import MagicMock, patch

import pytest
from cladetime.get_clade_list import main

pytest.importorskip(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this parameterized test for checking our "get_clade_list" logic. However, that script now (rightfully) lives in the variant nowcast hub.

Need to do some more thinking about how to incorporate these tests in the other repo, where we don't necessary have a python/pytest environment set up.

In the meantime, added this code to skip the test file.

@bsweger bsweger requested a review from elray1 October 24, 2024 14:51
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@bsweger bsweger merged commit 31b982f into main Oct 24, 2024
2 checks passed
@bsweger bsweger deleted the bsweger/new-clade-assign-scaffolding/6 branch October 24, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get genome sequence and its metadata by downloading Nextstrain files instead of via the NCBI API
2 participants