Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!(ingest, prepro): replace fasta filter by a specific metadata filter and add ANY aligned option for segmented organisms #3512

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jan 12, 2025

resolves #

preview URL: https://ingest-nextclade-sort.loculus.org/

BREAKING CHANGES

This only affects segmented organisms, segmented organisms typically use nextclade_sort or nextclade_align for segment assignment. The arguments relating to these two options should now be placed in a dictionary with that heading e.g.

nextclade_dataset_name: nextstrain/cchfv/linked
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output

becomes

nextclade_align:
  nextclade_dataset_name: nextstrain/cchfv/linked
  nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output

Summary

  1. This PR replaces the previous filter rule that filtered sequences by fasta header with a more exact filter that is applied to the metadata fields post grouping.
  2. This PR also updates preprocessing to accept multi-segmented sequences where any of the segments have aligned. Currently we only accept sequences where all segments have aligned. For example in a L, M, S segmented organism all of the segments (that were uploaded) must align for the submission to be successful, e.g. if I upload L and M both must align. Now I require only 1 segment to align, the other will be shown on the webpage in this case with no alignment. For 8-segmented organisms such as influenza this is very important as a 8-segmented sequence could error even if only 1 segment did not align, meaning that we previously bad to perform alignment twice - once during ingest and once during prepro.
  3. This makes the ingest config clearer by grouping the nextclade_align and nextclade_sort parameters together, each in one dictionary.

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jan 12, 2025
@anna-parker anna-parker changed the title feat(ingest): use Ingest nextclade sort for subclades feat(ingest): replace fasta filter by a specific metadata filter - applied after grouping sequences Jan 12, 2025
@anna-parker anna-parker force-pushed the ingest_nextclade_sort branch 2 times, most recently from dde0e45 to c7bec4a Compare January 20, 2025 09:46
@anna-parker anna-parker changed the base branch from main to ingest_memory_fixes January 20, 2025 09:46
@anna-parker anna-parker changed the base branch from ingest_memory_fixes to group_using_assemblies January 20, 2025 09:51
@anna-parker anna-parker force-pushed the group_using_assemblies branch 2 times, most recently from 0aeed39 to 985ea7f Compare January 30, 2025 09:42
@anna-parker anna-parker force-pushed the ingest_nextclade_sort branch 2 times, most recently from 74d5eb2 to 75bad34 Compare January 30, 2025 13:09
@anna-parker anna-parker changed the title feat(ingest): replace fasta filter by a specific metadata filter - applied after grouping sequences feat!(ingest, prepro): replace fasta filter by a specific metadata filter and add ANY aligned option for segmented organisms Jan 30, 2025
@anna-parker anna-parker force-pushed the group_using_assemblies branch from d5b8e1f to 7ba6f61 Compare January 30, 2025 14:55
@anna-parker anna-parker force-pushed the ingest_nextclade_sort branch from 180ac9b to 617b985 Compare January 30, 2025 15:00
@anna-parker anna-parker requested review from corneliusroemer and fhennig and removed request for corneliusroemer January 30, 2025 15:14
@anna-parker anna-parker marked this pull request as ready for review January 30, 2025 15:14
@anna-parker anna-parker requested review from corneliusroemer and removed request for fhennig January 30, 2025 15:14
Base automatically changed from group_using_assemblies to main January 30, 2025 16:35
@anna-parker anna-parker force-pushed the ingest_nextclade_sort branch from 617b985 to 9d46001 Compare January 30, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant