Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate & Align V-Pipe Reads #53

Open
gordonkoehn opened this issue Nov 29, 2024 · 12 comments · May be fixed by #78
Open

Translate & Align V-Pipe Reads #53

gordonkoehn opened this issue Nov 29, 2024 · 12 comments · May be fixed by #78
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@gordonkoehn
Copy link
Collaborator

gordonkoehn commented Nov 29, 2024

Currently, we use Nextclade to translate from nucleotides to amino acids.

We also use it to realign the notes and amino acids, even though our nucleotides are already aligned.

This seemed fine for small test data, yet it may be infeasible or a mere waste of resources to realign the reads again.

So consider rewriting the translate functions.

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Nov 29, 2024

Potential Solution: call functions within Nextclade are told after the alignment.

  • Reach out the the nextclade devs

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 2, 2024

See their response:

In particular from rneher

A simple script could translate those already aligned reads. You'd need to figure which ORFs you read falls into, the reading frame, and then use something like Bio.Sequence.translate from the biopython package to translate the sequence.

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 2, 2024

So, we probably need a custom tool, yet I am unsure how to get the new amino acid positions.

Plus in general I don't know how to deal with the insertions

Can I just translate the nucleotide insertion to amino-acid insertions?

@gordonkoehn gordonkoehn added the bug Something isn't working label Dec 2, 2024
@gordonkoehn gordonkoehn changed the title Remove Realignment by NextClade, Scale Performance to Fullsize BAM Remove Realignment by NextClade, Write Custom Amino Acid Translation Dec 2, 2024
@gordonkoehn gordonkoehn changed the title Remove Realignment by NextClade, Write Custom Amino Acid Translation Remove Realignment by NextClade, Write Custom Amino Acid Translation/Insertion Dec 2, 2024
@gordonkoehn
Copy link
Collaborator Author

@DrYak, for now, I am ignoring insertions completely. There are too many biological unknowns for me here. I'd cherish your advice here sometime.

@gordonkoehn
Copy link
Collaborator Author

The issue:

Image

@gordonkoehn
Copy link
Collaborator Author

Need some input before I can continue.

@gordonkoehn gordonkoehn added the help wanted Extra attention is needed label Dec 3, 2024
@gordonkoehn gordonkoehn self-assigned this Dec 3, 2024
@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 9, 2024

Take-aways from chat with @LaraFuhrmann

  • BIG CAVEAT to uploading the .bam reads to SILO is that this discards all of V-Pipe's mutation calling efforts, so the database will contain sequencing artefacts.
  • Yet, anyone will still be able to query, which is useful in urgent situations quickly.
  • As a rough estimate, V-Pipe's computational effort is 15 % preprocessing / 25 % nucleotide alignment / 60 % mutational calling.
  • The current plan hence is just to take all .bam take single reads to align with nextclade and once down, merge pair reads
  • Basically, we only use V-Pipe's preprocessing and use Nextclade alignment / at the end, we again have V-Pipe's post-processing.

So actionable:

  • implement to run nextclade for batches of the reads to still run in a small docker, just for longer.
  • this means sr2silo is a computationally expensive step

@DrYak FYI. Let's also chat about this before I do it.

@gordonkoehn gordonkoehn changed the title Remove Realignment by NextClade, Write Custom Amino Acid Translation/Insertion Translate & Align V-Pipe Reads Dec 9, 2024
@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 11, 2024

Take-aways from chat with @DrYak:

There exists, indeed, no code for this as of now. This is because most of the time, people will use mutations called on nucleotides and only translate these mutations, not entire alignments.

There are two options:

    1. build it from the ground( with pysam and proper logic to handle all coding frames, handle corner cases)
    1. hack Nextclade and hook into the process after the alignment they do

In both cases, the steps would be to take a .sam of the single reads and make a .sam with paired reads, i.e. using Micha's tool.

Once that exists, translate the amino acid alignment with either option.

Option 1) would probably take me 1-2 Months to handle properly – hard to estimate – for my lack of experience.

Option 2) could be quick and handle all corner cases.

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 17, 2024

This is a tool that Niko shared

it appears it does the alignment itself. So this is not what we want, but it reads like there are some difficulties with the translation. Which just proved my point to use a well-supported tool. Then Nextclade will be their better choice if I hack something.

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Dec 17, 2024

  • How does V-Pipe align? Is that better in any way than Nextclade?

Ivan mentioned some probabilistic work in the alignment and theorised that Nextclade might do something more simplistic.

@gordonkoehn gordonkoehn linked a pull request Jan 16, 2025 that will close this issue
@gordonkoehn
Copy link
Collaborator Author

Managed to get a clear text aligment i.e. equal length to reference as well as insertions per read.

Now it is ready to pass these strings over to nextclade.

@gordonkoehn
Copy link
Collaborator Author

Progress update from Nextclade side

  • nextclade is parralelized, it would be hard hack its cli with limited rust knowledge
  • stripped contains the input I can provide from the sr2silo side. See here
  • translation and alignment of amino acids !! happens here

Obviously realigment has to happen if there are frameshift, this was not obvious to me.

It appears best to import nextclade as a crate and run these functions as a mini script ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant