-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
padded_alignment() produces invalid output #40
Comments
Thanks for submitting this. I've been able to replicate the issue with the sample code you have provided, and will investigate further. |
Hi @rishidev, any news on that bug by any chance? Solving it would really help dealing with long, e.g. Nanopore, reads. |
Hi @AlexanderDilthey, apologies, I have moved on to a different project now, but this is the last thing I have to fix. The SAM and CRAM behaviour are quite different in the code. I think I've identified where the issue is occurring, but need further clarification to see what the fix is supposed to be. This seems to be around the CIGAR value. The SAM/BAM contain the sequence themselves so don't have the issue. The CRAM reads reference a CIGAR string with a P in it which is 'a silent deletion from the padded reference'. The padded_alignment call looks like it is returning a padded value but the string is being compared to something that hasn't analysed the P when it's chomping out letters, but I want to go through the spec to confirm that making it do so is the correct decision. |
Unfortunately, I'm not able to replicate this, I get (v2.9 with htslib 1.7): [E::cram_get_ref] Failed to populate reference for id 20 with 2.7 and/or htslib 1.5 is the same. @AlexanderDilthey, @rishidev can you confirm you still get the same invalid output and warnings? |
@avullo Definitely still relevant - I think the pairwise alignment feature is one of the most useful of all! I won't be able to check whether it reproduces with 1.7 immediately, but it's on my list! |
Thanks @AlexanderDilthey for letting me know. I'm gonna try with 1.5, but can you tell me your set up (OS/Bio::DB::HTS/htslib version)? Forgot to mention, #54 is resolved. |
From the module's README: This is probably an indication that the CRAM Reference archive is unavailable. |
@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl. http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES |
@keiranmraine thanks for the trust but you shouldn't be sure: I'm not (yet) well versed with these tools; I was asked to contribute since I have XS experience ;) |
ENA you mean @keiranmraine. Ensembl doesn't host the CRAM reference registry
… On 7 Mar 2018, at 12:08, Keiran Raine ***@***.***> wrote:
@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.
http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sorry yes, ENA at EBI. |
Hi @AlexanderDilthey, little but not negligible progress so far. I and @rishidev managed to isolate the problem and we think fixing would require a reimplementation of the way the reference sequence is reconstructed using the CIGAR and MD strings together (borrowed from the Bio::DB::Sam ancestor module). This is not entirely trivial so bear with me until I properly attack the problem. But you have a workaround: just instantiate the HTS instance passing the -force_refseq option. This way, it will use the indexed fasta file so you get the correct reference properly aligned to the read. Let me know if that suits you for your immediate needs. I'm going to address the problem anyway. |
Hi @avullo, It seems that I now have the opposite problem when using -force_refseq. I instantiate with my $sam = Bio::DB::HTS->new(-fasta => $reference, -bam => $BAM, -force_refseq => 1); ... and then later retrieve the alignment via: my ($ref, $matches, $query) = $inputAlignment->padded_alignment; ... then $query is string of Ns. |
I'm noticing the same issue discussed here by the OP, @AlexanderDilthey This function doesn't work always. It's particularly a problem with secondary/supplementary alignments (1) Has this issue been resolved? I haven't tried the code posted above in April 2017, but it appears to be a reproducible example---does this code still not work? (2) It's somewhat urgent---if you're willing to walk me through the issue at hand, I could try to write up a patch. If it's easier, we could discuss this via e-mail. Thanks, Evan |
HI Evan, it's been a pretty intense period doing other things under deadlines. The newest code doesn't address this issue, so it's not resolved but is on the spotlight. Before walking through this, I need to recollect thoughts that I had when I first looked at it. I will try to allocate some time next week. When would you absolutely need it solved? cheers Alessandro |
Hi @avullo I'm working on using this code now. If it's easier, please collect your notes on this issue, and we can discuss this via e-mail. I will CC @AlexanderDilthey I'm very happy to help patching this up. Thanks, Evan |
from what I vaguely recollect, the problem is that the code doesn't correctly reconstruct the sequence in the absence of the reference, that's why the problem occurs for CRAM but not in the other cases. |
Hi @avullo I've reproduced the example from @AlexanderDilthey as follows:
It is true that the issue exists for the CRAM @AlexanderDilthey provided, i.e. here's the error:
However, the bug is not CRAM-dependent; it exists for BAM as well. |
That's the error I was getting when initially trying to reproduce the error, but then I've learned you need to set REF_PATH env variable, as pointed out by @keiranmraine, see http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES. |
Thanks for the reply. I'm a bit confused: in order to resolve the issue raised above, you set the environment variable that is
|
Hi @avullo Apologies if you're busy. Just to clarify, by this statement
do you mean that if you set the |
Not a problem @evanbiederstedt I meant, you don't get the error: but you do get the warning originally reported by @AlexanderDilthey, which point to the existence of bugs. |
Hi @avullo I see what you mean now. Just to clarify for future reference: When the variable
When you do set the variable, here is the output:
|
It's the other way around for me, but I use this: Do you still get the error on not CRAM? |
Hi @avullo Thanks for the help. Perhaps this is an elementary question (and somewhat outside this github issue), but I'm still deeply confused how MD5 checksums solve this issue. Is this is fundamental property of the CRAM format? I'm used to MD5 checksums in order to check whether something has downloaded correctly... |
The md5 of each chr seq is used to store a reference to it in the header of the BAM/CRAM file. This ensures that the sequence being used to reconstruct the read seq is actually the correct one, and not a different build or something similar. The http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES |
Thanks for the help! I appreciate the explanation, and I think I understand now how this works. |
Hi,
The padded_alignment() function sometimes produces invalid output, in that the strings that specify the pairwise alignment are not of equal length.
Interestingly, this only happens (on the same alignment) when using CRAM format -- BAM and SAM are fine.
I also get these warnings:
substr outside of string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.
Files to reproduce this error:
Reference genome GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa: https://gembox.cbcb.umd.edu/shared/graph_hackathon/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
CRAM/SAM with one test alignment: https://www.dropbox.com/s/2dbb413ulfrhpma/testCase.zip?dl=0
Code:
use strict;
use warnings;
use Data::Dumper;
use Bio::DB::HTS;
my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa';
my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');
my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq)
{
my ($ref,$matches,$query) = $alignment->padded_alignment;
unless(length($ref) == length($query))
{
warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name);
next;
}
}
The text was updated successfully, but these errors were encountered: