Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

padded_alignment() produces invalid output #40

Open
AlexanderDilthey opened this issue Apr 24, 2017 · 27 comments
Open

padded_alignment() produces invalid output #40

AlexanderDilthey opened this issue Apr 24, 2017 · 27 comments
Assignees
Labels

Comments

@AlexanderDilthey
Copy link

Hi,

The padded_alignment() function sometimes produces invalid output, in that the strings that specify the pairwise alignment are not of equal length.

Interestingly, this only happens (on the same alignment) when using CRAM format -- BAM and SAM are fine.

I also get these warnings:

substr outside of string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /home/diltheyat/perl5/lib/perl5/x86_64-linux-thread-multi/Bio/DB/HTS/AlignWrapper.pm line 307.

Files to reproduce this error:

Reference genome GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa: https://gembox.cbcb.umd.edu/shared/graph_hackathon/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa

CRAM/SAM with one test alignment: https://www.dropbox.com/s/2dbb413ulfrhpma/testCase.zip?dl=0

Code:

use strict;
use warnings;
use Data::Dumper;
use Bio::DB::HTS;

my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa';
my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');

my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq)
{
my ($ref,$matches,$query) = $alignment->padded_alignment;
unless(length($ref) == length($query))
{
warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name);
next;
}
}

@rishidev
Copy link
Collaborator

Thanks for submitting this. I've been able to replicate the issue with the sample code you have provided, and will investigate further.

@rishidev rishidev self-assigned this Jun 17, 2017
@AlexanderDilthey
Copy link
Author

Hi @rishidev, any news on that bug by any chance? Solving it would really help dealing with long, e.g. Nanopore, reads.

@rishidev
Copy link
Collaborator

Hi @AlexanderDilthey, apologies, I have moved on to a different project now, but this is the last thing I have to fix. The SAM and CRAM behaviour are quite different in the code. I think I've identified where the issue is occurring, but need further clarification to see what the fix is supposed to be.

This seems to be around the CIGAR value. The SAM/BAM contain the sequence themselves so don't have the issue. The CRAM reads reference a CIGAR string with a P in it which is 'a silent deletion from the padded reference'. The padded_alignment call looks like it is returning a padded value but the string is being compared to something that hasn't analysed the P when it's chomping out letters, but I want to go through the spec to confirm that making it do so is the correct decision.

@avullo
Copy link

avullo commented Mar 6, 2018

Unfortunately, I'm not able to replicate this, I get (v2.9 with htslib 1.7):

[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice

with 2.7 and/or htslib 1.5 is the same.

@AlexanderDilthey, @rishidev can you confirm you still get the same invalid output and warnings?
@AlexanderDilthey is this still relevant for you?

@AlexanderDilthey
Copy link
Author

@avullo Definitely still relevant - I think the pairwise alignment feature is one of the most useful of all! I won't be able to check whether it reproduces with 1.7 immediately, but it's on my list!

@avullo
Copy link

avullo commented Mar 7, 2018

Thanks @AlexanderDilthey for letting me know. I'm gonna try with 1.5, but can you tell me your set up (OS/Bio::DB::HTS/htslib version)?

Forgot to mention, #54 is resolved.

@avullo
Copy link

avullo commented Mar 7, 2018

From the module's README:
4. "Failed to populate reference for id X" error message

This is probably an indication that the CRAM Reference archive is unavailable.

@keiranmraine
Copy link
Contributor

@avullo, I'm sure you're already aware of this but check REF_PATH and REF_CACHE env values as well as any proxy settings. If they are correct the last thing it could be is the ref server is down at ensembl.

http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

@avullo
Copy link

avullo commented Mar 7, 2018

@keiranmraine thanks for the trust but you shouldn't be sure: I'm not (yet) well versed with these tools; I was asked to contribute since I have XS experience ;)

@andrewyatz
Copy link
Member

andrewyatz commented Mar 7, 2018 via email

@keiranmraine
Copy link
Contributor

Sorry yes, ENA at EBI.

@avullo avullo self-assigned this Mar 12, 2018
@avullo avullo added the bug label Mar 12, 2018
@avullo
Copy link

avullo commented Mar 15, 2018

Hi @AlexanderDilthey, little but not negligible progress so far.

I and @rishidev managed to isolate the problem and we think fixing would require a reimplementation of the way the reference sequence is reconstructed using the CIGAR and MD strings together (borrowed from the Bio::DB::Sam ancestor module). This is not entirely trivial so bear with me until I properly attack the problem.

But you have a workaround: just instantiate the HTS instance passing the -force_refseq option. This way, it will use the indexed fasta file so you get the correct reference properly aligned to the read.

Let me know if that suits you for your immediate needs. I'm going to address the problem anyway.

@AlexanderDilthey
Copy link
Author

Hi @avullo,

It seems that I now have the opposite problem when using -force_refseq.

I instantiate with

my $sam = Bio::DB::HTS->new(-fasta => $reference, -bam => $BAM, -force_refseq => 1);

... and then later retrieve the alignment via:

my ($ref, $matches, $query) = $inputAlignment->padded_alignment;

... then $query is string of Ns.

@evanbiederstedt
Copy link

Hello @avullo, @rishidev

I'm noticing the same issue discussed here by the OP, @AlexanderDilthey

This function doesn't work always. It's particularly a problem with secondary/supplementary alignments

(1) Has this issue been resolved? I haven't tried the code posted above in April 2017, but it appears to be a reproducible example---does this code still not work?

(2) It's somewhat urgent---if you're willing to walk me through the issue at hand, I could try to write up a patch.

If it's easier, we could discuss this via e-mail. Thanks, Evan

@avullo
Copy link

avullo commented Aug 23, 2018

HI Evan,

it's been a pretty intense period doing other things under deadlines. The newest code doesn't address this issue, so it's not resolved but is on the spotlight.

Before walking through this, I need to recollect thoughts that I had when I first looked at it. I will try to allocate some time next week.

When would you absolutely need it solved?

cheers

Alessandro

@evanbiederstedt
Copy link

Hi @avullo

I'm working on using this code now. If it's easier, please collect your notes on this issue, and we can discuss this via e-mail. I will CC @AlexanderDilthey

I'm very happy to help patching this up.

Thanks, Evan

@avullo
Copy link

avullo commented Aug 23, 2018

Ok @evanbiederstedt

from what I vaguely recollect, the problem is that the code doesn't correctly reconstruct the sequence in the absence of the reference, that's why the problem occurs for CRAM but not in the other cases.

@evanbiederstedt
Copy link

Hi @avullo

I've reproduced the example from @AlexanderDilthey as follows:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Bio::DB::HTS;

my $referenceFasta = 'GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa';
my $sam = Bio::DB::HTS->new(-fasta => $referenceFasta, -bam => 'testCase.cram');

my $alignment_iterator = $sam->features(-iterator => 1);
while(my $alignment = $alignment_iterator->next_seq)
{
	my ($ref,$matches,$query) = $alignment->padded_alignment;
	unless(length($ref) == length($query))
	{
		warn Dumper("Alignment length mismatch", length($ref), length($query), $alignment->query->name);
		next;
	}
}

It is true that the issue exists for the CRAM @AlexanderDilthey provided, i.e. here's the error:

[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice

However, the bug is not CRAM-dependent; it exists for BAM as well.

@avullo
Copy link

avullo commented Aug 24, 2018

That's the error I was getting when initially trying to reproduce the error, but then I've learned you need to set REF_PATH env variable, as pointed out by @keiranmraine, see http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES.
When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.

@evanbiederstedt
Copy link

@avullo

Thanks for the reply.

I'm a bit confused: in order to resolve the issue raised above, you set the environment variable REF_PATH to the reference FASTA?

that is

export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa

@evanbiederstedt
Copy link

Hi @avullo

Apologies if you're busy. Just to clarify, by this statement

When you properly set the above, you'll get the warnings on the CRAM file, but not on BAM.

do you mean that if you set the REF_PATH variable, the operation on the CRAM throws warnings but works correctly, or throws warnings and doesn't work?

@avullo
Copy link

avullo commented Aug 28, 2018

Not a problem @evanbiederstedt

I meant, you don't get the error:
[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice

but you do get the warning originally reported by @AlexanderDilthey, which point to the existence of bugs.

@evanbiederstedt
Copy link

Hi @avullo
CC @AlexanderDilthey

I see what you mean now. Just to clarify for future reference:

When the variable REF_PATH is not defined, here is the output I see, using the script testCram.pl shown here: #40 (comment)

$ perl testCram.pl
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
substr outside of string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
Use of uninitialized value in concatenation (.) or string at /opt/common/CentOS_6-dev/perl/perl-5.22.0/lib/perl5/x86_64-linux/Bio/DB/HTS/AlignWrapper.pm line 307.
$VAR1 = 'Alignment length mismatch';
$VAR2 = 3071526;
$VAR3 = 3074241;
$VAR4 = 'NA19240.gi|939191008|gb|LKPB01001610.1|';

When you do set the variable, here is the output:

$ export REF_PATH=GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ echo $REF_PATH
## GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa
$ perl testCram.pl
[E::cram_get_ref] Failed to populate reference for id 20
[E::cram_decode_slice] Unable to fetch reference #20 41031278..42297761
[E::cram_next_slice] Failure to decode slice

@avullo
Copy link

avullo commented Aug 29, 2018

It's the other way around for me, but I use this:
$ echo $REF_PATH
/home/avullo/.cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s

Do you still get the error on not CRAM?

@evanbiederstedt
Copy link

Hi @avullo

Thanks for the help. Perhaps this is an elementary question (and somewhat outside this github issue), but I'm still deeply confused how MD5 checksums solve this issue. Is this is fundamental property of the CRAM format?

I'm used to MD5 checksums in order to check whether something has downloaded correctly...

@keiranmraine
Copy link
Contributor

The md5 of each chr seq is used to store a reference to it in the header of the BAM/CRAM file. This ensures that the sequence being used to reconstruct the read seq is actually the correct one, and not a different build or something similar.

The /.../.cache/%2s/%2s/%s version is a pre-populated cache. My understanding is that you cannot use a file for REF_PATH/REF_CACHE. A file has to be provided as a parameter to the hts library.

http://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

@evanbiederstedt
Copy link

Hi @keiranmraine

Thanks for the help! I appreciate the explanation, and I think I understand now how this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants