Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling coverage on a BAM re-fetches the header every time #56

Open
ah19 opened this issue Nov 20, 2017 · 4 comments
Open

Calling coverage on a BAM re-fetches the header every time #56

ah19 opened this issue Nov 20, 2017 · 4 comments
Assignees

Comments

@ah19
Copy link

ah19 commented Nov 20, 2017

A possible performance issue rather than a bug:
https://github.com/Ensembl/Bio-DB-HTS/blob/master/lib/Bio/DB/HTS.pm#L1951

...
    my $header = $self->{hts_file}->header_read;
    my ( $id, $s, $e ) = $header->parse_region($region);
    return unless defined $id;
    # parse_region may return a very high value if no end specified
    $end = $e >= 1 << 29 ? $header->target_len->[$id] : $e;
...

Calling $self->header uses the cached version

@andrewyatz
Copy link
Member

@ah19 does this mean we just need to change that header_read call into a header call? Not sure there's a reason why we can't do that

@ah19
Copy link
Author

ah19 commented Feb 7, 2018

I believe so -- I verified it was making multiple requests by turning the htslib (and maybe curl) logging level up, and after making the change to header saw that it stopped.

I did only checked this was working against a single bam file, and I wasn't sure if there was a reason why it used header_read instead

@andrewyatz
Copy link
Member

Fair point. @rishidev is the best bet for knowing the answer if there was an intention to use header_read over header. Hopefully @'ing him will ping off the Rishi-signal

@rishidev
Copy link
Collaborator

rishidev commented Feb 7, 2018

Thanks for submitting and bumping this issue. It does raise an area which could be clarified. Firstly, header_read() over header was a definite decision that was made.

The underlying issue is to do with BAM vs CRAM header objects being different HTSlib objects requiring different handling via a common mechanism. I believe the functions here need a fresh header object to work correctly. In HTS.xs, header_read() rewinds the header for BAM files (it is a file pointer, and the header is at the start of the file, so the structure is available by knowing the right place) but does a complete new header fetch for the CRAM files (it is an actual C structure).

IIRC Using the cached version for the header caused issues for one, or both, of BAM and CRAM, and they needed resetting for correct functionality. The header_read() in fact used to be just header() - the name change was to reflect there was a matching header_write() and that in the CRAM case a definite read was made from the file. I'll try and go back over notes and provide further clarification. There is certainly scope for investigation here to make this work neater.

@avullo avullo self-assigned this Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants