Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base-depth become extremely slow when calculating highly covered site. #59

Open
y9c opened this issue Nov 16, 2022 · 8 comments
Open

base-depth become extremely slow when calculating highly covered site. #59

y9c opened this issue Nov 16, 2022 · 8 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@y9c
Copy link
Contributor

y9c commented Nov 16, 2022

Position 21:8215302 on GRCh38 (Ensembl ref) is used for debugging. When calculating this single site, whose depth is ~10k ,the command took more than 10min and still not finish.

perbase base-depth -m -t 16 -r Homo_sapiens.GRCh38.genome.fa debug.bam

debug.tar.gz

@sstadick sstadick added the bug Something isn't working label Nov 16, 2022
@sstadick
Copy link
Owner

Thanks for making an issue! I'll check it out in the next few days. Could you please share the version of perbase that you are using? (I think it's in the output of perbse base-depth --help).

It's been a while since I dug into the perbase code, but it's not impossible that that a pileup of that depth would be slow. A good comparison would be to do a pileup with samtools as a sanity check, since perbase is using htslib under the hood.

@y9c
Copy link
Contributor Author

y9c commented Nov 16, 2022

perbase-base-depth 0.8.5

@y9c
Copy link
Contributor Author

y9c commented Nov 16, 2022

I did test samtools mpileup and my customized pipeup script using rust-htslib. The speed is not fast but acceptable. It took 0.5s for samtools mpileup.

samtools mpileup -f  debug.bam -r 21:8215302-8215302 > /dev/null  0.52s user 0.05s system 99% cpu 0.573 total

@y9c
Copy link
Contributor Author

y9c commented Nov 17, 2022

I waited more than 30 mins, but perbase is still running. I use all the threads, and there is no IO or memory bound.

@sstadick
Copy link
Owner

So, I think the issue is that you need to specify the region for perbase. It's spinning away doing nothing looking every possible based from the BAM header. If you give it a region it will be faster (seconds):

perbase base-depth --mate-fix --threads 16 -r /data/misc/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b <(printf "21\t8215302\t8215303\n") ~/Downloads/debug.bam

I do think this is unintuitive and still counts as a bug. Work could also be done to avoid having perbase do the setup / teardown for every TID when it should be able to avoid that from the start. So I'll leave this open for any future work.

@sstadick sstadick added the enhancement New feature or request label Nov 17, 2022
@y9c
Copy link
Contributor Author

y9c commented Nov 17, 2022

Thanks. I remember I also test with bed file input, and also wait for more than 20min. I will post my test results later.

@y9c
Copy link
Contributor Author

y9c commented Nov 18, 2022

Seem that there is some problem with my previous test. I might forget to specify the bed file. I try to run the analysis with a bed file just now, and it finish in ~1min, it is quicker than before but is still too slow.
I try to parse this single site with rust-hislib, and drop mate overlap by read name. It took <1s. Is it because some filtering step in perbase slow down this analysis?
BTW, bed file is 0-based, so the record should be 21\t8215301\t8215302, correct?

@sstadick
Copy link
Owner

I'm pretty sure it's slow because it checks every contig in the bam header, loads up that that contigs reference, spins up resources to process it, then discovers no reads for that contig.

Basically, it's not at all optimized for single query work. I'm totally open to PRs on this! And I'll leave this open in the event of future work on Perbase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants