Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hpc_hmmsearch #8

Open
apcamargo opened this issue Jul 7, 2021 · 6 comments
Open

Use hpc_hmmsearch #8

apcamargo opened this issue Jul 7, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@apcamargo
Copy link

This might be outside of the scope of the project, but I thought it could be a nice addition. There's a modified version of hmmseach (https://github.com/Larofeticus/hpc_hmmsearch) that runs way faster in parallel than the regular command. Maybe pyhmmer.hmmsearch could use it instead, or having it available through a separate function.

More technical details: https://docs.nersc.gov/performance/case-studies/hmmer3/

@althonos
Copy link
Owner

althonos commented Jul 7, 2021

Hi @apcamargo , thanks for the heads-up!

The current version of pyhmmer.hmmsearch is already kind of its own command, it's not actually using hmmsearch under the hood but reimplementing the whole hmmsearch command using the objects exposed in pyhmmer.plan7. I could have a look at the version you provided, and if they have a more efficient way to run I could try to replicate that.

@althonos althonos added the enhancement New feature or request label Jul 7, 2021
@apcamargo
Copy link
Author

Thanks! I did some quick benchmarks here and I found hpc_hmmsearch to be faster than pyhmmer.hmmsearch when using several threads. Might be worth to take a look a it.

@althonos
Copy link
Owner

althonos commented Jul 16, 2021

Hi @apcamargo ,

I was able to trace down a big performance penalty in the Pipeline search loop, which was causing parallel code to become less and less efficient with more threads (basically, when too many threads were invoked, they would spend more time waiting for the GIL than actually processing the HMMs/sequence pairs). By adding an extra restriction on the input reference, i was able to rewrite the search loop to only reacquire the GIL when the Pipeline is completely done comparing the HMM to all the reference sequences.

On a consumer PC, with a small number of threads, this didn't affect performance that much, but on larger PCs (e.g. our lab's workstation) it is making quite a difference with a higher number of threads.

Next step will be to try to rewrite the hmmsearch implementation using OpenMP instead of the Python threading, but this is going to be a bit more complicated. In the meantime, it would be great if you could test the new implementation to see if it makes any difference (installable distribution attached):
pyhmmer-0.4.4-openmp.tar.gz

@apcamargo
Copy link
Author

Thanks! I'll give it a try later today.

Do you expect the OpenMP implementation to perform similarly to hpc_hmmsearch?

@apcamargo
Copy link
Author

I did get some improvements with this branch, although I didn't have the chance to test it in a machine with lots of threads. Very excited about this progress! Thanks!

@apcamargo
Copy link
Author

A bit more than two years later, I did some benchmarks. In a very common scenario (hmmsearch against Pfam-A), hpc_hmmsearch was about 5% faster than pyhmmer (see script below). This speedup was very consistent across several sizes of input FASTA files. I used version 0.10.2 for this benchmark.

#!/usr/bin/env python

import pyhmmer

with pyhmmer.easel.SequenceFile("test_proteins.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()

with pyhmmer.plan7.HMMFile("Pfam-A.h3m") as hmms, open("hmmer_output.txt", "w") as fout:
    for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs, bit_cutoffs="gathering"):
        for hit in hits:
            fout.write(f"{hits[0].name.decode()}\t{hits.query_accession.decode()}\t{hits.query_name.decode()}\t{hits[0].evalue:.2e}\t{hits[0].score:.2f}\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants