Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot reproduce example for pyhmmer.hmmer.hmmsearch #58

Open
erfanshekarriz opened this issue Dec 9, 2023 · 1 comment
Open

cannot reproduce example for pyhmmer.hmmer.hmmsearch #58

erfanshekarriz opened this issue Dec 9, 2023 · 1 comment
Labels
question Further information is requested

Comments

@erfanshekarriz
Copy link

erfanshekarriz commented Dec 9, 2023

Hello and hope all is well. I'm running into issues reproducing this code (https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Search-and-scan):

import time
import pyhmmer

t1 = time.time()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")

# pre-fetching targets - fast, but needs the whole target database in memory
t1 = time.time()
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
    print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")

My code is:

import os
import sys
import psutil
import argparse

import pyhmmer
from pyhmmer.easel import SequenceFile
from pyhmmer.plan7 import HMMFile
from pyhmmer.hmmer import hmmsearch
from pyhmmer.hmmer import hmmscan


def run_pyhmmer(proteins, hmmdb, tblout, domtblout, bitcutoff, cores_n):

        available_memory = psutil.virtual_memory().available
        proteins_size = os.stat(proteins).st_size
        database_size = os.stat(proteins).st_size
        input_size = proteins_size + database_size

        with HMMFile(hmmdb) as hmm_file:
                hmms = list(hmm_file)
                Z_val=len(hmms)

                with SequenceFile(proteins, digital=True) as seq_file:
                        if input_size < available_memory * 0.1:
                                print("Enough available memory!")
                                print("Pre-fetching targets into memory...")
                                seqs = seq_file.read_block()

                        else:
                                seqs = seq_file

                        print("Performing pyhmmer hmmsearch...")
                        all_hits = hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
                        print(type(all_hits)) # This output prints <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>

The documentation states that hmmsearch yields a TopHits object, but I seem to be getting <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>.

I also wanted to ask if I can run hmmsearch(hmms, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) instead of hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) if my system has enough memory to store the db.

My hmm database is a large database of 31,150 HMM profiles (~3 GB), and my protein is a FAA output from prodigal with 208,993 proteins (76 MB) .

@erfanshekarriz
Copy link
Author

erfanshekarriz commented Dec 9, 2023

I've noticed that putting the function into a list returns a list of TopHits objects:

tophits_list = list(hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff))

How would I then be able to write all the top hits into a single "domains" file?
I've done this but seems like it's not the most efficient way:

with open(domtblout, "ab") as f:=
  all_hits_list[0].write(f, format="domains", header=True)
  for hits in all_hits_list[1:]:
      hits.write(f, format="domains", header=False)

I've read all over the documentation and can't seem to see any example of many HMMs run against manny proteins through pyhmmer.hmmer.hmmsearch()..

Would be nice to do a walk-through of what I'm trying to achieve for anyone planning to do the same!

Once I figure it out happy to contribute to the walkthrough.

Best,

Erfan

@althonos althonos added the question Further information is requested label Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants