Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to filter alignment results on coverage between HMM and target #27

Open
jpjarnoux opened this issue Nov 28, 2022 · 6 comments
Labels
question Further information is requested

Comments

@jpjarnoux
Copy link

Hi !

I was searching if that's possible to report the alignment coverage on the HMM and the target. I'm using the PADLOC-DB as HMM database and I notice that they include hmm.coverage.threshold and target.coverage.threshold to filter results in there metadata file.

So I was searching if these values were in the HIT object, but they're not, and I don't find them in the documentation. Did I miss something ?

Thanks

@althonos
Copy link
Owner

Hi @jpjarnoux

If I'm not mistaken, this option doesn't exist in the original HMMER either, because of the difficulty to compute coverage for an alignment which can contain more than one domain. I like this example (from the documentation examples):

image

@althonos althonos added the question Further information is requested label Nov 30, 2022
@jpjarnoux
Copy link
Author

Hi,
Thanks, I will try to catch up why there is this value in the metadata and how they're using it.

@jpjarnoux
Copy link
Author

Hi,
I come back to this question. I have some new information. To compute the target and hmm coverage, they choose the best domain. I tried to reproduce, but unfortunately the length of the query and the target (HMM and sequence) are not saved and accessible in the HIT. I would to take it when I digitalized my sequences and when I'm reading my HMM, but the length is not an attribute in HMM object and DigitalSequence. For the sequence I can easily get the length by another way, but for the HMM, it could be really easier if the length was read and add as an attribute (in HMM file there is LENG key).
Do you think you could add those lengths as attribute in Sequence, HMM and Hit or Alignment object ?

@apcamargo
Copy link

apcamargo commented Sep 11, 2023

Hi @jpjarnoux

You can get the length of HMMs from HMM objects:

hmm_lengths = {}
with pyhmmer.plan7.HMMFile("Pfam-A.h3m") as hmm_file:
    for hmm in hmm_file:
        hmm_lengths[hmm.name] = len(hmm.consensus)

Then:

n_aligned_positions = len(
    hit.best_domain.alignment.hmm_sequence
) - hit.best_domain.alignment.hmm_sequence.count(".")
hmm_coverage = (
    n_aligned_positions / hmm_lengths[hit.best_domain.alignment.hmm_name]
)

As @althonos said, this is an oversimplification because you can have multiple domains in the hit. But it can be pretty useful for HMMs of full-length proteins.

@jpjarnoux
Copy link
Author

Sorry, I forget to reply.
That works for me, thank you.

@apcamargo
Copy link

apcamargo commented Apr 7, 2024

A less dumb approach:

def get_hmm_coverage(domain):
    n_aligned_positions = domain.alignment.hmm_to - domain.alignment.hmm_from + 1
    return n_aligned_positions / domain.alignment.hmm_length

with pyhmmer.plan7.HMMFile("Pfam-A.h3m") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
        for hit in hits:
            for domain in hit.domains.included:
                hmm_coverage = get_hmm_coverage(domain)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants