Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What criteria are used to determine significant HMMER hits? #32

Open
jolespin opened this issue Jul 19, 2023 · 1 comment
Open

Comments

@jolespin
Copy link

I'm looking at the code here but not very familiar w/ Ruby:

module KofamScan
  class Result
    extend Autoload

    autoload :WithEvalueThreshold
    autoload :WithThresholdScale
    autoload :WithThresholdScaleAndEvalueThreshold

    def self.create(query_list, threshold_scale: nil, e_value_threshold: nil)
      if threshold_scale && e_value_threshold
        WithThresholdScaleAndEvalueThreshold.new(query_list, threshold_scale, e_value_threshold)
      elsif e_value_threshold
        WithEvalueThreshold.new(query_list, e_value_threshold)
      elsif threshold_scale
        WithThresholdScale.new(query_list, threshold_scale)
      else
        Result.new(query_list)
      end
    end

Here's what ko_list looks like:

knum	threshold	score_type	profile_type	F-measure	nseq	nseq_used	alen	mlen	eff_nseq	re/pos	definition
K00001	357.90	domain	all	0.256163	2367	1915	1975	464	14.05	0.590	alcohol dehydrogenase [EC:1.1.1.1]
K00002	443.17	full	all	0.430391	2376	2273	5878	503	7.51	0.590	alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
K00003	286.37	domain	all	0.945500	6268	5369	3257	782	7.03	0.590	homoserine dehydrogenase [EC:1.1.1.3]
K00004	369.60	domain	trim	0.809403	1572	1320	1364	436	5.41	0.590	(R,R)-butanediol dehydrogenase / meso-butanediol dehydrogenase / diacetyl reductase [EC:1.1.1.4 1.1.1.- 1.1.1.303]
K00005	320.93	full	all	0.984022	1449	1051	682	366	1.99	0.590	glycerol dehydrogenase [EC:1.1.1.6]
K00006	316.47	full	all	0.899202	2118	1971	3274	549	4.64	0.590	glycerol-3-phosphate dehydrogenase (NAD+) [EC:1.1.1.8]
K00007	520.80	full	all	0.998236	358	283	834	469	1.01	0.589	D-arabinitol 4-dehydrogenase [EC:1.1.1.11]
K00008	420.17	full	all	0.500025	3737	3281	3597	524	8.29	0.590	L-iditol 2-dehydrogenase [EC:1.1.1.14]
K00009	147.27	full	trim	0.997067	1556	1192	1443	459	3.28	0.590	mannitol-1-phosphate 5-dehydrogenase [EC:1.1.1.17]

I'm seeing a score threshold but not e-value threshold. Can you elaborate on the scheme used for determining significant hits?

@PfisterMaxJLU
Copy link

With the threshold values from the "ko_list" our group build a little helper script for the formatting and enforcement of KO cut-off thresholds in hmmsearch output files.

We discard the rare (≈1%) secondary (lower score) K assignments so keep an eye out for that. Depending on your use case, this might not be desirable.

Disregarding the dual assignments, the resulting output is functionally identical (>99.999% for our data/settings) to Kofamscan.

I uploaded a code example, which you can modify for your case. I hope this helps.

https://github.com/PfisterMaxJLU/kofamscan_cutoff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants