Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make-probes genbank file key error #131

Open
arnoldbain opened this issue Jun 11, 2021 · 7 comments
Open

make-probes genbank file key error #131

arnoldbain opened this issue Jun 11, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@arnoldbain
Copy link

Mykrobe make-probes doesnot seem to like protein locus tag (e.g. Rv1979c) but ok with the name tags like rpoB; below is the error;
mykrobe variants make-probes -t gene.txt -g H37Rv.gbk rH37Rv.fasta > probes.gene.fa

[mykrobe 2021-06-11T13:44:34 WARNING] Could not connect to database. Continuing without using genetic backgrounds
Traceback (most recent call last):
File "/home/arnold/miniconda3/bin/mykrobe", line 11, in
sys.exit(main())
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/cli.py", line 8, in main
args.func(parser, args)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/parser.py", line 55, in run_subtool
run(parser, args)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/cmds/makeprobes.py", line 69, in run
for var_name in aa2dna.get_variant_names(
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/annotation/genes/models.py", line 181, in get_variant_names
gene = self.get_gene(gene)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/annotation/genes/models.py", line 237, in get_gene
return self.genbank[gene]
KeyError: 'Rv1979c'

@mbhall88
Copy link
Member

The problem is that Rv1979c is not in the genbank file we use. It was added to the annotation recently so we need to update the genbank file.

In our email correspondence I asked @arnoldbain to download the latest genbank file (which has this locus) and still get's the same issue. This locus does not have a "gene" name and the way mykrobe currently indexes genbank files relies on there being a gene name. For instance, this is the entry for the locus @arnoldbain is looking at

     gene            complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /db_xref="GeneID:885819"

So it seems we need update our genbank indexing method @martinghunt

@mbhall88 mbhall88 added the bug Something isn't working label Jun 15, 2021
@mbhall88 mbhall88 changed the title make-probes error #Protein_tag make-probes genbank file key error Jun 15, 2021
@mbhall88
Copy link
Member

As a work-around in the meantime @arnoldbain you could try adding a gene name to the annotation for Rv1979c in the latest genbank file. You should find it on line 41310 and 41313 in the latest genbank file.

So it looks something like

     gene            complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /gene="Rv1979c"
                     /db_xref="GeneID:885819"
     CDS             complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /gene="Rv1979c"
                     /inference="protein motif:PROSITE:PS00599"

@martinghunt
Copy link
Member

I expect this is non-trivial to fix, and needs careful thought.

If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

For now, I recommend using @mbhall88's workaround.

@arnoldbain
Copy link
Author

Working fine with @mbhall88 suggestion, thanks guys

@mbhall88
Copy link
Member

I expect this is non-trivial to fix, and needs careful thought.

If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

We could just use a sort of hierarchy when we first try and use the gene tag and of that isn't present try and use a locus tag and if that isn't present, move on? That way we would always use rpoB instead of Rv0667

@martinghunt
Copy link
Member

I expect this is non-trivial to fix, and needs careful thought.
If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

We could just use a sort of hierarchy when we first try and use the gene tag and of that isn't present try and use a locus tag and if that isn't present, move on? That way we would always use rpoB instead of Rv0667

Need to handle when someone provides two variants in the same gene, but one of them specified by the gene name and the other specified by the locus tag. There are likely more edge cases.

@mbhall88
Copy link
Member

Yes, but that seems to be a corner case we could very easily say we don't handle yet (or may never handle as it seems insanse)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants