Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange (incorrect) mappings of pathogens in APS #50

Closed
austinmeier opened this issue Mar 6, 2018 · 7 comments
Closed

strange (incorrect) mappings of pathogens in APS #50

austinmeier opened this issue Mar 6, 2018 · 7 comments

Comments

@austinmeier
Copy link
Collaborator

We have found a few oddities in the name matchings for the APS scrape. Some of the pathogens are being identified as "EST" from NCBI. An example will help illustrate this:

mapped pathogen: NCBITaxon:1585532 (Beta vulgaris/Cercospora beticola mixed EST library)

The verbatim name: Beet curly top virus (BCTV)

The correct pathogen: NCBITaxon:10840 (Beet curly top virus)

It appears to me that the algorithm is being greedy in some way, and stopping after recognizing "Beet" and mapping to beet (Beta vulgaris) But I have no idea why it maps to the mixed EST library, instead of mapping to just plain beet.

Here is the offending line from the scrape:

Curly top Beet curly top virus (BCTV) Beet curly top virus (BCTV) NCBITaxon:1585532 pathogen of http://purl.obolibrary.org/obo/RO_0002556 Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others) Citrullus spp., Cucumis spp., Cucurbita spp., and s NCBITaxon:3653 R. D. Martyn, M. E. Miller and B. D. Bruton, primary collators (last update 2/19/93). Diseases of Cucurbits (Citrullus spp., Cucumis spp., Cucurbita spp., and others). The American Phytopathological Society. Accessed on 2016-09-07 at http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx http://www.apsnet.org/publications/commonnames/Pages/Curcubits.aspx 2016-09-07

@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 6, 2018

Thanks for sharing this specific example.

You are correct that GloBI's (old) taxon mapping algorithm was greedy, chopping off terms until a single word remained (now the matching is a bit less greedy). This caused Beet curly top virus (BCTV) to be chopped off to Beet, which was then mapped to "Beta". Resolving "Beta" via https://resolver.globalnames.org results in an exact canonical match to ncbi taxon 1585532 . https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1585532 . This dubious NCBI matching is known to occur via globalnames.org and I've reported an example of this via GlobalNamesArchitecture/gni#48 which led to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1585532 .

jhpoelen pushed a commit that referenced this issue Mar 6, 2018
@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 6, 2018

@austinmeier I've added correct mappings for the terms in 0a26375 . Please review. These should be applied on the next aps scrape. Leaving issue open until we can verify this is actually the case.

@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 7, 2018

I just ran a apsnet scrape and found that the mapping added via 0a26375 are now being applied.

@austinmeier @marieALaporte please continue to add mapping corrections to the files in https://github.com/jhpoelen/samara/tree/master/src/main/resources/org/planteome/samara/apsnet . This would include PO/taxon mappings .

@jhpoelen jhpoelen closed this as completed Mar 7, 2018
@austinmeier
Copy link
Collaborator Author

So I think there might be an issue with a blanket remapping using the taxonmap.tsv...

There are two places an NCBITaxon:ID shows up, HOST, or PATHOGEN. In cases of host being assigned an "EST" taxon, it is as simple as just using the plant portion of the EST. There are actually only 3 examples, and the host_label is in 1:1 agreement with the 'host' ID

NCBITaxon:331356 - Triticum spp. L.
NCBITaxon:910407 -  Brassica
NCBITaxon:69324 - Gossypium spp.

However, in the case of pathogen mappings from these EST taxons, we see more that are just incorrect. And most importantly: the mappings are not 1:1. Here is an example of a single NCBITaxon:ID being mapped to multiple pathogen_labels

NCBITaxon:176297 - Maize chlorotic dwarf virus (MCDV)
NCBITaxon:176297 - Maize chlorotic mottle virus (MCMV)
NCBITaxon:176297 - Maize leaf fleck virus (MLFV)
NCBITaxon:176297 - Maize line virus (MLV)
NCBITaxon:176297 - Maize mosaic virus (MMV)
NCBITaxon:176297 - Maize pellucid ringspot virus (MPRV)
NCBITaxon:176297 - Maize rayado fino virus (MRFV)
NCBITaxon:176297 - Maize red stripe virus (MRSV)
NCBITaxon:176297 - Maize ring mottle virus (MRMV)
NCBITaxon:176297 - Maize rough dwarf virus (MRDV)
NCBITaxon:176297 - Maize sterile stunt virus (strains of barley yellow striate virus)
NCBITaxon:176297 - Maize streak virus (MSV)
NCBITaxon:176297 - Maize tassel abortion virus (MTAV)
NCBITaxon:176297 - Maize vein enation virus (MVEV)
NCBITaxon:176297 - Maize wallaby ear virus (MWEV)
NCBITaxon:176297 - Maize white leaf virus
NCBITaxon:176297 - Maize white line mosaic virus (MWLMV)

You can see that the pathogens are all different viruses, but they are all being mapped to "Zea mays/Colletotrichum graminicola mixed EST library"

@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 7, 2018

Thanks for sharing. In the most recent scape at https://build.berkeleybop.org/job/extract-apsnet-diseases/92/artifact/apsnet.tsv . I found a single entry using NCBITaxon:176297 , namely Maize rough dwarf virus (MRDV) . Can you confirm?

@austinmeier
Copy link
Collaborator Author

Confirmed. Although even this strikes me as strange. The name of that row is "Maize rough dwarf", with a pathogen_label of "maize rough dwarf virus (MRDV)" This exact name shows up in several other rows (eg: row-10076), and in these other rows, the NCBItaxon is correctly mapped to NCBITaxon:10989 (Maize rough dwarf virus)

@jhpoelen
Copy link
Collaborator

jhpoelen commented Mar 7, 2018

Sounds like we need an additional mapping from maize rough dwarf virus (MRDV) to NCBITaxon:10989. Would you like me to add that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants