-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of ontology version mismatches #51
Comments
HPO do provide stable monthly releases... well, possibly stable... http://compbio.charite.de/hudson/job/hpo.annotations.monthly/ |
Doesn't look very monthly to me, more like when they get to it. What I was trying to push for in Miami was some sort of disciplined version and release process (for example the UMLS releases in the spring and in the fall, with AA and AB releases for the year of release.) The issue is that in the space of about 18 months after PhenoDB was mapped to HPO, there were about 20 changes the HPO IDs we linked to (about 1800 I think). Some were easy, the number was folded into another numbers, others were nastier where the definition changed. When I talked to Peter about this in Dublin he did not seem to grasp the issue he was creating. I suppose we could use the build number as the version. But how do I reconcile across builds? Say I am using build 83 and I get a request in build 79? Or the reverse? |
I would propose:
This is obviously not completely "correct", and can present challenges. However, I don't see any way we can reasonably reconcile, and I don't think we should even try. The number of terms affected are small, the changes are likely small, and we really can't do any better. For example, a patient might be classified as having microcephaly. If the term definition changes from being < -3SD to <-4SD, there's no way we can properly reconcile this across builds because we don't know the actual measurement. If a new term is created instead, we lose all of the historical information about the term and its frequency. There's no right solution here, but I think the one of least resistance and least "badness" is to actually assume the definitions are close enough that we map any terms we can on any HPO version we have. |
Alternatively, if it's obsoleted, we could bump it up the ontology until you find a non-obsolete term. :) |
"Alternatively, if it's obsoleted, we could bump it up the ontology until you find a non-obsolete term. :)" I was thinking this could be a solution, however it would require the requester to do the bumping, as the reciever wouldn't know how to bump the term. I met someone at the ga4gh uk meeting on Friday (24th April) who claimed to have solved this problem, and is looking at making the solution open source. I will open up further discussions with him on this to assess if it's a viable solution. |
Fair point. I was thinking that the node was flagged as obsoleted, but the links were still there. It turns out this is not the case. These obsolete terms always seem to be annotated with 'replaced_by', where it's clear what we should do, or 'consider', in which it isn't, and we may not be able to use the term for now, until we have a better solution. |
Oh gosh... that's not especially helpful. It seems HPO is not really designed to be used programtically without human interaction... =/ A potential solution, I've considered but dislike, is that each system has an endpoint which returns ontologies supported, and what version. The system could then check if the version is older or newer, and get a copy of that version, digest it, and then where terms are obsoleted or not present in the searching system, the user could be asked to make adjustments. I don't really think this is a viable option, but it may spur some further thought. I have emailed the person I met at the ga4gh UK day, so am waiting for him to return. Ontologies are feeling more and more like they cause as many problems as they try to solve. |
GeneMatcher is in a strange place with this because we use PhenoDB IDs internally. There are 3,646 features, of which 2,857 are mapped to HPO. This mapping is done manually and is checked/updated every 12-18 months. So when we get an MME request we have to map the feature HPO IDs to PhenoDB IDs. To do that we take the HPO ID and walk up the HPO tree until we hit a mapping to PhenoDB. This is made a little tricky because HPO features can have multiple parents, so we pick the 'closest' one. Obviously this mapping is done ahead of time for efficiency (one lookup to get the corresponding ICHPT ID, if any), and we update this mapping every week with a current copy of the hpo.obo file. There are going to be HPO IDs that cannot be resolved and we just drop them and we also map alternative/obsolete HPO IDs. I am not sure if this helps with this issue, but this is a practical approach that offers a pretty strong guarantee that an MME request won't contain an HPO ID that we have not yet seen. |
It may be worth resurrecting the proposal of passing (mandatory?) ICHPT term ancestor(s) for each feature, once it's available. |
I am not sure that helps, and I think just adds another layer of complexity to the protocol. I described what GeneMatcher was doing as one possible approach which will accept any version of HPO and remap 'old' IDs to 'new' IDs. It does put the onus on GeneMatcher to stay current but that is not onerous. The one issue is that I can't do anything if Peter changes the definition for an ID, but when that happens everyone is screwed. As I said it is not perfect but it works without having to deal with HPO versions. |
Fair points, François. I think we should come up with a solution that avoids the chaos of trying to recreate older versions of the HPO. I think the only gap in the HPO right now is that only some obsolete terms have 'replaced_by' annotations. For those that don't, we'll have to ignore them, when in reality we could map them to more general terms that the obsolete term implies. This is probably the best we can do. I've sent an email to some Monarch people to see if they have any suggestions. |
|
I have some good news on this! I spoke to someone from the EBI yesterday. They are working on a fresh and updated version of the EBI Ontology Lookup Service (OLS). There will be a publically useable API, which will allow the looking up of terms. Say for example there's a new term that I don't have in my version of HPO. I would be able to query the OLS, and get all the anscestor terms back. This would allow me to bump the term up. If you don't want to rely on an API, tools are currently provided to allow you to create a solr or mongo instance containing an ontology. The tools can take owl files as input, and save directly to solr. I'll be investigaing this soon. Our plan would be to update the ontologies nightly. Pushing the ontology data outside of our database would make it non reliant on our release cycle. |
Taking a step back here, is this an issue we need to solve in MME or is this something we should leave up to individual sites to deal with, much like we are doing with feature matching and ranking? I think taking a similar approach would be a good option, namely publishing our different approaches here for transparency and pushing new members do likewise. |
Agreed. I think we're no longer looking to solve this as a part of the MME API itself, but provide a documented solution that we can suggest or reccomend, but isn't mandatory. It's obvilusly an issue we, and others involved, should be aware of. I suggest we rename this issue to something like "How could we handle ontology version missmatches?" |
I agree with both of you. We can mention the OLS as one possible solution for handling ontology version mismatches, but leave each site to handle it in their own way. From the MME API side, I think we just need to:
|
I'm not sure point 1 is relevant any longer. I mean, even if the reciving system knows it has an older version, it won't know which items from the ontology it doesn't know about till it looks them up, making sending the version number redundant, doesn't it? I can see your reasoning for point 2, but I'm not sure how this could be displayed to the end user. Any thoughts? |
Re point 2, this expands to the other fields in the match request, and how represent what matched, how that match was done including remapping, etc... This turns into a bit of a rat's nest of complexity. I am not sure we want to address that. |
I guess it's possible that there could be "additional notes" which have a human readable text which explains any transcoding or changes made to the incoming request data. I think we should push this to the backburner though, once we can work with different ontology versions OK. |
It could be useful to create a summary of this issue and add that to the first or second post. |
Discussed and approved at the Miami meeting day 1.
The text was updated successfully, but these errors were encountered: