diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/404.html b/404.html new file mode 100644 index 000000000..c737b6ad4 --- /dev/null +++ b/404.html @@ -0,0 +1,3484 @@ + + + +
+ + + + + + + + + + + + + +Description: add here
+We rely on our readers to correct our materials and add to them - the hope is to centralise all the usual teaching materials for OBO ontology professionals in one place. Feel free to:
+#obo-training
channel) to ask any questions (you can request access on the issue tracker)Wednesday, September 15, 2021
+The goal of this tutorial is to provide a flavor of the OBO landscape, from the OBO Foundry organization to the ontology curators and OBO engineers that are doing the daily ontology development.
+Time CEST | +Presenter | +Topic | +
---|---|---|
4:00 - 4:10pm | +James Overton | +Workshop overview | +
4:10 - 4:20pm | +James Overton | +OBO Foundry Overview | +
4:20 - 4:30pm | +Nicole Vasilevsky | +Controlled Vocabularies and Ontologies | +
4:30 - 4:50pm | +Nicole Vasilevsky | +Using and Reusing Ontology Terms | +
4:50 - 5:25pm | +Nicole Vasilevsky | +A day in the life of an Ontology Curator | +
5:25 - 5:30pm | ++ | Break | +
5:30pm - 5:40pm | +Nico Matentzoglu | +Ontology 201 Overview | +
5:40 - 6:15 pm | +James Overton | +ROBOT Tutorial | +
6:15 - 6:35 pm | +Nico Matentzoglu | +ODK presentation | +
6:35 - 6:55 pm | +Nico Matentzoglu | +A brief introduction into ontology QC using the OBO dashboard | +
6:55 - 7:00 pm | +James Overton | +Wrap up | +
September 26, 2022, 9:00 am – 12:30 pm ET
+We'd love any feedback on this tutorial via this short survey.
+ +The Open Biological and Biomedical Ontologies (OBO) community includes hundreds of open source scientific ontology projects, committed to shared principles and practices for interoperability and FAIR data. An OBO tutorial has been a regular feature of ICBO for a decade, introducing new and experienced ontology users and developers to ontologies in general, and to current OBO tools and techniques specifically. While ICBO attracts many ontology experts, it also includes an audience of ontology beginners, and of ontology users looking to become ontology developers or to further refine their skills. Our OBO tutorial will help beginner and intermediate ontology users with a combination of theory and hands-on practice.
+For ICBO 2022 we will host a half-day OBO tutorial consisting of two parts, with a unifying theme of ontology term reuse.
+The first part of our tutorial will be introductory, aimed at an audience that is new to ontologies and to the OBO Foundry. We will introduce OBO, its community, principles, resources, and best practices. We will finish the first part with a hands-on lesson in basic tools: ontology browsers, how to contribute to ontologies via GitHub (creating issues and making Pull Requests), and the Protege ontology editor.
+The second part will build on the first, addressing an audience that is familiar with ontologies and OBO, and wants to make better use of OBO workflows and tools in their own projects. The focus will be on making best use of OBO community open source software. We will introduce ROBOT, the command-line tool and library for automating ontology development tasks. We will show how the Ontology Development Kit (ODK) is used to standardize ontology projects with a wide range of best practices. The special emphasis of this year's tutorial will be ontology reuse, and specifically on how ROBOT and ODK can be used to manage imports from other ontologies and overcome a number of challenges to term reuse.
+This material for this year's OBO Tutorial will build on the content here in the OBO Academy. The OBO Academy offers free, open, online resources with self paced learning materials covering various aspects of ontology development and curation and OBO. Participants are encouraged to continue their learning using this OBO Academy website, and contribute to improving the OBO documentation.
+As an outcome of this workshop, we expect that new ontologists will have a clearer understanding of why we need and use ontologies, how to find ontology terms and contribute to ontologies and make basic edits using Protege. Our more advanced participants should be able to apply OBO tools and workflows to their own ontology development practices.
+ +Time | +Topic | +Presenter | +
---|---|---|
09:00 am ET | +Introduction to OBO, its community, principles, resources, and best practices | +James Overton | +
09:20 am ET | +Hands-on lesson in basic tools: see details below | +Nicole Vasilevsky | +
10:15 am ET | +Coffee break | ++ |
10:30 am ET | +Introduction to ROBOT | +Becky Jackson | +
11:30 pm ET | +Introduction to the Ontology Development Kit (ODK) and Core Workflows | +Nico Matentzoglu | +
12:15 pm ET | +How to be an open science ontologist (Slides are here) | +Nico Matentzoglu | +
Instructor: Nicole Vasilevsky
+Example: We will work on this ticket.
+ + + + + + +Conference website: https://icbo-conference.github.io/icbo2023/
+ICBO Workshops details: https://www.icbo2023.ncor-brasil.org/program.html#workshops
+Date: August 28, 2023 13:30-15:00 (Part 1) and 15:30-15:45 (Part 2)
+The Open Biological and Biomedical Ontologies (OBO) community includes hundreds of open source scientific ontology projects, +committed to shared principles and practices for interoperability and FAIR data. An OBO tutorial has been a regular feature of ICBO for a decade, introducing new and experienced ontology users and developers to ontologies in general, and to current OBO tools and techniques specifically. While ICBO attracts many ontology experts, it also includes an audience of ontology beginners, and of ontology users looking to become ontology developers or to further refine their skills. Our OBO tutorial will help beginner and intermediate ontology users with a combination of theory and hands-on practice.
+For ICBO 2023 we will host a half-day OBO tutorial consisting of two parts.
+The first part of our tutorial will be introductory, aimed at an audience that is new to ontologies and to the OBO Foundry. +We will introduce OBO, its community, principles, resources, and best practices. +We will finish the first part with a hands-on lesson in basic tools: ontology browsers, how to contribute to ontologies via +GitHub (creating issues and making Pull Requests), and the Protege ontology editor.
+The second part will build on the first, addressing an audience that is familiar with ontologies and OBO, and wants to make better use of +OBO workflows and tools in their own projects.
+This material for this year's OBO Tutorial will build on the content here in the OBO Academy. +The OBO Academy offers free, open, online resources with self paced learning materials covering various aspects of ontology development and +curation and OBO. Participants are encouraged to continue their learning using this OBO Academy website, and contribute to improving the +OBO documentation.
+ +The tutorial is designed to be 'show and tell' format, but you are welcome to install the following software on your machine in advance, if you'd like to follow along in real time:
+Time | +Topic | +Presenter | +Duration | +
---|---|---|---|
13:30 pm ET | +Welcome | +Tiago Lubiana and Nico Matentzoglu | +5 min | +
13:35 pm ET | +Introduction to OBO, its community, principles, resources, and best practices | +Darren Natale | +20 min | +
13:55 pm ET | +Hands-on lesson in basic tools | +Sabrina Toro | +35 min | +
14:30 pm ET | +Protege updates and new features | +Damien Goutte-Gattat | +15 min | +
14:45 pm ET | +Overview of OBO Dashboard | +Anita Caron | +15 min | +
15:00 pm ET | +Break | ++ | 15 min | +
15:15 pm ET | +Introduction to ROBOT | +Becky Jackson | +30 min | +
15:45 pm ET | +Role of ChatGPT in OBO Ontology Development | +Sierra Moxon | +15 min | +
16:00 pm ET | +How to be an Open Science Engineer | +Nico Matentzoglu | +15 min | +
16:15 pm ET | +Discussion and Wrap up | +Tiago Lubiana | +30 min | +
16:45 pm ET | +Adjourn | ++ | + |
The goal of this course is to provide ongoing training for the OBO community. As with previous tutorials, we follow the flipped classroom concept: as organisers, we provide you with materials to look at, and you will work through the materials on your own. During our biweekly meeting, we will answer your questions, provide you with additional demonstrations where needed and go into depth wherever you as a student are curious to learn more. This means that this course can only work if you are actually putting in the time to preparing the materials. That said, we nevertheless welcome anyone to just lurk or ask related questions.
+Note: this is tentative and subject to change
+Date | +Lesson | +Notes | +Recordings | +
---|---|---|---|
2023/10/03 | +Units modelling in and around OBO | +James Overton | ++ |
2023/09/19 | +Improving ontology interoperability with Biomappings | +Charlie Hoyt | ++ |
2023/09/05 | +Modern prefix management with Bioregistry and curies |
+Charlie Hoyt | ++ |
2023/08/22 | +How to determine if two entities are the same? | +Nico | +(subject open for debate) | +
2023/08/08 | +OMOP Vocabulary Management - how does it work and how does it differ from the OBO way of thinking? | +Nico | +(subject open for debate) | +
July 2023 | +Cancelled: Summer break | ++ | + |
2023/06/27 | +Cancelled | ++ | + |
2023/06/13 | +Modelling with Subclass and Equivalent class statements | +Tutorial by Henriette Harmse | +slides | +
2023/05/30 | +First steps with ChatGPT for semantic engineers and curators | +Led by Sierra Moxon and Nico Matentzoglu | +N/A | +
2023/05/16 | +Cancelled (Monarch/C-Path workshop) | ++ | + |
2023/05/02 | +Cancelled (No meeting week) | ++ | + |
2023/04/18 | +Overview of Protege 5.6 - the latest features | +Tutorial by Damien Goutte-Gattat (slides) | +Here | +
2023/04/04 | +Introduction to Exomiser | +Tutorial by Valentina, Yasemin and Carlo from QMUL. | +Here | +
2023/03/21 | +Introduction to Wikidata | +Tutorial by experts in the field Andra Waagmeester and Tiago Lubiana | +Here | +
2023/03/07 | +OAK for the Ontology Engineering community | +Tutorial by Chris Mungall | +Here | +
2023/02/21 | +OBO Academy Clinic | +Bring your ontology issues and questions to discuss with Sabrina and Nico! Attend the Ontology Summit Seminars instead! | ++ |
2023/02/07 | +Querying the Monarch KG using Neo4J | +Tutorial by Kevin Schaper | +Here | +
2023/01/24 | +OBO Academy Clinic | +Bring your ontology issues and questions to discuss with Sabrina and Nico! | ++ |
2023/01/10 | +Modeling with taxon constraints | +Tutorial by Jim Balhoff | +Here | +
2022/12/27 | +No Meeting | +Enjoy the Holidays! | ++ |
2022/12/13 | +Introduction to Semantic Entity Matching | +Slides | +Here | +
2022/11/29 | +OBO Academy hackathon | +Work on open tickets together. | ++ |
2022/11/15 | +Contributing to OBO ontologies - Part 2 | ++ | Here | +
2022/11/01 | +Contributing to OBO ontologies - Part 1 | ++ | Here | +
2022/10/18 | +Introduction to Medical Action Ontology (MAxO) | ++ | Here | +
2022/10/04 | +No meeting - ISB virtual conference: register here | ++ | + |
2022/09/20 | +How to be an open science ontologist | ++ | Here | +
2022/09/06 | +Pull Requests: Part 2 | ++ | Here | +
2022/07/26 | +Pull Requests: Part 1 | ++ | Here | +
2022/07/12 | +Basic introduction to the CLI: Part 2 | +Due to intermitent connection issues, the first few minutes of this recording are not included. Refer to the Tutorial link for the initial directions. | +Here | +
2022/06/28 | +Basic introduction to the CLI: Part 1 | ++ | Here | +
2022/06/14 | +Application/project ontologies | ++ | Here | +
2022/05/31 | +Contributing to ontologies: annotation properties | ++ | Here | +
2022/05/17 | +Introduction to managing mappings with SSSOM | ++ | Here | +
2022/05/03 | +No meeting | ++ | + |
2022/04/19 | +Disjointness and Unsatisfiability | ++ | Here | +
2022/04/05 | +No meeting | ++ | + |
2022/03/22 | +Creating an ontology from scratch | ++ | Here | +
2022/03/08 | +Obsoletions in OBO ontologies | +Review Obsoleting an Existing Ontology Term and Merging Ontology Terms. Slides are here. | +Here | +
2022/02/22 | +SPARQL for OBO ontology development | ++ | Here | +
2022/02/07 | +ODK/DOSDPs | ++ | Here | +
2022/01/25 | +Contributing to OBO ontologies | +This is not new content but we'll start at the beginning again with our previous lessons. | +Here | +
2022/01/11 | +Office hours with Nicole and Sabrina - no formal lesson | +Bring any open questions. | ++ |
2021/12/14 | +Lessons learned from troubleshooting ROBOT | +Open discussion, no advance preparation is needed. | ++ |
2021/11/30 | +Semantics of object properties (including Relations Ontology) | ++ | + |
2021/11/16 | +SPARQL for OBO ontology development | ++ | Here | +
2021/11/02 | +Templating: DOSDPs and ROBOT | ++ | + |
2021/10/19 | +Ontology Design | ++ | + |
2021/10/05 | +Cancelled due to overlap with ISB conference | ++ | + |
2021/09/21 | +Ontology Pipelines with ROBOT 2 | ++ | + |
2021/09/08 | +Migrating legacy ontology systems to ODK | ++ | + |
2021/09/07 | +Ontology Pipelines with ROBOT | ++ | + |
2021/09/01 | +Manage dynamic imports the ODK | ++ | + |
2021/08/25 | +Ontology Release Management with the ODK | ++ | Here | +
2021/08/24 | +Contributing to OBO ontologies 2 | ++ | Here | +
2021/08/17 | +Contributing to OBO ontologies | ++ | + |
Most of materials used by this course were developed by James Overton, Becky Jackson, Nicole Vasilevsky and Nico Matentzoglu as part of a project with the Critical Path Institute (see here). The materials are improved as part of an internal training program (onboarding and CPD) for the Phenomics First project (NIH / NHGRI #1RM1HG010860-01).
+Thanks to Sarah Gehrke for her help with project management.
+ + + + + + +This course unit only covers the OBO part of the Ontology Summit 2023, for a full overview see https://ontologforum.org/index.php/OntologySummit2023.
+Giving a broad overview of the key OBO methodologies and tools to the general ontology community.
+Date | +Lesson | +Tutors | +Notes | +
---|---|---|---|
2023/01/25 | +Introduction to COB | +Chris Mungall | +Slides | +
2023/02/01 | +Introduction to ROBOT and OAK | +James Overton and Chris Mungall | ++ |
2023/02/08 | +Managing the Ontology Life Cycle with the Ontology Development Kit | +Anita Caron, Damien Goutte-Gattat, Philip Stroemert, Nicolas Matentzoglu | +
|
+
2023/02/15 | +Using Dashboards to monitor OBO ontologies | +Charlie Hoyt, Nicolas Matentzoglu, Anita Caron | +
|
+
2023/02/02 | +Using OBO Ontologies: Ubergraph and other applications | +Jim Balhoff | ++ |
Editors: Sabrina Toro (@sabrinatoro), Nicolas Matentzoglu (@matentzn)
+Examples with images can be found here.
An entity such as an individual, a class, or a property can have annotations, such as labels, synonyms and definitions. An annotation property is used to link the entity to a value, which in turn can be anything from a literal (a string, number, date etc) to another entity (such as, another class).
+Here are some examples of frequently used annotation properties: (every element in bold is an annotation property)
+http://purl.obolibrary.org/obo/MONDO_0004975
+NCIT:C2866
Annotation properties have their own IRIs, just like classes and individuals. For example, the IRI of the RDFS built in label property is http://www.w3.org/2000/01/rdf-schema#label. Other examples:
+Annotation properties are just like other entities (classes, individuals) and can have their own annotations. For example, the annotation propert http://purl.obolibrary.org/obo/IAO_0000232 has an rdfs:label ('curator note') and a human readable definition (IAO:0000115): 'An administrative note of use for a curator but of no use for a user'.
+Annotation properties can be organised in a hierarchical structure.
+For example, the annotation property 'synonym_type_property' (http://www.geneontology.org/formats/oboInOwl#SynonymTypeProperty) is the parent property of other, more specific ones (such as "abbreviation").
+Annotation properties are (usually) used with specific type of annotation values.
+*
Note: the type of annotation required for an annotation property can be defined by adding a Range + "select datatype" in the Annotation Property's Description
+e.g. : 'scheduled for obsoletion on or after' (http://purl.obolibrary.org/obo/IAO_0006012)
Some annotation properties look like data properties (connecting an entity to a literal value) and others look like object properties (connecting an entity to another entity). Other than the fact that statements involving data and object properties look very different in RDF, the key difference from a user perspective is that OWL Reasoners entirely ignore triples involving annotation properties. Data and Object Properties are taken into account by the reasoner.
+Object properties are different to annotation properties in that they:
+Data properties are different to annotation properties in that they:
+Boomer
as all people born between 1946 and 1964. If an individual would be asserted to be a Boomer, but is born earlier than 1946, the reasoner would file a complaint.Note: before creating a new annotation property, it is always a good idea to check for an existing annotation property first.
+Detailed explanations for adding a new annotation property can be found here
+The word "annotation" is used in different contexts to mean different things. For instance, "annotation in owl" (ie annotations to an ontology term) is different from "annotation in the biocuration sense" (ie gene-to-disease, gene-to-phenotype, gene-to-function annotations). It is therefore crucial to give context when using the word "annotation".
+ + + + + + +Given
+ObjectProperty: r
+Class: D
+ EquivalentTo: r some C
+Class: C
+
+the semantics of
+r some C
is the set of individuals such that for each
+individual x
there is at least 1 individual y
of type C
that is linked to
+x
via the object property r
.
Based on this semantics, a possible world adhering to our initial equivalence axiom may be:
+ +In this Venn diagram we assume individuals are black dots.
+Thus, our world consists of 7 individuals,
+with only 2 classes, namely C
and D
, as well 2 object properties, namely r
and q
.
+In this world, D
and thus the class r some C
, consist of only 2 individuals. D
and
+r some C
consist of only 2 individuals because these are the only individuals linked
+via object property r
to at least 1 individual respectively in C
.
In the following we define a pet owner as someone that owns at least 1 pet.
+ObjectProperty: owns
+Class: PetOwner
+ EquivalentTo: owns some Pet
+Class: Pet
+
+If we want to introduce the class DogOwner
, assuming we can only use the class Pet
+and the object property owns
(assuming we have not defined PetOwner
), we could say
+that a dog owner is a subset of pet owners:
ObjectProperty: owns
+Class: DogOwner
+ SubClassOf: owns some Pet
+Class: Pet
+
+In this case we use SubClassOf
instead of EquivalentTo
because not every pet owner
+necessarily owns a dog. This is equivalent to stating:
ObjectProperty: owns
+Class: PetOwner
+ EquivalentTo: owns some Pet
+Class: Pet
+Class: DogOwner
+ SubClassOf: PetOwner
+
+In the previous section we modeled a PetOwner
as owns some Pet
. In the expression
+owns some Pet
Pet
is referred to as the filler of owns
and more specifically
+we say Pet
is the owns
-filler.
The PetOwner EquivalentTo: owns some Pet
state that pet owners are those individuals
+that own a pet and ignore all other owns
-fillers that are not pets. How can we define
+arbitrary ownership?
ObjectProperty: owns
+Class: Owner
+ EquivalentTo: owns some owl:Thing
+
+We can base restrictions on having a relation to a specific named individual, +i.e.:
+Individual: UK
+ObjectProperty: citizenOf
+Class: UKCitizen
+ EquivalentTo: citizenOf hasValue UK
+
+This far we have only considered existential restrictions based on object properties, but +it is possible to define existential restrictions based on data properties. As an example, +we all expect that persons have at least 1 name. This could be expressed as follows:
+DataProperty: name
+Class: Person
+ SubClassOf: name some xsd:string
+
+In our example of Person SubClassOf: name some xsd:string
, why did we use SubClassOf
+rather than EquivalentTo
? That is, why did we not use
+Person EquivalentTo: name some xsd:string
? With using the EquivalentTo
axiom, any
+individual that has a name, will be inferred to be an instance of Person
. However,
+there are many things in the world that have names that are not persons. Some examples are pets,
+places, regions, etc:
Compare this with, for example, DogOwner
:
ObjectProperty: owns
+Class: Dog
+Class: DogOwner
+ EquivalentTo: owns some Dog
+
+
+
+
+
+
+
+ Based on CL editors training by David Osumi-Sutherland
+We face an ever-increasing deluge of biological data analysis. Ensuring that this data and analysis are Findable, Accessible, Interoperable, and Re-usable (FAIR) is a major challenge. Findability, Interoperabiltiy, and Resuability can all be enhanced by standardising metadata. Well-standardised metadata can make it easy to find data and analyses despite variations in terminology ('Clara cell' vs 'nonciliated bronchiolar secretory cell' vs 'club cell') and precision ('bronchial epithelial cell' vs 'club cell'). Understanding which entities are referred to in metadata and how they relate to the annotated material can help users work out if the data or analysis they have found is of interest to them and can aid in its re-use and interoperability with other data and analyses. For example, does an annotation of sample data with a term for breast cancer refer to the health status of the patient from which the sample was derived or that the sample itself comes from a breast cancer tumor?
+Given variation in terminology and precision, annotation with free text alone is not sufficient for findability. One very lightweight solution to this problem is to rely on user-generated keyword systems, combined with some method of allowing users to choose from previously used keywords. This can produce some degree of annotation alignment but also results in fragmented annotation and varying levels of precision with no clear way to relate annotations.
+For example, trying to refer to feces, in NCBI BioSample:
+Query | +Records | +
---|---|
Feces | +22,592 | +
Faeces | +1,750 | +
Ordure | +2 | +
Dung | +19 | +
Manure | +154 | +
Excreta | +153 | +
Stool | +22,756 | +
Stool NOT faeces | +21,798 | +
Stool NOT feces | +18,314 | +
Terminology alone can be ambiguous. The same term may be used for completely unrelated or vaguely analogous structures. An insect femur and an mammalian femur are neither evolutionarily related nor structurally similar. Biologists often like to use abbreviations to annotate data, but these can be extremely ambiguous. Drosophila biologists use DA1 to refer to structures in the tracheal system, musculature and nervous system. Outside of Drosophila biology it is used to refer to many other things including a rare disease, and a a neuron type in C.elegans.
+Some extreme examples of this ambiguity come from terminological drift in fields with a long history. For example +in the male genitalia of a gasteruptiid wasp, these 5 different structures here have each been labeled "paramere" by different people, each studying different hymenopteran lineages. How do we know what "paramere" means when it is referred to?
+ +This striking example shows that even precise context is not always sufficient for disambiguation.
+Rather than rely on users to generate lists of re-usable keywords, we can instead pre-specify a set of terms to use in annotation. This is usually refered to a controlled vocabulary or CV.
+Any controlled vocabulary that is arranged in a hierarchy.
+Taxonomy describes a hierarchical CV in which hierarchy equals classification. E.g., 'Merlot' is classified as a 'Red' (wine). Not all hierchical CVs are classifications. For example, anatomical atlases often have hierarchical CVs representing "parthood". The femur is a part of the leg, but it is not 'a leg'.
+The use of a hierachical CV in which general terms group more specific terms allows for varying precision (glial cell vs some specific subtype) and simple grouping of annotated content.
+For example:
+ +Hierarchical CVs tend to increase in complexity in particular ways:
+To support findability, terms in hierarchical CVs often need to be associated with synonyms, or cross-referenced to closely related terms inside the CV.
+CV content is often driven by requests from annotators and so expansion is not driven by any unified vision of scheme. This often leads to pressure for hierarchies to support terms having multiple parents, either reflecting multiple relationship types, or multiple types of classification. For example, an anatomical CV could reasonably put 'retinal bipolar cell' under 'retina' based on location and, at the same time, under 'bipolar neuron' and 'glutamatergic neuron' based on cell type classification.
+Developers of hierarchical CVs often come to realise that multiple relationship types are represented in the hierarchy and that it can be useful to name these relationship for better distinction. For example, a heart glial cell is a 'type of' glial cell, but is 'part of' the heart.
+ +Definitions of ontologies can be controversial. Rather than attempting a comprehensive definition, this tutorial will emphasise ontologies as:
+Terms are arranged in a classification hierarchy
+Terms are defined
+Terms are richly annotated:
+Relationships between terms are defined, allowing logical inference and sophisticated queries as well as graph representations.
+Expressed in a knowledge representation language such as RDFS, OBO, or OWL
+Terminology can be ambiguous, so text definitions, references, synonyms and images are key to helping users understand the intended meaning of a term.
+ +Identifiers that do not hold any inherent meaning are important to ontologies. If you ever need to change the names of your terms, you're going to need identifiers that stay the same when the term name changes.
+For example:
+A microglial cell is also known as: hortega cell, microglia, microgliocyte and brain resident macrophage.
+In the cell ontology, it is however referred to by a unique identifier: CL:0000129
+These identifiers are short ways of referring to IRIs (e.g., CL:000129 = http://purl.obolibrary.org/obo/CL_0000129)
+This IRI is a unique, resolvable identifier on the web.
+A group of ontologies - loosely co-ordinated through the OBO Foundry, have standardised their IRIs (e.g. http://purl.obolibrary.org/obo/CL_0000129 - A term in the cell ontology; http://purl.oblibrary.org/obo/cl.owl - The cell ontology)
OBO ontologies are mostly written in OWL2 or OBO syntax. The latter is a legacy format that maps completely to OWL.
+For a more in-depth explanation of formats (OWL, OBO, RDF etc.) refer to explainer on OWL format variants. +In the examples below we will use OWL Manchester syntax, which allows us to express formal logic in English-like sentences.
+Ontology terms refer to classes of things in the world. For example, the class of all wings.
+Below you will see a classification of parts of the insect and how it is represented in a simple ontology.
+ +We use a SubClassOf (or is_a in obo format) to represent that one class fully subsumes another. +For example: +OWL: hindwing SubClassOf wing +OBO: hindwing is_a wing
+In English we might say: "a hindwing is a type of wing" or more specifically, "all instances of hindwing are instances of wing." 'Instance' here refers to a single wing of an individual fly.
+ +In the previous section, we talked about different types of relationships. In OWL we can define specific relations (known as object properties). One of the commonest is 'part of' which you can see used below.
+ +English: all (insect) legs are part of a thoracic segment +OWL: 'leg' SubClassOf part_of some thoracic segment +OBO: 'leg'; relationship: part_of thoracic segment
+It might seem odd at first that OWL uses subClassOf here too. The key to understanding this is the concept of an anonymous class - in OWL, we can refer to classes without giving them names. In this case, the anonymous class is the class of all things that are 'part of' (some) 'thoracic segment' (in insects). A vast array of different anatomical structures are subclasses of this anonymous class, some of which, such as wings, legs, and spiracles, are visible in the diagram.
+Note the existential quantifier some
in OWL format -- it is interpreted as "there exists", "there is at least one", or "some".
The quantifier is important to the direction of relations.
+subClassOf:
+'wing' SubClassOf part_of some 'thoracic segment'
is correct
+'thoracic segment' SubClassOf has_part some 'wing'
is incorrect as it implies all thoracic segment have wings as a part.
Similarly:
+'claw' SubClassOf connected_to some 'tarsal segment'
is correct
+'tarsal segment' SubClassOf connected_to some 'claw'
is incorrect as it implies all tarsal segments are connected to claws (for example, some tarsal segments are connected to other tarsal segments)
These relationships store knowledge in a queryable format. For more information about querying, please refer to guide on DL queries and SPARQL queries.
+There are many ways to classify things. For example, a neuron can be classified by structure, electrophysiology, neurotransmitter, lineage, etc. Manually maintaining these multiple inheritances (that occur through multiple classifications) does not scale.
+ +Problems with maintaining multiple inheritance classifications by hand
+Doesn’t scale
+When adding a new class, how are human editors to know
+all of the relevant classifications to add?
+how to rearrange the existing class hierarchy?
+It is bad for consistency
+Reasons for existing classifications often opaque
+Hard to check for consistency with distant superclasses
+Doesn’t allow for querying
+The knowledge an ontology contains can be used to automate classification. For example:
+English: Any sensory organ that functions in the detection of smell is an olfactory sensory organ +OWL:
+'olfactory sensory organ'
+ EquivalentTo ‘sensory organ’
+that
+capable_of some ‘detection of smell’
+
If we then have an entity nose
that is subClassOf sensory organ
and capable_of some detection of smell
, it will be automatically classified as an olfactory sensory organ.
Many classes, especially in the domains of disease and phenotype, describe combinations of multiple classes - but it is very important to carefully distinguish whether this combination follows "disjunctive" logic ("or") or "conjunctive" logic ("and"). Both mean something entirely different. Usually where a class has 'and' in the label, such as 'neonatal inflammatory skin and bowel disease' (MONDO:0017411), the class follows a conjunctive logic (as expected), and should be interpreted in a way that someone that presents with this disease has both neonatal inflammatory skin disease and bowel disease at once. This class should be classified as a child of 'bowel disease' and 'neonatal inflammatory skin disease'. Note, however, that naming in many ontologies is not consistent with this logic, and you need to be careful to distinguish wether the interpretation is supposed to be conjunctive or disjunctive (i.e. "and" could actually mean "or", which is especially often the case for clinical terminologies).
+Having asserted multiple SubClassOf axioms means that an instance of the class is a combination of all the SubClass Of statements (conjunctive interpretation, see above). For example, if 'neonatal inflammatory skin and bowel disease' is a subclass of both 'bowel disease' and 'neonatal inflammatory skin disease', then an individual with this disease has 'bowel disease' and 'neonatal inflammatory skin disease'.
+ +If there were a class 'neonatal inflammatory skin or bowel disease', the intention is usually that this class follows disjunctive logic. A class following this logic would be interpreted in a way that an individual with this disease has either bowel disease or neonatal inflammatory skin disease or both. It would not be accurate to classify this class as a child of bowel disease and neonatal inflammatory skin disease. This type of class is often called a "grouping class", and is used to aggregate related diseases in a way useful to users, like "disease" and "sequelae of disease".
+ +This explainer requires understanding of ontology classifications. Please see "an ontology as a classification" section of the introduction to ontologies documentation if you are unfamiliar with these concepts.
+You can watch this video about an introduction to Logical Description.
+Logical axioms are relational information about classes that are primarily aimed at machines. This is opposed to annotations like textual definitions which are primarily aimed at humans. These logical axioms allow reasoners to assist in and verify classification, lessening the development burden and enabling expressive queries.
+Ideally, everything in the definition should be axiomatized when possible. For example, if we consider the cell type oxytocin receptor sst GABAergic cortical interneuron
, which has the textual definition:
"An interneuron located in the cerebral cortex that expresses the oxytocin receptor. These interneurons also express somatostatin."
+The logical axioms should then follow accordingly:
+SubClassOf:
+These logical axioms allow a reasoner to automatically classify the term. For example, through the logical axioms, we can infer that oxytocin receptor sst GABAergic cortical interneuron
is a cerebral cortex GABAergic interneuron
.
Axiomatizing definitions well will also allow for accurate querying. For example, if I wanted to find a neuron that expresses oxytocin receptor, having the SubClassOf axioms of interneuron
and expresses some 'oxytocin receptor'
will allow me to do so on DL query (see tutorial on DL query for more information about DL queries).
Everything in the logical axioms must be true, (do not axiomatize things that are true to only part of the entity)
+For example, the cell type chandelier pvalb GABAergic cortical interneuron
is found in upper L2/3 and deep L5 of the cerebral cortex.
+We do not make logical axioms for has soma location
some layer 2/3 and layer 5.
+Axioms with both layers would mean that a cell of that type must be in both layer 2/3 and layer 5, which is an impossibility (a cell cannot be in two seperate locations at once!). Instead we axiomatize a more general location: 'has soma location' some 'cerebral cortex'
An equivalent class axiom is an axiom that defines the class; it is a necessary and sufficient logical axiom that defines the cell type. It means that if a class B fulfils all the criteria/restrictions in the equivalent axiom of class A, class B is by definition a subclass of class A. +Equivalent classes allow the reasoner to automatically classify entities.
+For example:
+chandelier cell
has the equivalent class axiom interneuron and ('has characteristic' some 'chandelier cell morphology')
chandelier pvalb GABAergic cortical interneuron
has the subclass axioms 'has characteristic' some 'chandelier cell morphology'
and interneuron
chandelier pvalb GABAergic cortical interneuron
is therefore a subclass of chandelier cell
Equivalent class axioms classification can be very powerful as it takes into consideration complex layers of axioms.
+For example:
+primary motor cortex pyramidal cell
has the equivalent class axiom 'pyramidal neuron' and ('has soma location' some 'primary motor cortex')
.Betz cell
has the axioms 'has characteristic' some 'standard pyramidal morphology'
and 'has soma location' some 'primary motor cortex layer 5'
Betz cell
are inferred to be primary motor cortex pyramidal cell
through the following chain (you can see this in Protégé by pressing the ? button on inferred class):The ability of the reasoner to infer complex classes helps identify classifications that might have been missed if done manually. However, when creating an equivalent class axiom, you must be sure that it is not overly constrictive (in which case, classes that should be classified under it gets missed) nor too loose (in which case, classes will get wrongly classified under it).
+Example of both overly constrictive and overly loose equivalent class axiom:
+neuron equivalent to cell and (part_of some 'central nervous system')
In such cases, sometimes not having an equivalent class axioms is better (like in the case of neuron), and asserting is the best way to classify a child.
+Each ontology has certain styles and conventions in how they axiomatize. This style guide is specific to OBO ontologies. We will also give reasons as to why we choose to axiomatize in the way we do. However, be aware of your local ontology's practices.
+It is important to note that ontologies have specific axiomatization styles and may apply to, for example, selecting a preferred relation. This usually reflects their use cases. For example, the Cell Ontology has a guide for what relations to use. An example of an agreement in the community is that while anatomical locations of cells are recorded using part of
, neurons should be recorded with has soma location
. This is to accommodate for the fact that many neurons have long reaching axons that cover multiple anatomical locations making them difficult to axiomatize using part of
.
For example, Betz cell
, a well known cell type which defines layer V of the primary motor cortex, synapses lower motor neurons or spinal interneurons (cell types that reside outside the brain). Having the axiom 'Betz cell' part_of 'cortical layer V'
is wrong. In this case has soma location
is used. Because of cases like these that are common in neurons, all neurons in CL should use has soma location
.
Do not add axioms that are not required. If a parent class already has the axiom, it should not be added to the child class too. +For example:
+retinal bipolar neuron
is a child of bipolar neuron
bipolar neuron
has the axiom 'has characteristic' some 'cortical bipolar morphology'
'has characteristic' some 'cortical bipolar morphology'
to retinal bipolar neuron
Axioms add lines to the ontology, resulting in larger ontologies that are harder to use. They also add redundancy, making the ontology hard to maintain as a single change in classification might require multiple edits.
+Asserted is_a parents do not need to be retained as entries in the 'SubClass of' section of the Description window in Protégé if the logical definition for a term results in their inference.
+For example, cerebral cortex GABAergic interneuron
has the following logical axioms:
Equivalent_To
+ 'GABAergic interneuron' and
+ ('has soma location' some 'cerebral cortex')
+
We do not need to assert that it is a cerebral cortex neuron
, CNS interneuron
, or neuron of the forebrain
as the reasoner automatically does that.
We avoid having asserted subclass axioms as these are redundant lines in the ontology which can result in a larger ontology, making them harder to use.
+Good practice to let the reasoner do the work:
+1) If you create a logical definition for your term, you should delete all redundant, asserted is_a parent relations by clicking on the X to the right of the term.
+2) If an existing term contains a logical definition and still shows an asserted is_a parent in the 'SubClass of' section, you may delete that asserted parent. Just make sure to run the Reasoner to check that the asserted parent is now replaced with the correct reasoned parent(s).
+3) Once you synchronize the Reasoner, you will see the reasoned classification of your new term, including the inferred is_a parent(s).
+4) If the inferred classification does not contain the correct parentage, or doesn't make sense, then you will need to modify the logical definition.
+
10 min overview of Jérôme Euzenat and Pavel Shvaiko ground-breaking Ontology Matching.
+ + + + + + + +Here we briefly list the building blocks that are used in OWL that enable reasoning.
+OWL | +Semantics | +Example | +
---|---|---|
instance or individual | +A member of a set. | +A person called Mary or a dog called Fido . |
+
class | +A set of in dividuals. | +The Person class consisting of persons or the Dog class consisting of dogs. |
+
object property | +A set of pairs of individuals. | +The owns object property can link a pet and its owner: Mary owns Fido . |
+
data property | +A set of pairs where each pair consists of an individual linked to a data value. | +The data property hasAge can link a number representing an age to an individual: hasAge(Mary, 10) . |
+
For reference of the more technical aspects of release artefacts, please see documentation on Release Artefacts
+Ontologies come in different serialisations, formalisms, and variants For example, their are a full 9 (!) different release files associated with an ontology released using the default settings of the Ontology Development Kit, which causes a lot of confusion for current and prospective users.
+Note: In the OBO Foundry pages, "variant" is currently referred to as "products", i.e. the way we use "variant" here is synonymous with with notion of "product".
+Some people like to also list SHACL and Shex as ontology languages and formalism. Formalisms define syntax (e.g. grammar rules) and semantics (what does what expression mean?). The analogue in the real world would be natural languages, like English or Greek.
+git diff
, i.e changes to ontologies in functional syntax are much easier to be reviewed. RDF/XML is not suitable for manual review, due to its verbosity and complexity.The real-world analogue of serialisation or format is the script, i.e. Latin or Cyrillic script (not quite clean analogue).
+src/ontology/cl-edit.owl
.subClassOf
and partOf
. Some users that require the ontology to correspond to acyclic graphs, or deliberately want to focus only on a set of core relations, will want to use this variant, see docs). The formal definition of the basic variant can be found here.owl:imports
statements - these are easily ignored by your users and make the intended "content" of the ontology quite none-transparent.SubClassOf
vs EquivalentTo
¶This lesson assumes you have basic knowledge wrt ontologies and OWL as explained in:
+ +SubClassOf
¶In this section we explain the semantics of SubClassOf
, give an example of using SubClassOf
and provide guidance for when not to use SubClassOf
.
If we have
+Class: C
+ SubClassOf: D
+Class: D
+
+the semantics of it is given by the following Venn diagram:
+ +Thus, the semantics is given by the subset relationship, stating the C
is a subset of D
. This means every individual
+of C
is necessarily an individual of D
, but not every individual of D
is necessarily an individual of C
.
Class: Dog
+ SubClassOf: Pet
+Class: Pet
+
+which as a Venn diagram will look as follows:
+ +There are at least 2 scenarios which at first glance may seem like C SubClassOf D
holds, but it does not hold, or
+using C EquivalentTo D
may be a better option.
C
has many individuals that are in D
, but there is at least 1 individual of C
that is
+not in D
. The following Venn diagram is an example. Thus, to check whether you may be dealing with this scenario, you
+can ask the following question: Is there any individual in C
that is not in D
? If 'yes', you are dealing with this
+scanario and you should not be using C SubClassOf D
. C
in D
, but also every individual in D
is in C
. This means C
and D
are equivalent. In the case you rather want
+to make use of EquivalentTo
.EquivalentTo
¶If we have
+Class: C
+ EquivalentTo: D
+Class: D
+
+this means the sets C
and D
fit perfectly on each other, as shown in the next Venn diagram:
Note that C EquivalentTo D
is shorthand for
Class: C
+ SubClassOf: D
+Class: D
+ SubClassOf: C
+
+though, in general it is better to use EquivalentTo
rather than the 2 SubClassOf
axioms when C
and D
are equivalent.
We all probably think of humans and persons as the exact same set of individuals.
+Class: Person
+ EquivalentTo: Human
+Class: Human
+
+and as a Venn diagram:
+ +When do you not want to use EquivalentTo
?
C
that is not in D
.D
that is not in C
.Taxon restrictions (or, "taxon constraints") are a formalised way to record what species a term applies to—something crucial in multi-species ontologies.
+Even species neutral ontologies (e.g., GO) have classes that have implicit taxon restriction.
+GO:0007595 ! Lactation - defined as “The secretion of milk by the mammary gland.”
+
Finding inconsistencies. Taxon restrictions use terms from the NCBI Taxonomy Ontology, which asserts pairwise disjointness between sibling taxa (e.g., nothing can be both an insect and a rodent). When terms have taxon restrictions, a reasoner can check for inconsistencies.
+When GO implemented taxon restrictions, they found 5874 errors!
+Defining taxon-specific subclasses. You can define a taxon-specific subclass of a broader concept, e.g., 'human clavicle'. This allows you, for example, to assert relationships for the new term that don't apply to all instances of the broader concept:
+'human clavicle' EquivalentTo ('clavicle bone' and ('in taxon' some 'Homo sapiens'))
+'human clavicle' SubClassOf ('connected to' some sternum)
+
Creating SLIMs. Use a reasoner to generate ontology subsets containing only those terms that are logically allowed within a given taxon.
+Querying. Facet terms by taxon. E.g., in Brain Data Standards, in_taxon axioms allow faceting cell types by species. (note: there are limitations on this and may be incomplete).
+There are, in essence, three categories of taxon-specific knowledge we use across OBO ontologies. Given a class C
, which could be anything from an anatomical entity to a biological process, we have the following categories:
C
are in some instance of taxon T
"C SubClassOf (in_taxon some T)
+
C
.C
are in taxon T
"C SubClassOf (not (in_taxon some T))`
+
C DisjointWith (in_taxon some T)
C SubClassOf (in_taxon some (not T))
C never_in_taxon T
# Editors use thisnever_in_taxon
annotations, the taxon should be as broad as possible for the maximum utility, but it must be the case that a C
is never found in any subclass of that taxon.C
and in_taxon some T
"¶C
is in taxon T
".IND:a Type (C and (in_taxon some T))`
+
C_in_T SubClassOf (C and (in_taxon some T)
(C_in_T
will be unsatisifiable if violates taxon constraints)C present_in_taxon T
# Editors use thisPlease see how-to guide on adding taxon restrictions
+As stated above, one of the major applications for taxon restrictions in OBO is for quality control (QC), by finding logical inconsistencies. Many OBO ontologies consist of a complex web of term relationships, often crossing ontology boundaries (e.g., GO biological process terms referencing Uberon anatomical structures or CHEBI chemical entities). If particular terms are only defined to apply to certain taxa, it is critical to know that a chain of logic implies that the term must exist in some other taxon which should be impossible. Propagating taxon restrictions via logical relationships greatly expands their effectiveness (the GO term above may acquire a taxon restriction via the type of anatomical structure in which it occurs).
+It can be helpful to think informally about how taxon restrictions propagate over the class hierarchy. It's different for all three types:
+in_taxon
) include all superclasses of the taxon, and all subclasses of the subject term:
+ never_in_taxon
) include all subclasses of the taxon, and all subclasses of the subject term:
+ present_in_taxon
) include all superclasses of the taxon, and all superclasses of the subject term:
+ The Relation Ontology defines number of property chains for the in_taxon
property. This allows taxon restrictions to propagate over other relationships. For example, the part_of o in_taxon -> in_taxon
chain implies that if a muscle is part of a whisker, then the muscle must be in a mammal, but not in a human, since we know both of these things about whiskers:
Property chains are the most common way in which taxon restrictions propagate across ontology boundaries. For example, Gene Ontology uses various subproperties of results in developmental progression of to connect biological processes to Uberon anatomical entities. Any taxonomic restrictions which hold for the anatomical entity will propagate to the biological process via this property.
+The graph depictions in the preceding illustrations are informal; in practice never_in_taxon
and present_in_taxon
annotations are translated into more complex logical constructions using the in_taxon
object property, described in the next section. These logical constructs allow the OWL reasoner to determine that a class is unsatisfiable when there are conflicts between taxon restriction inferences.
The OWL axioms required to derive the desired entailments for taxon restrictions are somewhat more complicated than one might expect. Much of the complication is the result of workarounds to limitations dictated by the OWL EL profile. Because of the size and complexity of many of the ontologies in the OBO Library, particularly those heavily using taxon restrictions, we primarily rely on the ELK reasoner, which is fast and scalable since it implements OWL EL rather than the complete OWL language. In the following we discuss the particular kinds of axioms required in order for taxon restrictions to work with ELK, with some comments about how it could work with HermiT (which implements the complete OWL language but is much less scalable). We will focus on this example ontology:
+There are three classes outlined in red which were created mistakenly; the asserted taxon for each of these conflicts with taxon restrictions in the rest of the ontology:
+part_of o in_taxon -> in_taxon
. This conflicts with its asserted in_taxon 'Homo sapiens', a subclass of 'Hominidae'.We can start by modeling the two taxon restrictions in the ontology like so:
+'hair' SubClassOf (in_taxon some 'Mammalia')
'whisker' SubClassOf (not (in_taxon some 'Hominidae'))
Both HermiT and ELK can derive that 'whisker in human' is unsatisfiable. This is the explanation:
+'human whisker' EquivalentTo ('whisker' and (in_taxon some 'Homo sapiens'))
'Homo sapiens' SubClassOf 'Hominidae'
'whisker' SubClassOf (not ('in_taxon' some 'Hominidae'))
Unfortunately, neither reasoner detects the other two problems. We'll address the 'whisker in catfish' first. The reasoner infers that this class is in_taxon
both 'Mammalia' and 'Siluriformes'. While these are disjoint classes (all sibling taxa are asserted to be disjoint in the taxonomy ontology), there is nothing in the ontology stating that something can only be in one taxon at a time. The most intuitive solution to this problem would be to assert that in_taxon
is a functional property. However, due to limitations of OWL, functional properties can't be used in combination with property chains. Furthermore, functional properties aren't part of OWL EL. There is one solution that works for HermiT, but not ELK. We could add an axiom like the following to every "always in taxon" restriction:
'hair' SubClassOf (in_taxon only 'Mammalia')
This would be sufficient for HermiT to detect the unsatisfiability of 'whisker in catfish' (assuming taxon sibling disjointness). Unfortunately, only
restrictions are not part of OWL EL. Instead of adding the only
restrictions, we can generate an extra disjointness axiom for every taxon disjointness in the taxonomy ontology, e.g.:
(in_taxon some 'Tetrapoda') DisjointWith (in_taxon some 'Teleostei')
The addition of axioms like that is sufficient to detect the unsatisfiability of 'whisker in catfish' in both HermiT and ELK. This is the explanation:
+'whisker in catfish' EquivalentTo ('whisker' and (in_taxon some 'Siluriformes'))
'whisker' SubClassOf 'hair'
'hair' SubClassOf (in_taxon some 'Mammalia')
'Mammalia' SubClassOf 'Tetrapoda'
'Siluriformes' SubClassOf 'Teleostei'
(in_taxon some 'Teleostei') DisjointWith (in_taxon some 'Tetrapoda')
While we can now detect two of the unsatisfiable classes, sadly neither HermiT nor ELK yet finds 'whisker muscle in human' to be unsatisfiable, which requires handling the interaction of a "never" assertion with a property chain. If we were able to make in_taxon
a functional property, HermiT should be able to detect the problem; but as we said before, OWL doesn't allow us to combine functional properties with property chains. The solution is to add even more generated disjointness axioms, one for each taxon (in combination with the extra disjointness we added in the previous case), e.g.,:
(in_taxon some Hominidae) DisjointWith (in_taxon some (not Hominidae))
While that is sufficient for HermiT, for ELK we also need to add another axiom to the translation of each never_in_taxon assertion, e.g.,:
+'whisker' SubClassOf (in_taxon some (not 'Hominidae'))
Now both HermiT and ELK can find 'whisker muscle in human' to be unsatisfiable. This is the explanation from ELK:
+'whisker muscle in human' EquivalentTo ('whisker muscle' and (in_taxon some 'Homo sapiens'))
'Homo sapiens' SubClassOf 'Hominidae'
'whisker muscle' SubClassOf (part_of some 'whisker')
'whisker' SubClassOf (in_taxon some ('not 'Hominidae'))
part_of o in_taxon SubPropertyOf in_taxon
(in_taxon some 'Hominidae') DisjointWith (in_taxon some (not 'Hominidae'))
The above example didn't incorporate any present_in_taxon (SOME-IN) assertions. These work much the same as ALL-IN in_taxon assertions. However, instead of stating that all instances of a given class are in a taxon (C SubClassOf (in_taxon some X)
), we either state that there exists an individual of that class in that taxon, or that there is some subclass of that class whose instances are in that taxon:
<generated individual IRI> Type (C and (in_taxon some X))
— violations involving this assertion will make the ontology logically inconsistent.
or
+<generated class IRI> SubClassOf (C and (in_taxon some X))
— violations involving this assertion will make the ontology logically incoherent, i.e., a named class is unsatisfiable (here, <generated class IRI>
).
Incoherency is easier to debug than inconsistency, so option 2 is the default expansion for present_in_taxon
.
In summary, the following constructs are all needed for QC using taxon restrictions:
+in_taxon
property chains for relations which should propagate in_taxon
inferencesX DisjointWith Y
for all sibling taxa X
and Y
(in_taxon some X) DisjointWith (in_taxon some Y)
for all sibling taxa X
and Y
(in_taxon some X) DisjointWith (in_taxon some (not X))
for every taxon X
C in_taxon X
C SubClassOf (in_taxon some X)
C never_in_taxon X
C SubClassOf (not (in_taxon some X))
C SubClassOf (in_taxon some (not X))
C present_in_taxon X
)<generated class IRI> SubClassOf (C and (in_taxon some X))
If you are checking an ontology for coherency in a QC pipeline (such as by running ROBOT within the ODK), you will need to have the required constructs from the previous section present in your import chain:
+http://purl.obolibrary.org/obo/ncbitaxon.owl
)http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim-disjoint-over-in-taxon.owl
(or implement a way to generate the needed disjointness axioms)(in_taxon some X) DisjointWith (in_taxon some (not X))
. You may need to implement a way to generate the needed disjointness axioms until this is corrected.never_in_taxon
and present_in_taxon
shortcut annotation properties, you can expand these into the logical forms using robot expand
.present_in_taxon
expansions add named classes to your ontology, you will probably want to organize your pipeline in such a way that this expansion only happens in a QC check, and the output is not included in your published ontology.Using the DL Query panel and a running reasoner, it is straightforward to check whether a particular taxon restriction holds for a term (such as when someone has requested one be added to your ontology). Given some term of interest, e.g., 'whisker', submit a DL Query such as 'whisker' and (in_taxon some Mammalia)
. Check the query results:
Equivalent classes
includes owl:Nothing
, then a never_in_taxon is implied for that taxon.Equivalent classes
includes the term of interest itself (and not owl:Nothing
), then an in_taxon is implied for that taxon.Superclasses
includes the term of interest (and the query isn't equivalent to owl:Nothing
), then there is no particular taxon restriction involving that taxon.To quickly see exactly which taxon restrictions are in effect for a selected term, install the OBO taxon constraints plugin for Protégé. Once you have the plugin installed, you can add it to your Protégé window by going to the menu Window > Views > OBO views > Taxon constraints
, and then clicking the location to place the panel. The plugin will show the taxon constraints in effect for the selected OWL class. When a reasoner is running, any inferred taxon constraints will be shown along with directly asserted ones. The plugin executes many reasoner queries behind the scenes, so there may be a delay before the user interface is updated.
Comments are annotations that may be added to ontology terms to further explain their intended usage, or include information that is useful but does not fit in areas like definition.
+Some examples of comments, and possible standard language for their usage, are:
+WARNING: THESE EXAMPLES ARE NOT UNIVERSALLY USED AND CAN BE CONTROVERSIAL IN SOME ONTOLOGIES! PLEASE CHECK WITH THE CONVENTIONS OF YOUR ONTOLOGY BEFORE DOING THIS!
+This term should not be used for direct annotation. It should be possible to make a more specific annotation to one of the children of this term.
+Example: +GO:0006810 transport
+Note that this term should not be used for direct annotation. It should be possible to make a more specific annotation to one of the children of this term, for e.g. transmembrane transport, microtubule-based transport, vesicle-mediated transport, etc.
+This term should not be used for direct manual annotation. It should be possible to make a more specific manual annotation to one of the children of this term.
+Example: +GO:0000910 cytokinesis
+Note that this term should not be used for direct annotation. When annotating eukaryotic species, mitotic or meiotic cytokinesis should always be specified for manual annotation and for prokaryotic species use 'FtsZ-dependent cytokinesis; GO:0043093' or 'Cdv-dependent cytokinesis; GO:0061639'. Also, note that cytokinesis does not necessarily result in physical separation and detachment of the two daughter cells from each other.
+Information about the term that do not belong belong in the definition or gloss, but are useful for users or editors. This might include information that is adjacent to the class but pertinent to its usage, extended information about the class (eg extended notes about a characteristic of a cell type) that might be useful but does not belong in the definition, important notes on why certain choices were made in the curation of this terms (eg why certain logical axioms were excluded/included in the way they are) (Note: dependent on ontology, some of these might belong in editors_notes, etc.).
+Standard language for these are not given as they vary dependent on usage.
+ + + + + + +As a rule of thumb, for every single problem/term/use case, you will have 3-6 options to choose from, in some cases even more. The criteria for selecting a good ontology are very much dependent on your particular use case, but some concerns are generally relevant. A good first pass is to apply to "10 simple rules for selecting a Bio-ontology" by Malone et al, but I would further recommend to ask yourself the following:
+Aside from aspects of your analysis, there is one more thing you should consider carefully: the open-ness of your ontology in question. As a user, you have quite a bit of power on the future trajectory of the domain, and therefore should seek to endorse and promote open standards as much as possible (for egotistic reasons as well: you don't want to have to suddenly pay for the ontologies that drive your semantic analyses). It is true that ontologies such as SNOMED have some great content, and, even more compellingly, some really great coverage. In fact, I would probably compare SNOMED not with any particular disease ontology, but with the OBO Foundry as a whole, and if you do that, it is a) cleaner, b) better integrated. But this comes at a cost. SNOMED is a commercial product - millions are being paid every year in license fees, and the more millions come, the better SNOMED will become - and the more drastic consequences will the lock-in have if one day you are forced to use SNOMED because OBO has fallen too far behind. Right now, the sum of all OBO ontologies is probably still richer and more valuable, given their use in many of the central biological databases (such as the ones hosted by the EBI) - but as SNOMED is seeping into the all aspects of genomics now (for example, it will soon be featured on OLS!) it will become increasingly important to actively promote the use of open biomedical ontologies - by contributing to them as well as by using them.
+ + + + + + +Based on Intro to GitHub (GO-Centric) with credit to Nomi Harris and Chris Mungall
+Writing a good ticket (or issue) is crucial to good management of a repo. In this explainer, we will discuss some good practices in writing a ticket and show examples of what not to do.
+The OBOOK is trying to centralise all OBO documentation in one place. It is, and will be, a big construction site, for years to come. The goal is to iterate and make things better.
+We follow two philosophies:
+There are three main consequences to this:
+We just introduced a new concept to OBOOK called pathways
. The idea is that we provide a linear guide for all 6 roles mentioned on the getting started page through the materials. This will help us also complete the materials and provide a good path to reviewing them regularly.
A step-by-step guide to complete a well-defined mini-project. Examples: ROBOT template tutorial. DOSDP template tutorial. Protege tutorial on using the reasoner.
+A collection of materials (tutorials, explanations and how-to-guides) that together seek to teach a well defined concept. Examples: Contributing to OBO ontologies; An Introduction to templates in OBO; An Introduction to OBO Application development. While the distinction to "tutorial" is often fuzzy, the main distinguishing feature should be that a lesson conveys a general concept independent of some concrete technology stack. While we use concrete examples in lessons, we do always seek to generalise to problem space.
+A convenience content type that allows us to assemble materials from obook for a specific taught unit, such as the yearly ICBO tutorials, or the ongoing Monarch Ontology Tutorials and others. Course pages serve as go-to-pages for course participants and link to all the relevant materials in the documentation. Course usually comprise lessons, tutorials and how-to guides.
+A pathway is a kind of course, but without the expectation that it is ever taught in a concrete setting. A pathways pertains to a single concrete role (Ontology Curator, Pipeline Developer etc). It is a collection of materials (lessons, tutorials, how-to-guides) that is ordered in a linear fashion for the convenience of the student. For example, we are developing a pathway for ontology pipeline developers that start by teaching common concepts such as how to make term requests etc, and then go into depth on ROBOT pipelines, ODK and Make.
+Before you start with the lessons of this course, keep the following in mind:
+There are a wide variety of entry points into the OBO world, for example:
+make
and ROBOT
Of course, many of you will occupy more than one of the above "hats" or roles. While they all require specialised training, many shared skill requirements exist. This course is being developed to:
+See Daily Curator Workflow for creating branches and basic Protégé instructions.
+In the main Protégé window, click on the "Entities" tab. Below that, click the "Annotation properties" tab.
+Select the subset_property
annotation property.
Click on the "Add sub property" button.
+In the pop-up window, add the name of the new slim. The IRI will automatically populate according to settings in the user's "New entities" settings. Click OK.
+With the newly created annotation property selected, click on "Refactor > Rename entity..." in the menu.
+In the pop-up window, select the "Show full IRI" checkbox. The IRI will appear. +Edit the IRI to fit the following standard:
+http://purl.obolibrary.org/obo/{ontology_abbreviation}#{label_of_subset}
+For example, in CL, the original IRI will appear as:
+http://purl.obolibrary.org/obo/CL_1234567
+If the subset was labeled "kidney_slim", the IRI should be updated to:
+http://purl.obolibrary.org/obo/cl#kidney_slim
+In the 'Annotations" window, click the +
next to "Annotations".
In the pop-up window, select the rdfs:comment
annotation property. Under "Value" enter a brief descripton for the slim. Under "Datatype" select xsd:string
. Click OK.
See Daily Curator Workflow section for commit, push and merge instructions.
+See Daily Curator Workflow for creating branches and basic Protégé instructions.
+In the main Protégé window, click on the "Entities" tab. Select the class that is to be added to a subset (slim).
+In the 'Annotations" window, click the +
next to "Annotations".
In the pop-up window, select the in_subset
annotation property.
Click on the ‘Entity IRI’ tab.
+Search for the slim label under "Entity IRI". In the pop-up that appears, double-click on the desired slim. Ensure that a sub property of subset_property
is selected. Click OK.
See Daily Curator Workflow section for commit, push and merge instructions.
+ + + + + + +Before adding taxon restrictions, please review the types of taxon restrictions documentation.
+See Daily Workflow for creating branches and basic Protégé instructions.
+in taxon
relations are added as Subclasses
.+
.'in taxon' some Viridiplantae
).never in taxon
or present in taxon
relations added as Annotations
.+
.never_in_taxon
or present_in_taxon
as appropriate.See Daily Workflow section for commit, push and merge instructions.
+ + + + + + +Warning: You should only use this method if the files you are editing are reasonably small (less than 1 MB).
+This method only works if the file you want to edit has already been editing as part of the pull request.
+...
, and then "Edit file".If this option is greyed out, it means that - you don't have edit rights on the repository - the edit was made from a different fork, and the person that created the pull request did not activate the "Allow maintainers to make edits" option when submitting the PR - the pull request has already been merged
+In GitHub Desktop, click the branch switcher button and paste in branch name (or you can type it in). +
+Now you are on the branch, you can open the files to be edited and make your intended changes and push via the usual workflow.
+If a user forked the repository and created a branch, you can find that branch by going to the branch switcher button in GitHub Desktop, click on Pull Requests (next to Branches) and looking for that pull request +
+Select that pull request and edit the appropriate files as needed and push via the usual workflow.
+Prerequisite: Install Github Desktop +Github Desktop can be downloaded here
+For the purpose of going through this how-to guide, we will use Mondo as an example. However, all obo onotlogies can be cloned in a similar way.
+mondo
can be replaced with any ontology that is setup using the ODK as their architecture should be the same.If this all works okay, you are all set to start editing!
+ + + + + + +To create a new term, the 'Asserted view' must be active (not the 'Inferred view').
+In the Class hierarchy window, click on the 'Add subclass' button at the upper left of the window.
++
next to Annotations 2. Add Definition References
+ 1. Click on the circle with the ‘@’ in it next to definition and in the resulting pop-up click on the ```+``` to add a new ref, making sure they are properly formatted with a database abbreviation followed by a colon, followed by the text string or ID. Examples: ```PMID:27450630```.
+ 2. Click OK.
+ 3. Add each definition reference separately by clicking on the ```+``` sign.
+
+3. Add synonyms and dbxrefs following the same procedure if they are required for the term.
++
sign in the appropriate section (usually SubClass Of) and typing it in, using Tab
to autocomplete terms.Converting to Equivalent To axioms:
+If you want to convert your SubClassOf axioms to EquivalentTo axioms, you can select the appropriate rows and right click, selecting "Convert selected rows to defined class"
+
In some cases, logical axioms reuiqre external ontologies (eg in the above example, the newly added CL term has_soma_location in the cerebellar cortex which is an uberon term), it might be necessary to import the term in. For instructions on how to do this, please see the import managment section of your local ontology documentation (an example of that in CL can be found here: https://obophenotype.github.io/cell-ontology/odk-workflows/UpdateImports/)
+When you have finished adding the term, run the reasoner to ensure that nothing is problematic with the axioms you have added (if there is an issue, you will see it being asserted under owl:Nothing)
+Save the file on protege and review the changes you have made in your Github Desktop (or use git diff
in your terminal if you do not use Github Desktop)
See Daily Workflow section for commit, push and merge instructions.
+Editors:
+Summary:
+This is a guide to build an OBO ontology from scratch. We will focus on the kind of thought processes you want to go through, and providing the following:
+Before reading on, there are three simple rules for when NOT to build an ontology everyone interested in ontologies should master, like a mantra:
+Do not build a new ontology if:
+Scope is one of the hardest and most debated subjects in the OBO Foundry operation calls. There are essentially two aspects to scope:
+phenotype
, disease
, anatomical entity
, assay
, environmental exposure
, biological process
, chemical entity
. Before setting out to build an ontology, you should get a rough sense of what kind of entities you need to describe your domain. However, this is an iterative process and more entities will be revealed later on.Alzheimer's Disease
, which will need many different kinds of biological entities (like anatomical entity
and disease
classes).As a rule of thumb, you should NOT create a term if another OBO ontology has a branch of for entities of the same kind
. For example, if you have to add terms for assays, you should work with the Ontology for Biomedical Investigations to add these to their assay branch.
Remember, the vision of OBO is to build a semantically coherent ontology for all of biology, and the individual ontologies in the OBO Foundry should be considered "modules" of this super ontology. You will find that while collaboration is always hard the only way for our community to be sustainable and compete with commercial solutions is to take that hard route and work together.
+There are many kinds of semantic artefacts that can work for your use case:
+Think of it in terms of cost. Building a simple vocabulary with minimal axiomatisation is 10x cheaper than building a full fledged domain model in OWL, and helps solving your use case just the same. Do not start building an ontology unless you have some understanding of these alternatives first.
+Do not build an ontology because someone tells you to or because you "think it might be useful". Write out a proper use case description, for example in the form of an agile user story, convince two or three colleagues this is worthwhile and only then get to work. Many ontologies are created for very vague use cases, and not only do they cost you time to build, they also cost the rest of the community time - time it takes them to figure out that they do not want to use your ontology. Often, someone you trust tells you to build one and you believe they know what they are doing - do not do that. Question the use of building the ontology until you are convinced it is the right thing to do. If you do not care about reasoning (either for validation or for your application), do not build an ontology.
+ +Depending on your specific starting points, the way you start will be slightly different, but some common principles apply.
+workflow
system, i.e. some way to run commands like release
or test
, as you will run these repeatedly. A typical system to achieve this is make, and many projects choose to encode their workflows as make
targets (ODK, OBI Makfile).Note: Later in the process, you also want to think about the following:
+There are many different starting points for building an ontology:
+There are two fundamentally different kinds of ontologies which need to be distinguished:
+Some things to consider:
+It is imperative that it is clear which of the two you are building. Project ontologies sold as domain ontologies are a very common practice and they cause a lot of harm for open biomedical data integration.
+ +We will re-iterate some of the steps taken to develop the Vertebrate Breed Ontology. At the time of this writing, the VBO is still in early stages, but it nicely illustrates all the points above.
+See here. Initial interactions with the OMIA team further determined more long term goals such as phenotypic similarity and reasoning.
+Similar ontologies. While there is no ontology OBO ontology related to breeds, the Livestock Breed Ontology (LBO) served as an inspiration (much different scale). NCBI taxonomy is a more general ontology about existing taxa as they occur in the wild.
+Our starting point was the raw OMIA data.
+species
represents the same concept as ‘species’ in NCBI, the ontology should be built ‘on top of’ NCBI terms to avoid confusion of concepts and to avoid conflation of terms with the same conceptWarnings based on our experience:
+For us this was using Google Sheets, ROBOT & ODK.
+At first, we chose to name the ontology "Unified Breed Ontology" (UBO). Which meant that for everything from ODK setup to creating identifiers for our terms, we used the UBO
prefix. Later in the process, we decided to change the name to "Vertebrate Breed Ontology". Migrating all the terms and the ODK setup from ubo
to vbo
required some expert knowledge on the workings of the ODK, and created an unnecessary cost. We should have finalised the choice of name first.
Thank you to Melanie Courtot, Sierra Moxon, John Graybeal, Chris Stoeckert, Lars Vogt and Nomi Harris for their helpful comments on this how-to.
+ + + + + + +Navigate to the ontology directory of go-ontology: cd repos/MY-ONTOLOGY/src/ontology
.
If the terminal window is not configured to display the branch name, type: git status
. You will see:
On branch [master] [or the name of the branch you are on] + Your branch is up-to-date with 'origin/master'.
+If you’re not in the master branch, type: git checkout master
.
From the master branch, type: git pull
. This will update your master branch, and all working branches, with the files that are most current on GitHub, bringing in and merging any changes that were made since you last pulled the repository using the command git pull
. You will see something like this:
~/repos/MY-ONTOLOGY(master) $ git pull
+remote: Counting objects: 26, done.
+remote: Compressing objects: 100% (26/26), done.
+remote: Total 26 (delta 12), reused 0 (delta 0), pack-reused 0
+Unpacking objects: 100% (26/26), done.
+From https://github.com/geneontology/go-ontology
+ 580c01d..7225e89 master -> origin/master
+ * [new branch] issue#13029 -> origin/issue#13029
+Updating 580c01d..7225e89
+Fast-forward
+ src/ontology/go-edit.obo | 39 ++++++++++++++++++++++++---------------
+ 1 file changed, 24 insertions(+), 15 deletions(-)
+~/repos/MY-ONTOLOGY(master) $
+
When starting to work on a ticket, you should create a new branch of the repository to edit the ontology file.
+Make sure you are on the master branch before creating a new branch. If the terminal window is not configured to display the branch name, type: git status
to check which is the active branch. If necessary, go to master by typing git checkout master
.
To create a new branch, type: git checkout -b issue-NNNNN
in the terminal window. For naming branches, we recommend using the string 'issue-' followed by the issue number. For instance, for this issue in the tracker: https://github.com/geneontology/go-ontology/issues/13390, you would create this branch: git checkout -b issue-13390
. Typing this command will automatically put you in the new branch. You will see this message in your terminal window:
~/repos/MY-ONTOLOGY/src/ontology(master) $ git checkout -b issue-13390
+Switched to a new branch 'issue-13390'
+~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $
+
If you are continuing to do work on an existing branch, in addition to updating master, go to your branch by typing git checkout [branch name]
. Note that you can view the existing local branches by typing git branch -l
.
OPTIONAL: To update the working branch with respect to the current version of the ontology, type git pull origin master
.
+ This step is optional because it is not necessary to work on the current version of the ontology; all changes will be synchronized when git merge is performed.
Before launching Protégé, make sure you are in the correct branch. To check the active branch, type git status
.
Click on the 'File' pulldown. Open the file: go-edit.obo. The first time, you will have to navigate to repos/MY-ONTOLOGY/src/ontology
. Once you have worked on the file, it will show up in the menu under 'Open'/'Recent'.
Click on the 'Classes' tab.
+Searching: Use the search box on the upper right to search for a term in the ontology. Wait for autocomplete to work in the pop-up window.
+Viewing a term: Double-click on the term. This will reveal the term in the 'Class hierarchy' window after a few seconds.
+Launching the reasoner: To see the term in the 'Class hierarchy' (inferred) window, you will need to run the 'ELK reasoner'. 'Reasoner' > select ELK 0.4.3, then click 'Start reasoner'. Close the various pop-up warnings about the ELK reasoner. You will now see the terms in the inferred hierarchy.
+After modification of the ontology, synchronize the reasoner. Go to menu: 'Reasoner' > ' Synchronize reasoner'.
+NOTE: The only changes that the reasoner will detect are those impacting the ontology structure: changes in equivalence axioms, subclasses, merges, obsoletions, new terms.
+TIP: When adding new relations/axioms, 'Synchronize' the reasoner. When deleting relations/axioms, it is more reliable to 'Stop' and 'Start' the reasoner again.
+Use File > Save to save your changes.
+Review: Changes made to the ontology can be viewed by typing git diff
in the terminal window. If there are changes that have already been committed, the changes in the active branch relative to master can be viewed by typing git diff master
.
Commit: Changes can be committed by typing: git commit -m ‘Meaningful message Fixes #ticketnumber’ go-edit.obo
.
For example:
+ git commit -m ‘hepatic stellate cell migration and contraction and regulation terms. Fixes #13390’ go-edit.obo
+
+This will save the changes to the go-edit.obo file. The terminal window will show something like:
+ ~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $ git commit -m 'Added hepatic stellate cell migration and contraction and regulation terms. Fixes #13390' go-edit.obo
+ [issue-13390 dec9df0] Added hepatic stellate cell migration and contraction and regulation terms. Fixes #13390
+ 1 file changed, 79 insertions(+)
+ ~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $
+
+Committer: Kimberly Van Auken vanauken@kimberlukensmbp.dhcp.lbnl.us + Your name and email address were configured automatically based on your username and hostname. Please check that they are accurate.
+Push: To incorporate the changes into the remote repository, type: git push origin mynewbranch
.
Example:
+ git push origin issue-13390
+
+Pull
+geneontology/go-ontology/code
. You will see your commit listed at the top of the page in a light yellow box. If you don’t see it, click on the 'Branches' link to reveal it in the list, and click on it.Merge If the Travis checks are succesful and if you are done working on that branch, merge the pull request. Confirming the merge will close the ticket if you have used the word 'fixes' in your commit comment. + NOTE: Merge the branches only when the work is completed. If there is related work to be done as a follow up to the original request, create a new GitHub ticket and start the process from the beginning.
+Delete your branch on the repository using the button on the right of the successful merge message.
+You may also delete the working branch on your local copy. Note that this step is optional. However, if you wish to delete branches on your local machine, in your terminal window:
+git checkout master
.git pull origin master
git branch -d workingbranchname
.
+ Example: git branch -d issue-13390
Dealing with very large ontologies, such as the Protein Ontology (PR), NCBI Taxonomy (NCBITaxon), Gene Ontology (GO) and the CHEBI Ontology is a big challenge when developing ontologies, especially if we want to import and re-use terms from them. There are two major problems:
+There are a few strategies we can employ to deal with the problem of memory consumption:
+To deal with file size, we:
+All four strategies will be discussed in the following. We will then look a bit
+The default recipe for creating a module looks something like this:
+imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt
+ if [ $(IMP) = true ]; then $(ROBOT) query -i $< --update ../sparql/preprocess-module.ru \
+ extract -T imports/$*_terms_combined.txt --force true --copy-ontology-annotations true --individuals exclude --method BOT \
+ query --update ../sparql/inject-subset-declaration.ru --update ../sparql/postprocess-module.ru \
+ annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/%_import.owl
+
(Note: This snippet was copied here on 10 February 2021 and may be out of date by the time you read this.)
+As you can see, a lot of stuff is going on here: first we run some preprocessing (which is really costly in ROBOT, as we need to load the ontology into Jena, and then back into the OWL API – so basically the ontology is loaded three times in total), then extract a module, then run more SPARQL queries etc, etc. Costly. For small ontologies, this is fine. All of these processes are important to mitigate some of the shortcomings of module extraction techniques, but even if they could be sorted in ROBOT, it may still not be enough.
+So what we can do now is this. In your ont.Makefile
(for example, go.Makefile
, NOT Makefile
), located in src/ontology
, you can add a snippet like this:
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
+ if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
+ annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/pr_import.owl
+
Note that all the %
variables and uses of $*
are replaced by the ontology ID in question. Adding this to your ont.Makefile
will overwrite the default ODK behaviour in favour of this new recipe.
The ODK supports this reduced module out of the box. To activate it, do this:
+import_group:
+ products:
+ - id: pr
+ use_gzipped: TRUE
+ is_large: TRUE
+
This will (a) ensure that PR is pulled from a gzipped location (you have to check whether it exists though. It must correspond to the PURL, followed by the extension .gz
, for example http://purl.obolibrary.org/obo/pr.owl.gz
) and (b) that it is considered large, so the default handling of large imports is activated for pr
, and you don't need to paste anything into ont.Makefile
.
If you prefer to do it yourself, in the following sections you can find a few snippets that work for three large ontologies. Just copy and paste them into ont.Makefile
, and adjust them however you wish.
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
+ if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
+ annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/pr_import.owl
+
imports/ncbitaxon_import.owl: mirror/ncbitaxon.owl imports/ncbitaxon_terms_combined.txt
+ if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/ncbitaxon_terms_combined.txt --force true --method BOT \
+ annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/ncbitaxon_import.owl
+
imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt
+ if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \
+ annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/chebi_import.owl
+
Feel free to use an even cheaper approach, even one that does not use ROBOT, as long as it produces the target of the goal (e.g. imports/chebi_import.owl
).
For some ontologies, you can find slims that are much smaller than full ontology. For example, NCBITaxon maintains a slim for OBO here: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.owl, which smaller than the 1 or 2 GB of the full version. Many ontologies maintain such slims, and if not, probably should. (I would really like to see an OBO slim for Protein Ontology!)
+(note the .obo file is even smaller but currently robot has issues getting obo files from the web)
+You can also add your favourite taxa to the NCBITaxon slim by simply making a pull request on here: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt
+You can use those slims simply like this:
+import_group:
+ products:
+ - id: ncbitaxon
+ mirror_from: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo
+
This is a real hack – and we want to strongly discourage it – but sometimes, importing an ontology just to import a single term is total overkill. What we do in these cases is to maintain a simple template to "import" minimal information. I can't stress enough that we want to avoid this, as such information will necessarily go out of date, but here is a pattern you can use to handle it in a sensible way:
+Add this to your src/ontology/ont-odk.yaml
:
import_group:
+ products:
+ - id: my_ncbitaxon
+
Then add this to src/ontology/ont.Makefile
:
mirror/my_ncbitaxon.owl:
+ echo "No mirror for $@"
+
+imports/my_ncbitaxon_import.owl: imports/my_ncbitaxon_import.tsv
+ if [ $(IMP) = true ]; then $(ROBOT) template --template $< \
+ --ontology-iri "$(ONTBASE)/$@" --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/my_ncbitaxon_import.owl
+
Now you can manage your import manually in the template, and the ODK will not include your manually-curated import in your base release. But again, avoid this pattern for anything except the most trivial case (e.g. you need one term from a huge ontology).
+Remember that ontologies are text files. While this makes them easy to read in your browser, it also makes them huge: from 500 MB (CHEBI) to 2 GB (NCBITaxon), which is an enormous amount.
+Thankfully, ROBOT can automatically read gzipped ontologies without the need of unpacking. To avoid long runtimes and network timeouts, we can do the following two things (with the new ODK 1.2.26):
+import_group:
+ products:
+ - id: pr
+ use_gzipped: TRUE
+
This will try to append .gz
to the default download location (http://purl.obolibrary.org/obo/pr.owl → http://purl.obolibrary.org/obo/pr.owl.gz). Note that you must make sure that this file actually exists. It does for CHEBI and the Protein Ontology, but not for many others.
If the file exists, but is located elsewhere, you can do this:
+import_group:
+ products:
+ - id: pr
+ mirror_from: http://purl.obolibrary.org/obo/pr.owl.gz
+
You can put any URL in mirror_from
(including non-OBO ones!)
We developed a completely automated variant of the Custom OBO Dashboard Workflow, which does not require any local installation.
+ +dashboard-config.yml
file, in particular the ontologies
section:mirror_from
field.profile
section to overwrite the custom robot report profile and add custom checks!yaml
+ profile:
+ baseprofile: "https://raw.githubusercontent.com/ontodev/robot/master/robot-core/src/main/resources/report_profile.txt"
+ custom:
+ - "WARN\tfile:./sparql/missing_xrefs.sparql"
Click on Settings
> Pages
to configure the GitHub pages
. Set the Source
to deploy from branch, and Branch
to build from main
(or master
if you are still using the old default) and /(root)
as directory. Hit Save
.
Click on the Actions
tab in your repo. On the left, select the Run dashboard
workflow and click on the Run workflow
button. This action will rebuild the dashboard and make a pull request with the changes.
Visit site
and you should find your new shiny dashboard page!Failed: make dashboard ROBOT_JAR=/tools/robot.jar ROBOT=robot -B with return code 2
There is a known bug at the moment requiring at least one ontology with a warning, error, info and pass, see https://github.com/OBOFoundry/OBO-Dashboard/issues/85.
+dashboard-config.yml
, add a temporary ontology we created to make this work. This is already in the Dashboard template repository. ontologies:
+ custom:
+ - id: tmp
+ mirror_from: "https://raw.githubusercontent.com/monarch-ebi-dev/robot_tests/master/custom-dashboard.owl"
+
remote: Permission to <name of the user or organization>/<name of the repository>.git denied to github-actions[bot].
You need to update the workflow permission for the repository.
+Settings
, then Actions
on the left menu, then General
.Error: GitHub Actions is not permitted to create or approve pull requests.
You need to enable GitHub Actions to create pull requests.
+Settings
, then Actions
on the left menu, then General
.Contributed by @XinsongDu
, edited by @matentzn
.gitignore
from the obo-nor.github.io
repo is also copied to your new repo (it is frequently skipped or hidden from the user in Finder
or when using the cp
command) and push to everything to GitHub.docker pull obolibrary/odkfull
+
dashboard-config.yml
file, in particular the ontologies
section:mirror_from
field.#
before pip install networkx==2.6.2
to ensure the correct network x version is installed.sh run-dash.sh
(make sure dashboard folder is empty before running, e.g. rm -rf dashboard/*
).Before you start:
+Using Protégé you can add annotations such as labels, definitions, synonyms, database cross references (dbxrefs) to any OWL entity. The panel on the right, named Annotations, is where these annotations are added. OBO Foundry ontologies includes a pre-declared set of annotation properties. The most commonly used annotations are below.
+Note: OBO ontologies allow only one rdfs:label, definition, and comment.
+Note, most of these are bold in the annotation property list:
+ +Use this panel to add a definition to the class you created. Select the + button to add an annotation to the selected entity. Click on the annotation 'definition' on the left and copy and paste in the definition to the white editing box on the right. Click OK.
+Example (based on MONDO):
+Definition: A disorder characterized by episodes of swelling under the skin (angioedema) and an elevated number of the white blood cells known as eosinophils (eosinophilia). During these episodes, symptoms of hives (urticaria), fever, swelling, weight gain and eosinophilia may occur. Symptoms usually appear every 3-4 weeks and resolve on their own within several days. Other cells may be elevated during the episodes, such as neutrophils and lymphocytes. Although the syndrome is often considered a subtype of the idiopathic hypereosinophilic syndromes, it does not typically have organ involvement or lead to other health concerns.
+ + +Definitions in OBO ontologies should have a 'database cross reference' (dbxref), which is a reference to the definition source, such as a paper from the primary literature or another database. For references to papers, we cross reference the PubMed Identifier in the format, PMID:XXXXXXXX. (Note, no space)
+To add a dbxref to the definition:
+To add a synonym:
+database_cross_reference
on the left panel and add your reference to the Literal tab on the right hand sideWe have seen how to add sub/superclasses and annotate the class hierarchy. Another way to do the same thing is via the Class description view. When an OWL class is selected in the entities view, the right-hand side of the tab shows the class description panel. If we select the 'vertebral column disease' class, we see in the class description view that this class is a "SubClass Of" (= has a SuperClass) the 'musculoskeletal system disease' class. Using the (+) button beside "SubClass Of" we could add another superclass to the 'skeletal system disease' class.
+Note the Anonymous Ancestors. These are superclasses that are inherited from the parents. If you hover over the Subclass Of (Anonymous Ancestor) you can see the parent that the class inherited the superclass from.
+ +When you press the '+' button to add a SubClass of
axiom, you will notice a few ways you can add a term. The easiest of this is to use the Class expression editor. This allows you to type in the expression utilizing autocomplete. As you start typing, you can press the 'TAB' or '->|' button on your keyboard, and protege will suggest terms. You will also note that the term you enter is not in the ontology, protege will not allow you add it, with the box being highlighted red, and the term underlined red.
This guide explains how to embed a YouTube video into a page in this OBO Academy material. Example, see the videos on the Contributing to OBO Ontologies page.
+The content should look something like this: <iframe width="560" height="315" src="https://www.youtube.com/embed/_z8-KGDzZ6U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
The embedded video should look like this:
+ + + + + + + +Let's say you want to remove some lines from a large text file programmatically. For example, you want to remove every line that contains certain IDs, but you want to keep the rest of the lines intact.
+You can use the command line utility grep
with option -v
to find all the lines in the file that do NOT contain your search term(s). You can make a file with a list of several search terms and use that file with grep
using the -f
option as follows:
grep -v -f your_list.txt target_file.tsv | tee out_file.tsv
+
csv
, tsv
, obo
etc. For example, you wish to filter a file with these lines:keep this 1 + this line is undesired 2, so you do not wish to keep it + keep this 3 + keep this 4 + keep this 5 + keep this 6 + something undesired 2 + this line is undesired 1 + keep this 7
+your_list.txt
is a text file with your list of search terms. Format: one search term per line. For example:undesired 1 + undesired 2
+The utility tee
will redirect the standard output to both the terminal and write it out to a file.
You expect the out_file.tsv
to contain lines:
keep this 1 + keep this 3 + keep this 4 + keep this 5 + keep this 6 + keep this 7
+You can also do a one-step filter-update when you are confident that your filtering works as expected, or if you have a backup copy of your target_file.tsv
.
+Use cat
and pipe the contents of your text file as the input for grep
. Redirect the results to both your terminal and overwrite your original file so it will contain only the filtered lines.
cat target_file.tsv | grep -v -f your_list.txt | tee target_file.tsv
+
This video illustrates an example of fixing a merge conflict in the Mondo Disease Ontology.
+Instructions:
+If a merge conflict error appears in your Github.com pull request after committing a change, open GitHub Desktop and select the corresponding repository from the "Current Repository" button. If the conflict emerged after editing the ontology outside of Protégé 5.5.0, see Ad hoc Reserialisation below.
+With the repository selected, click the "Fetch origin" button to fetch the most up-to-date version of the repository.
+Click the "Current Branch" button and select the branch with the merge conflict.
+From the menu bar, select Branch > "Update from master".
+A message indicating the file with a conflict should appear along with the option to open the file (owl or obo file) in a text/code editor, such as Sublime Text. Click the button to open the file.
+Search the file for conflict markings ( <<<<<<< ======= >>>>>>> ).
+Make edits to resolve the conflict, e.g., arrange terms in the correct order.
+Remove the conflict markings.
+Save the file.
+Open the file in Protégé. If prompted, do not reload any previously opened file. Open as a new file.
+Check that the terms involved in the conflict appear OK, i.e., have no obvious errors.
+Save the file in Protégé using File > 'Save as...' from the menu bar and replace the ontology edit file, e.g., mondo-edit.obo
+Return to GitHub Desktop and confirm the conflicts are now resolved. Click the "Continue Merge" button and then the "Push origin" button.
+Return to Github.com and allow the QC queries to rerun.
+The conflict should be resolved and the branch allowed to be merged.
+Ad hoc Reserialisation
+If the owl or obo file involved in the merge conflict was edited using Protégé 5.5.0, the above instructions should be sufficient. If edited in any other way, such as fixing a conflict in a text editor, the serialisation order may need to be fixed. This can be done as follows:
+Reserialise the master file using the Ontology Development Kit (ODK). This requires setting up Docker and ODK. If not already set up, follow the instructions here.
+Open Docker.
+At the line command (PC) or Terminal (Mac), use the cd (change directory) command to navigate to the repository's src/ontology/ directory. + For example,
+cd PATH_TO_ONTOLOGY/src/ontology/
Replace "PATH_TO_ONTOLOGY" with the actual file path to the ontology. If you need to orient yourself, use the pwd
(present working directory) or ls
(list) line commands.
sh run.sh make normalize_src
If you are resolving a conflict in an .obo file, run:
+sh run.sh make normalize_obo_src
In some ontologies (such as the Cell ontology (CL)), edits may result in creating a large amount of unintended differences involving ^^xsd:string. If you see these differences after running the command above, they can be resolved by following the instructions here.
+Continue by going to step 1 under the main Instructions above.
+The command line tool Robot has a diff tool that compares two ontology files and can print the differences between them in multiple formats, among them markdown.
+We can use this tool and GitHub actions to automatically post a comment when a Pull Request to master is created, with the differences between the two ontologies.
+To create a new GitHub action, create a folder in your ontology project root folder called .github
. Then create a yaml file in a subfolder called workflows
, e.g. .github/workflows/diff.yml
. This file contains code that will be executed in GitHub when certain conditions are meant, in this case, when a PR to master is submitted. The comments in this file from FYPO will help you write an action for your own repository.
The comment will look something like this.
+ + + + + + +Note: Creating a fork allows you to create your copy GitHub repository. This example provides instructions on forking the Mondo GitHub reposiitory. You can't break any of the Mondo files by editing your forked copy.
+Clone your forked repo:
+If you have GitHub Desktop installed - click Code -> Open with GitHub Desktop
+How are you planning to use this fork? To contribute to parent project
+In GitHub Desktop, create a new branch:
+Click Current Branch - > New Branch
+Give your branch a name, like c-path-training-1
+You will make changes to the Mondo on the branch of your local copy.
+A Git repo consists of a set of branches each with a complete history of all changes ever made to the files and directories. This is true for a local copy you check out to your computer from GitHub or for a copy (fork) you make on GitHub.
+ +A Git repo typically has a master or main branch that is not directly edited. Changes are made by creating a branch from Master (complete copy of the Master + its history) (either a direct branch or via a fork).
+You can copy (fork) any GitHub repo to some other location on GitHub without having to ask permission from the owners. If you modify some files in that repo, e.g. to fix a bug in some code, or a typo in a document, you can then suggest to the owners (via a Pull Request) that they adopt (merge) you your changes back into their repo. See the Appendix for instructions on how to make a fork.
+If you have permission from the owners, you can instead make a new branch.
+A Pull Request (PR) is an event in Git where a contributor (you!) asks a maintainer of a Git repository to review changes (e.g. edits to an ontology file) they want to merge into a project (e.g. the owl file) (see reference). Create a pull request to propose and collaborate on changes to a repository. These changes are proposed in a branch, which ensures that the default branch only contains finished and approved work. See more details here.
+See these instructions on cloning an ontology repo and creating a branch using GitHub Dekstop.
+Review: Once changes are made to the ontology file, they can be viewed in GitHub Desktop.
+Before committing, check the diff. An example diff from the Cell Ontology (CL) is pasted below. Large diffs are a sign that something went wrong. In this case, do not commit the changes and consider asking the ontology editor team for help instead.
+Example 1 (Cell Ontology):
+ +Example 2 (Mondo):
+ +Commit message: Before Committing, you must add a commit message. In GitHub Desktop in the Commit field in the lower left, there is a subject line and a description.
+Give a very descriptive title: Add a descriptive title in the subject line. For example: add new class ONTOLOGY:ID [term name] (e.g. add new class MONDO:0000006 heart disease)
+Write a great summary of what the change is in the Description box, referring to the issue. The sentence should clearly state how the issue is addressed.
+To link the issue, you can use the word 'fixes' or 'closes' in the description of the commit message, followed by the corresponding ticket number (in the format #1234) - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
+Note: 'Fixes' and "Closes' are case-insensitive.
+If you don't want to close the ticket, just refer to the ticket # without the word 'Fixes' or use 'Addresses'. The commit will be associated with the correct ticket but the ticket will remain open. 7.NOTE: It is also possible to type a longer message than allowed when using the '-m' argument; to do this, skip the -m, and a vi window (on mac) will open in which an unlimited description may be typed.
+Click Commit to [branch]. This will save the changes to the ontology edit file.
+Push: To incorporate the changes into the remote repository, click Publish branch.
+Click: Create Pull Request in GitHub Desktop
+This will automatically open GitHub Desktop
+Click the green button 'Create pull request'
+You may now add comments to your pull request.
+The CL editors team will review your PR and either ask for changes or merge it.
+The changes will be available in the next release.
+Curators and projects are assigned specific ID ranges within the prefix for your ontology. See the README-editors.md for your ontology
+An example: go-idranges.owl
+NOTE: You should only use IDs within your range.
+If you have only just set up this repository, modify the idranges file and add yourself or other editors.
+Once you have your assigned ID range, you need to configure Protege so that your ID range is recorded in the Preferences menu. Protege does not read the idranges file.
+In the Protege menu, select Preferences.
+In the resulting pop-up window, click on the New Entities tab and set the values as follows.
+In the Entity IRI box:
+Start with: Specified IRI: http://purl.obolibrary.org/obo
+Followed by: /
End with: Auto-generated ID
Same as label renderer: IRI: http://www.w3.org/2000/01/rdf-schema#label
+In the Auto-generated ID section:
+Numeric
+Prefix GO_
Suffix: leave this blank
+Digit Count 7
Start: see go-idranges.owl. Only paste the number after the GO:
prefix. Also, note that when you paste in your GO ID range, the number will automatically be converted to a standard number, e.g. pasting 0110001 will be converted to 110,001.)
End: see go-idranges.owl
+Remember last ID between Protege sessions: ALWAYS CHECK THIS
+(Note: You want the ID to be remembered to prevent clashes when working in parallel on branches.)
+ + + + + + +.asc
extension to verify the integrity of the downloaded Protégé version..zip
or .tar.gz
file with tools appropriate for your operating system.Follow the steps as needed by your operating system to install the Protégé application.
+For example, on macOS: drag and drop Protégé.app
to the Applications
folder and replace any older versions of the software.
+You may need to right click Protégé.app
and then choose Open
from the menu to authorise the programme to run on your machine.
+Alternatively, go to Preferences -> Security -> General
.
+You need to open the little lock, then click Mac stopped an application from Running (Protégé)
-> Open anyways
.
Adjust memory settings if necessary. +Memory settings can now be adjusted in a jvm.conf configuration file that can be located either in the .protege/conf directory under your home directory, or in the conf directory within the application bundle itself. +For example, to set the maximum amount of memory available for Protégé to, say, 12GB, put the following in the jvm.conf file: +
max_heap_size=12G
+
/Applications/Protégé.app/Contents/conf/jvm.conf
+
Edit this part:
+# Uncomment the line below to set the maximal heap size to 8G
+#max_heap_size=8G
+
Click here to get the latest Protege Plugin latest build (this is available on the bottom of ELK pages. This will download a zipped file.)
+When downloaded, unzip and copy puli and elk jars (two .jar files) in the unpacked directory.
+Remove old org.semanticweb.elk.jar
+Install ELK plugin on Mac:
+This can be done via one of two ways:
+Approach 1
+open ~/.Protege, then click on plugins
Approach 2
+~/.Protege
and a directory called plugins
does not exist in this folder, you can create it.Important: it seems Elk 0.5. Does not work with all versions of Protege, in particular, 5.2 and below. These instructions were only tested with Protege 5.5.
+NOTE This documentation is incomplete, for now you may be better consulting the GO Editor Docs
+For instructions on obsoleting terms (without merging/replacing with a new term, see obsoletion how to guide.)
+See Daily Workflow for creating branches and basic Protégé instructions.
+Note Before performing a merge, make sure that you know all of the consequences that the merge will cause. In particular, be sure to look at child terms and any other terms that refer to the ‘obsoleted’ term. In many cases a simple merge of two terms is not sufficient because it will result in equivalent classes for child terms. For example if obsoleted term X is going to be merged into target term Y and ‘regulation of X’ and ‘regulation of Y’ terms exist, then you will need to merge the regulation terms in addition to the primary terms. You will also need to edit any terms that refer to the obsoleted term to be sure that the names and definitions are consistent.
+Duplicate class
then OK in the pop up window. This should create a class with the exact same name.Change IRI (Rename)
_
in the identifier instead of the colon :
, for example: GO_1234567
. Make sure that the 'change all entities with this URI' box is checked.o
to change the label of the obsoleted term.has_broad_synonym
has_exact_synonym
has_narrow_synonym
has_related_synonym
(if unsure, this is the safest choice)x
on the right.x
on the right.x
on the right.rdfs:comment
that states that term was duplicated and to refer to the new new.term replaced by
annotations as per the instructions and add the winning merged term.≡
in the class hierarchy view on the left hand panel.See Daily Workflow section for commit, push and merge instructions.
+To use owltools will need to have Docker installed and running (see instructions here).
+This is the workflow that is used in Mondo.
+owltools --use-catalog mondo-edit.obo --obsolete-replace [CURIE 1] [CURIE 2] -o -f obo mondo-edit.obo
CURIE 1 = term to be obsoleted
+CURIE 2 = replacement term (ie term to be merged with)
For example: +If to merge MONDO:0023052 ectrodactyly polydactyly with MONDO:0009156 ectrodactyly-polydactyly syndrome, the command is:
+owltools --use-catalog mondo-edit.obo --obsolete-replace MONDO:0023052 MONDO:0009156 -o -f obo mondo-edit.obo
TROUBLESHOOTING: Travis/Jenkins errors
+:: ERROR: ID-mentioned-twice:: GO:0030722
+ :: ERROR: ID-mentioned-twice:: GO:0048126
+ GO:0030722 :: ERROR: has-definition: missing definition for id
The cause of this error is that Term A (GO:0048126) was obsoleted and had replace by Term B (GO:0030722). The GO editor tried to merge Term B into a third term term C (GO:0007312). The Jenkins checkk failed because 'Term A replaced by' was an alternative_id rather than by a main_id. +Solution: In the ontology, go to the obsolete term A and replace the Term B by term C to have a primary ID as the replace_by.
+ + + + + + +See Daily Workflow for creating branches and basic Protégé instructions.
+Warning: Every ontology has their procedures on how they obsolete terms (eg notice periods, notification emails, to_be_obsolete tags, etc.), this how-to guide only serves as a guide on how obsolete a term directly on protege.
+For instructions on how to merge terms (i.e., replace a term with another term in the ontology), see instructions here.
+Check if the term (or any of its children) is being used for annotation:
+Go to your ontology browser of choice, search for the term, either by label or ID
+Notify affected groups (usually by adding an issue in their tracker)
+Check if the term is used elsewhere in the ontology
+Warning: some ontologies give advance notice on terms that will be obsoleted through the annotation 'scheduled for obsoletion on or after' instead of directly obsoleting the term. Please check with the conventions of your ontology before obsoleting a term.
+Examples of additional annotations to add:
+IAO:0000233 term tracker item (type xsd:anyURI) - link to GitHub issue
+has_obsolence_reason
+add ‘OBSOLETE.’ to the term definition: In the 'Description' window, click on the o
on the right-hand side of the definition entry. In the resulting window, in the Literal tab, at the beginning of the definition, type: OBSOLETE.
if the obsoleted term was not replaced by another term in the ontology, but there are existing terms that might be appropriate for annotation, add those term IDs in the 'consider' tag: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select consider
and enter the ID of the replacement term.
++NOTE: Here you have to add the ID of the entity as an
+xsd:string
, e.g. GO:0005819, not the term label.
Add a statement about why the term was made obsolete: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select rdfs:comment
and select Type: Xsd:string
.
+ Consult the wiki documentation for suggestions on standard comments:
- [http://wiki.geneontology.org/index.php/Curator_Guide:_Obsoletion](http://wiki.geneontology.org/index.php/Curator_Guide:_Obsoletion)
+
+ - [http://wiki.geneontology.org/index.php/Obsoleting_GO_Terms](http://wiki.geneontology.org/index.php/Obsoleting_GO_Terms)
+
+ - [http://wiki.geneontology.org/index.php/Editor_Guide](http://wiki.geneontology.org/index.php/Editor_Guide)
+
+If the obsoleted term was replaced by another term in the ontology: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select term replaced by
and enter the ID of the replacement term.
If the obsoleted term was not replaced by another term in the ontology, but there are existing terms that might be appropriate for annotation, add those term IDs in the 'consider' tag: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select consider
and enter the ID of the replacement term.
++NOTE: Here you have to add the ID of the entity as an
+xsd:string
, e.g. GO:0005819, not the term label.
Add any additional annotations needed - this is specific to ontologies and you should consult the conventions of the ontology you are working on.
+Examples of additional annotations to add:
+See Daily Workflow section for commit, push and merge instructions.
+ + + + + + +The Open Researcher and Contributor Identifier (ORCID) is a global, unambiguous way to identify a researcher. +ORCID URIs (e.g., https://orcid.org/0000-0003-4423-4370) can therefore be used to unambigously and actionably attribute various aspects of ontology terms in combination with DC Terms or IAO predicates. However, URIs themselves are opaque and it is difficult to disambiguate to which person an ORCID corresponds when browsing an ontology (e.g., in Protégé).
+ORCIDIO is an ontology that declares ORCID URIs as named individuals and associates basic metadata (e.g., name, description) to each such that tools like Protégé can display a human-readable label rather than the URI itself as in the following example.
+ +In this guide, we discuss how to add ORCIDIO to your ODK setup.
+In your ODK configuration (e.g. src/ontology/myont-odk.yaml
), add the following to the import_group
:
import_group:
+ annotation_properties:
+ - rdfs:label
+ - dc:description
+ - dc:source
+ - IAO:0000115
+ products:
+ - id: orcidio
+ mirror_from: https://w3id.org/orcidio/orcidio.owl
+ module_type: filter
+ base_iris:
+ - https://orcid.org/
+
The list of annotation properties, in particular dc:source
, is important for the filter
module to work (ORCIDIO relies heavily on axiom annotations for provenance).
TODO: "as usual" should be re-written to cross-link to another guide about updating the catalog (or don't say as usual to keep this more self-contained)
+As usual, add a statement into your catalog (src/ontology/catalog-v001.xml
):
<uri name="http://purl.obolibrary.org/obo/ro/imports/orcidio_import.owl" uri="imports/orcidio_import.owl"/>
+
TODO: "as usual" should be re-written to cross-link to another guide about updating the edit file (or don't say as usual to keep this more self-contained)
+As usual, add an imports declaration to your edit file (src/ontology/myont-edit.owl
):
Import(<http://purl.obolibrary.org/obo/ro/imports/orcidio_import.owl>)
+
TODO: link to explanation of base merging strategy
+Note: This is not necessary when using the base merging
strategy (you will know what this means when you do use it).
Add a new SPARQL query: src/sparql/orcids.sparql
. This is used to query for all ORCIDs used in your ontology.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+prefix owl: <http://www.w3.org/2002/07/owl#>
+SELECT DISTINCT ?orcid
+WHERE {
+ VALUES ?property {
+ <http://purl.org/dc/elements/1.1/creator>
+ <http://purl.org/dc/elements/1.1/contributor>
+ <http://purl.org/dc/terms/creator>
+ <http://purl.org/dc/terms/contributor>
+ }
+ ?term ?property ?orcid .
+ FILTER(isIRI(?term))
+}
+
Next, overwrite your ORCID seed generation to using this query by adding the following to your src/ontology/myont.Makefile
(not Makefile
!):
$(IMPORTDIR)/orcidio_terms_combined.txt: $(SRCMERGED)
+ $(ROBOT) query -f csv -i $< --query ../sparql/orcids.sparql $@.tmp &&\
+ cat $@.tmp | sort | uniq > $@
+
For your specific use-case, it may be necessary to tweak this SPARQL query, for example if your ORCIDs are used on axiom annotation level rather than entity annotation level.
+Now run to apply your ODK changes:
+sh run.sh make update_repo
+
This will update a number of files in your project, such as the autogenerated Makefile
.
Lastly, update your ORCIDIO import to apply the changes:
+sh run.sh make refresh-orcidio
+
Commit all the changes to a branch, wait for continuous integration to finish, and enjoy your new ORCIDIO import module.
+ + + + + + +This is instructions on how to create an ontology repository in +GitHub. This will only need to be done once per project. You may need +assistance from someone with basic unix knowledge in following +instructions here.
+We will walk you though the steps to make a new ontology project
+docker ps
in your terminal or command line (CMD). If all is ok, you should be seeing something like:CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
+
.gitconfig
file in your user directory!docker pull obolibrary/odkfull
NOTE The very first time you run this it may be slow, while docker downloads necessary images. Don't worry, subsequent runs should be much faster!
+NOTE Windows users, occasionally it has been reported that files downloaded on a Windows machine get a wrong file ending, for example seed-via-docker.bat.txt
instead of seed-via-docker.bat
, or, as we will see later, project.yaml.txt
instead of project.yaml
. If you have problems, double check your files are named correctly after the download!
You can either pass in a configuration file in YAML format that specifies your ontology project setup, or you can pass arguments on the command line. You can use dir
in your command line on PC to ensure that your wrapper script, .gitconfig, and project.yaml (if you so choose) are all in the correct directory before running the wrapper script.
Passing arguments on the command line:
+./seed-via-docker.sh -d po -d ro -d pato -u cmungall -t "Triffid Behavior ontology" triffo
+
+Using a the predefined project.yaml file:
+./seed-via-docker.sh -C examples/triffo/project.yaml
+
+Passing arguments on the command line:
+seed-via-docker.bat -d po -d ro -d pato -u cmungall -t "Triffid Behavior ontology" triffo
+
+Using a the predefined project.yaml config file:
+seed-via-docker.bat -C project.yaml
+
+-u cmungall
you should be using your own username (i.e. -u nico
), for example for your GitHub or GitLab hosting sites.-c
stands for clean
or "clean up previous attempts before running again" and -C
stands for "the next parameter is the relative path to my config file".command+s
on Mac or ctrl+s
on Windows to save it in the same directory as your seed-via-docker
script.
+ Then you can open the file with a text editor like Notepad++, Atom, Sublime or even nano, and adapt it to your project. Other more comprehensive examples can be found here.This will create your starter files in
+target/triffid-behavior-ontology
. It will also prepare an initial
+release and initialize a local repository (not yet pushed to your Git host site such as GitHub or GitLab).
There are three frequently encountered problems at this stage:
+.gitconfig
in user directory.gitconfig
in user directory¶The seed-via-docker script requires a .gitconfig
file in your user directory. If your .gitconfig
is in a different directory, you need to change the path in the downloaded seed-via-docker
script. For example on Windows (look at seed-via-docker.bat
):
docker run -v %userprofile%/.gitconfig:/root/.gitconfig -v %cd%:/work -w /work --rm -ti obolibrary/odkfull /tools/odk.py seed %*
+
%userprofile%/.gitconfig
should be changed to the correct path of your local .gitconfig
file.
We have had reports of users having trouble if there paths (say, D:\data
) contain a space symbol, like D:/Dropbox (Personal)
or similar. In this case, we recommend to find a directory you can work in that does not contain a space symbol.
You can customize at this stage, but we recommend to first push the changes to you Git hosting site (see next steps).
+Windows users, occasionally it has been reported that files downloaded on a Windows machine get a wrong file ending,
+for example seed-via-docker.bat.txt
instead of seed-via-docker.bat
, or, as we will see later, project.yaml.txt
+instead of project.yaml
. If you have problems, double check your files are named correctly after the download!
The development kit will automatically initialize a git project, add all files and commit.
+You will need to create a project on you Git hosting site.
+For GitHub:
+-u
option. The name MUST be the one you set with -t
, just with lower case letters and dashes instead of spaces. In our example above, the name "Triffid Behavior Ontology" translates to triffid-behavior-ontology
.For GitLab:
+-u
option. The name MUST be the one you set with -t
.Follow the instructions there. E.g. (make sure the location of your remote is exactly correct!).
+cd target/triffo
+git remote add origin https://github.com/matentzn/triffid-behavior-ontology.git
+git branch -M main
+git push -u origin main
+
Note: you can now mv target/triffid-behavior-ontology
to anywhere you like in your home directory. Or you can do a fresh checkout from github.
I generally feel its easier and less error prone to deviate from the standard instructions above. I keep having problems with git, passwords, typose etc, so I tend to do it, inofficially, as follows:
+target/triffo
).In your repo you will see a README-editors.md file that has been customized for your project. Follow these instructions.
+The assumption here is that you are adhering to OBO principles and +want to eventually submit to OBO. Your repo will contain stub metadata +files to help you do this.
+You can create pull requests for your ontology on the OBO Foundry. See the src/metadata
file for more details.
For more documentation, see http://obofoundry.org
+You will want to also:
+See the README-editors.md file that has been generated for your project.
+ + + + + + +docker pull obolibrary/odkfull
. This will download the ODK (will take a few minutes, depending on you internet connection).Raw
, and then, when the file is open in your browser, CTRL+S to save it. Ideally, you save this file in your project directory, the directory you will be using for your exercises, as it will only allow you to edit files in that very same directory (or one of its sub-directories).docker ps
in your terminal or command line (CMD). If all is ok, you should be seeing something like:CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
+
docker pull obolibrary/odkfull
on your command line to install the ODK. This will take while.sh odk.sh robot --version
to see whether it works.sh odk.sh bash
(to leave the ODK container again, simply run exit
from within the container). On Windows, use run.bat bash
instead.
+ However, for many of the ontologies we develop, we already ship an ODK wrapper script in the ontology repo, so we dont need the odk.sh or odk.bat file.
+ That file is usually called run.sh
or run.bat
and can be found in your ontology repo in the src/ontology
directory
+ and can be used in the exact same way.One of the most frequent problems with running the ODK for the first time is failure because of lack of memory. +There are two potential causes for out-of-memory errors:
+JAVA
inside the ODK docker container. This memory is set as part of the ODK wrapper files, i.e. src/ontology/run.bat
or src/ontology/run.sh
, usually with ODK_JAVA_OPTS
.Out-of-memory errors can take many forms, like a Java OutOfMemory exception,
+but more often than not it will appear as something like an Error 137
.
There are two places you need to consider to set your memory:
+robot_java_args: '-Xmx8G'
to your src/ontology/cl-odk.yaml file, see for example here.robot_java_args
variable. You can manage your memory settings
+ by right-clicking on the docker whale in your system bar-->Preferences-->Resources-->Advanced, see picture below.If your problem is that you do not have enough memory on your machine, the only solution is to try to engineer the pipelines a bit more intelligently, but even that has limits: large ontologies require a lot of memory to process when using ROBOT. For example, handling ncbitaxon as an import in any meaningful way easily consumes up to 12GB alone. Here are some tricks you may want to contemplate to reduce memory:
+robot query
uses an entirely different framework for representing the ontology, which means that whenever you use ROBOT query, for at least a short moment, you will have the entire ontology in memory twice. Sometimes you can optimse memory by seperating query
and other robot
commands into seperate commands (i.e. not chained in the same robot
command).The robot reason
command consumes a lot of memory. reduce
and materialise
potentially even more. Use these only ever in the last possible moment in a pipeline.
`
+A new version of the Ontology Development Kit (ODK) is out? This is what you should be doing:
+docker pull obolibrary/odkfull
+
src/ontology
directory.cd myrepo/src/ontology
+
Now run the update command TWICE (the first time it may fail, as the update command needs to update itself).
+sh run.sh make update_repo
+sh run.sh make update_repo
+
.github/workflows/qc.yml
(from the top level of your repository) and make sure that it is using the latest version of the ODK.For example, container: obolibrary/odkfull:v1.3.0
, if v1.3.0
. Is the latest version. If you are unsure what the latest version is, you can find that information here: https://hub.docker.com/r/obolibrary/odkfull/tags
OPTIONAL: if you have any other GitHub actions you would like to update to the latest ODK, now is the time! All of your GitHub actions can be found in the .github/workflows/
directory from the top level of your repo.
Review all the changes and commit them, and make a PR the usual way. 100% wait for the PR to pass QC - ODK updates can be significant!
+Send a reminder to all other ontology developers of your repo and tell them to install the latest version of ODK (step 1 only).
+This 'how to' guide provides a template for an Ontology Overview for your ontology. Please create a markdown file using this template and share it in your ontology repository, either as part of your ReadMe file or as a separate document in your documentation. The Ontology Overview should include the following three sections:
+Describe the ontology level curation, ie how to add terms. For example, terms are added to the ontology via:
+Note: There is no need for details about QC, ODK unless it is related to curation (ie pipeline that automatically generates mappings, include that)
+Include 1-3 actual use cases. Please provide concrete examples.
+For example:
+Contributors:
+ +Status: This is a working document! Feel free to add more content!
+++The Open Science Engineer contributes to the collection and standardisation of publicly available scientific knowledge through curation, community-building and data, ontology and software engineering.
+
Open Science and all its sub-divisions, including Open Data and Open Ontologies, are central to tackling global challenges from rare disease to climate change. Open licenses are only part of the answer - the really tough part is the standardisation of data (including the unification of ontologies, the FAIRification of data and adoption of common semantic data models) and the organisation of a global, fully decentralised community of Open Science engineers. Here, we will discuss some basic principles on how we can maximise our impact as members of a global community combating the issues of our time:
+We discuss how to best utilise social workflows to achieve positive impact. We will try to convince you that building a close collaborative international community by spending time on submitting and answering issues on GitHub, helping on Stack Overflow and other online platforms, or just reaching out and donating small amounts of time to other open science efforts can make a huge difference.
+For a quick 10 minute overview:
+ + +How to be an open science ontologist
+ + + +The heart and soul of a successful Open Science culture is collaboration. The relative isolation into which many projects are forced due to limitations imposed by certain kinds of funding makes it even more important to develop effective social, collaborative workflows. This involve effective online communication, vocal appreciation (likes, upvotes, comments), documentation and open-ness.
+<details>
tag: <details><summary>[click arrow to expand]</summary>
. See example hereMaximising impact of your changes is by far the best way you can benefit society as an Open Science Engineer. Open Science projects are a web of mutually dependent efforts, for example:
+The key to maximising your impact is to push any fixes as far upstream as possible. Consider the following projects and the way they depend on each other (note that this is a gross simplification for illustration; in reality the number of dependencies is much higher):
+ +Let's think of the following (entirely fabricated) scenario based on the image above.
+It is, therefore, possible that:
+Imagine a user of Open Targets that sees this evidence, and reports it to Open Targets as a bug. Open Targets could take the easy way out: remove the erroneous record from the database permanently. This means that the IMPC (itself with hundreds of dependent users and tools), Monarch (again with many dependents), uPheno and HPO (with probably thousands of dependents) would still carry forward that (tiny) mistake. +This is the basic idea of maximising impact through Upstream Fixing: The higher up-stream (up the dependency graph) an error is fixed, the more cumulative benefit there is to a huge ecosystem of tools and services.
+An even better fix would be to have each fix to the ontology result in a new, shared quality control test. For example, some errors (duplicate labels, missing definition, etc) can be caught by automated testing. Here is a cool story.
+@vasvir
(GitHub name), a member of the global community reached out to us on Uberon: https://github.com/obophenotype/uberon/issues/2424.
+
+https://github.com/obophenotype/uberon/pull/2640Gasserian ganglion
and gasserian ganglion
where previously considered distinct). Note: before the PRs, @vasvir did not speak any SPARQL.
+ Instead of simply deleting the synonyms for his NLP projects, @vasvir
instead decided to report the issues straight to the source. This way, hundreds, if not thousands of projects will directly or indirectly benefit from him!
Example 1: While curating Mondo, Nicole identified issues relevant to Orphanet and created this issue.
+ +Example 2: There is overlap between Mondo and Human Phenotype Ontology and the Mondo and HPO curators tag each other on relevant tickets.
+ +Example 3: In Mondo, if new classifications are made, Mondo curators report this back to the source ontology to see if they would like to follow our classification.
+ +Have you ever wondered how much impact changing a synonym from exact
to related
could have? Or the addition of a precise mapping? The fixing of a typo in a label? It can be huge. And this does not only relate to ontologies, this goes for tool development as well. We tend to work around bugs when we are building software. Instead, or at least in addition to, we should always report the bug at the source to make sure it gets fixed eventually.
Many of the resources we develop are financed by grants. Grants are financed in the end by the taxpayer. While it is occasionally appropriate to protect open work with creative licenses, it rarely makes sense to restrict access to Open Ontologies work - neither to commercial nor research exploitation (we may want to insist on appropriate attribution to satisfy our grant developers).
+On the other side there is always the risk of well-funded commercial endeavours simply "absorbing" our work - and then tying stakeholders into their closed, commercial ecosystem. However, this is not our concern. We cannot really call it stealing if it is not really ours to begin with! Instead of trying to prevent unwanted commercialisation and closing, it is better to work with corporations in pre-competitive schemes such as Pistoia Alliance or Allotrope Foundation and lobby for more openness. (Also, grant authorities should probably not allow linking scientific data to less than totally open controlled vocabularies.)
+Here, we invite you to embrace the idea that ontologies and many of the tools we develop are actually community-driven, with no particular "owners" and "decision makers". While we are not yet there (we don't have sufficiently mature governance workflows for full fledged onto-communism), and most ontologies are still "owned" by an organisation that provides a major source of funding, we invite you to think of this as a preliminary state. It is better to embrace the idea of "No-ownership" and figure out social workflows and governance processes that can handle the problems of decision making.
+Feel empowered to nudge reviewers or experts to help. Get that issue answered and PR merged whatever it takes!
+Example: After waiting for the PR to be reviewed, Meghan kindly asked Nicole if she should find a different reviewer. + +1. Find review buddies. For every ontology you seek to contribute to pair up with someone who will review your pull requests and you will review their pull requests. Sometimes, it is very difficult to get anyone to review your pull request. Reach out to people directly, and form an alliance for review. It is fun, and you learn new things (and get to know new people!). +1. Be proactive
+Prettier standardizes the representation and formatting of Markdown. More information is available at https://prettier.io/. Note, these instructions are for a Mac.
+If you do not have npm installed, this can be installed using homebrew (if you have homebrew installed).
+brew install node
npm install --save-dev --save-exact prettier
npx prettier --write .
Note: Windows users should open Protege using run.bat +Note: For the purpose of this how-to, we will be using MONDO as the ontology
+The Protégé interface follows a basic paradigm of Tabs and Panels. By default, Protégé launches with the main tabs seen below. The layout of tabs and panels is configurable by the user. The Tab list will have slight differences from version to version, and depending on your configuration. It will also reflect your customizations.
+To customize your view, go to the Window tab on the toolbar and select Views. Here you can customize which panels you see in each tab. In the tabs view, you can select which tabs you will see. You will commonly want to see the Entities tab, which has the Classes tab and the Object Properties tab.
+ +Note: if you open a new ontology while viewing your current ontology, Protégé will ask you if you'd like to open it in a new window. For most normal usage you should answer no. This will open in a new window.
+The panel in the center is the ontology annotations panel. This panel contains basic metadata about the ontology, such as the authors, a short description and license information.
+ +Before browsing or searching an ontology, it is useful to run an OWL reasoner first. This ensures that you can view the full, intended classification and allows you to run queries. Navigate to the query menu, and run the ELK reasoner:
+ +You will see various tabs along the top of the screen. Each tab provides a different perspective on the ontology. +For the purposes of this tutorial, we care mostly about the Entities tab, the DL query tab and the search tool. OWL Entities include Classes (which we are focussed on editing in this tutorial), relations (OWL Object Properties) and Annotation Properties (terms like, 'definition' and 'label' which we use to annotate OWL entities. +Select the Entities tab and then the Classes sub-tab. Now choose the inferred view (as shown below).
+ +The Entities tab is split into two halves. The left-hand side provides a suite of panels for selecting various entities in your ontology. When a particular entity is selected the panels on the right-hand side display information about that entity. The entities panel is context specific, so if you have a class selected (like Thing) then the panels on the right are aimed at editing classes. The panels on the right are customizable. Based on prior use you may see new panes or alternate arrangements. +You should see the class OWL:Thing. You could start browsing from here, but the upper level view of the ontology is too abstract for our purposes. To find something more interesting to look at we need to search or query.
+You can search for any entity using the search bar on the right:
+ +The search window will open on top of your Protege pane, we recommend resizing it and moving it to the side of the main window so you can view together.
+ +Here's an example search for 'COVID-19': +
+It shows results found in display names, definitions, synonyms and more. The default results list is truncated. To see full results check the 'Show all results option'. You may need to resize the box to show all results. +Double clicking on a result, displays details about it in the entities tab, e.g.
+ +In the Entities, tab, you can browse related types, opening/closing branches and clicking on terms to see details on the right. In the default layout, annotations on a term are displayed in the top panel and logical assertions in the 'Description' panel at the bottom.
+Try to find these specific classes:
+Note - a cool feature in the search tool in Protege is you can search on partial string matching. For example, if you want to search for ‘down syndrome’, you could search on a partial string: ‘do synd’.
+Note - if the search is slow, you can uncheck the box ‘Search in annotation values. Try this and search for a term and note if the search is faster. Then search for ‘shingles’ again and note what results you get.
+ + + + + + +You need to have a GitHub account GitHub and download GitHub Desktop
+Pull Requests are GitHub's mechanism for allowing one person to propose changes to a file (which could be a chunk of code, documentation, or an ontology) and enabling others to comment on (review) the proposed changes. You can learn more about creating Pull Requests (PRs) here; this document is about reviewing other people's PRs.
+One key aspect of reviewing pull requests (aka code review or ontology change review) is that the purpose is not just to improve the quality of +the proposed change. It is also about building shared coding habits and practices and improving those practices for all engineers (ontology and software) across a whole organisation (effectively building the breadth of project knowledge of the developers and reducing the amount of hard-to-understand code).
+Reviewing is an important aspect of open science and engineering culture that needs to be learned and developed. In the long term, this habit will have an effect on the growth and impact of our tools and ontologies comparable to the engineering itself.
+It is central to open science work that we review other people's work outside our immediate team. We recommend choosing a few people with whom to mutually review your work, whether you develpo ontologies, code or both. It is of great importance that pull requests are addressed in a timely manner, ideally within 24 hours of the request. The requestor is likely in the headspace of being receptive to changes and working hard to get the code fixed when they ask for a code review.
+Understand the Context: First, read the description of the pull request (PR). It should explain what changes have been made and why. Understand the linked issue or task related to this PR. This will help you understand the context of the changes.
+Check the Size: A good PR should not be too large, as this makes it difficult to understand the full impact of the changes. If the PR is very large, it may be a good idea to ask the author to split it into smaller, more manageable PRs.
+Review the Code: Go through the code changes line by line. Check the code for clarity, performance, and maintainability. Make sure the code follows the style guide and best practices of your project. Look out for any potential issues such as bugs, security vulnerabilities, or performance bottlenecks.
+Check the Tests: The PR should include tests that cover the new functionality or changes. Make sure the tests are meaningful, and they pass. If the project has a continuous integration (CI) system, all tests should pass in the CI environment. In some cases, manual testing may be helpful (see below).
+Check the Documentation: If the PR introduces new functionality, it should also update the documentation accordingly. Even for smaller changes, make sure that comments in the code are updated.
+Give Feedback: Provide constructive feedback on the changes. If you suggest changes, explain why you think they are necessary. Be clear, respectful, and concise. Remember, your goal is to help improve the quality of the code.
+Follow Up: After you have provided feedback, check back to see if the author of the PR has made the suggested changes. You might need to have a discussion or explain your points further.
+Approve/Request Changes: If you are satisfied with the changes and all your comments have been addressed, approve the PR. If not, request changes and explain what should be done before the PR can be approved.
+Merge the PR: Once the PR is approved and all CI checks pass, it can be merged into the main branch. If your project uses a specific merge strategy (like squash and merge or rebase and merge), make sure it's followed.
+xsd:string
declarations), request before doing a review to reduce the changes to only the changes pertaining to the specific issue at hand.In many cases, we may not have the time to perform a proper code review. In that case, try at least to achieve this:
+The instructions below describe how to capture a screenshot of your screen, either your entire screen or a partial screenshot. These can be pasted into GitHub issues, pull requests or any markdown file.
+Different keyboards have different keys. One of the following options should work:
+(This was adopted from the Gene Ontology editors guide and Mondo documentation). Updated 2023-03-13 by Nicole Vasilevsky
+These instructions are for Mac OS
+As of February 2023, OBO ontology editors are using Protege version 5.6.1.
+Protege needs at least 4G of RAM to cope with large ontologie like Mondo, ideally use 12G or 16G if your machine can handle it. Edits to the Protege configuration files will not take effect until Protege is restarted.
+<string>-Xss16M</string>
<string>-Xmx12G</string>
Some Mac users might find that the edits need to be applied to /Applications/Protégé.app/Contents/Info.plist
.
Taken in part from Memory Management with Protégé by Michael DeBellis. Updated by Nicole Vasilevsky.
+The following instructions will probably not work if Protégé was installed from the platform independent version, which does not include the Java Runtime Environment or a Windows .exe launcher.
+Protege-<version>-win.zip
Protege.l4j.ini
in the same directory as Protege.exe
. Opening large ontologies like MONDO will require an increase to Protege's default maximum Java heap size, which is symbolized as -Xmx<size>
. 4GB is usually adequate for opening MONDO, as long as 4GB of free memory is really available on your system before you launch Protégé! Allocating even more memory will improve some tasks, like reasoning. You can check your available memory by launching the Windows Task Manager, clicking on the More details button on the bottom of the window and then checking the Performance tab at the top of the window.Protege.l4j.ini
before editingOpen Protege.l4j.ini
with a lightweight text editor like Atom or Sublime. Using notepad.exe instead might work, but may change character encodings or the character(s) used to represent End of Line.
After increasing the memory available to Protégé, Protege.l4j.ini
might look like this.
-Xms200M
+-Xmx4G
+-Xss16M
+
Note that there is no whitespace between -Xmx
, the numerical amount of memory, and the Megabytes/Gigabytes suffix. Don't forget to save.
Taking advantage of the memory increase requires that Protégé is shut down and relaunched, if applicable. The methods discussed here may not apply if Protégé is launched through any method other than double clicking Protege.exe
from the folder where the edited Protege.l4j.ini
resides.
If you have issues opening Protege, then reduce the memory, try 10G (or lower) instead.
+See instructions here. Note: Protege 5.6.1 has the ELK reasoner installed.
+See instructions here.
+User name
Click Use supplied user name:
add your name (ie nicolevasilevsky)Use Git user name when available
ORCID
. Add the ID number only, do not include https://, ie 0000-0001-5208-3432Preferences
> New Entities Metadata
tabAnnotate new entities with creator (user)
boxCreator property
Add http://purl.org/dc/terms/contributorCreator value
Select Use ORCIDDate property
Add http://purl.org/dc/terms/dateDate value format
Select ISO-8601This plugin enables some extra functionality, such as the option to obsolete entities from the menu. To install it:
+File > Check for plugins...
.OBO Annotations Editor
and click on Install
.Edit > Make entity obsolete
.Preferences > Plugins
.docker pull obolibrary/odkfull
. This will download the ODK (will take a few minutes, depending on you internet connection).When you edit an ontology, you need to make sure you are using the correct prefix and your assigned ID range for that on ontology. Protege (unfortunately) does +not remember the last prefix or ID range that you used when you switch between ontologies. Therefore we need to manually update this each time we switch ontologies.
+src/ontology/[ontology-name]-idranges.owl
. (For example, src/ontology/mondo-idranges.owl.)You need to have a GitHub account to make term requests. Sign up for a free GitHub account.
+This guide on How to select and request terms from ontologies by Chris Mungall provides some helpful background and tips for making term requests.
+Onologies are under constant development and are continuously expanded and iterated upon. You may discover that a term you need is not available in your preferred ontology. In this case, please make a new term request to the ontology.
+In the following text below, we describe best practices for making a term request to an ontology. In general, requests for new terms are make on the ontology GitHub issue tracker. For example, this is the GitHub issue tracker for the Uberon Anatomy onology.
+Note: These are suggestions and not strict rules. We appreciate your contributions to extending and improving ontologies. Following best guidelines is appreciated by the curators and developers, and assists them in addressing your issue more quickly. However, we understand if you are not always able to follow these best practices. Please add as much information as possible, and if there are any questions, the ontology developer may follow up with you for further clarification.
+This page discusses how to update the contents of your imports using the ODK, like adding or removing terms.
+Note: This is a specialised how-to for ODK managed ontologies and is replicated from ODK docs to consolidate workflows in the obook. Not all ontologies use ODKs and many ontologies have their own workflows for imports, please also check with your local ontology documents and/or developers.
+Note: The extract function in ROBOT can also be used to extract subsets from onotlogies for modular imports without the use of the ODK. For details on that, please refer to the ROBOT documentation
+Note: some ontologies now use a merged-import system to manage dynamic imports, for these please follow instructions in the section title "Using the Base Module approach".
+Importing a new term is split into two sub-phases:
+There are three ways to declare terms that are to be imported from an external ontology. Choose the appropriate one for your particular scenario (all three can be used in parallel if need be):
+This workflow is to be avoided, but may be appropriate if the editor does not have access to the ODK docker container. +This approach also applies to ontologies that use base module import approach.
+Now you can use this term for example to construct logical definitions. The next time the imports are refreshed (see how to refresh here), the metadata (labels, definitions, etc) for this term are imported from the respective external source ontology and becomes visible in your ontology.
+Every import has, by default a term file associated with it, which can be found in the imports directory. For example, if you have a GO import in src/ontology/go_import.owl
, you will also have an associated term file src/ontology/go_terms.txt
. You can add terms in there simply as a list:
GO:0008150
+GO:0008151
+
Now you can run the refresh imports workflow) and the two terms will be imported.
+This workflow is appropriate if:
+To enable this workflow, you add the following to your ODK config file (src/ontology/cl-odk.yaml
), and update the repository (using sh run.sh make update_repo
):
use_custom_import_module: TRUE
+
Now you can manage your imported terms directly in the custom external terms template, which is located at src/templates/external_import.owl
. Note that this file is a ROBOT template, and can, in principle, be extended to include any axioms you like. Before extending the template, however, read the following carefully.
The main purpose of the custom import template is to enable the management off all terms to be imported in a centralised place. To enable that, you do not have to do anything other than maintaining the template. So if you, say current import APOLLO_SV:00000480
, and you wish to import APOLLO_SV:00000532
, you simply add a row like this:
ID Entity Type
+ID TYPE
+APOLLO_SV:00000480 owl:Class
+APOLLO_SV:00000532 owl:Class
+
When the imports are refreshed see imports refresh workflow, the term(s) will simply be imported from the configured ontologies.
+Now, if you wish to extent the Makefile (which is beyond these instructions) and add, say, synonyms to the imported terms, you can do that, but you need to (a) preserve the ID
and ENTITY
columns and (b) ensure that the ROBOT template is valid otherwise, see here.
WARNING. Note that doing this is a widespread antipattern (see related issue). You should not change the axioms of terms that do not belong into your ontology unless necessary - such changes should always be pushed into the ontology where they belong. However, since people are doing it, whether the OBO Foundry likes it or not, at least using the custom imports module as described here localises the changes to a single simple template and ensures that none of the annotations added this way are merged into the base file (see format variant documentation for explanation on what base file is)
+If you want to refresh the import yourself (this may be necessary to pass the travis tests), and you have the ODK installed, you can do the following (using go as an example):
+First, you navigate in your terminal to the ontology directory (underneath src in your hpo root directory).
+cd src/ontology
+
Then, you regenerate the import that will now include any new terms you have added. Note: You must have docker installed.
+sh run.sh make PAT=false imports/go_import.owl -B
+
Since ODK 1.2.27, it is also possible to simply run the following, which is the same as the above:
+sh run.sh make refresh-go
+
Note that in case you changed the defaults, you need to add IMP=true
and/or MIR=true
to the command below:
sh run.sh make IMP=true MIR=true PAT=false imports/go_import.owl -B
+
If you wish to skip refreshing the mirror, i.e. skip downloading the latest version of the source ontology for your import (e.g. go.owl
for your go import) you can set MIR=false
instead, which will do the exact same thing as the above, but is easier to remember:
sh run.sh make IMP=true MIR=false PAT=false imports/go_import.owl -B
+
Since ODK 1.2.31, we support an entirely new approach to generate modules: Using base files. +The idea is to only import axioms from ontologies that actually belong to it. +A base file is a subset of the ontology that only contains those axioms that nominally +belong there. In other words, the base file does not contain any axioms that belong +to another ontology. An example would be this:
+Imagine this being the full Uberon ontology:
+Axiom 1: BFO:123 SubClassOf BFO:124
+Axiom 1: UBERON:123 SubClassOf BFO:123
+Axiom 1: UBERON:124 SubClassOf UBERON 123
+
The base file is the set of all axioms that are about UBERON terms:
+Axiom 1: UBERON:123 SubClassOf BFO:123
+Axiom 1: UBERON:124 SubClassOf UBERON 123
+
I.e.
+Axiom 1: BFO:123 SubClassOf BFO:124
+
Gets removed.
+The base file pipeline is a bit more complex then the normal pipelines, because +of the logical interactions between the imported ontologies. This is solved by _first +merging all mirrors into one huge file and then extracting one mega module from it.
+Example: Let's say we are importing terms from Uberon, GO and RO in our ontologies. +When we use the base pipelines, we
+imports/merged_import.owl
The first implementation of this pipeline is PATO, see https://github.com/pato-ontology/pato/blob/master/src/ontology/pato-odk.yaml.
+To check if your ontology uses this method, check src/ontology/cl-odk.yaml to see if use_base_merging: TRUE
is declared under import_group
If your ontology uses Base Module approach, please use the following steps:
+First, add the term to be imported to the term file associated with it (see above "Using term files" section if this is not clear to you)
+Next, you navigate in your terminal to the ontology directory (underneath src in your hpo root directory).
+cd src/ontology
+
Then refresh imports by running
+sh run.sh make imports/merged_import.owl
+
Note: if your mirrors are updated, you can run sh run.sh make no-mirror-refresh-merged
This requires quite a bit of memory on your local machine, so if you encounter an error, it might be a lack of memory on your computer. A solution would be to create a ticket in an issue tracker requesting for the term to be imported, and your one of the local devs should pick this up and run the import for you.
+Lastly, restart Protege, and the term should be imported in ready to be used.
+ + + + + + +There are two places you'll probaby want to use images in GitHub, in issue tracker and in markdown files, html etc. +The way you handle images in these contexts is quite different, but easy once you get the hang of it.
+All images referenced in static files such as html and markdown need to be referenced using a URL; dragging and dropping is not supported and could actually cause problems. Keeping images in a single directory enables them to be referenced more readily. Sensible file names are highly recommended, preferably without spaces as these are hard to read when encoded.
+An identical file, named in two different ways is shown as an example below. +They render in the same way, but the source "code" looks ugly when spaces are used in file names.
+Eg.
+encoding needed | +no encoding needed | +
---|---|
![](github%20organizations%20teams%20repos.png |
+![](github-organizations-teams-repos.png) |
+
+ | + |
In this example, the filename is enough of a 'url' because this file (https://ohsu-library.github.io/github-tutorial/howto/images/index.md) and the images are in the same directory https://ohsu-library.github.io/github-tutorial/howto/images/.
+To reference/embed an image that is not in the same directory, a more careful approach is needed.
+Absolute path referencing | +Relative path referencing | +
---|---|
![](https://github.com/OHSU-Library/github-tutorial/raw/master/docs/other-images/owl.jpg) |
+![](other-images/owl.jpg) |
+
+ | + |
Each instance of ../
means 'go up one level' in the file tree.
It is also possible to reference an image using an external URL outside your control, in another github organization, or anywhere on the web, however this method can be fragile if the URL changes or could lead to unintended changes. Therefore make your own copies and reference those unless:
+For example, it is not clear for how long the image below will manage to persist at this EPA link, or sadly, for how long the image will even be an accurate reflection of the current situation in the arctic. https://www.epa.gov/sites/production/files/styles/microsite_banner/public/2016-12/epa-banner-images/science_banner_arctic.png
+ +Images that are embedded into issues can be dragged and dropped in the GitHub issues interface. +Once you've done so, it will look something like this with GitHub assigning an arbitrary URL (githubuserassets) for the image.
+![](screenshot-of-images-in-issues.png)
Ideally, a Markdown document is renderable in a variety of output formats and devices. In some cases, it may be desirable to create non-portable Markdown that uses HTML syntax to position images. This limits the longevity of the artifact, but may be necessary sometimes. We describe how to manage this below.
+In order to size images, use the native html syntax: width =
with the <img src=, as per below.
<img src="https://github.com/monarch-initiative/monarch-app/raw/master/image/Phenogrid3Compare.png" width="53">
Welcome to the OBOOK and our OBO Semantic Engineering Training!
+Documentation in the OBOOK is organised into 4 distinct sections based on the Diátaxis framework of documentation:
+To accommodate for the various training use cases we support, we added the following categories:
+Note: We are in the process of transforming the documentation accordingly, so please be patient if some of the documentation is not yet in the correct place. Feel free to create an issue if you find something that you suspect isn't in place.
+If you would like to contribute to this training, please find out more here.
+Critical Path Institute (CPI) is an independent, nonprofit organisation dedicated to bringing together experts from regulatory agencies, industry and academia to collaborate and improve the medical product development process.
+In April 2021, the CPI has commissioned the first version of this OBO course, contributing not only funding for the preparation and delivery of the materials, but also valuable feedback about the course contents and data for the practical exercises. We thank the CPI for contributing significantly to the OBO community and open science!
+https://c-path.org/
+ + + + + + +These materials are under construction and incomplete.
+In the following we will look a bit at the general Linked Data landscape, and name some of its flagship projects and standards. It is important to be clear that the Semantic Web field is a very heterogenous one:
+While these Semantic Web flagship projects are doubtlessly useful, it is sometimes hard to see how they can help for your biomedical research. We rarely make use of them in our day to day work as ontologists, but there are some notable exceptions:
+The OBO format is a very popular syntax for representing biomedical ontologies. A lot of tools have been built over the years to hack OBO ontologies on the basis of that format - I still work with it on a daily basis. Although it has semantically been proven to be a subset of OWL (i.e. there is a lossless mapping of OBO into OWL) and can be viewed as just another syntax, it is in many ways idiosyncratic. For starters, you wont find many, if any, IRIs in OBO ontologies. The format itself uses CURIEs which are mapped to the general OBO PURL namespace during transformation to OWL. For example, if you see MONDO:0003847 in an OBO file, and were to translate it to OWL, you will see this term being translated to http://purl.obolibrary.org/obo/MONDO_0003847. Secondly, you have a bunch of built-in properties like BROAD or ABBREVIATION that mapped to a vocabulary called oboInOwl (oio). These are pretty non-standard on the general Semantic Web, and often have to be manually mapped to the more popular counterparts in the Dublin Core or SKOS namespaces.
+Having URIs as identifiers is not generally popular in the life sciences. As discussed elsewhere, it is much more likely to encounter CURIEs such as MONDO:0003847 than URIs such as http://purl.obolibrary.org/obo/MONDO_0003847 in biomedical databases.
+ +Why does the biomedical research, and clinical, community care about the Semantic Web and Linked Data? There are endless lists of applications that try to apply semantic technologies to biomedical problems, but for this week, we only want to look at the broader picture. In our experience, the use cases where Semantic Web standards are applied successfully are:
+As a rule of thumb, for every single problem/term/use case, you will have 3-6 options to choose from, in some cases even more. The criteria for selecting a good ontology are very much dependent on your particular use case, but some concerns are generally relevant. A good first pass is to apply to "10 simple rules for selecting a Bio-ontology" by Malone et al, but I would further recommend to ask yourself the following:
+Aside from aspects of your analysis, there is one more thing you should consider carefully: the open-ness of your ontology in question. As a user, you have quite a bit of power on the future trajectory of the domain, and therefore should seek to endorse and promote open standards as much as possible (for egotistic reasons as well: you don't want to have to suddenly pay for the ontologies that drive your semantic analyses). It is true that ontologies such as SNOMED have some great content, and, even more compellingly, some really great coverage. In fact, I would probably compare SNOMED not with any particular disease ontology, but with the OBO Foundry as a whole, and if you do that, it is a) cleaner, b) better integrated. But this comes at a cost. SNOMED is a commercial product - millions are being payed every year in license fees, and the more millions come, the better SNOMED will become - and the more drastic consequences will the lock-in have if one day you are forced to use SNOMED because OBO has fallen too far behind. Right now, the sum of all OBO ontologies is probably still richer and more valuable, given their use in many of the central biological databases (such as the ones hosted by the EBI) - but as SNOMED is seeping into the all aspects of genomics now (for example, it will soon be featured on OLS!) it will become increasingly important to actively promote the use of open biomedical ontologies - by contributing to them as well as by using them.
+We will discuss ontologies in the medical, phenomics and genomics space in more detail in a later session of the course.
+In this section we will discuss the following:
+Note of caution: No two Semantic Web overviews will be equivalent to each other. Some people claim the Semantic Web as an idea is an utter failure, while others praise it as a great success (in the making) - in the end you will have to make up your own mind. In this section I focus on parts of the Semantic Web step particularly valuable to the biomedical domain, and I will omit many relevant topics in the wider Semantic Web area, such as Enterprise Knowledge Graphs, decentralisation and personalisation, and many more. Also, the reader is expected to be familiar with the basic notions of the Semantic Web, and should use this overview mainly to tie some of the ideas together.
+The goal of this section is to give the aspiring Semantic Data Engineer in the biomedical domain a rough idea of key concepts around Linked Data and the Semantic Web insofar as they relate to their data science and and data engineering problems. Even after 20 years of Semantic Web research (the seminal paper, conveniently and somewhat ironically behind a paywall, was published in May 2001), the area is still dominated by "academic types", although the advent of the Knowledge Graph is already changing that. As I already mentioned above, no two stories of what the Semantic Web is will sound the same. However, there are a few stories that are often told to illustrate why we need semantics. The OpenHPI course names a few:
+<span about="dbpedia:Jaguar">Jaguar</span>
, will make it easier for the search engine to understand what your site is about and link it to other relevant content. From this kind of mark-up, structured data can be extracted and integrate into a giant, worldwide database, and exposed through SPARQL endpoints, that can then be queried using a suitable query language.I am not entirely sure anymore that any of these ways (web of data, machine understanding, layered stack of matching standards) to motivate the Semantic Web are particularly effective for the average data scientists or engineer. +If I had to explain the Semantic Web stack to my junior self, just having finished my undergraduate, I would explain it as follows (no guarantee though it will help you).
+The Semantic Web / Linked Data stack comprises roughly four components that are useful for the aspiring Semantic (Biomedical) Data Engineer/Scientist to distinguish:
+You, as a scientist, might be using the term "gene" to refer to basic physical and functional unit of heredity, but me, as a German, prefer the term "Gen". In the Semantic Web, instead of natural language words, we prefer to use URIs to refer to things such as https://www.wikidata.org/wiki/Q7187: if you say something using the name https://www.wikidata.org/wiki/Q7187, both your German and Japanese colleagues will "understand" what you are referring to. More about that in the next chapter.
+For example, to express "a mutation of SHH in humans causes isolated microphthalmia with coloboma-5" you could say something like (http://purl.obolibrary.org/obo/MONDO_0012709 | "microphthalmia, isolated, with coloboma 5")--[http://purl.obolibrary.org/obo/RO_0004020 | "has basis in dysfunction of"]-->(https://identifiers.org/HGNC:10848 | "SSH (gene)"). Or you could say: (http://purl.obolibrary.org/obo/MONDO_0012709 | "microphthalmia, isolated, with coloboma 5")--[http://www.w3.org/2000/01/rdf-schema#subClassOf | "is a"]-->(http://purl.obolibrary.org/obo/MONDO_0003847 | "Mendelian Disease"). If we use the analogy of "language", then the URIs (above) are the words, and the statements are sentences in a language. Unfortunately, there are many languages in the Semantic Web, such as OWL, RDFS, SKOS, SWRL, SHACL, SHEX, and dialects (OWL 2 EL, OWL 2 RL) and a plethora of formats, or serialisations (you can store the exact same sentence in the same language such as RDF, or OWL, in many different ways)- more about that later. In here lies also one of the largest problems of the Semantic Web - lots of overlapping standards means, lots of incompatible data - which raises the bar for actually being able to seamlessly integrate "statements about things" across resources.
+Examples include:
+For example (as always, non exhaustive):
+This week will focus on 1 (identifiers) and 4 (applications) - 2 (languages and standards) and 3 (controlled vocabularies and ontologies) will be covered in depth in the following weeks.
+Note on the side: Its not always 100% clear what is meant by Linked Data in regular discourse. There are some supposedly "clear" definitions ("method for publishing structured data", "collection of interrelated datasets on the Web"), but when it comes down to the details, there is plenty of confusion (does an OWL ontology constitute Linked Data when it is published on the Web? Is it Linked Data if it does not use RDF? Is it Linked Data if it is less than 5-star - see below). In practice all these debates are academic and won't mean much to you and your daily work. There are entities, statements (context) being said about these entities using some standard (associated with the Semantic Web, such as OWL or RDFS) and tools that do something useful with the stuff being said.
+One of the top 5 features of the Semantic Web (at least in the context of biomedical sciences) is the fact that we can use URIs as a global identifier scheme that is unambiguous, independent of database implementations, independent of language concerns to refer to the entities in our domain.
+For example, if I want to refer to the concept of "Mendelian Disease", I simply refer to http://purl.obolibrary.org/obo/MONDO_0003847 - and everyone, in Japan, Germany, China or South Africa, will be able to "understand" or look up what I mean. I don't quite like the word "understanding" in this context as it is not actually trivial to explain to a human how a particular ID relates to a thing in the real world (semiotics). In my experience, this process is a bit rough in practice - it requires that there is a concept like "Mendelian Disease" in the mental model of the person, and it requires some way to link the ID http://purl.obolibrary.org/obo/MONDO_0003847 to that "mental" concept - not always as trivial as in this case (where there are standard textbook definitions). The latter is usually achieved (philosophers and linguists please stop reading) by using an annotation that somehow explains the term - either a label or some kind of formal definition - that a person can understand. In any case, not trivial, but thankfully not the worst problem in the biomedical domain where we do have quite a wide range of shared "mental models" (more so in Biology than Medical Science..). Using URIs allows us to facilitate this "understanding" process by leaving behind some kind of information at the location that is dereferenced by the URI (basically you click on the URI and see what comes up). Note that there is a huge deal of compromise already happening across communities. In the original Semantic Web community, the hope was somehow that dereferencing the URI (clicking on it, navigating to it) would reveal structured information about the entity in question that could used by machines to understand what the entity is all about. In my experience, this was rarely ever realised in the biomedical domain. Some services like Ontobee expose such machine readable data on request (using a technique called content negotiation), but most URIs simply refer to some website that allow humans to understand what it means - which is already a huge deal. For more on names and identifiers I refer the interested reader to James Overton's OBO tutorial here.
+Personal note: Some of my experienced friends in the bioinformatics world say that "IRI have been more pain than benefit". It is clear that there is no single thing in the Semantic Web is entirely uncontested - everything has its critics and proponents.
+In reality, few biological resources will contain a reference to http://purl.obolibrary.org/obo/MONDO_0003847. More often, you will find something like MONDO:0003847
, which is called a CURIE. You will find CURIEs in many contexts, to make Semantic Web languages easier to read and manage. The premise is basically that your document contains a prefix declaration that says something like this:
PREFIX MONDO: <http://purl.obolibrary.org/obo/MONDO_>
+
which allows allows the interpreter to unfold the CURIE into the IRI:
+MONDO:0003847 -> http://purl.obolibrary.org/obo/MONDO_0003847
+
In reality, the proliferation of CURIEs has become a big problem for data engineers and data scientists when analysing data. Databases rarely, if ever, ship the CURIE maps with their data required to understand what a prefix effectively stands for, leading to a lot of guess-work in the daily practice of the Semantic Data Engineer (if you ever had to distinguish ICD: ICD10: ICD9: UMLS:, UMLSCUI: without a prefix map, etc you will know what I am talking about). Efforts to bring order to this chaos, essentially globally agreed CURIE maps (e.g. prefixcommons), or ID management services such as identifiers.org exist, but right now there is no one solution - prepare yourself to having to deal with this issue when dealing with data integration efforts in the biomedical sciences. More likely than not, your organisation will build its own curie map and maintain it for the duration of your project.
+There are probably quite a few divergent opinions on this, but I would like to humbly list the following four use cases as among the most impactful applications of Semantic Web Technology in the biomedical domain.
+We can use hierarchical relations in ontology to group data. For example, if I know that http://purl.obolibrary.org/obo/MONDO_0012709 ("microphthalmia, isolated, with coloboma 5") http://www.w3.org/2000/01/rdf-schema#subClassOf ("is a") http://purl.obolibrary.org/obo/MONDO_0003847 ("Mendelian Disease"), then a specialised Semantic Web tool called a reasoner will know that, if I ask for all genes associated with Mendelian diseases, you also want to get those associated with "microphthalmia, isolated, with coloboma 5" specifically (note that many query engines such as SPARQL with RDFS entailment regime have simple reasoners embedded in them, but we would not call them "reasoner" - just query engine).
+Ontologies are extremely hard to manage and profit from the sound logical foundation provided by the Web Ontology Language (OWL). We can logically define our classes in terms of other ontologies, and then use a reasoner to classify our ontology automatically. For example, we can define abnormal biological process phenotypes in terms of biological processes (Gene Ontology) and classify our phenotypes entirely using the classification of biological processes in the Gene Ontology (don't worry if you don't understand a thing - we will get to that in a later week).
+Refer to the same thing the same way. While this goal was never reached in total perfection, we have gotten quite close. In my experience, there are roughly 3-6 ways to refer to entities in the biomedical domain (like say, ENSEMBL, HGNC, Entrez for genes; or SNOMED, NCIT, DO, MONDO, UMLS for diseases). So while the "refer to the same thing the same way" did not truly happen, a combination of standard identifiers with terminological mappings, i.e. links between terms, can be used to integrate data across resources (more about Ontology Matching later). Again, many of my colleagues disagree - they don't like IRIs, and unfortunately, you will have to build your own position on that.
+Personal note: From an evolutionary perspective, I sometimes think that having 2 or 3 competing terminological systems is better than 1, as the competition also drives the improvements in quality, but there is a lot of disagreement on this.
+The OBO Foundry is a community-driven effort to coordinate the development of vocabularies and ontologies across the biomedical domain. It develops standards for the representation of terminological content (like standard properties), and ontological knowledge (shared design patterns) as well as shared systems for quality control. Flagship projects include:
+In the following, we will list some of the technologies you may find useful, or will be forced to use, as a Semantic Data Engineer. Most of these standards will be covered in the subsequent weeks of this course.
+Standard | +Purpose | +Use case | +
---|---|---|
Web Ontology Language (OWL) | +Representing Knowledge in Biomedical Ontologies | +All OBO ontologies must be provided in OWL as well. | +
Resource Description Framework (RDF) | +Model for data interchange. | +Triples, the fundamental unit of RDF, are ubiquitous on the Semantic Web | +
SPARQL Query Language for RDF | +A standard query language for RDF and RDFS. | +Primary query language to interrogate RDF/RDFS/Linked Data on the Web. | +
Simple Knowledge Organization System (SKOS) | +Another, more lightweight, knowledge organisation system in many ways competing with OWL. | +Not as widely used in the biomedical domain as OWL, but increasing uptake of "matching" vocabulary (skos:exactMatch, etc). | +
RDF-star | +A key shortcoming of RDF is that, while I can in principle say everything about everything, I cannot directly talk about edges, for example to attribute provenance: "microphthalmia, isolated, with coloboma 5 is kind of Mendelian disease"--source: Wikipedia | +Use cases here. | +
JSON-LD | +A method to encoding linked data in JSON format. | +(Very useful to at least know about). | +
RDFa | +W3C Recommendation to embed rich semantic metadata in HTML (and XML). | +I have to admit - in 11 years Semantic Web Work I have not come across much use of RDFa in the biomedical domain. But @jamesaoverton is using it in his tools! | +
A thorough overview of all the key standards and tools can be found on the Awesome Semantic Web repo.
+For a rough sense of current research trends it is always good to look at the accepted papers at one of the major conferences in the area. I like ISWC (2020 papers), but for the aspiring Semantic Data Engineering in the biomedical sphere, it is probably a bit broad and theoretical. Other interesting specialised venues are the Journal of Biomedical Semantics and the International Conference on Biomedical Ontologies, but with the shift of the focus in the whole community towards Knowledge Graphs, other journals and conferences are becoming relevant.
+Here are a few key research areas, which are, by no means (!), exhaustive.
+It is useful to get a picture of the typical tasks a Semantic Data Engineer faces when building ontologies are Knowledge Graphs. In my experience, it is unlikely that any particular set of tools will work in all cases - most likely you will have to try and assemble the right toolchain for your use case and refine it over the lifetime of your project. The following are just a few points for consideration of tasks I regularly encountered - which may or may not overlap with the specific problems you will face.
+There are no simple answers here and it very heavily depends on your use cases. We are discussing some places to look for ontologies here, but it may also be useful to simply upload the terms you are interested in to a service like Zooma and see what the terms map to at a major database provider like EBI.
+This is much harder still than it should have to be. Scientific databases are scattered across institutions that often do not talk to each other. Prepare for some significant work in researching the appropriate databases that could benefit your work, using Google and the scientific literature.
+It is rare nowadays that you will have to develop an ontology entirely from scratch - most biomedical sub-domains will have some kind of reasonable ontology to build upon. However, there is often a great need to extend existing ontologies - usually because you have the need of representing certain concepts in much more detail, or your specific problem has not been modelled yet - think for example when how disease ontologies needed to be extended during the Coronavirus Crisis. Extending ontologies usually have two major facets:
+Also sometimes more broadly referred to as "data integration", this problem involves a variety of tasks, such as:
+To make your data discoverable, it is often useful to extract a view from the ontologies you are using (for example, Gene Ontology, Disease Ontology) that only contains the terms and relationships of relevance to your data. We usually refer to this kind of ontology as an application ontology, or an ontology specific to your application, which will integrate subsets of other ontologies. This process will typically involve the following:
+There are many ways your semantic data can be leveraged for data analysis, but in my experience, two are particularly central:
+The open courses of the Hasso Plattner Institute (HPI) offer introductions into the concepts around Linked Data, Semantic Web and Knowledge Engineering. There are three courses of relevance to this weeks topics, all of which overlap significantly.
+These materials are under construction and incomplete.
+In this course, you will learn the basics of automation in and around the OBO ontology world - and beyond. The primary goal is to enable ontology pipeline developers to plan the automation of their ontology workflows and data pipelines, but some of the materials are very general and apply to scientific computing more widely. The course serves also as a prerequisite for advanced application ontology development.
+make
Please complete the following tutorials.
+By: James Overton
+Automation is part of the foundation of the modern world. +The key to using and building automation +is a certain way of thinking about processes, +how they can be divided into simple steps, +and how they operate on inputs and outputs +that must be exactly the same in some respects but different in others.
+In this article I want to make some basic points about automation +and how to think about it. +The focus is on automation with software and data, +but not on any particular software or data. +Some of these points may seem too basic, +especially for experienced programmers, +but in 20+ years of programming +I've never seen anybody lay out these basic points in quite this way. +I hope it's useful.
+++"automatos" from the Greek: "acting of itself"
+
Automation has two key aspects:
+The second part is more visible, +and tends to get more attention, +but the first part is at least as important. +While automation makes much of the modern world possible, +it is not new, +and there are serious pitfalls to avoid. +No system is completely automatic, +so it's best to think of automation on a spectrum, +and starting thinking about automation +at the beginning of a new project.
+To my mind, the word "automation" brings images of car factories, +with conveyor belts and robotic arms moving parts and welding them together. +Soon they might be assembling self-driving ("autonomous") cars. +Henry Ford is famous for making cars affordable +by building the first assembly lines, +long before there were any robots. +The essential steps for Ford were standardizing the inputs and the processes +to get from raw materials to a completed car. +The history of the 20th century is full of examples of automation +in factories of all sorts.
+Automation was essential to the Industrial Revolution, +but it didn't start then. +We can look to the printing press. +We can look to clocks, which regimented lives in monasteries and villages. +We can think of recipes, textiles, the logistics of armies, +advances in agriculture, banking, the administration of empires, +and so on. +The scientific revolution was built on repeatable experiments +published in letters and journal articles. +I think that the humble checklist is also an important relative of automation.
+Automation is not new, +but it's an increasingly important part +of our work and our lives.
+Software is almost always written as source code in text files +that are compiled and/or interpreted as machine code +for a specific set of hardware. +Software can drive machines of all sorts, +but a lot of software automation stays inside the computer, +working on data in files and databases, +and across networks. +We'll be focused on this kind of software automation, +transforming data into data.
+The interesting thing about this is that source code is a kind of data, +so there are software automation workflows +that operate on data that defines software. +The upshot is that you can have automation that modifies itself. +Doing this on a large scale introduces a lot of complexity, +but doing it on a small scale can be a clean solution to certain problems.
+Another interesting thing about software is that +once we solve an automation problem once +we can copy that solution and apply it again and again +for almost zero cost. +We don't need to build a new factory or a new threshing machine. +We can just download a program and run it. +Henry Ford could make an accurate estimate +of how long it would take to build a car on his assembly line, +but software development is not like working on the assembly line, +and estimating time and budget for software development is notoriously hard. +I think this is because software developers +aren't just executing automation, +they're building new automation for each new project.
+Although we talk about "bit rot", +and software does require maintenance of a sort, +software doesn't break down or wear out +in the same ways that physical machines do. +So while the Industrial Revolution eliminated many jobs, +it also created different jobs, +building and maintaining the machines. +It's not clear that software automation will work the same way.
+Software automation is special because it can operate on itself, +and once complete can be cheaply copied. +Software development is largely about building automated systems of various sorts, +usually out of many existing pieces. +We spend most of our time building new systems, +or modifying an existing system to handle new inputs, +or adapting existing software to a new use case.
+++To err is human; to really foul things up requires a computer.
+
An obvious danger of automation is that +machines are faster than humans, +so broken automation can often do more damage +more quickly than a human can. +A related problem is that humans usually have much more +context and depth of experience, +which we might call "common sense", +and a wider range of sensory inputs than most automated systems. +This makes humans much better at recognizing +that something has gone wrong with a process +and that it's time to stop.
+New programmers soon learn that a simple program +that performs perfectly when the input is in exactly the right format, +becomes a complex program once it's updated to handle +a wide range of error conditions. +In other words, it's almost always much harder +to build automation that can gracefully handler errors and problems +than it is to automate just the "happy path". +Old programmers have learned through bitter experience +that it's often practically impossible to predict +all the things that can go wrong with an automated system in practise.
+++I suppose it is tempting, if the only tool you have is a hammer, +to treat everything as if it were a nail. +-- Abraham Maslow
+
A less obvious danger of automation comes from the sameness requirement. +When you've built a great piece of automation, +perfectly suited to inputs of a certain type, +it's very tempting to apply that automation more generally. +You start paying too much attention to how things are the same, +and not enough attention to their differences. +You may begin to ignore important differences. +You may surrender your common sense and good judgment, +to save yourself the work of changing the automated system or making an exception.
+Bureaucracies are a form of automation. +Everyone has had a bad experience +filling out some form that ignores critical information, +and with some bureaucrat who would not apply common sense and make an exception.
+Keep all this in mind as you build automated systems: +a broken machine can do a lot of damage very quickly, +and a system built around bad assumptions +can do a lot of hidden damage.
+Let's consider a simple case of automation with software, +and build from the most basic sort of automation +to a full-fledged system.
+Say you have a bunch of text files in a directory, +each containing minutes from meetings that we had together over the years. +You can remember that I talked about a particular software package +that might solve a problem that you just discovered, +but you can't remember the name.
+The first thing you try is to just search the directory. +On a Mac you would open the Finder, +navigate to the directory, +and type "James" into the search bar. +Unfortunately that gives too many results: +all the files with the minutes for a meeting where I said something.
+The next thing to do is double click some text files, +which would open them in Text Edit program, +and skim them. +You might get lucky!
+You know that I the meeting was in 2019, +so you can try and filter for files modified in that year. +Unfortunately the files have been updated at different times, +so the file dates aren't useful.
+Now if each file was named with a consistent pattern, +including the meeting date, +then it would be simple to filter for files with "2019" in the name. +This isn't automation, +but it's the first step in the right direction. +Consistent file names are one way to make inputs the same +so that you can process them in the same way.
+Let's say it works: +you filter for files from 2019 with "James" in them, +skim a few, +and find a note where I recommended using Pandoc +to convert between document formats. +Mission accomplished!
+Next week you need to do something very similar: +Becky mentioned a website where you can find an important dataset. +It's basically the same problem with different inputs. +If you remember exactly what you did last time, +then you can get the job done quickly. +As the job gets more complicated and more distant in time, +and as you find yourself doing similar tasks more often, +it's nice to have notes about what you did and how you did it.
+If I'm using a graphical user interface (GUI) +then for each step I'll note +the program I used, +and the menu item or button I clicked, +e.g. "Preferences > General > Font Size", +or "Search" or "Run". +If I'm using a command-line interface (CLI) +then I'll copy-paste the commands into my notes.
+I often keep informal notes like this in a text file +in the relevant directory. +I name the file "notes.txt". +A "README" file is similar. +It's used to describe the contents of a directory, +often saying which files are which, +or what the column headers for a given table mean.
+Often the task is more complicated +and requires one or more pieces of software that I don't use every day. +If there's relevant documentation, +I'll put a link to it in my notes, +and then a short summmary of exactly what I did.
+In this example +I look in the directory of minutes and see my "notes.txt" file. +I read that and remember how I filtered on "2019" and searched for "James". +This time I filter on "2020" and search for "Becky", +and I find the website for the dataset quickly enough.
+As a rule of thumb, +it might take you three times longer +to find your notes file, +write down the steps you took, +and provide a short description, +than it would to just do the job without taking notes. +When you're just taking notes for yourself, +this often feels like a waste of time +(you'll remember, right?!), +and sometimes it is a bit of a waste. +If you end up using your notes +to help with similar tasks in the future, +then this will likely be time well spent.
+As a rule of thumb, +it might take three times longer +to write notes for a broader audience +than notes for just yourself. +This is because you need to take into account +the background knowledge of your reader, +including her skills and assumptions and context, +and especially the possible misunderstandings +that you can try to avoid with careful writing. +I often start with notes for just myself +and then expand them for a wider audience only when needed.
+When tasks get more complicated +or more important +then informal notes are not enough. +The next step on the spectrum of automation is the humble checklist.
+The most basic checklists are for making sure that each item has been handled. +Often the order isn't important, +but lists are naturally ordered from top to bottom, +and in many cases that order is useful. +For example, +my mother lays out her shopping lists +in the order of the aisles in her local grocery store, +making it easier to get each item and check it off +without skipping around and perhaps having to backtrack.
+I think of a checklist as a basic form of automation. +It's like a recipe. +It should lay out the things you need to start, +then proceed through the required steps +in enough detail that you can reproduce them. +In some sense, +by using the checklist you are becoming the "machine". +You are executing an algorithm +that should take you from the expected inputs to the expected output.
+Humble as the checklist is, +there's a reason that astronauts, pilots, and surgical teams +live by their checklists. +Even when the stakes are not so high, +it's often nice to "put your brain on autopilot" +and just work the checklist +without having to remember and reconsider the details of each step.
+A good checklist is more focused than a file full of notes. +A checklist has a goal at the end. +It has specific starting conditions. +The steps have been carefully considered, +so that they have the proper sequence, +and none are missing. +Perhaps most importantly, +a checklist helps you break a complex task down +into simple parts. +If one of the parts is still too complex, +then break it down again +into a nested checklist +(really a sort of tree structure).
+Checklists sometimes include another key element of automation: conditionals. +A shopping list might say +"if there's a sale on crackers, then buy three boxes". +If-then conditions let our automated systems adapt to circumstances. +The "then" part is just another step, +but the "if" part is a little different. +It's a test to determine whether a condition holds. +We almost always want the result of the test +to be a simple True or False. +Given a bunch of inputs, +some of which pass the test and some of which fail it, +we can think of the test as determining some way in which +all the things that pass are the same +and all the things that fail are the same. +Programmers will also be familiar with more complex conditionals +such as if-then-else, if-elseif-else, and "case", +which divide process execution across multiple "branches".
+As a rule of thumb, +turning notes into a checklist will likely take +at least three times as long +as simply writing the notes. +If the checklist is for a wider audience, +expect it to take three times as long to write, +for the same reasons mentioned above for notes.
+If a task is simple +and I can hold all the steps in my head, +and I can finish it in one sitting without distractions, +then I won't bother with a checklist. +But more and more I find myself writing myself a checklist +before I begin any non-trivial tasks. +I use bullet points in my favourite text editor, +or sometimes the Notes app on my iPhone. +I lay out the steps in the expected order, +and I check them off as I go. +Sometimes I start making the checklist days before I need it, +so I have lots of time to think about it and improve it. +If there's a job that I'm worried about, +breaking it down into smaller pieces +usually helps to make the job feel more manageable. +Actually, I try to start every workday +by skimming my (long) To Do list, +picking the most important tasks, +and making a checklist for what I want to get done +by quitting time.
+"Checkscript" is a word that I think I made up, +based on insights from a couple of sources, +primarily this blog post on +"Do-nothing scripting: the key to gradual automation" +This is where "real" automation kicks in, +writing "real" code and stuff, +but hopefully you'll see that it's just one more step +on the spectrum of automation that I'm describing.
+The notes and checklists we've been discussing +are just text in your favourite text editor. +A checkscript is a program. +It can be written in whatever programming language you prefer. +I'll give examples in Posix Shell, +but that blog post uses Python, +and it really doesn't matter. +You start with a checklist +(in your mind at least). +The first version of your program +should just walk you through your checklist. +The program should walk you through each step of your checklist, +one by one. +That's it.
+Here's a checkscript based on the example above.
+It just prints the first step (echo
),
+waits for you to press any key (read
),
+then prints the next step, and so on.
###!/bin/sh
+
+echo "1. Use Finder to filter for files with '2019' in the name"
+read -p "Press enter to continue"
+
+echo "2. Use finder to search file content for 'James'"
+read -p "Press enter to continue"
+
+echo "3. Open files in Text Edit and search for 'James'"
+read -p "Press enter to continue"
+
+echo "Done!"
+
So far this is just a more annoying way to use a checklist. +The magic happens once you break the steps down into small enough pieces +and realize that you know how to tell the computer +to do some of the steps +instead of doing them all yourself.
+For example,
+you know that the command-line tool grep
+is used for searching the contents of files,
+and that you can use "fileglob"s to select
+just the files that you want to search,
+and that you can send the output of grep
+to another file to read in your favourite text editor.
+Now you know how to automate the first two steps.
+The computer can just do that work without waiting for you:
###!/bin/sh
+
+grep "James" *2019* > search_results.txt
+
+echo "1. Open 'search_results.txt' in Text Edit and search for 'James'"
+read -p "Press enter to continue"
+
+echo "Done!"
+
Before we were using the Finder,
+and it is possible to write code to tell the Finder
+to filter and seach for files.
+The key advantage of grep
here
+is that we send the search results to another file
+that we can read now or save for later.
This is also a good time to mention the advantage of text files
+over word processor files.
+If the minutes were stored in Word files, for example,
+then Finder could probably search them
+and you could use Word to read them,
+but you wouldn't be able to use grep
+or easily output the results to another file.
+Unix tools such as grep
treat all text files the same,
+whether they're source code or meeting minutes,
+which means that these tools work pretty much the same on any text file.
+By keeping your data in Word
+you restrict yourself to a much smaller set of tools
+and make it harder to automate you work
+with simple scripts like this one.
Even if you can't get the computer +to run any of the steps for you automatically, +a checkscript can still be useful +by using variables instead of repeating yourself:
+###!/bin/sh
+
+FILE_PATTERN="*2019*"
+FILE_CONTENTS="James"
+
+echo "1. Use Finder to filter for files with '${FILE_PATTERN}' in the name"
+read -p "Press enter to continue"
+
+echo "2. Use finder to search file content for '${FILE_CONTENTS}'"
+read -p "Press enter to continue"
+
+echo "3. Open files in Text Edit and search for '${FILE_CONTENTS}'"
+read -p "Press enter to continue"
+
+echo "Done!"
+
Now if I want to search for "Becky" +I can just change the FILE_CONTENTS variable in one place. +I find this especially useful for dates and version numbers.
+This is pretty simple for a checkscript, +with very few steps. +A more realistic example would be +if there were many directories containing the minutes of many meetings, +maybe in different file formats +and with different naming conventions. +In order to be sure that we're searching all of them +we might need a longer checkscript.
+Writing and using a checkscript instead of a checklist +will likely take (you guessed it) about three times as long. +But the magic of the checkscript +is in the title of the blog post I mentioned: +"gradual automation". +Once you have a checkscript, +you can run through it all manually, +but you can also automate bits a pieces of the task, +saving yourself time and effort next time.
+A "script" is a kind of program that's easy to edit and run. +There are technical distinctions to be made +between "compiled" programs and "interpreted" programs, +but they turn out to be more complicated and less helpful than they seem at first. +Technically, a checkscript is just a script +that waits for you to do the hard parts. +In this section I want to talk about "fully automated" or "standalone" scripts +that you just provide some input and execute.
+Most useful programs are useful +because they call other programs (in the right ways). +I like shell scripts because they're basically just +commands that are copied and pasted +from work I was doing on the command-line. +It's really easy to call other programs.
+To continue our example,
+say that our minutes were stored in Word files.
+There are Python libraries for this,
+such as python-docx.
+You can write a little script using this library
+that works like grep
+to search for specified text in selected files,
+and output the results to a search results file.
As you add more and more functionality to a script +it can become unwieldy. +Scripts work best when they have a simple "flow" +from beginning to end. +They may have some conditionals and some loops, +but once you start seeing nested conditionals and loops, +then your script is doing too much. +There are two main options to consider:
+The key difference between a checkscript and a "standalone" script +is handling problems. +A checkscript relies on you to supervise it. +A standalone script is expected to work properly without supervision. +So the script has to be designed to handle +a wider range of inputs +and fail gracefully when it gets into trouble. +This is a typical case of the "80% rule": +the last 20% takes 80% of the time. +As a rule of thumb, +expect it to take three times as long +to write a script that can run unsupervised +than it takes you to write a checkscript +that does "almost" the same thing.
+When your script needs nested conditionals and loops, +then it's probably time to reach for a programming language +that's designed to write code "in the large". +Some languages such as Python can make a pretty smooth transition +from a script in a single file +to a set of files in a module, +working together nicely. +You might also choose another language +that can provide better performance or efficiency.
+It's not just the size and the logical complexity of your script, +consider its purpose. +The specialized tools that I have in mind +have a clear purpose that helps guide their design. +This also makes them easier to reuse +across multiple projects.
+I often divide my specialized tools into two parts: +a library and a command-line interface. +The library can be used in other programs, +and contains the most distinctive and important functionality. +But the command-line interface is essential, +because it lets me use my specialized tool +in the shell and in scripts, +so I can build more automation on top of it.
+Writing a tool in Java or C++ or Rust +usually takes longer than a script in shell or Python +because there are more details to worry about +such as types and efficient memory management. +In return you usually get more reliability and efficiency. +But as a rule of thumb, +expect it to take three times as long +to write a specialized tool +than it would to "just" write the script. +On the other hand, +if you already have a script that does most of what you want, +and you're already familiar with the target you are moving to, +then it can be fairly straightforward to translate +from the script to the specialized tool. +That's why it's often most efficient +to write a prototype script first, +do lots of quick experiments to explore the design space, +and when you're happy with the design +then start on the "production" version.
+The last step in the spectrum of automation
+is to bring together all your scripts
+into a single "workflow".
+My favourite tool for this is the venerable
+Make.
+A Makefile
is essentially a bunch of small scripts
+with their input and output files carefully specified.
+When you ask Make to build a given output file,
+it will look at the whole tree of scripts,
+figure out which input files are required to build your requested output file,
+then which files are required to build those files,
+and so on until it has determined a sequence of steps.
+Make is also smart enough to check whether some of the dependencies
+are already up-to-date,
+and can skip those steps.
+Looking at a Makefile
you can see everything
+broken down into simple steps
+and organized into a tree,
+through which you can trace various paths.
+You can make changes at any point,
+and run Make again to update your project.
I've done this all so many times
+that now I often start with a Makefile
in an empty directory
+and build from there.
+I try experiments on the command line.
+I make notes.
+I break the larger task into parts with a checklist.
+I automate the easy parts first,
+and leave some parts as manual steps with instructions.
+I write little scripts in the Makefile
.
+I write larger scripts in the src/
directory.
+If these get too big or complex,
+I start thinking about building a specialized tool.
+(And of course, I store everything in version control.)
+It takes more time at the beginning,
+but I think that I usually save time later,
+because I have a nice place to put everything from the start.
In other words, I start thinking about automation +at the very beginning of the project, +assuming from the start that it will grow, +and that I'll need to go back and change things. +With a mindset for automation, +from the start I'm thinking about how +the inputs I care about are the same and different, +which similarities I can use for my tests and code, +and which differences are important or unimportant.
+In the end, my project isn't ever completely automated. +It doesn't "act of itself". +But by making everything clear and explicit +I'm telling the computer how to do a lot of the work +and other humans (or just my future self) +how to do the rest of it. +The final secret of automation, +especially when it comes to software and data, +is communication: +expressing things clearly for humans and machines +so they can see and do exactly what you did.
+By: James Overton
+By "scientific computing" we mean +using computers to help with key aspect of science +such as data collection, cleaning, interpretation, analysis, and visualization. +Some people use "scientific computing" to mean something more specific, +focusing on computational modelling or computationally intensive analysis. +We'll be focusing on more general and day-to-day topics: +how can a scientist make best use of a computer +to do their work well?
+These three things apply to lots of fields, +but are particularly important to scientists:
+It should be no surprise that +automation can help with all of these. +When working properly, computers make fewer mistakes than people, +and the mistakes they do make are more predictable. +If we're careful, our software systems can be easily reproduced, +which means that an entire data analysis pipeline can be copied +and run by another lab to confirm the results. +And scientific publications are increasingly including data and code +as part of the review and final publication process. +Clear code is one of the best ways to communicate detailed steps.
+Automation is critical to scientific instruments and experiments, +but we'll focus on the data processing and analysis side: +after the data has been generated, +how should you deal with it.
+Basic information management is always important:
+More advanced data management is part of this course:
+Some simple rules of thumb can help reduce complexity and confusion:
+When starting a new project, +make a nice clean new space for it. +Try for that "new project smell".
+It's not always clear when a project is really "new" +or just a new phase of an old project. +But try to clear some space to make a fresh start.
+A lot of data analysis starts with a reference data set. +It might be a genome or a proteome. +It might be a corpus. +It might be a set of papers or data from those papers.
+Start by finding that data
+and selecting a particular version of it.
+Write that down clearly in your notes.
+If possible, include a unique identifier such as a (persistent) URL or DOI.
+If that's not possible, write down the steps you took.
+If the data isn't too big,
+keep a copy of it in your fresh new project directory.
+If the data is a bit too big,
+keep a compressed copy in a zip
or gz
file.
+A lot of software is perfectly happy to read directly from compressed files,
+and you can compress or uncompress data using piped commands in your shell or script.
+If the data is really too big,
+then be extra careful to keep notes
+on exactly where you can find it again.
+Consider storing just the
+hashes
+of the big files,
+so you can confirm that they have exactly the same contents.
If you know from the start +that you will need to compare your results with someone else's, +make sure that you're using the same reference data +that they are. +This may require a conversation, +but trust me that it's better to have this conversation now than later.
+It's much easier to think about processes +that flow in one direction. +Branches are a little trickier, but usually fine. +The real trouble comes with loops. +Once a process loops back on itself +it's much more difficult to reason about what's happening. +Loops are powerful, +but with great power comes great responsibility. +Keep the systems you design +as simple as possible +(but no simpler).
+In practical terms:
+It's very tempting: +you could automate this step, +or you could just do it manually. +It might take three times as long to automate it, right? +So you can save yourself some precious time +by just opening Excel and "fixing" things by hand.
+Sometimes that bet will pay off, +but I lose that bet most of the time. +I tend to realize my mistake only at the last minute. +The submission deadline is tomorrow +but the core lab "fixed" something +and they have a new version of the dataset +that we need to use for the figures. +Now I really don't have time to automate, +so I'm up late clicking through Excel again +and hoping that I remembered to redo +all the changes that I made last time.
+Automating the process would have actually saved me time, +but more importantly it would have avoided a lot of stress. +By now I should know that the dataset +will almost certainly be revised at the last minute. +If I have the automation set up, +then I just update the data, +run the automation again, +and quickly check the results.
+Tests are another thing that take time to implement.
+One of the key benefits to tests is (again) communication. +When assessing or trying out some new piece of software +I often look to the test files to see examples +of how the code is really used, +and the shape of the inputs and outputs.
+There's a spectrum of tests +that apply to different parts of your system:
+Tests should be automated. +The test suite should either pass or fail, +and if it fails something needs to be fixed +before any more development is done. +The automated test suite should run +before each new version is committed to version control, +and ideally more often during development.
+Tests come with costs:
+The first is obvious but the other two often more important. +A slow test suite is annoying to run, +and so it won't get run. +A test suite that's hard to update won't get updated, +and then failures will be ignored, +which defeats the entire purpose.
+I tend to forget how bad a memory I have. +In the moment, when I'm writing brilliant code +nothing could be more obvious than the perfect solution +that is pouring forth from my mind all over my keyboard. +But when I come back to that code weeks, months, or years later, +I often wonder what the heck I was thinking.
+We think about the documentation we write as being for other people, +but for a lot of small projects +it's really for your future self. +Be kind to your future self. +They may be even more tired, even more stressed than you are today.
+There's a range of different forms of documentation, +worth a whole discussion of its own. +I like this four-way distinction:
+You don't need all of these for your small project, +but consider a brief explanation of why it works the way it does +(aimed at a colleague who knows your field well), +and some brief notes on how-to do the stuff this project is for. +These could both go in the README of a small project.
+In this lesson, we will take a look at the generative capabilities of LLM's in general and ChatGPT in particular, to try and get a beginning sense on how to leverage it to enhance ontology curation workflows.
+The goal of the lesson is to give a mental model of what ChatGPT and LLMs are used for (ignoring details on how they work), contextualise the public discourse a bit, and then move on to looking at some concrete examples at its potential for improving curation activties.
+To achieve this we engaged in a dialog with ChatGPT to generate almost the entire content of the lesson. The lesson authors provide the general "structure" of the lesson, provided to ChatGPT as a series of prompts, and get ChatGPT to provide the content. This content is obviously not as good as it could have been if it was created by a human with infinite resources, but we hope it does get the following points across:
+We believe that from a user perspective, prompt engineering will be the most important skill that need to be learned when dealing with generative AI. Not just ChatGPT (which generates text), but also tools that generate images from text such as DALL-E or Midjourney, so this is what we will focus on. In the long term, applications like Monarchs OntoGPT will do some of the heavy lifting around writing perfect prompts, but it seems pretty clear that some basic knowledge of prompt engineering will be useful, or even necessary, for a long time to come.
+For a reference of effective ChatGPT prompts for ontology development see here.
+Note: +- ChatGPT is rapidly evolving. The moment we add an answer, it will probably be outdated. For example, I created the first version of this tutorial on April 17th 2023. On May 27th, almost all answers ChatGPT is giving are completely different from the ones given in the first round. This is also important to remember when building applications around ChatGPT. +- Note: https://open-assistant.io/chat is free and can be used to follow this tutorial instead of ChatGPT.
+Prompts
++ ++We use quote syntax with the prompt icon to indicate a concrete prompt for ChatGPT
+
Comments
++ ++We use quote syntax with the comment icon to indicate a comment by the author
+
Replies by ChatGPT
+Replies are given in normal text form. All text after the table of contents, apart from comments, prompts and the section on executable workflows are generated by ChatGPT.
++ ++None of the text in this section is generated with ChatGPT.
+
In essence, an LLM takes as an input a piece of text, and returns text as an output. A "prompt" is a +piece of text that is written by an agent. This can be a human, or a software tool, or a combination of the two. In most cases, a human agent will pass the prompt to a specialised tool that pre-processes the prompt in certain ways (like translating it, adding examples, structuring it and more) before passing it to the large language model (LLM). For example, a when a chatbot tool like ChatGPT receives a prompt, it processes the prompt in certain ways, than leveraging the trained LLM to generate the text (which is probably postprocessed) and passed back to the human agent.
+ +There are an infinite number of possible tools you can imagine following this rough paradigm. Monarch's own ontogpt, for example, receives the prompt from the human agent, then augments the prompt in a certain way (by adding additional instructions to it) before passing the augmentd prompt to an LLM like gpt3.5 (or lately even gpt4), which generates an instance of a curation schema. This is a great example for an LLM generating not only human readable text, but structured text. Another example for this is to ask an LLM to generate, for example, a SPARQL query to obtain publications from Wikidata.
+Given the wide range of applications LLMs can serve, it is important to get a mental model of how these can be leveraged to improve our ontology and data curation workflows. It makes sense for our domain (semantic engineering and curation) to distinguish four basic models of interacting with LLMs (which are technically not much different):
+Using LLMs as advisors has a huge number of creative applications. An advisor in this sense is a machine that "knows a lot" and helps you with your own understanding of the world. +Large language models trained on a wide range of inputs are particularly interesting in this regard because of the immense breadth of their knowledge (rather than depth), which is something that can be difficult to get from human advisors. +For example, the authors of this article have used ChatGPT and other LLM-based chatbots to help with understanding different domains, and how they might relate to knowledge management and ontologies in order to give specific career advice or to prepare for scientific panel discussions. For ontology curators, LLMs can be used to generate arguments for certain classification decisions (like a disease classification) or even suggest a mapping.
+Using LLMs as assistants is probably the most important use of LLM-based tools at the moment, which includes aspects like summarising texts, generating sometimes boring, yet important, creative work (documentation pages, tutorials, blog-posts etc). It is probably not a good idea, at least as of May 2023, to defer to LLM-based tools to classify a term in an ontology, for example because of its tendency to hallucinate. Despite many arguments to the contrary LLMs are not databases. They are programs to generate text.
+Using LLMs to extract information, similar to "LLMs as assistants", is, similar to the above, also about automating certain tasks, but the endpoint is not a software program rather than a human. It is the most important basic model of LLMs for us curators and software engineers to understand, because it is, in essence, the one that threatens our current work-life the most: What happens if LLMs become better at extracting structured knowledge from papers (or similarly generate software codes for user stories) than us? It is important that this thought is not ignored out of fear, but approached with a realistic and positive mindset.
+Training. Apart from the fact that LLMs take text as an input and return text as an output, it is important to be aware how they are trained.
+ +The basic technique for trainig is called "next token prediction". In essence, tokens in the text +such as words or phrases are masked out by the training function and the LLM is trained to correctly predict these masked out phrases given the previous words in the sentence (or paragraph). +The corpus used for this training ranges from specialised databases all the way to a large chunk of the publicly accessible textual information on the internet. +Again, it is important to realise that text does not have to be composed of natural language sentences - learning how to generate programming code and structured data works much the same way.
+In the following, we will ask ChatGPT or Open Assistant a few questions that are relevant to this lesson, which allows us to make certain points. The main point is that a lot of the work for preparing a useful tutorial can be deferred to ChatGPT is an assistant! +Many of the questions explore how these tools could relate to us biocurators and semantic engineers in particular, but non of this should be viewed as written in stone. +The most important take away from this lesson is that you should play with these tools yourself!
++ ++Can you explain in 3 simple sentences what an LLM is and how it works?
+[Note this question was asked to [Open Assistant](https://open-assistant.io/chat/) instead of ChatGPT.]
An LLM (Langauge Modell) is a AI model that can be trained to perform specific tasks such as speech recognition, text generation or machine translation. It's called a "model" because it is being trained on large sets of data to predict the next step based on previously learned relationships between inputs and outputs. +LLMs are designed to recognize patterns in human language by processing input data through multiple layers of artificial neural networks that simulate the way neurons communicate in the brain. The training process involves feeding vast amounts of text into the model so that it can learn to identify underlying structures within the language itself. This allows the model to make predictions about likely next words or phrases given previous context. By continually adjusting its internal weights and biases using gradient descent techniques, the LLM becomes increasingly proficient at generating coherent text based on statistical probabilities derived from its knowledge of the language.
++ ++What is interesting here: the LLM actually made a typo, or rather, three:
+ +Langauge Modell
- missing the first L (large), Langauge instead of Language and Modell with double L (which is German). Note that both words are spellied correctly later in the text.What are the 5 core concepts relevant for end-users to understand LLMs?
+
+ ++Describe the idea of prompt engineering in simple terms for an audience of domain experts
+
Prompt engineering is the process of carefully crafting input queries, or "prompts," to guide the behavior of a large language model (LLM) like GPT-4, ensuring that it provides useful, relevant, and accurate responses. As domain experts, you understand the importance of asking the right questions to extract valuable information. Similarly, with prompt engineering, the goal is to design questions or statements in a way that maximizes the LLM's ability to generate meaningful, high-quality output.
+Essentially, prompt engineering serves as a "communication bridge" between humans and the LLM, allowing the model to better understand the intent behind the query and generate the desired response. This can involve refining the wording, providing context, or specifying the format of the desired answer, among other techniques.
+By focusing on the art and science of prompt engineering, researchers and practitioners can better harness the potential of LLMs and overcome some of the challenges related to biases, incomplete knowledge, and misleading responses.
+++Providing an effective prompt to ChatGPT is an art form, and as resources like learnprompting.org testify, one that requires cultivation and care. The better you become at writing prompts, the more value you will draw from ChatGPT for your daily ontology curation tasks. How effective our prompts are depends on the exact language models used in ChatGPT (at the moment 3.5 or 4), which means that this lesson will keep evolving.
+
++In the excellent 60 minute OpenAI, Deeplearning.ai course on Prompt Engineering, we learn that we should be (1) writing clear an specific instructions and (2) give ChatGPT "time to think".
+
+ ++What are some potential applications of ChatGPT for ontology development?
+
ChatGPT can be a valuable tool for ontology development in various ways. Some potential applications include:
+By leveraging the capabilities of ChatGPT, ontology developers can streamline the development process, enhance the quality of their ontologies, and facilitate their adoption and use across various domains.
+ ++ ++A normal workflow of ontology curation is this:
++
+- A concept is required by a biocurator to describe a piece of data, for example, a disease concept, or an anatomical entity.
+- The biocurator requests the new concept (or term) from the ontology curator, providing a label and perhaps a scientific reference
+- The ontology curator then adds that concept to the ontology using a tool like Protege, adding a human-readable definition, synonyms and provenance metadata. Most importantly, they define the concept logically, often using a formalism such as OWL specifying super-classes, and other logical relationships
+- Lastly, the ontology curator runs a number of quality checks to ensure that the added concepts is logically coherent (e.g. the reasoner places it in the intended part of of the class hierarchy).
+How can ChatGPT be used concretely to help with that process?
+
ChatGPT can be incorporated into various stages of the ontology curation workflow to assist biocurators and ontology curators in streamlining and improving the process. Here's how ChatGPT can be used concretely at each step:
+By integrating ChatGPT into the ontology curation workflow, biocurators and ontology curators can benefit from its natural language understanding and generation capabilities, making the process more efficient and accurate. However, it is essential to keep in mind that ChatGPT should be used as a complementary tool and not as a replacement for human expertise or specialized software designed for ontology development and curation.
+ ++ ++What about ontology mappings?
+
LLMs like ChatGPT can be used to identify equivalent or similar concepts across different ontologies by analyzing concept labels, definitions, and other metadata. They can understand and process natural language descriptions and compare them to find potential matches or alignments between concepts. Specifically, LLMs can:
+By leveraging these capabilities, LLMs can assist ontology curators in the challenging task of concept matching, ultimately enhancing the ontology mapping and alignment process.
++ ++Can you provide an example of mapping a concept from the Human Phenotype Ontology to the Mammalian Phenotype Ontology?
+
Let's consider the concept "Abnormality of the ear" in HPO.
+Our goal is to find a matching or similar concept in the Mammalian Phenotype Ontology (MPO).
+In this case, both concepts describe abnormalities of the ear in their respective domains (human and mammalian). The labels and definitions are similar, indicating that these concepts might be considered equivalent or closely related.
+Thus, we can propose the following mapping:
+++COMMENT. The 27th May version of ChatGPT is giving a different answer:
+
Sure, let's consider an example for this mapping:
+Human Phenotype Ontology (HPO) is designed to classify and annotate the phenotypic abnormalities encountered in human disease. Each term in HPO describes a specific phenotypic abnormality, such as 'HP:0004322 - Short stature' which describes an abnormality of body height that is less than -2 standard deviations compared to the appropriate reference population.
+Mammalian Phenotype Ontology (MPO), on the other hand, is used in the annotation of data from mammalian phenotyping experiments, typically in model organisms like mice. Its equivalent term for short stature might be 'MP:0001265 - decreased body size' which describes a mammal with an overall smaller body size.
+So, in this case, 'HP:0004322 - Short stature' in HPO could be mapped to 'MP:0001265 - decreased body size' in MPO.
+Please note that this is a simplified example and the actual mapping process could be more complex, due to the nuanced differences between human and mammalian phenotypes and the structure of these ontologies. For actual mappings, it is essential to ensure the descriptions and definitions align appropriately, and the context of use is correctly considered.
+Also, there may not always be a one-to-one correspondence between the terms in different ontologies. Sometimes, one term in an ontology may correspond to multiple terms in another, or there may not be an equivalent term at all. This is why ontological mapping is a complex task requiring domain knowledge and careful consideration.
++ ++Can you express the above mapping in SSSOM?
+
The Simple Standard for Sharing Ontology Mappings (SSSOM) is a standard that enables sharing and comparing ontology mappings using a simple tabular format (TSV). Here's the example mapping from Human Phenotype Ontology (HPO) to Mammalian Phenotype Ontology (MPO) expressed in SSSOM format:
+subject_id subject_label predicate_id object_id object_label match_type confidence
+HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology Lexical 0.90
+
In this case, I have used skos:closeMatch as the predicate to represent the close relationship between the two concepts. The match_type is "Lexical" since the mapping is based on similar labels and definitions. The confidence score is set to 0.90 to reflect a high degree of similarity between the concepts, but this value may vary depending on the actual similarity estimation method used.
+Please note that the SSSOM format includes metadata and other required fields in a TSV header. The provided example only focuses on the core mapping information. For a complete SSSOM file, you would need to include the appropriate header and metadata.
++ ++The SSSOM example you gave is using an old version of the SSSOM standard. In particular, match_type now is mapping_justification. Can you try to rewrite the mapping using the new standard?
+
Apologies for the confusion. Here's the example mapping from Human Phenotype Ontology (HPO) to Mammalian Phenotype Ontology (MPO) expressed in the updated SSSOM format:
+subject_id subject_label predicate_id object_id object_label mapping_justification confidence
+HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology Lexical 0.90
+
In this revised example, the match_type field has been replaced with mapping_justification, which is now set to "Lexical" to indicate that the mapping is based on similar labels and definitions. The remaining fields are unchanged from the previous example.
++ ++mapping_justification values should be populated from a controlled vocabulary, called the Semantic Mapping Vocabulary (SEMAPV). You can look at the vocabulary in RDF here: https://raw.githubusercontent.com/mapping-commons/semantic-mapping-vocabulary/main/semapv.owl. Given that vocabulary, can you revise the value for the mapping_justification?
+
My apologies for the oversight. Based on the Semantic Mapping Vocabulary (SEMAPV), I will revise the mapping justification for the example. Here's the updated SSSOM mapping using the SEMAPV:
+subject_id subject_label predicate_id object_id object_label mapping_justification confidence
+HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatch 0.90
+
In this updated example, I have replaced the previous mapping_justification value with semapv:LexicalMatch from the Semantic Mapping Vocabulary (SEMAPV) to represent the lexical similarity between the two concepts based on their labels and definitions. The remaining fields are unchanged.
++ ++You were hallucinating earlier. Here is the semapv vocabulary as a tsv:
+
IRI skos:prefLabel skos:definition dc:source skos:example rdfs:comment altLabel Parent
+ID A skos:prefLabel A skos:definition AI dc:source A skos:example A rdfs:comment A skos:altLabel SC %
+semapv:MappingActivity mapping activity A process that relates to the creation, confirmation, rejection or curation of a mapping. Matching is a mapping activity that results in the creating of a mapping; mapping review is an activity that results in the confirmation of a mapping.
+semapv:Matching matching process An process that results in a mapping between a subject and an object entity. The label of a subject entity matches to an exact synonym of an object entity. matching operation|matching task semapv:MappingActivity
+semapv:Mapping mapping A triple <s,p,o> comprising a subject entity s, an object entity o and a mapping predicate p. The subject entity NCI:C9305 is mapped to the object entity ICD10:C80.9 using the skos:relatedMatch mapping predicate.
+semapv:LexicalMatching lexical matching process A matching process based on a lexical comparison between one or more syntactic features of the subject with one or more syntactic features of the object. The label of a subject entity matches to an exact synonym of an object entity. semapv:Matching
+semapv:LogicalReasoning logical reasoning process A matching process based on the inferences made by a logical reasoner. Two classes across ontologies are determined equivalent by an OWL reasoner such as ELK. semapv:Matching
+semapv:CompositeMatching composite matching process A matching process based on multiple, possibly intertwined, matching approaches. An ontology matching tool determines that a subject should be mapped to an object by employing a range of techniques, including lexical, semantic and structural. semapv:Matching
+semapv:UnspecifiedMatching unspecified matching process A matching process based on an unspecified comparison. A mapping between a subject and an object was established, but it is no longer clear how or why. semapv:Matching
+semapv:SemanticSimilarityThresholdMatching semantic similarity threshold-based matching process A matching process based on a minimum threshold of a score from a comparison based on a semantic similarity algorithm. A match between a subject and an object entity was established because they had a Jaccard score higher than 0.8 based on the set of (common) superclasses. semapv:Matching
+semapv:LexicalSimilarityThresholdMatching lexical similarity threshold-based matching process A lexical matching process based on a minimum threshold of a score from a comparison based on a lexical similarity algorithm. A match between a subject and an object was established because they had a Levenshtein score higher than 0.8. semapv:Matching
+semapv:StructuralMatching structural matching process https://doi.org/10.1016/j.websem.2009.11.002 A match between a subject and an object was established because of the similarity of their structural features, e.g., the number of direct property of a class. Structural matching does not involve looking at "values" of properties. semapv:Matching
+semapv:InstanceBasedMatching instance-based matching process A matching process based on individual representations (or instances). https://doi.org/10.1007/978-3-642-38721-0 A match between a subject A and an object B was established because they share the same instances. semapv:Matching
+semapv:BackgroundKnowledgeBasedMatching background knowledge-based matching process A matching process that exploits background knowledge from external resources, commonly referred to as background knowledge resources. This approach is also known as indirect matching, BK-based matching or context-based matching. https://doi.org/10.1016/j.websem.2018.04.001 A match between a subject A and an object B was established because they appear equivalent under consideration of externally provided background knowledge. semapv:Matching
+semapv:MappingChaining mapping chaining-based matching process A matching process based on the traversing of multiple mappings. A match between a subject A and an object B was established because A was mapped to C, C was mapped to D and D was mapped to B. semapv:Matching
+semapv:MappingReview mapping review A process that is concerned with determining if a mapping “candidate” (otherwise determined) is reasonable/correct. A match between a subject A and an object B was established elsewhere, and a human reviewer determined that the mapping is true (or false) based on an independent evaluation. semapv:MappingActivity
+semapv:ManualMappingCuration manual mapping curation An matching process that is performed by a human agent and is based on human judgement and domain knowledge. A human curator determines that a subject should be mapped to an object by virtue of their domain expertise. semapv:Matching
+semapv:MatchingPreprocessing matching process pre-processing A preliminary processing of inputs prior to performing matching. semapv:MappingActivity
+semapv:Stemming stemming The process of reducing a word to its word stem. semapv:Normalization
+semapv:Lemmatization lemmatization The process of reducing the different forms of a word to one single form. https://doi.org/10.1007/978-3-642-38721-0 semapv:Normalization
+semapv:StopWordRemoval stop-word removal A preprocessing method transforming a string by removing a list of stop words. https://doi.org/10.1007/978-3-642-38721-0 The stop-words “of, and, is, a” are removed from the label field of the subject entity in the mapping. semapv:Normalization
+semapv:RegexRemoval regex removal A preprocessing method transforming a string by matching a regular expression and then removing that match. The regex match “[ -]phenotype” is removed from the label field of the subject entity in the mapping. semapv:Normalization
+semapv:RegexReplacement regex replacement A preprocessing method transforming a string by matching a regular expression and then replacing that match with a specified substitution string. The regex match “[ -]phenotype” is replaced by “-disease” for the label field of the subject entity in the mapping. semapv:Normalization
+semapv:MatchingPostProcessing matching process post-processing A process applied to a set of mappings, usually with the intention of changing it. semapv:MappingActivity
+semapv:ThresholdFiltering threshold filtering semapv:MatchingPostProcessing
+semapv:StableMarriageFiltering stable marriage filtering semapv:MatchingPostProcessing
+semapv:CardinalityFiltering cardinality filtering semapv:MatchingPostProcessing
+semapv:LogicalConsistencyFiltering logical consistency filtering semapv:MatchingPostProcessing
+semapv:Normalization normalization A family of preprocessing techniques for reducing strings to be compared to a common format. https://doi.org/10.1007/978-3-642-38721-0 semapv:MatchingPreprocessing
+semapv:CaseNormalization case normalization A normalization technique converting each alphabetic character in a string to their lower case counterpart. https://doi.org/10.1007/978-3-642-38721-0 The string 'Ontology' is normalised to 'ontology' (lower case). semapv:Normalization
+semapv:DiacriticsSuppression diacritics suppression A normalization technique replacing diactritic signs (e.g. accents, hats) with their most frequent replacements. https://doi.org/10.1007/978-3-642-38721-0 The string 'Protégé' is normalised to 'Protege'. semapv:Normalization
+semapv:BlankNormalisation blank normalisation A normalization technique replacing all blank characters, such as space, tabulation, carriage return (or sequences of these) into a single blank character. https://doi.org/10.1007/978-3-642-38721-0 The string 'ontology matching' (two spaces between 'ontology' and 'matching) becomes 'ontology matching' (one space). semapv:Normalization
+semapv:LinkStripping link stripping A normalization technique replacing specific links between words, such as apostrophes, dashes, underscore, etc into dashes or blanks. https://doi.org/10.1007/978-3-642-38721-0 The string 'alzheimer's' becomes 'alzheimers'. semapv:Normalization
+semapv:DigitSuppression digit suppression A normalization technique removing all numbers in a string. https://doi.org/10.1007/978-3-642-38721-0 The string 'ontology1' becomes 'ontology'. semapv:Normalization
+
+ ++Can you use this to pick the right mapping justification?
+
Apologies for the confusion earlier. Based on the provided SEMAPV vocabulary as a TSV, I will revise the mapping justification for the example. Here's the updated SSSOM mapping using the correct SEMAPV:
+subject_id subject_label predicate_id object_id object_label mapping_justification confidence
+HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatching 0.90
+In this updated example, I have replaced the previous mapping_justification value with semapv:LexicalMatching from the provided Semantic Mapping Vocabulary (SEMAPV) to represent the lexical similarity between the two concepts based on their labels and definitions. The remaining fields are unchanged.
+
Please remember that a complete SSSOM file would require the appropriate header and metadata in addition to the core mapping information provided in the example.
++ ++Can you add 4 more examples to that SSSOM table
+
Sure, here are 4 more examples of mappings between the Human Phenotype Ontology (HPO) and the Mammalian Phenotype Ontology (MPO) using SSSOM format:
+subject_id subject_label predicate_id object_id object_label mapping_justification confidence
+HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatching 0.90
+HP:0000365 Hearing impairment skos:closeMatch MP:0006325 hearing loss semapv:LexicalMatching 0.92
+HP:0001382 Joint hypermobility skos:closeMatch MP:0004927 increased joint flexibility semapv:LexicalMatching 0.85
+HP:0001392 Hepatomegaly skos:closeMatch MP:0002764 enlarged liver semapv:LexicalMatching 0.88
+HP:0001250 Seizures skos:closeMatch MP:0002064 seizures semapv:LexicalMatching 0.95
+
These examples illustrate mappings between HPO and MPO concepts based on lexical matching, where the similarity between the labels and definitions is considered. Note that the confidence values are provided as an example and may differ in real-world scenarios.
++ ++Can you print only the table using markdown table syntax in a way that I can copy paste it
+
subject_id | +subject_label | +predicate_id | +object_id | +object_label | +mapping_justification | +confidence | +
---|---|---|---|---|---|---|
HP:0000598 | +Abnormality of the ear | +skos:closeMatch | +MP:0002161 | +abnormal ear morphology | +semapv:LexicalMatching | +0.90 | +
HP:0000365 | +Hearing impairment | +skos:closeMatch | +MP:0006325 | +hearing loss | +semapv:LexicalMatching | +0.92 | +
HP:0001382 | +Joint hypermobility | +skos:closeMatch | +MP:0004927 | +increased joint flexibility | +semapv:LexicalMatching | +0.85 | +
HP:0001392 | +Hepatomegaly | +skos:closeMatch | +MP:0002764 | +enlarged liver | +semapv:LexicalMatching | +0.88 | +
HP:0001250 | +Seizures | +skos:closeMatch | +MP:0002064 | +seizures | +semapv:LexicalMatching | +0.95 | +
+ ++Can you add 100 more examples, this time only providing a markdown table that I can copy paste, with no further comments around it?
+
Creating 100 mappings with accuracy requires domain expertise and a comprehensive understanding of the ontology mappings. However, I can provide you with 20 examples to help you understand the format. Please note that these examples may not be completely accurate and should not be used for any critical applications.
+subject_id | +subject_label | +predicate_id | +object_id | +object_label | +mapping_justification | +confidence | +
---|---|---|---|---|---|---|
HP:0000010 | +Megabladder | +skos:closeMatch | +MP:0000547 | +enlarged bladder | +semapv:LexicalMatching | +0.85 | +
HP:0000014 | +Renal agenesis | +skos:closeMatch | +MP:0000548 | +absent kidney | +semapv:LexicalMatching | +0.82 | +
HP:0000015 | +Bladder diverticulum | +skos:closeMatch | +MP:0000556 | +bladder diverticula | +semapv:LexicalMatching | +0.88 | +
HP:0000016 | +Urethral atresia | +skos:closeMatch | +MP:0000565 | +closed urethra | +semapv:LexicalMatching | +0.90 | +
HP:0000020 | +Renal duplication | +skos:closeMatch | +MP:0000571 | +duplex kidney | +semapv:LexicalMatching | +0.84 | +
HP:0000021 | +Hypospadias | +skos:closeMatch | +MP:0000573 | +abnormal penis morphology | +semapv:LexicalMatching | +0.76 | +
HP:0000022 | +Urethral stricture | +skos:closeMatch | +MP:0000574 | +urethral obstruction | +semapv:LexicalMatching | +0.86 | +
HP:0000025 | +Polycystic kidney dysplasia | +skos:closeMatch | +MP:0000576 | +polycystic kidney | +semapv:LexicalMatching | +0.92 | +
HP:0000026 | +Hyperechogenic kidneys | +skos:closeMatch | +MP:0000581 | +abnormal kidney echogenicity | +semapv:LexicalMatching | +0.81 | +
HP:0000027 | +Renal cortical microcysts | +skos:closeMatch | +MP:0000584 | +renal cortical cysts | +semapv:LexicalMatching | +0.89 | +
HP:0000028 | +Renal tubular atrophy | +skos:closeMatch | +MP:0000585 | +atrophic renal tubules | +semapv:LexicalMatching | +0.87 | +
HP:0000031 | +Renal hypoplasia | +skos:closeMatch | +MP:0000591 | +small kidney | +semapv:LexicalMatching | +0.79 | +
HP:0000032 | +Renal hypoplasia/aplasia | +skos:closeMatch | +MP:0000592 | +renal hypoplasia | +semapv:LexicalMatching | +0.91 | +
HP:0000033 | +Proximal tubulopathy | +skos:closeMatch | +MP:0000593 | +abnormal proximal tubule | +semapv:LexicalMatching | +0.75 | +
AT THIS POINT CHATGPT JUST STOPPED GENERATING CONTENT. | ++ | + | + | + | + | + |
+ ++Here are a few slots in a schema: quality, anatomical_entity and chemical_entity
+
quality:
+description: The property being measured, or changes in this property, for example, amount, level, increased amount, decreased concentration
+annotations:
+ prompt.example: amount, level, increased amount, decreased concentration
+range: Quality
+
+anatomical_entity:
+description: The anatomical location that the chemical entity is measured in
+range: AnatomicalEntity
+annotations:
+ prompt.example: liver, heart, brain, finger
+
+chemical_entity:
+description: The chemical entity that is being measured
+range: ChemicalEntity
+annotations:
+ prompt.example: lysine, metabolite
+
+ ++Can you create a YAML file with those three elements as keys, and extract the contents of the string "increased blood glucose levels" into as values to these keys? Output should be just a simple yaml file, like:
+
quality: concentration
+anatomical_entity: liver
+chemical_entity: lysine
+
The above tutorial was a fun case study using ChatGPT with GPT-4. 95% of the content provided was generated by ChatGPT with GPT-4. While certainly not as great as possible, it took a solid ontology engineer (@matentzn
) about 90 minutes to write this lesson, which would have usually cost him more than 8 hours.
It is clear that learning how to talk to AI, the process we refer to as "prompt engineering" is going to be absolutely essential for ontology curators moving forward - as LLMs improve and understand even complex languages like OWL better, perhaps as important as ontology modelling itself. I dont think there is any doubt that enganging is a good amount of play and study on this subject is both fun and hugely beneficial.
+All that said, perceiving LLMs through the lens of a chat bot leaves a lot of potential unexplored. For example, if ChatGPT (or LLMs in general) can generate structured data, why not implement this directly into our curation tools (like Protege)? Tools like GitHub co-pilot are already used to making good programmers a lot more effective, but so far, these tools focus on development environments where the majority of the generated content is text (e.g. software code), and not so much heavily UI driven ones like Protege.
+A lot of blog posts have circulated recently on Twitter and LinkedIn explored the potential of LLMs to generate RDF and OWL directly. It is already clear that LLMs can and will do this moving forward. For ontology curation specifically, we will need to develop executable workflows that fit into our general ontology curation process. As a first pass, some members of our community have developed OntoGPT. We will explore how to use OntoGPT in a future lesson.
+Update 27 May 2023: It seems that complaints wrt to hallucinations, the chat part of ChatGPT is a bit more sensitive to database like queries:
+ +(Current) Limitations:
+ + + + + + + +Participants will need to have access to the following resources and tools prior to the training:
+Description: How to contribute terms to existing ontologies.
+GitHub - distributed version control (Git) + social media for geeks who like to build code/documented collaboratively.
+A Git repo consists of a set of branches each with a complete history of all changes ever made to the files and directories. This is true for a local copy you check out to your computer from GitHub or for a copy (fork) you make on GitHub.
+ +A Git repo typically has a master or main branch that is not directly editing. Changes are made by creating a branch from Master (complete copy of the Master + its history).
+ +You can copy (fork) any GitHub repo to some other location on GitHub without having to ask permission from the owners. If you modify some files in that repo, e.g. to fix a bug in some code, or a typo in a document, you can then suggest to the owners (via a Pull Request) that they adopt (merge) you your changes back into their repo.
+If you have permission from the owners, you can instead make a new branch. For this training, we gave you access to the repository. See the Appendix for instructions on how to make a fork.
+ +Tip: you can easily obtain term metadata like OBO ID, IRI, or the term label by clicking the three lines above the Annotations box (next to the term name) in Protege, see screenshot below. You can also copy the IRI in markdown, which is really convenient for pasting into GitHub.
+ +See this example video on creating a new term request to the Mondo Disease Ontology:
+ + + +A README is a text file that introduces and explains a project. It is intended for everyone, not just the software or ontology developers. Ideally, the README file will include detailed information about the ontology, how to get started with using any of the files, license information and other details. The README is usually on the front page of the GitHub repository.
+ +The steps below describe how to make changes to an ontology.
+training-NV
)The instructions below are using the Mondo Disease Ontology as an example, but this can be applied to any ontology.
+Note: Windows users should open Protege using run.bat
+The Protégé interface follows a basic paradigm of Tabs and Panels. By default, Protégé launches with the main tabs seen below. The layout of tabs and panels is configurable by the user. The Tab list will have slight differences from version to version, and depending on your configuration. It will also reflect your customizations.
+To customize your view, go to the Window tab on the toolbar and select Views. Here you can customize which panels you see in each tab. In the tabs view, you can select which tabs you will see. You will commonly want to see the Entities tab, which has the Classes tab and the Object Properties tab.
+ +Note: if you open a new ontology while viewing your current ontology, Protégé will ask you if you'd like to open it in a new window. For most normal usage you should answer no. This will open in a new window.
+The panel in the center is the ontology annotations panel. This panel contains basic metadata about the ontology, such as the authors, a short description and license information.
+ +Before browsing or searching an ontology, it is useful to run an OWL reasoner first. This ensures that you can view the full, intended classification and allows you to run queries. Navigate to the query menu, and run the ELK reasoner:
+ +For more details on why it is important to have the reasoner on when using the editors version of an ontology, see the Reasoning reference guide. But for now, you don't need a deeper understanding, just be sure that you always have the reasoner on.
+You will see various tabs along the top of the screen. Each tab provides a different perspective on the ontology. +For the purposes of this tutorial, we care mostly about the Entities tab, the DL query tab and the search tool. OWL Entities include Classes (which we are focussed on editing in this tutorial), relations (OWL Object Properties) and Annotation Properties (terms like, 'definition' and 'label' which we use to annotate OWL entities. +Select the Entities tab and then the Classes sub-tab. Now choose the inferred view (as shown below).
+ +The Entities tab is split into two halves. The left-hand side provides a suite of panels for selecting various entities in your ontology. When a particular entity is selected the panels on the right-hand side display information about that entity. The entities panel is context specific, so if you have a class selected (like Thing) then the panels on the right are aimed at editing classes. The panels on the right are customizable. Based on prior use you may see new panes or alternate arrangements. +You should see the class OWL:Thing. You could start browsing from here, but the upper level view of the ontology is too abstract for our purposes. To find something more interesting to look at we need to search or query.
+You can search for any entity using the search bar on the right:
+ +The search window will open on top of your Protege pane, we recommend resizing it and moving it to the side of the main window so you can view together.
+ +Here's an example search for 'COVID-19': +
+It shows results found in display names, definitions, synonyms and more. The default results list is truncated. To see full results check the 'Show all results option'. You may need to resize the box to show all results. +Double clicking on a result, displays details about it in the entities tab, e.g.
+ +In the Entities, tab, you can browse related types, opening/closing branches and clicking on terms to see details on the right. In the default layout, annotations on a term are displayed in the top panel and logical assertions in the 'Description' panel at the bottom.
+Try to find these specific classes:
+Note - a cool feature in the search tool in Protege is you can search on partial string matching. For example, if you want to search for ‘down syndrome’, you could search on a partial string: ‘do synd’.
+Note - if the search is slow, you can uncheck the box ‘Search in annotation values. Try this and search for a term and note if the search is faster. Then search for ‘shingles’ again and note what results you get.
+ +Changes made to the ontology can be viewed in GitHub Desktop.
+Before committing, check the diff. Examples of a diff are pasted below. Large diffs are a sign that something went wrong. In this case, do not commit the changes and ask the ontology editors for help instead.
+Example 1:
+ +NOTE: You can use the word 'fixes' or 'closes' in the description of the commit message, followed by the corresponding ticket number (in the format #1234) - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
+Note: 'Fixes' and "Closes' are case-insensitive.
+If you don't want to close the ticket, just refer to the ticket # without the word 'Fixes' or use 'Adresses'. The commit will be associated with the correct ticket but the ticket will remain open. NOTE: It is also possible to type a longer message than allowed when using the '-m' argument; to do this, skip the -m, and a vi window (on mac) will open in which an unlimited description may be typed.
+Click Commit to [branch]. This will save the changes to the cl-edit.owl file.
+Push: To incorporate the changes into the remote repository, click Publish branch.
+The instructions below are using the Mondo Disease Ontology as an example, but this can be applied to any ontology.
+ +Ontology terms have separate names and IDs. The names are annotation values (labels) and the IDs are represented using IRIs. The OBO foundry has a policy on IRI (or ID) generation (http://www.obofoundry.org/principles/fp-003-uris.html). You can set an ID strategy using the "New Entities" tab under the Protégé Preferences -- on the top toolbar, click the "Protégé dropdown, then click Preferences.
+ +Set your new entity preferences precisely as in the following screenshot of the New Entities tab.
+Note - you have been assigned an ID range in the Mondo idranges file - you should be able to find your own range assigned there.
+DIY (only if you know what you are doing!)
+To add your own ID ranges:
+Go into src/ontology
+create a branch
+Find and edit mondo-idranges.owl by adding the following:
+Datatype: idrange:10 #update this to next following integer from previous
+
+ Annotations:
+ allocatedto: "Your Name" #change to your name
+
+ EquivalentTo:
+ xsd:integer[>= 0806000 , <= 0806999]. #add a range of 999 above the previous integer
+
+Be sure to change "Your Name" to your actual name! And note that this value should almost always be an individual, and not an organization or group.
+create a pull request and add matentzn or nicolevasilevsky as a reviewer
+proceed to settting up as below:
+ +Specified IRI: http://purl.obolibrary.org/obo/
+Note - if you edit more than one ontology in Protege, you will need to update your Preferences for each ontology before you edit.
+User name: click Use supplied user name and enter your username in the field below
+Click Use Git user name when available
+In the ORCID field, add your ORCID ID (in the format 0000-0000-0000-0000)
+ +The current recommendation of the OBO Foundry Technical Working Group is that an editor who creates a new term SHOULD add a http://purl.org/dc/terms/contributor
annotation, set to the ORCID or GitHub username of the editor, and a http://purl.org/dc/terms/date
annotation, set to the current date.
You can have Protégé automatically add those annotations by setting your preferences to match the screenshot below, in the New entities metadata tab (under preferences).
+If you do not have an ORCID, register for for free here: https://orcid.org/
+ + +Before you start:
+make sure you are working on a branch - see quick guide here.
+make sure you have the editor's file open in Protege as detailed here.
+New classes are created in the Class hierarchy panel on the left.
+There are three buttons at the top of the class hierarchy view. These allow you to add a subclass (L-shaped icon), add a sibling class (c-shaped icon), or delete a selected class (x'd circle).
+Practice adding a new term:
+We will work on these two tickets:
+Search for the parent term 'hypereosinophilic syndrome' (see search guide if you are unsure how to do this).
+When you are clicked on the term in the Class hierarchy pane, click the add subclass button to add a child class to 'hypereosinophilic syndrome'
+A dialog will popup. Name this new subclass: migratory muscle precursor. Click "OK" to add the class.
+ + +Using Protégé you can add annotations such as labels, definitions, synonyms, database cross references (dbxrefs) to any OWL entity. The panel on the right, named Annotations, is where these annotations are added. CL includes a pre-declared set of annotation properties. The most commonly used annotations are below.
+Note, most of these are bold in the annotation property list:
+ +Use this panel to add a definition to the class you created. Select the + button to add an annotation to the selected entity. Click on the annotation 'definition' on the left and copy and paste in the definition to the white editing box on the right. Click OK.
+Definition: A disorder characterized by episodes of swelling under the skin (angioedema) and an elevated number of the white blood cells known as eosinophils (eosinophilia). During these episodes, symptoms of hives (urticaria), fever, swelling, weight gain and eosinophilia may occur. Symptoms usually appear every 3-4 weeks and resolve on their own within several days. Other cells may be elevated during the episodes, such as neutrophils and lymphocytes. Although the syndrome is often considered a subtype of the idiopathic hypereosinophilic syndromes, it does not typically have organ involvement or lead to other health concerns.
+ + +Definitions in Mondo should have a 'database cross reference' (dbxref), which is a reference to the definition source, such as a paper from the primary literature or another database. For references to papers, we cross reference the PubMed Identifier in the format, PMID:XXXXXXXX. (Note, no space)
+To add a dbxref to the definition:
+We have seen how to add sub/superclasses and annotate the class hierarchy. Another way to do the same thing is via the Class description view. When an OWL class is selected in the entities view, the right-hand side of the tab shows the class description panel. If we select the 'vertebral column disease' class, we see in the class description view that this class is a "SubClass Of" (= has a SuperClass) the 'musculoskeletal system disease' class. Using the (+) button beside "SubClass Of" we could add another superclass to the 'skeletal system disease' class.
+Note the Anonymous Ancestors. This is a difficult concept we will return to later, and the contents of this portion may seem confusing at first (some of these may be clearer after you complete the "Basics of OWL" section below). These are OWL expressions that are inherited from the parents. If you hover over the Subclass Of (Anonymous Ancestor) you can see the parent that the class inherited the expression from. For many ontologies, you will see some quite abstract expressions in here inherited from upper ontologies, but these can generally be ignored for most purposes.
+ +If you want to revise the superclass, click the 'o' symbol next to the superclass and replace the text. Try to revise 'musculoskeletal system disease' to 'disease by anatomical system'.
+If you want to delete a superclass, click the 'x' button next to the superclass. Delete the 'disease by anatomical system' superclass.
+Close this window without saving.
+Save your work.
+Click: Create Pull Request in GitHub Desktop
+This will automatically open GitHub Desktop
+Click the green button 'Create pull request'
+You may now add comments to your pull request.
+The CL editors team will review your PR and either ask for changes or merge it.
+The changes will be available in the next release.
+Dead Simple Ontology Design Patterns (DOSDPs) are specifications, written in yaml format, that specify how ontology terms should be created (see article here). They can be used to:
+DOSDPs have some key features:
+Examples of design patterns are available here:
+ + +under development
+ +BDK14_exercises
from your file systembasic-subclass/chromosome-parts.owl
in Protégé, then do the following exercises:basic-restriction/er-sec-complex.owl
in Protégé, then do the following exercise:
+basic-dl-query/cc.owl
in Protégé, then do the following exercises:owl:Nothing
is defined as the very bottom node of an ontology, therefore the DL query results will show owl:Nothing
as a subclass. This is expected and does not mean there is a problem with your ontology! It's only bad when something is a subclass of owl:Nothing
and therefore unsatisfiable (more on that below).basic-classification/ubiq-ligase-complex.owl
in Protégé, then do the following exercises:
+Below are exercises to demonstrate how to:
+These instructions will use the Mondo disease ontology as an example.
+New classes are created in the Class hierarchy panel on the left.
+There are three buttons at the top of the class hierarchy view. These allow you to add a subclass (L-shaped icon), add a sibling class (c-shaped icon), or delete a selected class (x'd circle).
+ +Equivalence axioms in Mondo are added according to Dead Simple Ontology Design Patterns (DOSDPs). You can view all of the design patterns in Mondo by going to code/src/patterns/dosdp-patterns/
+For this class, we want to follow the design pattern for allergy.
+As noted above, equivalence axioms in Mondo are added according to Dead Simple Ontology Design Patterns (DOSDPs). You can view all of the design patterns in Mondo by going to code/src/patterns/dosdp-patterns/
+For this class, we want to follow the design pattern for acquired.
+Develop skills to lead a new or existing OBO project, or reference ontology develoment.
+Please complete the following and then continue with this tutorial below:
+By the end of this session, you should be able to:
+robot merge
robot reason
robot annotate
Like software, official OBO Foundry ontologies have versioned releases. This is important because OBO Foundry ontologies are expected to be shared and reused. Since ontologies are bound to change over time as more terms are added and refined, other developers need stable versions to point to so that there are no surprises. OBO Foundry ontologies use GitHub releases to maintain these stable copies of older versions.
+Generally, OBO Foundry ontologies maintain an "edit" version of their file that changes without notice and should not be used by external ontology developers because of this. The edit file is used to create releases on a (hopefully) regular basis. The released version of an OBO Foundry ontology is generally a merged and reasoned version of the edit file. This means that all modules and imports are combined into one file, and that file has the inferred class hierarchy actually asserted. It also often has some extra metadata, including a version IRI. OBO Foundry defines the requirements for version IRIs here.
+robot template
robot merge
robot reason
robot annotate
Since we can turn these steps into a series of commands, we can create a Makefile
that stores these as "recipes" for our ontology release!
report
and query
convert
, extract
, and template
merge
, reason
, annotate
, and diff
These materials are under construction and incomplete.
+Description: Combining ontology subsets for use in a project.
+All across the biomedical domain, we refer to domain entities (such as chemicals or anatomical parts) using identifiers, often from controlled vocabularies.
+The decentralised evolution of scientific domains has led to to the emergence of disparate "semantic spaces" with different annotation practices and reference vocabularies and formalisms.
+ +To bridge between these spaces, entity mappings have emerged, which link, for example, genes from HGNC to ENSEMBL, diseases between OMIM and Mondo and anatomical entities between FMA and Uberon.
+Entity matching is the process of establishing a link between an identifier in one semantic space to an identifier in another. There are many cultures of thought around entity matching, including Ontology Matching, Entity Resolution and Entity Linking.
+Concept | +Definition | +
---|---|
Semantic space | +A not widely used concept to denote a cluster of related data that can be interpreted using the same ontology. | +
Ontology matching | +The task of determining corresponding entities across ontologies. | +
Entity mapping | +Determining and documenting the correspondence of an entity in one semantic space to another. | +
Schema mapping | +Determining and documenting the translation rules for converting an entity from one semantic space to another. | +
Ontology alignment | +An ontology alignment is a set of term mappings that links all concepts in a source ontology to their appropriate correspondence in a target ontology, if any. | +
Knowledge graph matching | +More or less the same as ontology matching - for knowledge graphs | +
Thesaurus building | +Involves assigning natural language strings (synonym) to a code in a knowledge organisation system (like a taxonomy, terminology, or ontology) | +
Named Entity Recognition and Entity Linking | +Involve recognising entities (such as diseases) in text and linking them to some identifier. | +
Entity resolution/record linkage | +Involves determining if records from different data sources represent, in fact, the same entity | +
Schema matching | +Determines if two objects from different data models (schema elements, schema instances) are semantically related. | +
Value Set Mapping | +Determines and documents the correspondence of two Value Sets and their respective values (i.e. a 2-level mapping!). | +
The excellent OpenHPI course on Knowledge Engineering with Semantic Web Technologies gives a good overview:
+ + +Another gentle overview on Ontology Matching was taught as part of the Knowledge & Data course at Vrije Universiteit Amsterdam.
+ + + +In the following, we consider a entity a symbol that is intended to refer to a real world entity, for example:
+ +rdfs:label
"Friedreichs Ataxia".
+The label itself is not necessarily a term - it could change, for example to "Friedreichs Ataxia (disease)", and still retain the same meaning.Friedreich's Ataxia
" (example on the left) may be a term in my controlled vocabulary which I understand to correspond to that respective disease (not all controlled vocabularies have IDs for their terms).
+This happens for example in clinical data models that do not use formal identifiers to refer to the values of slots in their data model, like "MARRIED" in /datamodel/marital_status.In our experience, there are roughly four kinds of mappings:
+cheese sandwich (wikidata:Q2734068)
to sandwich (wikidata:Q111836983)
and cheese wikidata:Q10943
.
+ These are the rarest and most complicated kinds of mappings and are out of scope for this lesson.In some ways, these four kinds of mappings can be very different. We do believe, however, that there are enough important commonalities such as common features, widely overlapping use cases and overlapping toolkits to consider them together. In the following, we will discuss these in more detail, including important features of mappings and useful tools.
+Mappings have historically been neglected as second-class citizens in the medical terminology and ontology worlds - +the metadata is insufficient to allow for precise analyses and clinical decision support, they are frequently stale and out of date, etc. The question "Where can I find the canonical mappings between X and Y"? is often shrugged off and developers are pointed to aggregators such as OxO or UMLS which combine manually curated mappings with automated ones causing "mapping hairballs".
+There are many important metadata elements to consider, but the ones that are by far the most important to consider one way or another are:
+Whenever you handle mappings (either create, or re-use), make sure you are keenly aware of at least these three metrics, and capture them. You may even want to consider using a proper mapping model like the Simple Shared Standard for Ontology Mappings (SSSOM) which will make your mappings FAIR and reusable.
+String-string mappings are mappings that relate two strings. The task of matching two strings is ubiquitous for example in database search fields (where a user search string needs to be mapped to some strings in a database). Most, if not all effective ontology matching techniques will employ some form of string-string matching. For example, to match simple variations of labels such as "abnormal heart" and "heart abnormality", various techniques such as Stemming and bag of words can be employed effectively. Other techniques such as edit-distance or Levenshtein can be used to quantify the similarity of two strings, which can provide useful insights into mapping candidates.
+String-entity mappings relate a specific string or "label" to their corresponding term in a terminology or ontology. Here, we refer to these as "synonyms", but there may be other cases for string-entity mappings beyond synonymy.
+There are a lot of use cases for synonyms so we will name just a few here that are relevant to typical workflows of Semantic Engineers in the life sciences.
+Thesauri are reference tools for finding synonyms of terms. Modern ontologies often include very rich thesauri, with some ontologies like Mondo capturing more than 70,000 exact and 35,000 related synonyms. They can provide a huge boost to traditional NLP pipelines by providing synonyms that can be used for both Named Entity Recognition and Entity Resolution. Some insight on how, for example, Uberon was used to boost text mining can be found here.
+Entity-entity mappings relate a entity (or identifier), for example a class in an ontology, to another entity, usually from another ontology or database. The entity-entity case of mappings is what most people in the ontology domain would understand when they hear "ontology mappings". This is also what most people understand when they here "Entity Resolution" in the database world - the task of determining whether, in essence, two rows in a database correspond to the same thing (as an example of a tool doing ER see deepmatcher, or py-entitymatcher). For a list standard entity matching toolkit outside the ontology sphere see here.
+Mappings between terms/identifiers are typically collected in four ways:
+The main trade-off for mappings is very simple: +1. Automated mappings are very error prone (not only are they hugely incomplete, they are also often faulty). +1. Human curated mappings are very costly.
+--> The key for any given mapping project is to determine the highest acceptable error rate, and then distribute the workload between human and automated matching approaches. We will discuss all three ways of collecting mappings in the following.
+Aside from the main tradeoff above, there are other issues to keep in mind: +- Manually curated mappings are far from perfect. Most of the cost of mapping review lies in the decision how thorough a mapping should be reviewed. For example, a human reviewer may be tasked with reviewing 1000 mappings. If the acceptable error rate is quite high, the review may simply involve the comparison of labels (see here), which may take around 20 seconds. A tireless reviewer could possibly accept or dismiss 1000 mappings just based on the label in around 6 hours. Note that this is hardly better than what most automated approaches could do nowadays. +- Some use cases involve so much data that manual curation is nearly out of the question.
+It is important to remember that matching in its raw form should not be understood to result in semantic mappings. The process of matching, in particular lexical or fuzzy semantic matching is error prone and usually better treated as resulting in candidates for mappings. This means that when we calculate the effort of a mapping project, we should always factor in the often considerable effort required by a human to verify the correctness of a candidate mapping. There are many tools that can help with this process, for example by filtering out conflicting lower-confidence mappings, but in the end the reality is this: due to the fact that source and target do not share the same semantics, mappings will always be a bit wobbly. There are two important kinds of review which are very different:
+orange juice [wikidata:Q219059]
and orange juice (unpasteurized) [FOODON:00001277]
may not be considered as the same thing in the sense of skos:exactMatch
. oak lexmatch
this usually involves hacking labels and synonyms by removing or replacing words. More sophisticated matchers like Agreement Maker Light (AML) have many more tuning options, and it requires patience and expertise to find the right ones. One good approach here is to include semantically or lexically similar matches in the results, and review if generally consistent patterns of lexical variation can be spotted. For example: orange juice (liquid) [FOODON:00001001]
seems to be exactly what orange juice [wikidata:Q219059]
is supposed to mean. The labels are not the same, but lexically similar: a simple lexical distance metric like Levenshtein could have been used to identify these.Tip: always keep a clear visible list of unmapped classes around to sanity check how good your mapping has been so far.
+There are many (many) tools out there that have been developed for entity matching. A great overview can be found in Euzenats Ontology Matching. Most of the matchers apply a mix of lexical and semantic approaches.
+As a first pass, we usually rely on a heuristic that an exact match on the label is strong evidence that the two entities correspond to the same thing. Obviously, this cannot always be the case Apple
(the fruit) and Apple
(the company) are two entirely different things, yet a simple matching tool (like OAK lexmatch
) would return these as matching. The reason why this heuristic works in practice is because we usually match between already strongly related semantic spaces, such as two gene databases, two fruit ontologies or two disease terminologies. When the context is narrow, lexical heuristics have a much lower chance to generate excessively noisy mappings.
After lexical matchings are created, other techniques can be employed, including syntactic similarity (match all entities which have labels that are more than 80% similar and end with disease
) and semantic similarity (match all entities whose node(+graph)-embedding have a cosine similarity of more than 80%). Automated matching typically results in a large number of false positives that need to be filtered out using more sophisiticated approaches for mapping reconciliation.
The refinement step may involve automated approaches that are sensitive to the logical content of the sources involved (for example by ensuring that the result does not result in equivalence cliques, or unsatisfiable classes), but more often than not, human curators are employed to curate the mapping candidates generated by the various automated approaches.
+ +Mapping phenotypes across species holds great promise for leveraging the knowledge generated by Model Organism Database communities (MODs) for understanding human disease. There is a lot of work happening at the moment (2021) to provide standard mappings between species specific phenotype ontologies to drive translational research (example). Tools such as Exomiser leverage such mappings to perform clinical diagnostic tasks such as variant prioritisation. Another app you can try out that leverages cross-species mappings is the Monarch Initiatives Phenotype Profile Search.
+Medical terminology and ontology mapping is a huge deal in medical informatics (example). Mondo is a particularly rich source of well provenanced disease ontology mappings.
+Sign up for a free GitHub account
+No advance preparation is necessary.
+Optional: If you are unfamiliar with ontologies, this introduction to ontologies explanation may be helpful.
+Description: The purpose of this lesson is to train biomedical researchers on how to find a term, what to do if they find too many terms, how to decide on which term to use, and what to do if no term is found.
+This how to guide on How to be an Open Science Engineer - maximizing impact for a better world has a lot of details about the philosophy behind open science ontology engineering. Some key points are summarized below.
+See lesson on Using Ontologies and Ontology Terms
+See How to guide on Make term requests to existing ontologies
+In this lesson, we will give an intuition of how to work with object properties
in OBO ontologies, also referred to as "relations".
We will cover, in particular, the following subjects:
+We have worked with the University of Manchester to incorporate the Family History Knowledge Base Tutorial fully into OBO Academy.
+This is it: OBOAcademy: Family History - Modelling with Object Properties.
+In contrast to the Pizza tutorial, the Family history tutorial focuses on modelling with individuals. Chapters 4, 5, 8 and 9 are full of object property modelling, and are not only great to get a basic understanding of using them in your ontology, but also give good hints at where OWL and object properties fall short. We refer to the FHKB in the following and expect you to have completed at least chapter 5 before reading on.
+To remind ourselves, there are three different types of relations in OWL:
+For some example usage, run the following query in the ontobee OLS endpoint:
+http://www.ontobee.org/sparql
+prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+prefix owl: <http://www.w3.org/2002/07/owl#>
+SELECT distinct *
+WHERE {
+GRAPH ?graph_uri
+{ ?dp rdf:type owl:DatatypeProperty .
+ ?sub ?dp ?obj }
+}
+
Note that many uses of data properties across OBO are a bit questionable, for example, you do never want to attach a modification dates or similar to your classes using data properties, as these fall under OWL semantics. This means that logically, if a superclass has a relation using a DatatypeProperty, then this relation _holds for all subclasses of that class as well.
+Annotation properties are similar to data properties, but they are outside of OWL semantics, i.e. OWL reasoners and reasoning do not care, in fact ignore, anything related to annotation properties. This makes them suitable for attaching metadata like labels etc to our classes and properties. We sometimes use annotation properties even to describe relationships between classes if we want reasoners to ignore them. The most typical example is IAO:replaced_by, which connects an obsolete term with its replacement. Widely used annotation properties in the OBO-sphere are standardised in the OBO Metadata Ontology (OMO).
+The main type of relation we use in OBO Foundry are object properties. Object properties relate two individuals or classes with each other, for example:
+OWLObjectPropertyAssertion(:part_of, :heart, :cardiovascular_system)
+
In the same way as annotation properties are maintained in OMO (see above), object properties are maintained in the Relation Ontology (RO).
+Object properties are of central importance to all ontological modelling in the OBO sphere, and understanding their semantics is critical for any put the most trivial ontologies. We assume the reader to have completed the Family History Tutorial mentioned above.
+ +In our experience, these are the most widely used characteristics we specify about object properties (OP):
+ecologically co-occurs with
in RO has the domain 'organism or virus or viroid'
, which means that whenever anything ecologically co-occurs with
something else, it will be inferred to be a 'organism or virus or viroid'
.produced by
has the domain material entity
. Note that in ontologies, ranges are slightly less powerful then domains: If we have a class Moderna Vaccine
which is SubClass of 'produced by' some 'Moderna'
we get that Moderna Vaccine
is a material entity
due to the domain constraint, but NOT that Moderna
is a material entity
due to the range constraint (explanation to this is a bit complicated, sorry).Other characteristics like functionality and symmetry are used across OBO ontologies, but not nearly to the same extend as the 5 described above.
+The Relation Ontology serves two main purposes in the OBO world:
+prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+prefix owl: <http://www.w3.org/2002/07/owl#>
+SELECT distinct ?graph_uri ?s
+WHERE {
+GRAPH ?graph_uri
+{ ?s rdf:type owl:ObjectProperty ;
+ rdfs:label "part of" . }
+}
+
On the OntoBee SPARQL endpoint still reveals a number of ontologies using non-standard part-of relations. In our experience, most of these are accidental due to past format conversions, but not all. This problem was much worse before RO came along, and our goal is to unify the representation of key properties like "part of" across all OBO ontologies. The OBO Dashboard checks for object properties that are not aligned with RO.
+To add a relationship we usually follow the following process. For details, please refer to the RO documentation.
+These materials are under construction and incomplete.
+Participants will need to have access to the following resources and tools prior to the training:
+Description: This course will cover reasoning with OWL.
+At the end of this lesson, you should know how to do:
+OpenHPI Course Content
+In OWL, we use object properties to describe binary relationships between two individuals (or instances). We can also use the properties to describe new classes (or sets of individuals) using restrictions. A restriction describes a class of individuals based on the relationships that members of the class participate in. In other words, a restriction is a kind of class, in the same way that a named class is a kind of class.
+For example, we can use a named class to capture all the individuals that are idiopathic diseases. But we could also describe the class of idiopathic disease as all the instances that are 'has modifier' idiopathic disease.
+In OWL, there are three main types of restrictions that can be placed on classes. These are quantifier restriction, cardinality restrictions, and hasValue restriction. In this tutorial, we will initially focus on quantifier restrictions.
+Quantifier restrictions are further categorized into two types, the existential and the universal restriction.
+idiopathic disease
class. In Protege, the keyword 'some' is used to denote existential restrictions.In this tutorial, we will deal exclusively with the existential (some) quantifier.
+Strictly speaking in OWL, you don't make relationships between classes, however, using OWL restrictions we essentially achieve the same thing.
+We wanted to capture the knowledge that the named class 'idiopathic achalasia' is an idiopathic disease. In OWL speak, we want to say that every instance of an ' idiopathic achalasia' is also an instance of the class of things that have at least one 'has modifier' relationship to an idiopathic disease. In OWL, we do this by creating an existential restriction on the idiopathic achalasia class.
+This example introduces equivalence axioms or defined classes (also called logical definitions) and automatic classification.
+The example involves classification of Mendelian diseases that have a monogenic (single gene) varation. These equivalence axioms are based off the Mondo Design Pattern disease_series_by_gene.
+Constructs:
+'cardioacrofacial dysplasia 1'
'cardioacrofacial dysplasia'
that has dysfunction in the PRKACA gene.'cardioacrofacial dysplasia' and ('disease has basis in dysfunction of' some PRKACA)
For teaching purposes, let's say we need a new class that is 'fungal allergy'.
+By default, OWL assumes that these classes can overlap, i.e. there are individuals who can be instances of more than one of these classes. We want to create a restriction on our ontology that states these classes are different and that no individual can be a member of more than one of these classes. We can say this in OWL by creating a disjoint classes axiom.
+Below we'll review an example of one class and how to fix it. Next you should review and fix another one on your own and create a pull request for Nicole or Nico to review. Note, fixing these may require a bit of review and subjective decision making and the fix described below may not necessarily apply to each case.
+Bickerstaff brainstem encephalitis
: To understand why this class appeared under owl:Nothing, first click the ? next to owl:Nothing in the Description box. (Note, this can take a few minutes).Guillain-Barre syndrome
, which is a child of syndromic disease
.Bickerstaff brainstem encephalitis
is an appropriate child of regional variant of Guillain-Barre syndrome
. Note, Mondo integrates several disease terminologies and ontologies, and brought in all the subclass hierarchies from these source ontologies. To see the source of this superclass assertion, click the @ next to the assertion.regional variant of Guillain-Barre syndrome
(see this paper and this paper. It seems a bit unclear what the relationship of BBE is to Guillain-Barre syndrome. This also brings into the question if a disease can be syndromic and an infectious disease - maybe this disjoint axiom is wrong, but let's not worry about this for the teaching purposes.)These materials are under construction and incomplete.
+These materials are under construction and incomplete.
+BDK14_exercises
from your file systembasic-subclass/chromosome-parts.owl
in Protégé, then do the following exercises:basic-restriction/er-sec-complex.owl
in Protégé, then do the following exercise:
+basic-dl-query/cc.owl
in Protégé, then do the following exercises:owl:Nothing
is defined as the very bottom node of an ontology, therefore the DL query results will show owl:Nothing
as a subclass. This is expected and does not mean there is a problem with your ontology! It's only bad when something is a subclass of owl:Nothing
and therefore unsatisfiable (more on that below).basic-classification/ubiq-ligase-complex.owl
in Protégé, then do the following exercises:
+Description: Learn the fundamentals of ontologies.
+robot convert
(Review; ~15 minutes)robot extract
(Review; ~15 minutes)robot template
(Review; ~15 minutes)These materials are under construction and may be incomplete.
+convert
, extract
and template
annotate
, merge
, reason
and diff
There are two basic ways to edit an ontology: +1. Manually, using tools such as Protege, or +2. Using computational tools such as ROBOT.
+Both have their advantages and disadvantages: manual curation is often more practical when the required ontology change follows a non-standard pattern, such as adding a textual definition or a synonym, while automated approaches are usually much more scalable (ensure that all axioms in the ontology are consistent, or that imported terms from external ontologies are up-to-date or that all labels start with a lower-case letter).
+Here, we will do a first dive into the "computational tools" side of the edit process. We strongly believe that the modern ontology curator should have a basic set of computational tools in their Semantic Engineering toolbox, and many of the lessons in this course should apply to this role of the modern ontology curator.
+ROBOT is one of the most important tools in the Semantic Engineering Toolbox. For a bit more background on the tool, please refer to the paper ROBOT: A Tool for Automating Ontology Workflows.
+We also recommend to get a basic familiarity with SPARQL, the query language of the semantic web, that can be a powerful combination with ROBOT to perform changes and quality control checks on your ontology.
+ + +These materials are under construction and may be incomplete.
+Description: Using ontology terms for annotations and structuring data.
+Ontologies provide a logical classification of information in a particular domain or subject area. Ontologies can be used for data annotations, structuring disparate data types, classifying information, inferencing and reasoning across data, and computational analyses.
+A terminology is a collection of terms; a term can have a definition and synonyms.
+An ontology contains a formal classification of terminology in a domain that provides textual and machine readable definitions, and defines the relationships between terms. An ontology is a terminology, but a terminology is not (necessarily) an ontology.
+ +Numerous ontologies exist. Some recommended sources to find community developed, high quality, and frequently used ontologies are listed below.
+example of usage
.The OBO Foundry is a community of ontology developers that are committed to developing a library of ontologies that are open, interoperable ontologies, logically well-formed and scientifically accurate. OBO Foundry participants follow and contribute to the development of an evolving set of principles including open use, collaborative development, non-overlapping and strictly-scoped content, and common syntax and relations, based on ontology models that work well, such as the Gene Ontology (GO).
+The OBO Foundry is overseen by an Operations Committee with Editorial, Technical and Outreach working groups.
+Various ontology browsers are available, we recommend using one of the ontology browsers listed below.
+Some considerations for determining which ontologies to use include the license and quality of the ontology.
+Licenses define how an ontology can legally be used or reused. One requirement for OBO Foundry Ontologies is that they are open, meaning that the ontologies are openly and freely available for use with acknowledgement and without alteration. OBO ontologies are required to be released under a Creative Commons CC-BY license version 3.0 or later, OR released into the public domain under CC0. The license should be clearly stated in the ontology file.
+Some criteria that can be applied to determine the quality of an ontology include:
+Data can be mapped to ontology terms manually, using spreadsheets, or via curation tools such as:
+The figure below by Chris Mungall on his blog post on How to select and request terms from ontologies + describes a workflow on searching for identifying missing terms from an ontology.
+ + +See separate lesson on Making term requests to existing ontologies.
+ +A uniform resource identifier (URI) is a string of characters used to identify a name or a resource.
+A URL is a URI that, in addition to identifying a network-homed resource, specifies the means of acting upon or obtaining the representation.
+A URL such as this one:
+https://github.com/obophenotype/uberon/blob/master/uberon_edit.obo
+has three main parts:
+The protocol tells you how to get the resource. Common protocols for web pages are http (HyperText Transfer Protocol) and https (HTTP Secure). +The host is the name of the server to contact (the where), which can be a numeric IP address, but is more often a domain name. +The path is the name of the resource on that server (the what), here the Uberon anatomy ontology file.
+A Internationalized Resource Identifiers (IRI) is an internet protocol standard that allows permitted characters from a wide range of scripts. While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the Universal Character Set (Unicode/ISO 10646), including Chinese or Japanese kanji, Korean, Cyrillic characters, and so forth. It is defined by RFC 3987.
+More information is available here.
+A Compact URI (CURIE) consists of a prefix and a suffix, where the prefix stands in place of a longer base IRI.
+By converting the prefix and appending the suffix we get back to full IRI. For example, if we define the obo prefix to stand in place of the IRI as: +http://purl.obolibrary.org/obo/, then the CURIE +obo:UBERON_0002280 +can be expanded to +http://purl.obolibrary.org/obo/UBERON_0002280, which is the UBERON Anatomy term for ‘otolith’. +Any file that contains CURIEs need to +define the prefixes in the file header.
+A label is the textual, human readable name that is given to a term, class property or instance in an ontology.
+ + + + + + +First Instructor: James Overton
+Second Instructor: Becky Jackson
These materials are under construction and incomplete.
+Modelling and querying data with RDF triples, and working with RDF using tables
+OpenHPI Linked Data Engineering (2016)
+These materials are under construction and incomplete.
+Description: Using ontology terms in a database.
+Ontologies are notoriously hard to edit. This makes it a very high burden to edit ontologies for anyone but a select few. However, many of the contents of ontologies are actually best edited by domain experts with often little or known ontological training - editing labels and synonyms, curating definitions, adding references to publications and many more. Furthermore, if we simply remove the burden of writing OWL axioms, editors with very little ontology training can actually curate even logical content: for example, if we want to describe that a class is restricted to a certain taxon (also known as taxon-restriction), the editor is often capable to select the appropriate taxon for a term (say, a "mouse heart" is restricted to the taxon of Mus musculus), but maybe they would not know how to "add that restriction to the ontology".
+Tables are great (for a deep dive into tables and triples see here). Scientists in particular love tables, and, even more importantly, can be trained easily to edit data in spreadsheet tools, such as Google Sheets or Microsoft Excel.
+Ontology templating systems, such as DOSDP templates, ROBOT templates and Reasonable Ontology Templates (OTTR) allow separating the raw data in the ontology (labels, synonyms, related ontological entities, descriptions, cross-references and other metadata) from the OWL language patterns that are used to manifest them in the ontology. There are three main ingredients to a templating system:
+In OBO we are currently mostly concerned with ROBOT templates and DOSDP templates. Before moving on, we recommend to complete a basic tutorial in both:
+ +Ontologies, especially in the biomedical domain, are complex and, while growing in size, increasingly hard to manage for their curators. In this section, we will look at some of the key differences of two popular templating systems in the OBO domain: Dead Simple Ontology Design Patterns (DOSDPs) and ROBOT templates. We will not cover the rationale for templates in general in much depth (the interested reader should check ontology design patterns and Reasonable Ontology Templates (OTTR): Motivation and Overview, which pertains to a different system, but applies none-the-less in general), and focus on making it easier for developers to pick the right templating approach for their particular use case. We will first discuss in detail representational differences, before we go through the functional ones and delineate use cases.
+DOSDP separates data and templates into two files: a yaml file which defines the template, and a TSV file which holds the data. Lets look at s example.
+The template: abnormalAnatomicalEntity
+pattern_name: abnormalAnatomicalEntity
+pattern_iri: http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml
+description: "Any unspecified abnormality of an anatomical entity."
+
+contributors:
+ - https://orcid.org/0000-0002-9900-7880
+ - https://orcid.org/0000-0001-9076-6015
+ - https://orcid.org/0000-0003-4148-4606
+ - https://orcid.org/0000-0002-3528-5267
+
+classes:
+ quality: PATO:0000001
+ abnormal: PATO:0000460
+ anatomical entity: UBERON:0001062
+
+relations:
+ inheres_in_part_of: RO:0002314
+ has_modifier: RO:0002573
+ has_part: BFO:0000051
+
+annotationProperties:
+ exact_synonym: oio:hasExactSynonym
+
+vars:
+ anatomical_entity: "'anatomical entity'"
+
+name:
+ text: "abnormal %s"
+ vars:
+ - anatomical_entity
+
+annotations:
+ - annotationProperty: exact_synonym
+ text: "abnormality of %s"
+ vars:
+ - anatomical_entity
+
+def:
+ text: "Abnormality of %s."
+ vars:
+ - anatomical_entity
+
+equivalentTo:
+ text: "'has_part' some ('quality' and ('inheres_in_part_of' some %s) and ('has_modifier' some 'abnormal'))"
+ vars:
+ - anatomical_entity
+
The data: abnormalAnatomicalEntity.tsv
+defined_class | +defined_class_label | +anatomical_entity | +anatomical_entity_label | +
---|---|---|---|
HP:0040286 | +Abnormal axial muscle morphology | +UBERON:0003897 | +axial muscle | +
HP:0011297 | +Abnormal digit morphology | +UBERON:0002544 | +digit | +
ROBOT encodes both the template and the data in the same TSV; after the table header, the second row basically encodes the entire template logic, and the data follows in table row 3.
+ID | +Label | +EQ | +Anatomy Label | +
---|---|---|---|
ID | +LABEL | +EC 'has_part' some ('quality' and ('inheres_in_part_of' some %) and ('has_modifier' some 'abnormal')) | ++ |
HP:0040286 | +Abnormal axial muscle morphology | +UBERON:0003897 | +axial muscle | +
HP:0011297 | +Abnormal digit morphology | +UBERON:0002544 | +digit | +
Note that for the Anatomy Label
we deliberately left the second row empty, which instructs the ROBOT template tool to completely ignore this column.
From an ontology engineering perspective, the essence of the difference between DOSDP and ROBOT templates could be captured as follows:
+DOSDP templates are more about generating annotations and axioms, while ROBOT templates are more about curating annotations and axioms.
+
Curating annotations and axioms
means that an editor, or ontology curator, manually enters the labels, synonyms, definitions and so forth into the spreadsheet.
Generating axioms
in the sense of this section means that we try to automatically generate labels, synonyms, definitions and so forth based on the related logical entities in the patterns. E.g., using the example template above, the label "abnormal kidney" would automatically be generated when the Uberon term for kidney is supplied.
While both ROBOT and DOSDP can be used for "curation" of annotation of axioms, DOSDP seeks to apply generation rules to automatically generate synonyms, labels, definitions and so forth while for ROBOT template seeks to collect manually curated information in an easy-to-use table which is then compiled into OWL. In other words:
+However, there is another dimension in which both approaches differ widely: sharing and re-use. DOSDPs by far the most important feature is that it allows a community of developers to rally around a modelling problem, debate and establish consensus; for example, a pattern can be used to say: this is how we model abnormal anatomical entities. Consensus can be made explicit by "signing off" on the pattern (e.g. by adding your ORCId to the list of contributors), and due to the template/data separation, the template can be simply imported using its IRI (for example http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml) and re-used by everyone. Furthermore, additional metadata fields including textual descriptions, and more recently "examples", make DOSDP template files comparatively easy to understand, even by a less technically inclined editor.
+ROBOT templates on the other hand do not lend themselves to community debates in the same way; first of all, they are typically supplied including all data merged in; secondly, they do not provide additional metadata fields that could, for example, conveniently be used to represent a sign off (you could, of course, add the ORCId's into a non-functional column, or as a pipe-separated string into a cell in the first or second row; but its obvious that this would be quite clunky) or a textual description. A yaml file is much easier for a human to read and understand then the header of a TSV file, especially when the template becomes quite large.
+However, there is a flipside to the strict separation of data and templates. One is that DOSDP templates are really hard to change. Once, for example, a particular variable name was chosen, renaming the variable will require an excessive community-wide action to rename columns in all associated spreadsheets - which requires them all to be known beforehand (which is not always the case). You don't have such a problem with ROBOT templates; if you change a column name, or a template string, everything will continue to work without any additional coordination.
+Both ROBOT templates and DOSDP templates are widely used. The author of this page uses both in most of the projects he is involved in, because of their different strengths and capabilities. You can use the following rules of thumb to inform your choice:
+Consider ROBOT templates if your emphasis is on
+Consider DOSDP templates if your emphasis is on
+There is a nice debate going on which questions the use of tables in ontology curation altogether. There are many nuances in this debate, but I want to stylise it here as two schools of thoughts (there are probably hundreds in between, but this makes it easier to follow): The one school (let's call them Tablosceptics) claims that using tables introduces a certain degree of fragility into the development process due to a number of factors, including:
+They prefer to use tools like Protege that show the curator immediately the consequences of their actions, like reasoning errors (unintended equivalent classes, unsatisfiable classes and other unintended inferences). The Tablophile school of thought responds to these accusations in essence with "tools"; they say that tables are essentially a convenient matrix to input the data (which in turns opens ontology curation to a much wider range of people), and it is up to the tools to ensure that QC is run, hierarchies are being presented for review and weird ID space clashes are flagged up. Furthermore, they say, having a controlled input matrix will actually decrease the number of faulty annotations or axioms (which is evidenced by the large number of wrongful annotation assertions across OBO foundry ontologies I see every day as part of my work). At first sight, both template systems are affected equally by the war of the Tablosceptics and the Tablophile. Indeed, in my on practice, the ID space issue is really problematic when we manage 100s and more templates, and so far, I have not seen a nice and clear solution that ensures that no ID used twice unless it is so intended and respects ID spaces which are often semi-formally assigned to individual curators of an ontology.
+Generally in this course we do not want to take a 100% stance. The author of this page believes that the advantage of using tables and involving many more people in the development process outweighs any concerns, but tooling is required that can provide more immediate feedback when such tables such as the ones presented here are curated at scale.
+ + + + + + +Description: An introduction to the landscape of disease and phenotype terminologies and ontologies, and how they can be used to add value to your analysis.
+A landscape analysis of major disease and phenotype ontologies that are currently available is here (also available in Zenodo here).
+ +Different ontologies are built for different purposes and were created for various reasons. For example, some ontologies are built for text mining purposes, some are built for annotating data and downstream computational analysis.
+The unified phenotype ontology (uPheno) aggregates species-specific phenotype ontologies into a unified resource. Several species-specific phenotype ontologies exist, such as the Human Phenotype Ontology, Mammalian Phenotype Ontology (http://www.informatics.jax.org/searches/MP_form.shtml), and many more.
+Similarly to the phenotype ontologies, there are many disease ontologies that exist that are specific to certain areas of diseases, such as infectious diseases (e.g. Infectious Disease Ontology), cancer (e.g. National Cancer Institute Thesaurus), rare diseases (e.g. Orphanet), etc.
+In addition, there are several more general disease ontologies, such as the Mondo Disease Ontology, the Human Disease Ontology (DO), SNOMED, etc.
+Different disease ontologies may be built for different purposes; for example, ontologies like Mondo and DO are intended to be used for classifying data, and downstream computational analyses. Some terminologies are used for indexing purposes, such as the International classification of Diseases (ICD). ICD-11 is intended for indexing medical encounters for the purposes of billing and coding. Some of the disease ontologies listed on the landscape contain terms that define diseases, such as Ontology for General Medical Sciences (OGMS) are upper-level ontologies and are intended for integration with other ontologies.
+When deciding on which phenotype or disease ontology to use, some things to consider:
+make
# Early Career Data Scientist
+Description: A collection of videos, tutorials, training materials, and exercises targeted towards any entry-level, early-career person interested in learning basic skills in data science.
+Description: Introductory data science ethics videos hosted by Northeat Big Data Innovation Hub.
+6 videos available here
+Note: for the tutorials below PC users need to install ODK (instructions are linked from the tutorial)
+Description: These guidelines are developed for anyone interested in contributing to ontologies to guide how to contribute to OBO Foundry ontologies.
+Ontologies are routinely used for data standardization and in analytical analysis, but the ontologies themselves are under constant revisions and iterative development. Building ontologies is a community effort, and we need expertise from different areas:
+The OBO foundry ontologies are open, meaning anyone can access them and contribute to them. The types of contributions may include reporting issues, identifying bugs, making requests for new terms or changes, and you can also contribute directly to the ontology itself- if you are familiar with ontology editing workflows, you can download our ontologies and make edits on a branch and make a pull request in GitHub.
+Community feedback is welcome for all open OBO Foundry ontologies. Feedback is often provided in the form of:
+Note: There is no one single accepted way of doing ontology curation in the OBO-World, see here. This guide reflects the practice of the GO-style ontology curation, as it is used by GO, Uberon, CL, PATO and others.
+Note: Work on this document is still in progress, items that are not linked are currently being worked on.
+This section is a non-ordered collection of how to documents that a curator might needs
+Note: There is no one single accepted way of doing ontology curation in the OBO-World, see here. This guide reflects the practice of the OBI-style ontology curation, as it is used by OBI, IAO and others.
+There is no one single accepted methodology for building ontologies in the OBO-World. We can distinguish at least two major schools of ontology curation
+Note that there are many more variants, probably as many as there are ontologies. Both schools differ only in how they curate their ontologies - the final product is always an ontology in accordance with OBO Principles. These are some of the main differences of the two schools:
++ | GO-style | +OBI-style | +
---|---|---|
Edit format | +Historically developed in OBO format | +Developed in an OWL format | +
Annotation properties | +Many annotation properties from the oboInOwl namespace, for example for synonyms and provenance. | +Many annotation properties from the IAO namespace. | +
Upper Ontology | +Hesitant alignment with BFO, often uncommitted. | +Strong alignment with BFO. | +
Logic | +Tend do be simple existential restrictions (some ), ontologies in OWL 2 EL. No class expression nesting. Simple logical definition patterns geared towards automating classification |
+Tend to use a lot more expressive logic, including only and not . Class expression nesting can be more complex. |
+
Examples | +GO, Uberon, Mondo, HPO, PATO, CL, BSPO | +OBI, IAO, OGMS | +
There are a lot of processes happening that are bringing these schools together, sharing best practices (GitHub, documentation) and reconciling metadata conventions and annotation properties in the OBO Metadata Ontology (OMO). The Upper Level alignment is now done by members of both schools through the Core Ontology for Biology and Biomedicine (COB). While these processes are ongoing, we decided to curate separate pathways for both schools:
+ + + + + + + +As a ontology engineer, it would be useful for you to know how curators work, as such, it would be useful to be familiar with all the concepts in the ontology curator pathways document. This pathways will however be focusing on the engineering side of things.
+This section is a non-ordered collection of how to documents that an engineer might need (this includes everything from the curators list as they may be pertinent knowledge to an engineer).
+Pathways are materials from OBOOK in a linear fashion for the purpose of helping people in different roles finding the materials relevant to their work more easily. To browse through the pathways, look under the "Pathways" menu item.
+ + + + + + +For a basic tutorial on how to leverage ChatGPT for ontology development see here.
+I want you to act as a REST API, which takes natural language searches a an input and returns an SSSOM mapping in valid JSON in a codeblock, no comments, no additional text. An example of a valid mapping is
+{ + "subject_id": "a:something", + "predicate_id": "rdfs:subClassOf", + "object_id": "b:something", + "mapping_justification": "semapv:LexicalMatching", + "subject_label": "XXXXX", + "subject_category": "biolink:AnatomicalEntity", + "object_label": "xxxxxx", + "object_category": "biolink:AnatomicalEntity", + "subject_source": "a:example", + "object_source": "b:example", + "mapping_tool": "rdf_matcher", + "confidence": 0.8, + "subject_match_field": [ + "rdfs:label" + ], + "object_match_field": [ + "rdfs:label" + ], + "match_string": [ + "xxxxx" + ], + "comment": "mock data" + }
+As a first task, I want you to return a suitable mapping for MONDO:0004975 in ICD 10 CM.
+ + + + + + +The new OBO Foundry guidelines encourage the annotation of ontologies with an appropriately formatted description, title and license. Here are some examples that can be used as a guide to implement those in your ontology.
+Note: these examples purposefully do not include version information, this should not be manually added, instead it should be added by ROBOT as part of a pipeline. An ontology set up with the ODK will take care of all of this for you.
+<?xml version="1.0"?>
+<rdf:RDF xmlns="http://purl.obolibrary.org/obo/license.owl#"
+ xml:base="http://purl.obolibrary.org/obo/license.owl"
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:owl="http://www.w3.org/2002/07/owl#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:xml="http://www.w3.org/XML/1998/namespace"
+ xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
+ xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
+ xmlns:terms="http://purl.org/dc/terms/">
+ <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/license.owl">
+ <dc:description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An integrated and fictional ontology for the description of abnormal tomato phenotypes.</dc:description>
+ <dc:title rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Tomato Phenotype Ontology (TPO)</dc:title>
+ <terms:license rdf:resource="https://creativecommons.org/licenses/by/3.0/"/>
+ </owl:Ontology>
+ <owl:AnnotationProperty rdf:about="http://purl.org/dc/elements/1.1/description"/>
+ <owl:AnnotationProperty rdf:about="http://purl.org/dc/elements/1.1/title"/>
+ <owl:AnnotationProperty rdf:about="http://purl.org/dc/terms/license"/>
+</rdf:RDF>
+
Prefix(:=<http://purl.obolibrary.org/obo/license.owl#>)
+Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
+Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
+Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
+Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
+Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)
+
+
+Ontology(<http://purl.obolibrary.org/obo/license.owl>
+Annotation(<http://purl.org/dc/elements/1.1/description> "An integrated and fictional ontology for the description of abnormal tomato phenotypes."^^xsd:string)
+Annotation(<http://purl.org/dc/elements/1.1/title> "Tomato Phenotype Ontology (TPO)"^^xsd:string)
+Annotation(<http://purl.org/dc/terms/license> <https://creativecommons.org/licenses/by/3.0/>)
+
+)
+
<?xml version="1.0"?>
+<Ontology xmlns="http://www.w3.org/2002/07/owl#"
+ xml:base="http://purl.obolibrary.org/obo/license.owl"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:xml="http://www.w3.org/XML/1998/namespace"
+ xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
+ xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
+ ontologyIRI="http://purl.obolibrary.org/obo/license.owl">
+ <Prefix name="" IRI="http://purl.obolibrary.org/obo/license.owl#"/>
+ <Prefix name="owl" IRI="http://www.w3.org/2002/07/owl#"/>
+ <Prefix name="rdf" IRI="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
+ <Prefix name="xml" IRI="http://www.w3.org/XML/1998/namespace"/>
+ <Prefix name="xsd" IRI="http://www.w3.org/2001/XMLSchema#"/>
+ <Prefix name="rdfs" IRI="http://www.w3.org/2000/01/rdf-schema#"/>
+ <Annotation>
+ <AnnotationProperty IRI="http://purl.org/dc/elements/1.1/description"/>
+ <Literal>An integrated and fictional ontology for the description of abnormal tomato phenotypes.</Literal>
+ </Annotation>
+ <Annotation>
+ <AnnotationProperty IRI="http://purl.org/dc/elements/1.1/title"/>
+ <Literal>Tomato Phenotype Ontology (TPO)</Literal>
+ </Annotation>
+ <Annotation>
+ <AnnotationProperty abbreviatedIRI="terms:license"/>
+ <IRI>https://creativecommons.org/licenses/by/3.0/</IRI>
+ </Annotation>
+ <Declaration>
+ <AnnotationProperty IRI="http://purl.org/dc/elements/1.1/title"/>
+ </Declaration>
+ <Declaration>
+ <AnnotationProperty IRI="http://purl.org/dc/elements/1.1/description"/>
+ </Declaration>
+ <Declaration>
+ <AnnotationProperty IRI="http://purl.org/dc/terms/license"/>
+ </Declaration>
+</Ontology>
+
format-version: 1.2
+ontology: license
+property_value: http://purl.org/dc/elements/1.1/description "An integrated and fictional ontology for the description of abnormal tomato phenotypes." xsd:string
+property_value: http://purl.org/dc/elements/1.1/title "Tomato Phenotype Ontology (TPO)" xsd:string
+property_value: http://purl.org/dc/terms/license https://creativecommons.org/licenses/by/3.0/
+
sh run.sh make update_repo
+
sh run.sh make update_docs
+
sh run.sh make prepare_release
+
sh run.sh make refresh-%
+
Example:
+sh run.sh make refresh-chebi
+
sh run.sh make refresh-imports
+
sh run.sh make refresh-imports-excluding-large
+
sh run.sh make test
+
sh run.sh make odkversion
+
(of a specific file)
+sh run.sh make validate_profile_%
+
Example:
+sh run.sh make validate_profile_hp-edit.owl
+
Killed
: Running out of memory¶Running the same workflow several times simultaneously (e.g. if two PRs are submitted in a short time, and the second PR triggers the CI workflow while the CI workflow triggered by the first PR is still running) could lead to lack-of-memory situations because all concurrent workflows have to share a single memory limit.
+(Note: it isn't really clear with documentation of GitHub Actions on whether concurrent workflow runs share a single memory limit.)
+What could possibly be done is to forbid a given workflow from ever running as long as there is already a run of the same workflow ongoing, using the concurrency property.
+ + + + + + +This page aims to consolidate some tips and tricks that ontology editors have found useful in using git
. It is not meant to be a tutorial of git
, but rather as a page with tips that could help in certain specialised situations.
src/ontology
, in terminal use: git checkout master -- imports/uberon_import.owl
.git log
to list out the previous commits and copy the commit code of the commit you would like to revert to (example: see yellow string of text in screenshot below).git checkout ff18c9482035062bbbbb27aaeb50e658298fb635 -- imports/uberon_import.owl
using whichever commit code you want instead of the commit code in this example.For most of our training activities, we recommend using GitHub Desktop. It provides a very convenient way to push and pull changes, and inspect the "diff". It is, however, not mandatory if you are already familiar with other git workflows (such as command line, or Sourcetree).
+ + + + + + +A repository can consist of many files with several users simultaneously editing those files at any moment in time. In order to ensure conflicting edits between the users are not made and a history of the edits are tracked, software classified as a "distributed version control system" is used.
+All OBO repositories are managed by the Git version control system. This allows users to make their own local branch of the repository, i.e., making a mirror copy of the repository directories and files on their own computers, and make edits as desired. The edits can then be reviewed by other users before the changes are incorporated in the 'main' or 'master' branch of the repository. This process can be executed by running Git line commands and/or by using a web interface (Github.com) along with a desktop application (GitHub Desktop).
+Documentation, including an introduction to GitHub, can be found here: +Hello World.
+ + + + + + +This document is a list of terms that you might encounter in the ontology world. It is not an exhaustive list and will continue to evolve. Please create a ticket if there is a term you find missing or a term you encounter that you do not understand, and we will do our best to add them. This list is not arranged in any particular order. Please use the search function to find terms.
+Acknowledgement: Many terms are taken directly from OAK documentation with the permission of Chris Mungall. Many descriptions are also taken from https://www.w3.org/TR/owl2-syntax/.
+This term is frequently ambiguous. It can refer to Text Annotation, OWL Annotation, or Association.
+Annotation properties are OWL axioms that are used to place annotations on individuals, class names, property names, and ontology names. They do not affect the logical definition unless they are used as a "shortcut" that a pipeline expands to a logical axiom.
+An accumulation of all of the superclasses from ancestors of a class.
+If an individual is not expected to be used outside an ontology, one can use an anonymous individual, which is identified by a local node ID rather than a global IRI. Anonymous individuals are analogous to blank nodes in RDF.
+Application Programming Interface. An intermediary that allows two or more computer programs to communicate with each other. In ontologies, this usually means an Endpoint in which the ontology can be programmatically accessed.
+Usually refers to a Project Ontology.
+Axioms are statements that are asserted to be true in the domain being described. For example, using a subclass axiom, one can state that the class a:Student is a subclass of the class a:Person. (Note: in OWL, there are also annotation axioms which does not apply any logical descriptions)
+An Ontology Repository that is a comprehensive collection of multiple biologically relevant ontologies.
+Standardized and organized arrangements of words and phrases that provide a consistent way to describe data. A controlled vocabulary may or may not include definitions. Ontologies can be seen as a controlled vocabulary expressed in an ontological language which includes relations.
+An OWL entity that formally represents something that can be instantiated. For example, the class "heart".
+A CURIE is a compact URI. For example, CL:0000001
expands to http:purl.obolibrary.org/obo/CL_0000001. For more information, please see https://www.w3.org/TR/curie/.
An abstract model that organizes elements of data and standardizes how they relate to one another.
+dataProperty relate OWL entities to literal data (e.g., strings, numbers, datetimes, etc.) as opposed to ObjectProperty which relate individuals to other OWL entities. Unlike AnnotationProperty, dataProperty axioms fall on the logical side of OWL and are hence useable by reasoners.
+Datatypes are OWL entities that refer to sets of data values. Thus, datatypes are analogous to classes, the main difference being that the former contain data values such as strings and numbers, rather than individuals. Datatypes are a kind of data range, which allows them to be used in restrictions. For example, the datatype xsd:integer denotes the set of all integers, and can be used with the range of a dataProperty to state that the range of said dataProperty must be an integer.
+Description Logics (DL) are a family of formal knowledge representation languages. It provides a logical formalism for ontologies and is what OWL is based on. DL querying can be used to query ontologies in Protege.
+Domain, in reference to a dataProperty or ObjectProperty, refers to the restriction on the subject of a triple - if a given property has a given class in its domain this means that any individual that has a value for the property, will be inferred to be an instance of that domain class. For example, if John hasParent Mary
and Person
is listed in the domain of hasParent
, then John
will be inferred to be an instance of Person
.
Dead Simple Ontology Design Patterns. A templating system for ontologies with well-documented patterns and templates.
+A typed, directed link between Nodes in a knowledge graph. Translations of OWL into Knowledge graphs vary, but typically edges are generated for simple triples, relating two individuals or two classes via an AnnotationProperty or ObjectProperty and simple existential restrictions (A SubClassOf R some B), with the edge type corresponding to the property.
+Where an API interfaces with the ontology.
+A relationship between two classes, A R (some) B, that states that all individuals of class A stand in relation R to at least one individual of class B. For example, neuron has_part some dendrite
states that all instances of neuron have at least one individual of type dentrite as a part. In Manchester syntax, the keyword 'some' is used to denote existential restrictions and is interpreted as "there exists", "there is at least one", or "some". See documentation on classifications for more details.
An official syntax of OWL (others are RDF-XML and OWL-XML) in which each line represents and axiom (although things get a little more complex with axiom annotations, and axioms use prefix syntax (order = relation (subject, object)). This is in contrast to in-fix syntax (e.g. Manchester syntax) (order = subject relation object). Functional syntax is the preferred syntax for editor files maintained on GitHub, because it can be safely diff'd and (somewhat) human readable.
+Formally a graph is a data structure consisting of Nodes and Edges. There are different forms of graphs, but for our purposes an ontology graph has all Terms as nodes, and relationships connecting terms (is-a, part-of) as edges. Note the concept of an ontology graph and an RDF graph do not necessarily fully align - RDF graphs of OWL ontologies employ numerous blank nodes that obscure the ontology structure.
+An OWL entity that represents an instance of a class. For example, the instance "John" or "John's heart". Note that instances are not commonly represented in ontologies. For instance, "John" (an instance of person) or "John's heart" (an instance of heart).
+A measure of how informative an ontology concept is; broader concepts are less informative as they encompass many things, whereas more specific concepts are more unique. This is usually measured as -log2(Pr(term))
. The method of calculating the probability varies, depending on which predicates are taken into account (for many ontologies, it makes sense to use part-of as well as is-a), and whether the probability is the probability of observing a descendant term, or of an entity annotated using that term.
A programmatic abstraction that allows us to focus on what something should do rather than how it is done.
+A measures of the similarity between two sets of data to see which members are shared and distinct.
+Knowledge Graph Change Language (KGCL) is a data model for communicating desired changes to an ontology. It can also be used to communicate differences between two ontologies. See KGCL docs.
+A network of real-world entities (i.e., objects, events, situations, and concepts) that illustrates the relationships between them. Knowledge graphs (in relation to ontologies) are thought of as real data built using an ontology as a framework.
+Usually refers to a human-readable text string corresponding to the rdfs:label
predicate. Labels are typically unique per ontology. In OBO Format and in the bio-ontology literature, labels are sometimes called Names. Sometimes in the machine learning literature, and in databases such as Neo4J, "label" actually refers to a Category.
Lutra is the open source reference implementation of the OTTR templating language.
+A means of linking two resources (e.g. two ontologies, or an ontology and a database) together. Also see SSSOM
+The process of making inferred axioms explicit by asserting them.
+Usually synonymous with Label, but in the formal logic and OWL community, "Name" sometimes denotes an Identifier
+An Individual that is given an explicit name that can be used in any ontology to refer to the same object; named individuals get IRIs whereas anonymous individuals do not.
+The "right" side of a Triple.
+An owl entity that is used to related 2 individuals ('my left foot' part_of 'my left leg') or two classes ('foot' part_of some leg) or an individual and a class ('the neuron depicted in this image' (is) has_soma_location some 'primary motor cortex. More rarely it is used to define a class in terms of some individual (the class 'relatives of Shawn' related_to Value Shawn.
+Open Biological and Biomedical Ontology. This could refer to the OBO Foundry (e.g. OBO ontologies = ontologies that follow the standards of the OBO Foundry) or OBO Format
+A serialization format for ontologies designed for easy viewing, direct editing, and readable diffs. It is popular in bioinformatics, but not widely used or known outside the genomics sphere. OBO is mapped to OWL, but only expresses a subset, and provides some OWL abstractions in a more easy to understand fashion.
+Ontology Lookup Service. An Ontology Repository that is a curated collection of multiple biologically relevant ontologies, many from OBO. OLS can be accessed with this link
+A flexible concept loosely encompassing any collection of OWL entities and statements or relationships connecting them.
+Ontology Development Kit. A toolkit and docker image for managing ontologies.
+The systems or platform where various types of ontologies are stored from different sources and provide the ability to data providers and application developers to share and reuse the ontologies.
+A curated collection of ontologies.
+Reasonable Ontology Templates. A system for composable ontology templates and documentation.
+Web Ontology Language. An ontology language that uses constructs from Description Logic. OWL is not itself an ontology format, it can be serialized through different formats such as Functional Syntax, and it can be mapped to :RDF and serialized via an RDF format.
+In the context of OWL, the term Annotation means a piece of metadata that does not have a strict logical interpretation. Annotations can be on entities, for example, Label annotations, or annotations can be on Axioms.
+A java-based API to interact with OWL ontologies. Full documentation can be found at http://owlcs.github.io/owlapi/apidocs_5/index.html
+OWL Entities, such as classes, properties, and individuals, are identified by IRIs. They form the primitive terms of an ontology and constitute the basic elements of an ontology. For example, a class a:Person can be used to represent the set of all people. Similarly, the object property a:parentOf can be used to represent the parent-child relationship. Finally, the individual a:Peter can be used to represent a particular person called "Peter". +The following is a complete list of types of OWL Entities:
+ +An OWL entity that represents the type of a Relationship.
+Typically corresponds to an ObjectProperty in OWL, but this is not always true;
+in particular, the is-a relationship type is a builtin construct SubClassOf
in OWL
+Examples:
An ontology that is specific to a project and does not necessarily have interoperability with other ontologies in mind.
+An Ontology Library for parsing obo and owl files.
+An OWL entity that represents an attribute or a characteristic of an element. +In OWL, properties are divided into disjoint categories:
+ +A typical ontology development tool used by ontology developers in the OBO-sphere. Full documentation can be found at https://protege.stanford.edu/.
+Range, in reference to a dataProperty or ObjectProperty, refers to the restriction on the object of a triple - if a given property has a given class in its domain this means that any individual that has a value for the property (i.e. is the subject of a relation along the property), will be inferred to be an instance of that domain class. For example, if John hasParent Mary
and Person
is listed in the domain of hasParent
, then John
will be inferred to be an instance of Person
.
A datamodel consisting of simple Subject predicate Object Triples organized into an RDF Graph.
+A python library to interact with RDF data. Full documentation can be found at https://rdflib.readthedocs.io/en/stable/.
+An ontology tool that will perform inference over an ontology to yield new axioms (e.g. new Edges) or to determine if an ontology is logically coherent.
+A Relationship is a type connection between two OWL entities. The first element is called the subject, and the second one the Object, with the type of connection being the Relationship Type. Sometimes Relationships are equated with Triples in RDF but this can be confusing, because some relationships map to multiple triples when following the OWL RDF serialization. An example is the relationship "finger part-of hand", which in OWL is represented using a Existential Restriction that maps to 4 triples.
+See predicate
+A toolkit for transforming and interacting with ontologies. Full documentation can be found at http://robot.obolibrary.org/
+A means of measuring similarity between either pairs of ontology concepts, or between entities annotated using ontology concepts. There is a wide variety of different methods for calculating semantic similarity, for example Jaccard Similarity and Information Content based measures.
+Semantic SQL is a proposed standardized schema for representing any RDF/OWL ontology, plus a set of tools for building a database conforming to this schema from RDF/OWL files. See Semantic-SQL
+The standard query language and protocol for Linked Open Data on the web or for RDF triplestores - used to query ontologies.
+Simple Standard for Sharing Ontological Mappings (https://github.com/mapping-commons/sssom).
+The "left" side of a Triple.
+A named collection of elements, typically grouped for some purpose. In the ODK/OBO world, there is a standard annotation property and pattern for this, for more information, see the subset documentation.
+Usually used to mean Class and Individuals, however sometimes used to refer to wider OWL entities.
+The process of annotating spans of texts within a text document with references to ontology terms, or the result of this process. This is frequently done automatically. The Bioportal implementation provides text annotation services.
+A set of three entities that codifies a statement about semantic data in the form of Subject-predicate-Object expressions (e.g., "Bob is 35", or "Bob knows John"). Also see Relationship.
+A purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35" or "Bob knows Fred".
+An integrated OBO ontology Triplestore and a Ontology Repository, with merged set of mutually referential OBO ontologies (see the ubergraph github for list of ontologies included), that allows for SPARQL querying of integrated OBO ontologies.
+A Uniform Resource Indicator, a generalization of URL. Most people think of URLs as being solely for addresses for web pages (or APIs) but in semantic web technologies, URLs can serve as actual identifiers for entities like OWL entities. Data models like OWL and RDF use URIs as identifiers. In OAK, URIs are mapped to CURIE
+ + + + + + +IMPORTANT NOTE TO EDITORS, MERGE THIS WITH glossary.md.
+
Term | +Definition | +Type | +Docs | +
---|---|---|---|
Ontology Development Kit (ODK) | +A toolkit and docker image for managing ontology releases. | +Tool | +docs | +
ROBOT | +A toolkit for transforming and interacting with ontologies. | +Tool | +docs | +
rdflib | +A python library to interact with RDF data | +Library | +docs | +
OWL API | +A java-based API to interact with OWL ontologies | +Library | +docs | +
Protege | +A typical ontology development tool used by ontology developers in the OBO-sphere | +Tool | +docs | +
ROBOT templates | +A templating system based on tables, where the templates are integrated in the same table as the data | +Standard | +docs | +
Dead Simple Ontology Design Patterns (DOSDP) | +A templating system for ontologies with well-documented patterns and templates. | +Standard | +docs | +
DOSDP tools | +DOSDP is the open source reference implementation of the DOSDP templating language. | +Tool | +docs | +
Reasonable Ontology Templates (OTTR) | +A system for composable ontology templates and documentation | +Standard | +docs | +
Lutra | +Lutra is the open source reference implementation of the OTTR templating language. | +Tool | +docs | +
Note that while most of the practices documented here apply to all OBO ontologies this recommendation applies only to ontologies that are developed using GO-style curation workflows.
+Type | +Property to use | +Required | +Number/Limit | +Description | +Format | +Annotation | +Reference/Comments | +
---|---|---|---|---|---|---|---|
Label | +rdfs:label | +Y | +Max 1 * | +Full name of the term, must be unique. | +Free text | +None | +* some ontologies have multiple labels for different languages, in which case, there should maximum be one label per language | +
Definition | +IAO:0000115 | +Y | +Max 1 | +A textual definition of ther term. In most ontologies, must be unique. | +Free text | +database_cross_reference: reference materials used and contributors (in ORCID ID link format) | +See this document for guide on writing definitions | +
Contributor | +dcterms:contributor | +N (though highly reccomended) | +No limit | +The ORCID ID of people who contributed to the creation of the term. | +ORCID ID (using full link) | +None | ++ |
Synonyms | +http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym, http://www.geneontology.org/formats/oboInOwl#hasNarrowSynonym, http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym | +N | +No limit | +Synonyms of the term. | +Free text | +database_cross_reference: reference material in which the synonymn is used | +See synonyms documentation for guide on using synonyms | +
Comments | +rdfs:comment | +N | +Max 1 | +Comments about the term, extended descriptions that might be useful, notes on modelling choices, other misc notes. | +Free text | +database_cross_reference: reference material relating to the comment | +See documentation on comments for more information about comments | +
Editor note | +IAO:0000116 | +N | +Max 1 | +A note that is not relevant to front users, but might be to editors | +Free text | +database_cross_reference: reference material relating to the note | ++ |
Subset | +http://www.geneontology.org/formats/oboInOwl#inSubset | +N | +No limit | +A tag that marks a term as being part of a subset | +annotation property that is a subproperty of subset_property (see guide on how to select this) | +None | +See Slim documentation for more information on subsets | +
Database Cross Reference | +http://www.geneontology.org/formats/oboInOwl#hasDbXref | +N | +No limit | +Links out to external references. | +string and should* take the form {prefix}:{accession}; see db-xrefs yaml for prefixes | +None | +*Some ontologies allow full URLS in specific cases, but this is controversial | +
Date created | +dcterms:created | +N | +Max 1 | +Date in which the term was created | +ISO-8601 format | +None | ++ |
Date last updated | +dcterms:date | +N | +Max 1 | +Date in which the term was last updated | +ISO-8601 format | +None | ++ |
Deprecation | +http://www.w3.org/2002/07/owl#deprecated | +N | +Max 1 | +A tag that marks a term as being obsolete/deprecated | +xsd:boolean (true/false) | +None | +See obsoletion guide for more details | +
Replaced by | +IAO:0100001 | +N | +Max 1 | +Term that has replaced an obsoleted term | +IRI/ID (e.g. CL:0000001) | +None | +See obsoletion guide and merging terms guide for more details | +
Consider | +oboInOwl:consider | +N | +No limit | +Term that can be considered from manual replacement of an obsoleted term | +IRI/ID (e.g. CL:0000001) | +None | +See obsoletion guide and merging terms guide for more details | +
Based on Intro to GitHub (GO-Centric) with credit to Nomi Harris and Chris Mungall
+Labels are a useful tool to help group and organize issues, allowing people to filter issues by grouping. +Note: Only project contributors can add/change labels
+Superissues are issues that have checklists (added using -[] on items). These are useful as they show progress towards completion. These can be used for issues that require multiple steps to solve.
+ +Milestones are used for issues with a specific date/deadline. Milestones contain issues and issues can be filtered by milestones. They are also useful for visualizing how many issues in it is completed.
+ +Project boards are a useful tool to organise, as the name implies, projects. They can span multiple repos (though the repos need to be in the same organisation). Notes can also be added.
+ + + + + + + +Compiled by Nicole Vasilevsky. Feel free to make pull requests to suggest edits. Note: This currently just provides an overview of disease and phenotype ontologies. Contributors are welcome to add more descriptions of other medical ontologies. This was last updated in 2021.
+Name | +Disease Area | +
---|---|
Artificial Intelligence Rheumatology Consultant System Ontology (AI-RHEUM) | +Rheumatic diseases | +
Autism DSM-ADI-R Ontology (ADAR) | +Autism | +
Autism Spectrum Disorder Phenotype Ontology (ASDPTO) | +Autism | +
Brucellosis Ontology (IDOBRU) | +brucellosis | +
Cardiovascular Disease Ontology (CVDO) | +Cardiovascular | +
Chronic Kidney Disease Ontology (CKDO) | +Chronic kidney disease | +
Chronic Obstructive Pulmonary Disease Ontology (COPDO) | +Chronic obstructive pulmonary disease (COPD) | +
Coronavirus Infectious Disease Ontology (CIDO) | +Coronavirus infectious diseases | +
Diagnostic and Statistical Manual of Mental Disorders (DSM) | +Mental disorders | +
Dispedia Core Ontology (DCO) | +Rare diseases | +
Experimental Factor Ontology (EFO) | +Broad disease coverage | +
Fibrotic Interstitial Lung Disease Ontology (FILDO) | +Fibrotic interstitial lung disease | +
Genetic and Rare Diseases Information Center (GARD) | +Rare diseases | +
Holistic Ontology of Rare Diseases (HORD) | +Rare disease | +
Human Dermatological Disease Ontology (DERMO) | +Dermatology (skin) | +
Human Disease Ontology (DO) | +Human disease | +
Infectious Disease Ontology (IDO) | +Infectious disease | +
International Classification of Functioning, Disability and Health (ICF) | +Cross-discipline, focuses disabilities | +
International Statistical Classification of Diseases and Related Health Problems (ICD-11) | +Broad coverage | +
International Classification of Diseases for Oncology (ICD-O) | +Cancer | +
Logical Observation Identifier Names and Codes (LOINC) | +Broad coverage | +
Medical Subject Headings (MeSH) | +Broad coverage | +
MedGen | +Human medical genetics | +
Medical Dictionary for Regulatory Activities (MedDRA) | +Broad coverage | +
Mental Disease Ontology (MDO) | +Mental functioning | +
Mondo Disease Ontology (Mondo) | +Broad coverage, Cross species | +
National Cancer Institute Thesaurus (NCIT) | +Humam cancer and neoplasms | +
Neurological Disease Ontology (ND) | +Neurology | +
Online Mendelian Inheritance in Man (OMIM) | +Mendelian, genetic diseases. | +
Ontology of Cardiovascular Drug Adverse Events (OCVDAE) | +Cardiovascular | +
Ontology for General Medical Science (OGMS) | +Broad coverage | +
Ontology for Genetic Susceptibility Factor (OGSF) | +Genetic disease | +
Ontology of Glucose Metabolism Disorder (OGMD) | +Metabolic disorders | +
Ontology of Language Disorder in Autism (LDA) | +Austism | +
The Oral Health and Disease Ontology (OHD) | +Oral health and disease | +
Orphanet (ORDO) | +Rare diseases | +
Parkinson Disease Ontology (PDO) | +Parkinson disease | +
Pathogenic Disease Ontology (PDO) | +Pathogenic diseases | +
PolyCystic Ovary Syndrome Knowledgebase (PCOSKB) | +Polycystic ovary syndrome | +
Rat Disease Ontology (RDO) | +Broad coverage | +
Removable Partial Denture Ontology (RPDO) | +Oral health | +
Resource of Asian Primary Immunodeficiency Diseases (RPO) | +Immunodeficiencies | +
Sickle Cell Disease Ontology (SCDO) | +Sickle Cell Disease | +
SNOMED Clinical Terminology (SNOMED CT) | +Broad disease representation for human diseases. | +
Symptom Ontology | +Human diseases | +
Unified Medical Language System | +Broad coverage | +
Description: Contains findings, such as clinical signs, symptoms, laboratory test results, radiologic observations, tissue biopsy results, and intermediate diagnosis hypotheses, for the diagnosis of rheumatic diseases.
+Disease area: Rheumatic diseases
+Use Cases: Used by clinicians and informatics researchers.
+Website: https://bioportal.bioontology.org/ontologies/AI-RHEUM
+Open: Yes
Description: An ontology of autism spectrum disorder (ASD) and related neurodevelopmental disorders.
+Disease area: Autism
+Use Cases: It extends an existing autism ontology to allow automatic inference of ASD phenotypes and Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria based on subjects’ Autism Diagnostic Interview–Revised (ADI-R) assessment data.
+Website: https://bioportal.bioontology.org/ontologies/ADAR
+Open: Yes
Description: Encapsulates the ASD behavioral phenotype, informed by the standard ASD assessment instruments and the currently known characteristics of this disorder.
+Disease area: Autism
+Use Cases: Intended for use in research settings where extensive phenotypic data have been collected, allowing a concept-based approach to identifying behavioral features of importance and for correlating these with genotypic data.
+Website: https://bioportal.bioontology.org/ontologies/ASDPTO
+Open: Yes
Description: Describes the most common zoonotic disease, brucellosis, which is caused by Brucella, a type of facultative intracellular bacteria.
+Disease area: brucellosis bacteria
+Use Cases: An extension ontology of the core Infectious Disease Ontology (IDO-core). This project appears to be inactive.
+Website: https://github.com/biomedontology/idobru
+Open: Yes
Description: An ontology to describe entities related to cardiovascular diseases.
+Disease area: Cardiovascular
+Use Cases: Describes entities related to cardiovascular diseases including the diseases themselves, the underlying disorders, and the related pathological processes. Imports upper level terms from OGMS and imports some terms from Disease Ontology (DO).
+GitHub repo: https://github.com/OpenLHS/CVDO/
+Website: https://github.com/OpenLHS/CVDO
+OBO Foundry webpage: http://obofoundry.org/ontology/cvdo.html
+Open: Yes
Description: An ontology of chronic kidney disease in primary care.
+Disease area: Chronic kidney disease
+Use Cases: CKDDO was developed to assist routine data studies and case identification of CKD in primary care.
+Website: http://purl.bioontology.org/ontology/CKDO
+Open: Yes
Description: Models concepts associated with chronic obstructive pulmonary disease in routine clinical databases.
+Disease area: Chronic obstructive pulmonary disease (COPD)
+Use Cases: Clinical use.
+Website: https://bioportal.bioontology.org/ontologies/COPDO
+Open: Yes
Description: Aims to ontologically represent and standardize various aspects of coronavirus infectious diseases, including their etiology, transmission, epidemiology, pathogenesis, diagnosis, prevention, and treatment.
+Disease area: Coronavirus infectious diseases, including COVID-19, SARS, MERS; covers etiology, transmission, epidemiology, pathogenesis, diagnosis, prevention, and treatment.
+Use Cases: Used for disease annotations related to coronavirus infections.
+GitHub repo: https://github.com/cido-ontology/cido
+OBO Foundry webpage: http://obofoundry.org/ontology/cido.html
+Open: Yes
Description: Authoritative source to define and classify mental disorders to improve diagnoses, treatment, and research.
+Disease area: Mental disorders
+Use Cases: Used in clinical healthcare and research by pyschiatrists and psychologists.
+Website: https://www.psychiatry.org/psychiatrists/practice/dsm
+Open: No, must be purchased
Description: A schema for information brokering and knowledge management in the complex field of rare diseases. DCO describes patients affected by rare diseases and records expertise about diseases in machine-readable form.
+Disease area: Rare disease
+Use Cases: DCO was initially created with amyotrophic lateral sclerosis as a use case.
+Website: http://purl.bioontology.org/ontology/DCO
+Open: Yes
Description: Provides a systematic description of many experimental variables available in EBI databases, and for projects such as the GWAS catalog.
+Disease area: Broad disease coverage, integrates the Mondo disease ontology.
+Use Cases: Application ontology build for European Bioinformatics (EBI) tools and databases and Open Targets Genetics Portal.
+Website: https://www.ebi.ac.uk/efo/
+Open: Yes
Description: An in-progress, four-tiered ontology proposed to standardize the diagnostic classification of patients with fibrotic interstitial lung disease.
+Disease area: Fibrotic interstitial lung disease
+Use Cases: Goal is to standardize the diagnostic classification of patients with fibrotic ILD. A paper was published in 2017 and an ontology is not publicly available.
+Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5803648/
+Open: No
Description: NIH resource that provides the public with access to current, reliable, and easy-to-understand information about rare or genetic diseases in English or Spanish.
+Disease area: Rare diseases
+Use Cases: Patient portal. Integrates defintions and synonyms from Orphanet, maps to HPO phenotypes, and is integrated by Mondo.
+Website: https://rarediseases.info.nih.gov/
+Open: Yes
Description: Describes the biopsychosocial state (i.e., disease, psychological, social, and environmental state) of persons with rare diseases in a holistic way.
+Disease area: Rare disease
+Use Cases: Rehabilita, Disruptive Technologies for the Rehabilitation of the Future, a project that aims to enhance rehabilitation transforming it to a more personalized, ubiquitous and evidence-based rehabilitation.
+Website: http://purl.bioontology.org/ontology/HORD
+Open: Yes
Description: The most comprehensive dermatological disease ontology available, with over 3,500 classes available. There are 20 upper-level disease entities, with features such as anatomical location, heritability, and affected cell or tissue type.
+Disease area: Dermatology (skin)
+Use Cases: DermO can be used to extract data from patient electronic health records using text mining, or to translate existing variable-granularity coding such as ICD-10 to allow capture and standardization of patient/disease annotations.
+Website: https://bioportal.bioontology.org/ontologies/DERMO
+Open: Yes
Description: An ontology for describing the classification of human diseases organized by etiology.
+Disease area: Human disease terms, phenotype characteristics and related medical vocabulary disease concepts.
+Use Cases: Used by Model Organism Databases (MOD), such as Mouse Genome Informatics disease model for diseae annotations, and Alliance for Genome Resources for disease annotations. In 2018, DO tracked over 300 DO project citations suggesting wide adoption and usage for disease annotations.
+GitHub repo: https://github.com/DiseaseOntology/HumanDiseaseOntology/
+Website: http://www.disease-ontology.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/doid.html
+Open: Yes
Description: A set of interoperable ontologies that will together provide coverage of the infectious disease domain. IDO core is the upper-level ontology that hosts terms of general relevance across the domain, while extension ontologies host terms to specific to a particular part of the domain.
+Disease area: Infectious disease features, such as acute, primary, secondary infection, and chronic, hospital acquired and local infection.
+Use Cases: Does not seem active, has not been released since 2017.
+GitHub repo: https://github.com/infectious-disease-ontology/infectious-disease-ontology/
+Website: http://www.bioontology.org/wiki/index.php/Infectious_Disease_Ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/ido.html
+Open: Yes
Description: Represents diseases and provides a conceptual basis for the definition and measurement of health and disability as organized by patient-oriented outcomes of function and disability. ICF considers environmental factors as well as the relevance of associated health conditions in recognizing major models of disability.
+Disease area: Cross-discipline, focuses on health and disability
+Use Cases: ICF is the World Health Organization (WHO) framework for measuring health and disability at both individual and population levels. ICF was officially endorsed by the WHO as the international standard to describe and measure health and disability.
+Website: https://www.who.int/standards/classifications/international-classification-of-functioning-disability-and-health
+Open: Yes
Description: A medical classification list by the World Health Organization (WHO) that contains codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases.
+Disease area: Broad coverage of human disease features, such as disease of anatomical systems, infectious diseases, injuries, external causes of morbidity and mortality.
+Use Cases: The main purpose of ICD-11 is for clinical care, billing and coding for insurance companies.
+Website: https://www.who.int/standards/classifications/classification-of-diseases
+Open: Yes
Description: A domain-specific extension of the International Statistical Classification of Diseases and Related Health Problems for tumor diseases.
+Disease area: A multi-axial classification of the site, morphology, behaviour, and grading of neoplasms.
+Use Cases: Used principally in tumour or cancer registries for coding the site (topography) and the histology (morphology) of neoplasms, usually obtained from a pathology report.
+Website: https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology
+Open: Yes
Description: Identifies medical laboratory observations.
+Disease area: Broad coverage
+Use Cases: The Regenstrief Institute first developed LOINC in 1994 in response to the demand for an electronic database for clinical care and management. LOINC is publicly available at no cost and is endorsed by the American Clinical Laboratory Association and the College of American Pathologists. Since its inception, LOINC has expanded to include not just medical laboratory code names but also nursing diagnoses, nursing interventions, outcome classifications, and patient care data sets.
+Website: https://loinc.org/
+Open: Yes, registration is required.
Description: Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine.
+Disease area: Broad coverage
+Use Cases: It is used for indexing, cataloging, and searching of biomedical and health-related information. Integrated into Mondo.
+Website: https://meshb.nlm.nih.gov/search
+Open: Yes
Description: Organizes information related to human medical genetics, such as attributes of conditions and phenotypes of genetic contributions.
+Disease area: Human medical genetics
+Use Cases: MedGen is NCBI's portal to information about conditions and phenotypes related to Medical Genetics. Terms from the NIH Genetic Testing Registry (GTR), UMLS, HPO, Orphanet, ClinVar and other sources are aggregated into concepts, each of which is assigned a unique identifier and a preferred name and symbol. The core content of the record may include names, identifiers used by other databases, mode of inheritance, clinical features, and map location of the loci affecting the disorder. The concept identifier (CUI) is used to aggregate information about that concept, similar to the way NCBI Gene serves as a gateway to gene-related information.
+Website: https://www.ncbi.nlm.nih.gov/medgen/
+Open: Yes
Description: Provides a standardized international medical terminology to be used for regulatory communication and evaluation of data about medicinal products for human use.
+Disease area: Broad coverage
+Use Cases: Mainly targeted towards industry and regulatory users.
+Website: https://www.meddra.org/
+Open: Yes
Description: An ontology to describe and classify mental diseases such as schizophrenia, annotated with DSM-IV and ICD codes where applicable.
+Disease area: Mental functioning, including mental processes such as cognition and traits such as intelligence.
+Use Cases: The ontology has been partially aligned with the related projects Cognitive Atlas, knowledge base on cognitive science and the Cognitive Paradigm Ontology, which is used in the Brainmap, a database of neuroimaging experiments.
+GitHub repo: https://github.com/jannahastings/mental-functioning-ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/mfomd.html
+Open: yes
Description: An integrated disease ontology that provides precise mappings between source ontologies that comprehensively covers cross-species diseases, from common to rare diseases.
+Disease area: Cross species, intended to cover all areas of diseases, integrating source ontologies that cover Mendelian diseases (OMIM), rare diseases (Orphanet), neoplasms (NCIt), human diseases (DO), and others. See all sources here.
+Use Cases: Mondo was developed for usage in the Monarch Initiative, a discovery system that allows navigation of similarities between phenotypes, organisms, and human diseases across many data sources and organisms. Mondo is also used by ClinGen for disease curations, the Kids First Data Resource Portal for disease annotations and others, see an extensive list here.
+GitHub repo: https://github.com/monarch-initiative/mondo
+Website: https://mondo.monarchinitiative.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/mondo.html
+Open: yes
Description: NCI Thesaurus (NCIt)is a reference terminology that includes broad coverage of the cancer domain, including cancer related diseases, findings and abnormalities. The NCIt OBO Edition aims to increase integration of the NCIt with OBO Library ontologies. NCIt OBO Edition releases should be considered experimental.
+Disease area: Cancer and neoplasms
+Use Cases: NCI Thesaurus (NCIt) provides reference terminology for many National Cancer Institute and other systems. It is used by the Clinical Data Interchange Standards Consortium Terminology (CDISC), the U.S. Food and Drug Administration (FDA), the Federal Medication Terminologies (FMT), and the National Council for Prescription Drug Programs (NCPDP). It provides extensive coverage of neoplasms and cancers.
+GitHub repo: https://github.com/NCI-Thesaurus/thesaurus-obo-edition/issues
+Website: https://ncithesaurus.nci.nih.gov/ncitbrowser/pages/home.jsf?version=20.11e
+OBO Foundry webpage: http://obofoundry.org/ontology/ncit.html
+Open: Yes
Description: A framework for the representation of key aspects of neurological disease.
+Disease area: Neurology
+Use Cases: Goal is to provide a framework to enable representation of aspects of neurological diseases that are relevant to their treatment and study. This project may be inactive, the last commit to GitHub was in 2016.
+GitHub repo: https://github.com/addiehl/neurological-disease-ontology
+Open: Yes
Description: a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily.
+Disease area: Mendelian, genetic diseases.
+Use Cases: Integrated into the disease ontology, used by the Human Phenotype Ontology for disease annotations, patients and researchers.
+Website: https://omim.org/
+Open: yes
Description: A biomedical ontology of cardiovascular drug–associated adverse events.
+Disease area: Cardiovascular
+Use Cases: One novel study of the OCVDAE project is the development of the PCR method. Specifically, an AE-specific drug class effect is defined to exist when all the drugs (drug chemical ingredients or drug products) in a drug class are associated with an AE, which is formulated as a proportional class level ratio (“PCR”) = 1. See more information in the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5653862/. This project may be inactive, the last GitHub commit was in 2019.
+GitHub repo: https://github.com/OCVDAE/OCVDAE
+Website: https://bioportal.bioontology.org/ontologies/OCVDAE
+Open: yes
Description: An ontology of entities involved in a clinical encounter.
+Use Cases: Provides a formal theory of disease that can be further elaborated by specific disease ontologies. It is intended to be used as a upper level ontology for other disease ontologies. Used by Cardiovascular Disease Ontology.
+GitHub repo: https://github.com/OGMS/ogms
+OBO Foundry webpage: http://obofoundry.org/ontology/ogms.html
+Open: Yes
Description: An application ontology to represent genetic susceptibility to a specific disease, adverse event, or a pathological process.
+Use Cases: Modeling genetic susceptibility to vaccine adverse events.
+GitHub repo: https://github.com/linikujp/OGSF
+OBO Foundry webpage: http://obofoundry.org/ontology/ogsf.html
+Open: Yes
Description: Represents glucose metabolism disorder and diabetes disease names, phenotypes, and their classifications.
+Disease area: Metabolic disorders
+Use Cases: Still under development (last verssion released in BioPortal was in 2021) but there is little information about its usage online.
+Website: https://bioportal.bioontology.org/ontologies/OGMD
+Open: Yes
Description: An ontology assembled from a set of language terms mined from the autism literature.
+Disease area: Austism
+Use Cases: This has not been released since 2008 and looks like it is inactive.
+Website: https://bioportal.bioontology.org/ontologies/LDA
+Open: Yes
Description: Represents the content of dental practice health records and is intended to be further developed for use in translational medicine. OHD is structured using BFO (Basic Formal Ontology) and uses terms from many ontologies, NCBITaxon, and a subset of terms from the CDT (Current Dental Terminology).
+Disease area: Oral health and disease
+Use Cases: Used to represent the content of dental practice health records and is intended to be further developed for use in translation medicine. Appears to be inactive.
+OBO Foundry webpage: http://www.obofoundry.org/ontology/ohd.html
+Open: Yes
Description: The portal for rare diseases and orphan drugs. Contains a structured vocabulary for rare diseases capturing relationships between diseases, genes, and other relevant features, jointly developed by Orphanet and the EBI. It contains information on nearly 10,000 cancers and related diseases, 8,000 single agents and combination therapies, and a wide range of other topics related to cancer and biomedical research.
+Disease area: Rare diseases
+Use Cases: Used by rare disease research and clinical community. Integrated into the Mondo disease ontology, aligned with OMIM.
+Website: https://www.orpha.net/consor/cgi-bin/index.php
+Open: Yes
Description: A comprehensive semantic framework with a subclass-based taxonomic hierarchy, covering the whole breadth of the Parkinson disease knowledge domain from major biomedical concepts to different views on disease features held by molecular biologists, clinicians, and drug developers.
+Disease area: Parkinson disease
+Use Cases: This resource has been created for use in the IMI-funded AETIONOMY project. Last release was in 2015, may be inactive.
+Website: https://bioportal.bioontology.org/ontologies/PDON
+Open: Yes
Description: Provides information on infectious diseases, disease synonyms, transmission pathways, disease agents, affected populations, and disease properties. Diseases are grouped into syndromic disease categories, organisms are structured hierarchically, and both disease transmission and relevant disease properties are searchable.
+Disease area: human infectious diseases caused by microbes and the diseases that is related to microbial infection.
+Use Cases: Has not been released since 2016 and may be inactive.
+Website: https://bioportal.bioontology.org/ontologies/PDO
+Open: Yes.
Description: Comprises genes, single nucleotide polymorphisms, diseases, gene ontology terms, and biochemical pathways associated with polycystic ovary syndrome, a major cause of female subfertility worldwide.
+Disease area: polycystic ovary syndrome
+Use Cases: Ontology underlying the Polycystic Ovary Syndrome Knowledgebase, a manually curated knowledgebase on PCOS.
+Website: http://pcoskb.bicnirrh.res.in/go_d.php
+Open: Yes
Description: Provides the foundation for ten comprehensive disease area–related data sets at the Rat Genome Database Disease Portals.
+Disease area: Broad coverage including animal diseases, infectious diseases, chemically-induced disorders, occupational diseases, wounds and injuries and more.
+Use Cases: Developed for use with the Rat Genome Database Disease Portals.
+Website: https://rgd.mcw.edu/rgdweb/ontology/view.html?acc_id=DOID:4
+Open: Yes
Description: Represents knowledge of a patient’s oral conditions and denture component parts, originally developed to create a clinician decision support model.
+Disease area: Oral health and dentures
+Use Cases: A paper was published on this in 2016 but it does not appear any other information is available about this ontology on the website, presumably it is an inactive project.
+Publication: https://www.nature.com/articles/srep27855
+Open: No
Description: Represents observed phenotypic terms, sequence variations, and messenger RNA and protein expression levels of all genes involved in primary immunodeficiency diseases. +Disease area: Primary immunodeficiency diseases +Use Cases: This terminology is used in a freely accessible, dynamic and integrated database for primary immunodeficiency diseases (PID) called Resource of Asian Primary Immunodeficiency Diseases (RAPID), which is available here. +Publication: https://academic.oup.com/nar/article/37/suppl_1/D863/1004993 +Open: Yes
+Description: SCDO establishes (a) community-standardized sickle cell disease terms and descriptions, (b) canonical and hierarchical representation of knowledge on sickle cell disease, and (c) links to other ontologies and bodies of work.
+Disease area: Sickle Cell Disease (SCD).
+Use Cases: SCDO is intended to be a comprehensive collection of knowledge on SCD, facilitate exploration of new scientific questions and ideas, facilitate seamless data sharing and collaborations including meta-analysis within the SCD community, support the building of databasing and clinical informatics in SCD.
+GitHub repo: https://github.com/scdodev/scdo-ontology/issues
+Website: https://scdontology.h3abionet.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/scdo.html
+Open: Yes
Description: A comprehensive clinical terminology/ontology used in healthcare settings.
+Disease area: Broad disease representation for human diseases.
+Use Cases: Main coding system used in Electronic Health Records (EHRs).
+Website: https://browser.ihtsdotools.org/?
+Open: No, requires a license for usage.
Description: An ontology of disease symptoms, with symptoms encompasing perceived changes in function, sensations or appearance reported by a patient indicative of a disease.
+Disease area: Human diseases
+Use Cases: Developed by the Disease Ontology (DO) team and used for describing symptoms of human diseases in the DO.
+Website: http://symptomontologywiki.igs.umaryland.edu/mediawiki/index.php/Main_Page
+OBO Foundry webpage: http://obofoundry.org/ontology/symp.html
+Open: Yes
Description: The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services.
+Disease area: Broad coverage
+Use Cases: Healthcare settings including electronic health records and HL7.
+Website: https://www.nlm.nih.gov/research/umls/index.html
+Open: Yes
Name | +Species Area | +
---|---|
Ascomycete phenotype ontology (APO) | +Ascomycota | +
C. elegans phenotype (wbphenotype) | +C elegans | +
Dictyostelium discoideum phenotype ontology (ddpheno) | +Dictyostelium discoideum | +
Drosophila Phenotype Ontology (DPO) | +Drosophila | +
Flora Phenotype Ontology (FLOPO) | +Viridiplantae | +
Fission Yeast Phenotype Ontology (FYPO) | +S. pombe | +
Human Phenotype Ontology (HPO) | +Human | +
HPO - ORDO Ontological Module (HOOM) | +Human | +
Mammalian Phenotype Ontology (MP) | +Mammals | +
Ontology of Microbial Phenotypes (OMP) | +Microbe | +
Ontology of Prokaryotic Phenotypic and Metabolic Characters | +Prokaryotes | +
Pathogen Host Interaction Phenotype Ontology | +pathogens | +
Planarian Phenotype Ontology (PLANP) | +Schmidtea mediterranea | +
Plant Trait Ontology (TO) | +Viridiplantae | +
Plant Phenology Ontology | +Plants | +
Unified Phenotype Ontology (uPheno) | +Cross-species coverage | +
Xenopus Phenotype Ontology (XPO) | +Xenopus | +
Zebrafish Phenotype Ontology (ZP) | +Zebrafish | +
Description: A structured controlled vocabulary for the phenotypes of Ascomycete fungi.
+Species: Ascomycota
+GitHub repo: https://github.com/obophenotype/ascomycete-phenotype-ontology/
+Webpage: http://www.yeastgenome.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/wbphenotype.html
+Open: Yes
Description: A structured controlled vocabulary of Caenorhabditis elegans phenotypes.
+Species: C elegans
+GitHub repo: https://github.com/obophenotype/c-elegans-phenotype-ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/wbphenotype.html
+Open: Yes
Description: A structured controlled vocabulary of phenotypes of the slime-mould Dictyostelium discoideum.
+Species: Dictyostelium discoideum
+GitHub repo: https://github.com/obophenotype/dicty-phenotype-ontology/issues
+Webpage: http://dictybase.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/ddpheno.html
+Open: Yes
Description: An ontology of commonly encountered and/or high level Drosophila phenotypes.
+Species: Drosophila
+GitHub repo: https://github.com/obophenotype/c-elegans-phenotype-ontology
+Webpage: http://purl.obolibrary.org/obo/fbcv
+OBO Foundry webpage: http://obofoundry.org/ontology/dpo.html
+Open: Yes
Description: Traits and phenotypes of flowering plants occurring in digitized Floras.
+Species: Viridiplantae
+GitHub repo: https://github.com/flora-phenotype-ontology/flopoontology/
+OBO Foundry webpage: http://obofoundry.org/ontology/flopo.html
+Open: Yes
Description: FYPO is a formal ontology of phenotypes observed in fission yeast.
+Species: S. pombe
+GitHub repo: https://github.com/pombase/fypo
+OBO Foundry webpage: http://obofoundry.org/ontology/fypo.html
+Open: Yes
Description: HPO provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality.
+Species: Human
+GitHub repo: https://github.com/obophenotype/human-phenotype-ontology
+Website: https://hpo.jax.org/app/
+OBO Foundry webpage: http://obofoundry.org/ontology/hp.html
+Open: yes
Description: Orphanet provides phenotypic annotations of the rare diseases in the Orphanet nomenclature using the Human Phenotype Ontology (HPO). HOOM is a module that qualifies the annotation between a clinical entity and phenotypic abnormalities according to a frequency and by integrating the notion of diagnostic criterion. In ORDO a clinical entity is either a group of rare disorders, a rare disorder or a subtype of disorder. The phenomes branch of ORDO has been refactored as a logical import of HPO, and the HPO-ORDO phenotype disease-annotations have been provided in a series of triples in OBAN format in which associations, frequency and provenance are modeled. HOOM is provided as an OWL (Ontologies Web Languages) file, using OBAN, the Orphanet Rare Disease Ontology (ORDO), and HPO ontological models. HOOM provides extra possibilities for researchers, pharmaceutical companies and others wishing to co-analyse rare and common disease phenotype associations, or re-use the integrated ontologies in genomic variants repositories or match-making tools.
+Species: Human
+Website: http://www.orphadata.org/cgi-bin/img/PDF/WhatIsHOOM.pdf
+BioPortal: https://bioportal.bioontology.org/ontologies/HOOM
+Open: yes
Description: Standard terms for annotating mammalian phenotypic data.
+Species: Mammals (main focus is on mouse and rodents)
+GitHub repo: https://github.com/obophenotype/mammalian-phenotype-ontology
+Website: http://www.informatics.jax.org/searches/MP_form.shtml
+OBO Foundry webpage: http://obofoundry.org/ontology/mp.html
+Open: Yes
Description: An ontology of phenotypes covering microbes.
+Species: microbes
+GitHub repo: https://github.com/microbialphenotypes/OMP-ontology
+Website: http://microbialphenotypes.org
+OBO Foundry webpage: http://obofoundry.org/ontology/omp.html
+Open: Yes
Description: An ontology of phenotypes covering microbes.
+Species: Prokaryotes
+GitHub repo: https://github.com/microbialphenotypes/OMP-ontology/issues
+Website: http://microbialphenotypes.org/
+OBO Foundry webpage: http://obofoundry.org/ontology/omp.html
+Open: Yes
Description: PHIPO is a formal ontology of species-neutral phenotypes observed in pathogen-host interactions.
+Species: pathogens
+GitHub repo: https://github.com/PHI-base/phipo
+Website: http://www.phi-base.org
+OBO Foundry webpage: http://obofoundry.org/ontology/phipo.html
+Open: Yes
Description: Planarian Phenotype Ontology is an ontology of phenotypes observed in the planarian Schmidtea mediterranea.
+Species: Schmidtea mediterranea
+GitHub repo: https://github.com/obophenotype/planarian-phenotype-ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/planp.html
+Open: Yes
Description: A controlled vocabulary of describe phenotypic traits in plants.
+Species: Viridiplantae
+GitHub repo: https://github.com/Planteome/plant-trait-ontology/
+OBO Foundry webpage: http://obofoundry.org/ontology/to.html
+Open: Yes
Description: An ontology for describing the phenology of individual plants and populations of plants, and for integrating plant phenological data across sources and scales.
+Species: Plants
+GitHub repo: https://github.com/PlantPhenoOntology/PPO
+OBO Foundry webpage: http://obofoundry.org/ontology/ppo.html
+Open: Yes
Description: The uPheno ontology integrates multiple phenotype ontologies into a unified cross-species phenotype ontology.
+Species: Cross-species coverage
+GitHub repo: https://github.com/obophenotype/upheno
+OBO Foundry webpage: http://obofoundry.org/ontology/upheno.html
+Open: Yes
Description: XPO represents anatomical, cellular, and gene function phenotypes occurring throughout the development of the African frogs Xenopus laevis and tropicalis.
+Species: Xenopus
+GitHub repo: https://github.com/obophenotype/xenopus-phenotype-ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/xpo.html
+Open: Yes
Description: The Zebrafish Phenotype Ontology formally defines all phenotypes of the Zebrafish model organism.
+Species: Zebrafish
+GitHub repo: https://github.com/obophenotype/zebrafish-phenotype-ontology
+OBO Foundry webpage: http://obofoundry.org/ontology/zp.html
+Open: Yes
An index page to find some of our favourite articles on Chris' blog. +These are not all articles, but I selection we found useful during our every work.
+OntoTips Series. Must read series for the beginning ontology developer.
+Warning about complex modelling. Chris is generally big on Occam's Razor solutions: given two solutions that solve a use case, the simpler is better.
+OntoTip: Don’t over-specify OWL definitions. From the above OntoTip series.
+Some resources on OBOOK are less well developed than others. We use the OBOOK Maturity Indicator to document this (discussion).
+To add a status badge onto a site, simply paste a badge like this right under the title:
+<a href="https://oboacademy.github.io/obook/reference/obook-maturity-indicator/"><img src="https://img.shields.io/endpoint?url=https%3A%2F%2Fraw.githubusercontent.com%2FOBOAcademy%2Fobook%2Fmaster%2Fdocs%2Fresources%2Fobook-badge-final.json" /></a>
+
The ODK is essentially two things:
+The ODK bundles a lot of tools together, such as ROBOT, owltools, fastobo-validator and dosdp-tools. To get a better idea, its best to simply read the Dockerfile specifications of the ODK image:
+One of the tools in the toolbox, the "seed my repo" function, allows us to generate a complete GitHub repository with everything needed to manage an OBO ontology according to OBO best practices. The two central components are
+Schema can be found in ODK documentation here
+ + + + + + +Here's a collection of links about the Open Biological and Biomedical Ontologies (OBO), +and related topics.
+If you're completely new to OBO, +I suggest starting with Ontologies 101:
+If you're new to scientific computing more generally, +then I strongly recommend Software Carpentry, +which provides a set of very pragmatic introductions to +the Unix command line, git, Python, Make, +and other tools widely used by OBO developers.
+OBO is a community of people collaborating on open source ontologies for science. +We have a set of shared principles and best practises +to help people and data work together effectively.
+Here is a very incomplete list of some excellent services +to help you find an use OBO terms and ontologies.
+This is the suite of open source software that most OBO developers use.
+This section is for technical reference, not beginners.
+OBO projects use Semantic Web and Linked Data technologies:
+ +These standards form layers:
+Other useful resources on technical topics:
+Nicole Vasilevsky, James Overton, Rebecca Jackson, Sabrina Toro, Shawn Tan, Bradley Varner, David Osumi-Sutherland, & Nicolas Matentzoglu. (2022, August 3). OBO Academy: Training materials for bio-ontologists. 2022 ISMB Bio-Ontologies Community, Madison, WI. https://doi.org/10.5281/zenodo.6955490
+Available here. Please feel free to use this slide deck to promote the OBO Academy.
+To add an ontology term (such as a GO term) that contains '
in its name (e.g. RNA-directed 5'-3' RNA polymerase activity
) in the class expression editor, you need to escape the '
characters. In Protegé 5.5 this is not automatically handled when you auto-complete with tab. To escape the character append \
before the '
-> RNA-directed 5\'-3\' RNA polymerase activity
. You won't be able to add the annotation otherwise.
As in Protegé 5.5, the \
characters will show up in the description window, and when hovering over the term, you won't be able to click on it with a link. However, when you save the file, the relationship is saved correctly. You can double-check by going to the ontology text file and see that the term is correctly mentioned in the relationship.
For this reference, we will use the cell ontology to highlight the key information on the user interface in Protege
+'+' button (not shown above) = add +'?' button = explain axiom +'@' button = annotate +'x' button = remove +'o' button = edit
+When you open the ontology on protege, you should land on the Active ontology tab, alternatively, it is available on the top as one of your tabs.
+ +Annotations on the active ontology tab are ontology level annotations and contain metadata about the ontology. +This includes:
+Entities are where your "entries" in the ontology live and where you can add terms etc.
+ + + + + + + +A quick personal perspective up-front. When I was finishing my undergrad, I barely had heard the term Semantic Web. What I had heard vaguely intrigued me, so I decided that for my final project, I would try to combine something Semantic Web related with my other major, Law and build a tool that could automatically infer the applicability of a law (written in OWL) given a legal case. Super naively, I just went went ahead, read a few papers about legal ontologies, build a simple one, loaded it into my application and somehow got it to work, with reasoning and all, without even having heard of Description Logic.
+In my PhD, I worked on actual reasoning algorithms, which meant, no more avoiding logic. But - I did not get it. Up until this point in my life, I could just study harder and harder, and in the end I was confident with what I learned, but First Order Logic, in particular model theory and proofs, caused me anxiety until the date of my viva. In the end, a very basic understanding of model theory and Tableau did help me with charactering the algorithms I was working with (I was studying the effect of modularity, cutting out logically connected subsets of an ontology, on reasoning performance) but I can confidently say today: I never really, like deeply, understood logical proofs. I still cant read them - and I have a PhD in Reasoning (albeit from an empirical angle).
+If you followed the Open HPI courses on logic, and you are anything like me, your head will hurt and you will want to hide under your blankets. Most students feel like that. For a complete education in Semantic Web technologies, going through this part once is essential: it tells you something about how difficult some stuff is under the hood, and how much work has been done to make something like OWL work for knowledge representation. You should have gained some appreciation of the domain, which is no less complex than Machine Learning or Stochastic Processes. But, in my experience, some of the most effective ontology engineers barely understand reasoning - definitely have no idea how it works - and still do amazing work. In that spirit, I would like to invite you at this stage to put logic and reasoning behind you (unless it made you curious of course) - you won't need to know much of that for being an effective Semantic Engineer. In the following, I will summarise some of the key take-aways that I find useful to keep in mind.
+Human SubClassOf: Mammal
means that all instances of the Human
class, like me, are also instances of the Mammal
class. Or, in other words, from the statements:Human SubClassOf: Mammal
+Nico type: Human
+
Semantics allow as to deduce that Nico:Mammal
. What are semantics practically? Show me your semantics? Look at something like the OWL semantics. In there, you will find language statements (syntax) like X SubClassOf: Y
and a bunch of formulae from model theory that describe how to interpret it - no easy read, and not really important for you now.
When we want reasoners to be faster at making inferences (computational complexity), we need to decrease expressivity. + So we need to find a way to balance.
+What are the most important practical applications of reasoning? There are many, and there will be many opinions, but in the OBO world, by far (95%) of all uses of reasoners pertain to the following:
+inconsistent
- which means, totally broken. A slightly less bad, but still undesirable situation is that some of the classes in your ontologies break (in parlance, become unsatisfiable). This happens when you say some contradictory things about them. Reasoners help you find these unsatisfiable classes, and there is a special reasoning algorithm that can generate an explanation for you - to help fixing your problem.So in general, what is reasoning? There are probably a dozen or more official characterisations in the scientific literature, but from the perspective of biomedical ontologies, the question can be roughly split like this:
+How can we capture what we know? This is the (research-) area of knowledge representation, logical formalisms, such as First Order Logic, Description Logic, etc. It is concerned with how we write down what we now:
+All cars have four wheels
+If you are a human, you are also a mammal
+If you are a bird, you can fly (unless you are a penguin)
+
Lets think about a naive approach: using a fact-, or data-, base.
+ + + + + + +For explanation of different release artefacts, please see discussion documentation on owl format variants
+We made a first stab add defining release artefacts that should cover all use cases community-wide. We need to (1) agree they are all that is needed and (2) they are defined correctly in terms of ROBOT commands. This functionality replaces what was previously done using OORT.
+The source ontology is the ontology we are talking about. A release artefact is a version of the ontology modified in some specific way, intended for public use. An import is a module of an external ontology which contains all the axioms necessary for the source ontology. A component is a file containing axioms that belong to the source ontology (but are for one reason or another, like definitions.owl, managed in a separate file). An axiom is said to be foreign if it 'belongs' to a different ontology, and native if it belongs to the source ontology. For example, the source ontology might have, for one reason or another, been physically asserted (rather than imported) the axiom TransitiveObjectProperty(BFO:000005). If the source ontology does not 'own' the BFO namespace, this axiom will be considered foreign.
+There are currently 6 release defined in the ODK:
+We discuss all of them here in detail.
+The base file contains all and only native axioms. No further manipulation is performed, in particular no reasoning, redundancy stripping or relaxation. This release artefact is going to be the new backbone of the OBO strategy to combat incompatible imports and consequent lack of interoperability. (Detailed discussions elsewhere, @balhoff has documentation). Every OBO ontology will contain a mandatory base release (should be in the official OBO recommendations as well).
+The ROBOT command generating the base artefact: +$(SRC): source ontology +$(OTHER_SRC): set of component ontologies
+$(ONT)-base.owl: $(SRC) $(OTHER_SRC)
+ $(ROBOT) remove --input $< --select imports --trim false \
+ merge $(patsubst %, -i %, $(OTHER_SRC)) \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@
+
The full release artefact contains all logical axioms, including inferred subsumptions. Redundancy stripping (i.e. redundant subclass of axioms) and typical relaxation operations are performed. All imports and components are merged into the full release artefact to ensure easy version management. The full release represents most closely the actual ontology as it was intended at the time of release, including all its logical implications. Every OBO ontology will contain a mandatory full release.
+The ROBOT command generating the full artefact: +$(SRC): source ontology +$(OTHER_SRC): set of component ontologies
+$(ONT)-full.owl: $(SRC) $(OTHER_SRC)
+ $(ROBOT) merge --input $< \
+ reason --reasoner ELK \
+ relax \
+ reduce -r ELK \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@
+
The non-classified release artefact reflects the 'unmodified state' of the editors file at release time. No operations are performed that modify the axioms in any way, in particular no redundancy stripping. As opposed to the base artefact, both component and imported ontologies are merged into the non-classified release.
+The ROBOT command generating the full artefact: +$(SRC): source ontology +$(OTHER_SRC): set of component ontologies
+$(ONT)-non-classified.owl: $(SRC) $(OTHER_SRC)
+ $(ROBOT) merge --input $< \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@
+
Many users want a release that can be treated as a simple existential graph of the terms defined in an ontology. This corresponds to the state of OBO ontologies before logical definitions and imports. For example, the only logical axioms in -simple release of CL will contain be of the form CL1 subClassOf CL2
or CL1 subClassOf R some CL3
where R is any objectProperty and CLn is a CL class. This role has be fulfilled by the -simple artefact, which up to now has been supported by OORT.
To construct this, we first need to assert inferred classifications, relax equivalentClass axioms to sets of subClassOf axioms and then strip all axioms referencing foreign (imported) classes. As ontologies occasionally end up with forieign classes and axioms merged into the editors file, we achieve this will a filter based on obo-namespace. (e.g. finding all terms with iri matching http://purl.obolibrary.org/obo/CL_{\d}7).
+The ROBOT command generating the full artefact: +$(SRC): source ontology +$(OTHER_SRC): set of component ontologies +$(SIMPLESEED): all terms that 'belong' to the ontology
+$(ROBOT) merge --input $< $(patsubst %, -i %, $(OTHER_SRC)) \
+ reason --reasoner {{ project.reasoner }} --equivalent-classes-allowed {{ project.allow_equivalents }} \
+ relax \
+ remove --axioms equivalent \
+ relax \
+ filter --term-file $(SIMPLESEED) --select "annotations ontology anonymous self" --trim true --signature true \
+ reduce -r {{ project.reasoner }} \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@.tmp.owl && mv $@.tmp.owl $@
+
NOTES: This requires $(ONTOLOGYTERMS) to include all ObjectProperties usesd. --select parents
is required for logical axioms to be retained, but results in a few upper-level classes bleeding through. We hope this will be fixed by further improvments to Monarch.
Some legacy users (e.g. MGI) require an OBO DAG version of -simple. OBO files derived from OWL are not guarenteed to be acyclic, but acyclic graphs can be achieved using judicious filtering of relationships (simple existential restrictions) by objectProperty. The -basic release artefact has historically fulfilled this function as part of OORT driven ontology releases. The default -basic version corresponds to the -simple artefact with only 'part of' relationships (BFO:0000050), but others may be added where ontology editors judge these to be useful and safe to add without adding cycles. We generate by taking the simple release and filtering it
+The ROBOT command generating the full artefact: +$(SRC): source ontology +$(OTHER_SRC): set of component ontologies +$(KEEPRELATIONS): all relations that should be preserved. +$(SIMPLESEED): all terms that 'belong' to the ontology
+$(ROBOT) merge --input $< $(patsubst %, -i %, $(OTHER_SRC)) \
+ reason --reasoner {{ project.reasoner }} --equivalent-classes-allowed {{ project.allow_equivalents }} \
+ relax \
+ remove --axioms equivalent \
+ remove --axioms disjoint \
+ remove --term-file $(KEEPRELATIONS) --select complement --select object-properties --trim true \
+ relax \
+ filter --term-file $(SIMPLESEED) --select "annotations ontology anonymous self" --trim true --signature true \
+ reduce -r {{ project.reasoner }} \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@.tmp.owl && mv $@.tmp.owl $@
+
This artefact caters to the very special and hopefully transient case of some ontologies that do not yet trust reasoning (MP, HP). The simple-non-classified artefact corresponds to the simple artefact, just without the reasoning step.
+$(SRC): source ontology +$(OTHER_SRC): set of component ontologies +$(ONTOLOGYTERMS): all terms that 'belong' to the ontology
+$(ONT)-simple-non-classified.owl: $(SRC) $(OTHER_SRC) $(ONTOLOGYTERMS)
+ $(ROBOT) remove --input $< --select imports \
+ merge $(patsubst %, -i %, $(OTHER_SRC)) \
+ relax \
+ reduce -r ELK \
+ filter --term-file $(ONTOLOGYTERMS) --trim true \
+ annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@
+
Essentials
+Automation
+make
, managed by git
.Text editors:
+ +SPARQL query tool:
+SPARQL endpoints
+Templating systems
+Ontology Mappings
+Where to find ontologies and terms: Term browsers and ontology repositories
+Ontology visualisation
+Dot
.Other tools in my toolbox
+These are a bit less essential than the above, but I consider them still tremendously useful.
+make
, managed by git
.Semantic Data Engineering or Semantic Extract-Transform-Load (ETL) is an engineering discipline that is concerned with extracting information from a variety of sources, linking it together into a knowledge graph and enabling a range of semantic analyses for downstream users such as data scientists or researchers.
+The following glossary only says how we use the terms we are defining, not how they are defined by some higher authority.
+Term | +Definition | +Example | +
---|---|---|
Entity | +An entity is a thing in the world, like a molecule, or something more complex, like a disease. Entities do not have to be material, they can be processes as well, like cell proliferation. | +Marfan syndrome, H2O molecule, Ring finger, Phone | +
Term | +A term is a sequence of characters (string) that refers to an entity in a precise way. | +SMOKER (referring to the role of being a smoker), HP:0004934 (see explanations below) | +
Relation | +A link between two (or more) entities that signifies some kind of interaction. | +:A :loves :B , :smoking :causes :cancer |
+
Property | +A type of relation. | +The :causes in :smoking :causes :cancer |
+
As a Semantic Engineer, you typically coordinate the data collection from three largely separate sources: +1. Unstructured text, for example a corpus of scientific literature +2. External biological databases, such as STRING, a database of Protein-Protein Interaction Networks. +3. Manual in-house bio-curation efforts, i.e. the manual translation and integration of information relevant to biology (or medicine) into a database.
+Here, we are mostly concerned with the automated approaches of Semantic ETL, so we briefly touch on these and provide pointers to the others.
+The task of information extraction is concerned with extracting information from unstructured textual sources to enable identifying entities, like diseases, phenotypes and chemicals, as well as classifying them and storing them in a structured format.
+The discipline that is concerned with techniques for extracting information from text is called Natural Language Processing (NLP).
+NLP is a super exciting and vast engineering discipline which goes beyond the scope of this course. NLP is concerned with many problems such as document classification, speech recognition and language translation. In the context of information extraction, we are particularly interested in Named Entity Recognition (NER), and Relationship Extraction (ER).
+Named Entity Recognition (NER) is the task of identifying and categorising entities in text. NER tooling provides functionality to first isolate parts of sentence that correspond to things in the world, and then assigning them to categories (e.g. Drug, Disease, Publication).
+For example, consider this sentence:
+As in the X-linked Nettleship-Falls form of ocular albinism (300500), the patients showed reduced visual acuity, photophobia, nystagmus, translucent irides, strabismus, hypermetropic refractive errors, and albinotic fundus with foveal hypoplasia.
+
An NER tool would first identify the relevant sentence parts that belong together:
+As in the [X-linked] [Nettleship-Falls] form of [ocular albinism] (300500), the patients showed [reduced visual acuity], [photophobia], [nystagmus], [translucent irides], [strabismus], [hypermetropic refractive errors], and [albinotic fundus] with [foveal hypoplasia].
+
And then categorise them according to some predefined categories:
+As in the Phenotype[X-linked] [Nettleship-Falls] form of Disease[ocular albinism] (300500), the patients showed Phenotype[reduced visual acuity], Phenotype[photophobia], Phenotype[nystagmus], Phenotype[translucent irides], Phenotype[strabismus], Phenotype[hypermetropic refractive errors], and Phenotype[albinotic fundus] with Phenotype[foveal hypoplasia].
+
Interesting sources for further reading:
+ +Relationship extraction (RE) is the task of extracting semantic relationships from text. +RE is an important component for the construction of Knowledge Graphs from the Scientific Literature, a task that many Semantic Data Engineering projects pursue to augment or inform their manual curation processes.
+Interesting sources for further reading:
+There is a huge amount of literature and tutorials on the topic of integrating data, the practice of consolidating data from disparate sources into a single dataset. We want to emphasise here two aspects of data integration, which are of particular importance to the Semantic Data engineer.
+Entity resolution (ER), sometimes called "record linking", is the task of disambiguating records that correspond to real world entities across and within datasets. This task as many dimensions, but for us, the most important one is mapping a string, for example the one that was matched by our Named Entity Recognition pipeline, to ontology terms.
+Given our example:
+As in the Phenotype[X-linked] Nettleship-Falls form of Phenotype[ocular albinism] (300500), the patients showed Phenotype[reduced visual acuity], Phenotype[photophobia], Phenotype[nystagmus], Phenotype[translucent irides], Phenotype[strabismus], Phenotype[hypermetropic refractive errors], and Phenotype[albinotic fundus] with Phenotype[foveal hypoplasia].
+
We could end up, for example, resolving ocular albinism to HP:0001107.
+There are a lot of materials about Entity Resolution in general: +- https://www.districtdatalabs.com/basics-of-entity-resolution +- https://www.sciencedirect.com/topics/computer-science/entity-resolution
+In effect the term Ontology Mapping, which is the focus of this lesson, is Entity Resolution for ontologies - usually we don't have problem to use the two terms synonymously, although you may find that the literature typically favours one or the other.
+Knowledge, Knowledge Graph or Ontology Merging are the disciplines concerned with combining all your data sources into a semantically coherent whole. This is a very complex research area, in particular to do this in a way that is semantically consistent. There are essentially two separate problems to be solved to achieve semantic merging:
+1. The entities aligned during the entity resolution process must be aligned in the semantically correct way: if you you use logical equivalence to align them (owl:equivalentClasses
) the classes must mean absolutely the same thing, or else you may run into the hairball problem, in essence faulty equivalence cliques. In cases of close, narrow or broadly matching classes, the respective specialised semantically correct relationships need to be used in the merging process.
+2. The axioms of the merged ontologies must be logically consistent. For example, one ontology may say: a disease is a material entity. Another: a disease is a process. A background, or upper, ontology such as the ubiquitous Basic Formal Ontology (BFO) furthermore says that a process is not a material entity and vice versa. Merging this two ontologies would cause logical inconsistency.
Unfortunately, the literature on ontology and knowledge graph merging is still sparse and very technical. You are probably best off checking out the OpenHPI course on Ontology Alignment, which is closely related.
+ + + + + + +A basic SELECT query contains a set of prefixes, a SELECT clause and a WHERE clause.
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+
+SELECT ?term ?value
+WHERE {
+ ?term rdfs:label ?value .
+}
+
Prefixes allow you to specify shortcuts. For example, instead of using the prefixes above, you could have simply said:
+SELECT ?term ?value
+WHERE {
+ ?term <http://www.w3.org/2000/01/rdf-schema#label> ?value .
+}
+
Without the prefix. It means the exact same thing. But it looks nicer. Some people even go as far as adding entire entities into the prefix header:
+PREFIX label: <http://www.w3.org/2000/01/rdf-schema#label>
+
+SELECT ?term ?value
+WHERE {
+ ?term label: ?value .
+}
+
This query is, again, the same as the ones above, but even more concise.
+The SELECT clause defines what you part of you query you want to show, for example, as a table.
+SELECT ?term ?value
+
means: "return" or "show" whatever you find for the variable ?term
and the variable ?value
.
There are other cool things you can do in the SELECT clause:
+This document contains template SPARQL queries that can be adapted.
+Comments are added in-code with #
above each step to explain them so that queries can be spliced together
note: we assume that all native terms here have the same namespace - that of the ontology
+# select unique instances of the variable
+SELECT DISTINCT ?term
+WHERE {
+ # selecting where the variable term is either used as a subject or object
+ { ?s1 ?p1 ?term . }
+ UNION
+ { ?term ?p2 ?o2 . }
+ # filtering out only terms that have the MONDO namespace (assumed to be native terms)
+ FILTER(isIRI(?term) && (STRSTARTS(str(?term), "http://purl.obolibrary.org/obo/MONDO_")))
+}
+
# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+prefix BFO: <http://purl.obolibrary.org/obo/BFO_>
+
+# selecting only unique instances of the three variables
+SELECT DISTINCT ?entity ?label WHERE
+{
+ # the variable label is a rdfs:label
+ VALUES ?property {
+ rdfs:label
+ }
+
+ # only look for uberon terms. note: this is only used in ubergraph, use filter for local ontology instead.
+ ?entity rdfs:isDefinedBy <http://purl.obolibrary.org/obo/uberon.owl> .
+
+ # defining the order of variables in the triple
+ ?entity ?property ?label .
+ # entity must be material
+ ?entity rdfs:subClassOf BFO:0000040
+ # filtering out triples where the variable label has sulcus or incisure, or fissure in it
+ FILTER(contains(STR(?label), "sulcus")||contains(STR(?label), "incisure")||contains(STR(?label), "fissure"))
+
+}
+# arrange report by entity variable
+ORDER BY ?entity
+
prefix label: <http://www.w3.org/2000/01/rdf-schema#label>
+prefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+prefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>
+prefix owl: <http://www.w3.org/2002/07/owl#>
+
+# select a report with 3 variables
+SELECT DISTINCT ?term ?label ?def
+
+# defining the properties to be used
+ WHERE {
+ VALUES ?defproperty {
+ definition:
+ }
+ VALUES ?labelproperty {
+ label:
+ }
+
+# defining the order of the triples
+ ?term ?defproperty ?def .
+ ?term ?labelproperty ?label .
+
+# selects entities that are in a certain namespace
+ FILTER(isIRI(?term) && (STRSTARTS(str(?term), "http://purl.obolibrary.org/obo/CP_")))
+}
+
+# arrange report by term variable
+ORDER BY ?term
+
adaptable for lacking particular annotation
+# adding prefixes used
+prefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+prefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>
+prefix owl: <http://www.w3.org/2002/07/owl#>
+
+SELECT ?entity ?property ?value WHERE
+{
+ # the variable property has to be defintion (IAO:0000115)
+ VALUES ?property {
+ definition:
+ }
+ # defining the order of variables in the triple
+ ?entity ?property ?value .
+
+ # selecting annotation on definition
+ ?def_anno a owl:Axiom ;
+ owl:annotatedSource ?entity ;
+ owl:annotatedProperty definition: ;
+ owl:annotatedTarget ?value .
+
+ # filters out definitions which do not have a dbxref annotiton
+ FILTER NOT EXISTS {
+ ?def_anno oboInOwl:hasDbXref ?x .
+ }
+
+ # removes triples where entity is blank
+ FILTER (!isBlank(?entity))
+ # selects entities that are native to ontology (in this case MONDO)
+ FILTER (isIRI(?entity) && STRSTARTS(str(?entity), "http://purl.obolibrary.org/obo/MONDO_"))
+
+}
+# arrange report by entity variable
+ORDER BY ?entity
+
adaptable for checking if there is particular character in annotation
+# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+prefix IAO: <http://purl.obolibrary.org/obo/IAO_>
+prefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>
+
+# selecting only unique instances of the three variables
+SELECT DISTINCT ?entity ?property ?value WHERE
+{
+ # the variable property has to be definition (IAO:0000115)
+ VALUES ?property {
+ definition:
+ }
+ # defining the order of variables in the triple
+ ?entity ?property ?value .
+ # filtering out triples where the variable value has _ in it
+ FILTER( regex(STR(?value), "_"))
+ # removes triples where entity is blank
+ FILTER (!isBlank(?entity))
+}
+# arrange report by entity variable
+ORDER BY ?entity
+
# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+prefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+prefix IAO: <http://purl.obolibrary.org/obo/IAO_>
+prefix RO: <http://purl.obolibrary.org/obo/RO_>
+prefix mondo: <http://purl.obolibrary.org/obo/mondo#>
+prefix skos: <http://www.w3.org/2004/02/skos/core#>
+prefix dce: <http://purl.org/dc/elements/1.1/>
+prefix dcterms: <http://purl.org/dc/terms/>
+
+# selecting only unique instances of the three variables
+SELECT DISTINCT ?term ?property ?value WHERE
+{
+ # order of the variables in the triple
+ ?term ?property ?value .
+ # the variable property is an annotation property
+ ?property a owl:AnnotationProperty .
+ # selects entities that are native to ontology (in this case MONDO)
+ FILTER (isIRI(?term) && regex(str(?term), "^http://purl.obolibrary.org/obo/MONDO_"))
+ # removes triples where the variable value is blank
+ FILTER(!isBlank(?value))
+ # listing the allowed annotation properties
+ FILTER (?property NOT IN (dce:creator, dce:date, IAO:0000115, IAO:0000231, IAO:0100001, mondo:excluded_subClassOf, mondo:excluded_from_qc_check, mondo:excluded_synonym, mondo:pathogenesis, mondo:related, mondo:confidence, dcterms:conformsTo, mondo:should_conform_to, oboInOwl:consider, oboInOwl:created_by, oboInOwl:creation_date, oboInOwl:hasAlternativeId, oboInOwl:hasBroadSynonym, oboInOwl:hasDbXref, oboInOwl:hasExactSynonym, oboInOwl:hasNarrowSynonym, oboInOwl:hasRelatedSynonym, oboInOwl:id, oboInOwl:inSubset, owl:deprecated, rdfs:comment, rdfs:isDefinedBy, rdfs:label, rdfs:seeAlso, RO:0002161, skos:broadMatch, skos:closeMatch, skos:exactMatch, skos:narrowMatch))
+}
+
adaptable for checking that a property is used in a certain way
+# adding prefixes used
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+PREFIX replacedBy: <http://purl.obolibrary.org/obo/IAO_0100001>
+
+# selecting only unique instances of the three variables
+SELECT DISTINCT ?entity ?property ?value WHERE {
+ # the variable property is IAO_0100001 (item replaced by)
+ VALUES ?property { replacedBy: }
+
+ # order of the variables in the triple
+ ?entity ?property ?value .
+ # removing entities that have either owl:deprecated true or oboInOwl:ObsoleteClass (these entities are the only ones that should have replaced_by)
+ FILTER NOT EXISTS { ?entity owl:deprecated true }
+ FILTER (?entity != oboInOwl:ObsoleteClass)
+}
+# arrange report by entity variable
+ORDER BY ?entity
+
# this query counts the number of classes you have with each prefix (eg number of MONDO terms, CL terms, etc.)
+
+# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix obo: <http://purl.obolibrary.org/obo/>
+
+# selecting 2 variables, prefix and numberOfClasses, where number of classes is a count of distinct cls
+SELECT ?prefix (COUNT(DISTINCT ?cls) AS ?numberOfClasses) WHERE
+{
+ # the variable cls is a class
+ ?cls a owl:Class .
+ # removes any cases where the variable cls is blank
+ FILTER (!isBlank(?cls))
+ # Binds the variable prefix as the prefix of the class (eg. MONDO, CL, etc.). classes that do not have obo purls will come out as blank in the report.
+ BIND( STRBEFORE(STRAFTER(str(?cls),"http://purl.obolibrary.org/obo/"), "_") AS ?prefix)
+}
+# grouping the count by prefix
+GROUP BY ?prefix
+
# this query counts the number of classes that are subclass of CL:0000003 (native cell) that are in the pcl namespace
+
+# adding prefixes used
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX CL: <http://purl.obolibrary.org/obo/CL_>
+PREFIX PCL: <http://purl.obolibrary.org/obo/PCL_>
+
+# count the number of unique term
+SELECT (COUNT (DISTINCT ?term) as ?pclcells)
+WHERE {
+ # the variable term is a class
+ ?term a owl:Class .
+ # the variable term has to be a subclass of CL:0000003, including those that are subclassof by property path
+ ?term rdfs:subClassOf* CL:0000003
+ # only count the term if it is in the pcl namespace
+ FILTER(isIRI(?term) && (STRSTARTS(str(?term), "http://purl.obolibrary.org/obo/PCL_")))
+}
+
adaptable for removing all terms of a particular namespace
+# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+
+# removing triples
+DELETE {
+ ?s ?p ?o
+}
+WHERE
+{
+ {
+ # the variable p must be a rdfs:label
+ VALUES ?p {
+ rdfs:label
+ }
+ # the variable s is an object property
+ ?s a owl:ObjectProperty ;
+ # the other variables can be anything else (note the above value restriction of p)
+ ?p ?o
+ # filter out triples where ?s starts with "http://purl.obolibrary.org/obo/RO_"
+ FILTER (isIRI(?s) && STRSTARTS(str(?s), "http://purl.obolibrary.org/obo/RO_"))
+ }
+}
+
# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+
+# delete triples
+DELETE {
+ ?anno ?property ?value .
+}
+WHERE {
+ # the variable property is either synonym_type: or source:
+ VALUES ?property { synonym_type: source: }
+ # structure of variable value and variable anno
+ ?anno a owl:Axiom ;
+ owl:annotatedSource ?s ;
+ owl:annotatedProperty ?p ;
+ owl:annotatedTarget ?o ;
+ ?property ?value .
+ # filter out the variable value which start with "ICD10EXP:"
+ FILTER(STRSTARTS(STR(?value),"ICD10EXP:"))
+}
+
adaptable for replacing annotations properties on particular axioms
+# adding prefixes used
+prefix owl: <http://www.w3.org/2002/07/owl#>
+prefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+
+# delete triples where the relation is oboInOwl:source
+DELETE {
+ ?ax oboInOwl:source ?source .
+}
+# insert triples where the variables ax and source defined above are used, but using oboInOwl:hasDbXref instead
+INSERT {
+ ?ax oboInOwl:hasDbXref ?source .
+}
+WHERE
+{
+ # restricting to triples where the property variable is in this list
+ VALUES ?property { oboInOwl:hasExactSynonym oboInOwl:hasNarrowSynonym oboInOwl:hasBroadSynonym oboInOwl:hasCloseSynonym oboInOwl:hasRelatedSynonym } .
+ # order of the variables in the triple
+ ?entity ?property ?value .
+ # structure on which the variable ax and source applies
+ ?ax rdf:type owl:Axiom ;
+ owl:annotatedSource ?entity ;
+ owl:annotatedTarget ?value ;
+ owl:annotatedProperty ?property ;
+ oboInOwl:source ?source .
+ # filtering out triples where entity is an IRI
+ FILTER (isIRI(?entity))
+}
+
A synonym indicates an alternative name for a term. Terms can have multiple synonyms.
+The definition of the synonym is exactly the same as primary term definition. This is used when the same class can have more than one name.
+For example, hereditary Wilms' tumor has the exact synonoym familial Wilms' tumor.
+Additionally, translations into other languages are listed as exact synonyms. For example, the Plant Ontology list both Spanish and Japanese translations as exact synonyms; e.g. anther wall has exact synonym ‘pared de la antera’ (Spanish) and ‘葯壁 ‘(Japanese).
+The definition of the synonym is the same as the primary definition, but has additional qualifiers.
+For example, pod is a narrow synonym of fruit.
+Note - when adding a narrow synonym, please first consider whether a new subclass should be added instead of a narrow synonym. If there is any uncertainty, start a discussion on the GitHub issue tracker.
+The primary definition accurately describes the synonym, but the definition of the synonym may encompass other structures as well. In some cases where a broad synonym is given, it will be a broad synonym for more than one ontology term.
+For example, Cyst of eyelid has the broad synonym Lesion of the eyelid.
+Note - when adding a broad synonym, please first consider whether a new superclass should be added instead of a broad synonym. If there is any uncertainty, start a discussion on the GitHub issue tracker.
+This scope is applied when a word of phrase has been used synonymously with the primary term name in the literature, but the usage is not strictly correct. That is, the synonym in fact has a slightly different meaning than the primary term name. Since users may not be aware that the synonym was being used incorrectly when searching for a term, related synonyms are included.
+For example, Autistic behavior has the related synonym Autism spectrum disorder.
+Synonyms can also be classified by types. The default is no type. The synonym types vary in each ontology, but some commonly used synonym types include:
+Whenever possible, database cross-references (dbxrefs) for synonyms should be provided, to indicate the publication that used the synonym. References to PubMed IDs should be in the format PMID:XXXXXXX (no space). However, dbxrefs for synonyms are not mandatory in most ontologies.
+ + + + + + +Tables and triples seem very different. +Tables are familiar and predictable. +Triples are weird and floppy. +SQL is normal, SPARQL is bizarre, at least at first.
+Tables are great, and they're the right tool for a lot of jobs, +but they have their limitations. +Triples shine when it comes to merging heterogeneous data. +But it turns out that there's a clear path from tables to triples, +which should help make RDF make more sense.
+Tables are great! +Here's a table!
+first_name | +last_name | +
---|---|
Luke | +Skywalker | +
Leia | +Organa | +
Darth | +Vader | +
Han | +Solo | +
You won't be surprised to find out +that tables have rows and columns. +Often each row corresponds to some thing +that we want to talk about, +such as a fictional character from Star Wars. +Each column usually corresponds to some sort of property +that those things might have. +Then the cells contain the values of those properties +for their respective row. +We take some sort of complex information about the world, +and we break it down along two dimensions: +the things (rows) and their properties (columns).
+Tables are great! +We can add another name to our table:
+first_name | +last_name | +
---|---|
Luke | +Skywalker | +
Leia | +Organa | +
Darth | +Vader | +
Han | +Solo | +
Anakin | +Skywalker | +
Hmm. +That's a perfectly good table, +but it's not capturing the information that we wanted. +It turns out (Spoiler Alert!) that Anakin Skywalker is Darth Vader! +We might have thought that the rows of our table +were describing individual people, +but it turns out that they're just describing individual names. +A person can change their name +or have more than one name.
+We want some sort of identifier +that lets us pick out the same person, +and distinguish them from all the other people. +Sometimes there's a "natural key" that we can use for this purpose: +some bit of information that uniquely identifies a thing. +When we don't have a natural key, we can generate an "artificial key". +Random strings and number can be good artificial keys, +but sometimes a simple incrementing integer is good enough.
+The main problem with artificial keys +is that it's our job to maintain the link +between the thing and the identifier that we gave it. +We prefer natural keys because we just have to inspect that thing +(in some way) +to figure out what to call it. +Even when it's possible, +sometimes that's too much work. +Maybe we could use a DNA sequence as a natural key for a person, +but it probably isn't practical. +We do use fingerprints and facial recognition, +for similar things, though.
+(Do people in Star Wars even have DNA? +Or just midichlorions?)
+Let's add a column with an artificial key to our table:
+sw_id | +first_name | +last_name | +
---|---|---|
1 | +Luke | +Skywalker | +
2 | +Leia | +Organa | +
3 | +Darth | +Vader | +
4 | +Han | +Solo | +
3 | +Anakin | +Skywalker | +
This is our table of names, +allowing a given person to have multiple names. +But what we thought we wanted was a person table +with one row for each person, like this:
+sw_id | +first_name | +last_name | +
---|---|---|
1 | +Luke | +Skywalker | +
2 | +Leia | +Organa | +
3 | +Darth | +Vader | +
4 | +Han | +Solo | +
In SQL we could assert that the "sw_id" column of the person table +is a PRIMARY KEY. +This means it must be unique. +(It probably shouldn't be NULL either!)
+The names in the person table could be the primary names +that we use in our Star Wars database system, +and we could have another alternative_name table:
+sw_id | +first_name | +last_name | +
---|---|---|
3 | +Anakin | +Skywalker | +
Tables are great! +We can add more columns to our person table:
+sw_id | +first_name | +last_name | +occupation | +
---|---|---|---|
1 | +Luke | +Skywalker | +Jedi | +
2 | +Leia | +Organa | +princess | +
3 | +Darth | +Vader | ++ |
4 | +Han | +Solo | +scoundrel | +
The 2D pattern of a table is a strong one. +It not only provides a "slot" (cell) +for every combination of row and column, +it also makes it very obvious when one of those slots is empty. +What does it mean for a slot to be empty? +It could mean many things.
+For example, in the previous table +in the row for Darth Vader, +the cell for the "occupation" column is empty. +This could mean that:
+I'm sure I haven't captured all the possibilities. +The point is that there's lot of possible reasons +why a cell would be blank. +So what can we do about it?
+If our table is stored in a SQL database, +then we have the option of putting a NULL value in the cell. +NULL is pretty strange. +It isn't TRUE and it isn't FALSE. +Usually NULL values are excluded from SQL query results +unless you are careful to ask for them.
+The way that NULL works in SQL eliminates some of the possibilities above. +SQL uses the "closed-world assumption", +which is the assumption that if a statement is true then it's known to be true, +and conversely that if it's not known to be true then it's false. +So if Anakin's occupation is NULL in a SQL database, +then as far as SQL is concerned, +we must know that he doesn't have an occupation. +That might not be what you were expecting!
+The Software Carpentry module on +Missing Data +has more information.
+Tables are great! +Let's add even more information to our table:
+sw_id | +first_name | +last_name | +occupation | +enemy | +
---|---|---|---|---|
1 | +Luke | +Skywalker | +Jedi | +3 | +
2 | +Leia | +Organa | +princess | +3 | +
3 | +Darth | +Vader | ++ | 1,2,4 | +
4 | +Han | +Solo | +scoundrel | +3 | +
We're trying to say that Darth Vader is the enemy of everybody else in our table. +We're using the primary key of the person in the enemy column, which is good, +but we've ended up with multiple values in the "enemy" column +for Darth Vader.
+In any table or SQL database you could +make the "enemy" column a string, +pick a delimiter such as the comma, +and concatenate your values into a comma-separated list. +This works, but not very well.
+In some SQL databases, such as Postgres, +you could given the "enemy" column an array type, +so it can contain multiple values. +You get special operators for querying inside arrays. +This can work pretty well.
+The usual advice is to break this "one to many" information +into a new "enemy" table:
+sw_id | +enemy | +
---|---|
1 | +3 | +
2 | +3 | +
3 | +1 | +
3 | +2 | +
3 | +4 | +
4 | +1 | +
Then you can JOIN the person table to the enemy table as needed.
+Tables are great! +Let's add even more information to our table:
+sw_id | +first_name | +last_name | +occupation | +father | +lightsaber_color | +ship | +
---|---|---|---|---|---|---|
1 | +Luke | +Skywalker | +Jedi | +3 | +green | ++ |
2 | +Leia | +Organa | +princess | +3 | ++ | + |
3 | +Darth | +Vader | ++ | + | red | ++ |
4 | +Han | +Solo | +scoundrel | ++ | + | Millennium Falcon | +
A bunch of these columns only apply to a few rows. +Now we've got a lot more NULLs to deal with. +As the number of columns increases, +this can become a problem.
+Tables are great! +If sparse tables are a problem, +then let's try to apply the same solution +that worked for the "many to one" problem in the previous section.
+name table:
+sw_id | +first_name | +last_name | +
---|---|---|
1 | +Luke | +Skywalker | +
2 | +Leia | +Organa | +
3 | +Darth | +Vader | +
4 | +Han | +Solo | +
3 | +Anakin | +Skywalker | +
occupation table:
+sw_id | +occupation | +
---|---|
1 | +Jedi | +
2 | +princess | +
4 | +scoundrel | +
enemy table:
+sw_id | +enemy | +
---|---|
1 | +3 | +
2 | +3 | +
3 | +1 | +
3 | +2 | +
3 | +4 | +
4 | +1 | +
father table:
+sw_id | +father | +
---|---|
1 | +3 | +
2 | +3 | +
lightsaber_color table:
+sw_id | +lightsaber_color | +
---|---|
1 | +green | +
3 | +red | +
ship table:
+sw_id | +ship | +
---|---|
4 | +Millennium Falcon | +
Hmm. +Yeah, that will work. +But every query we write will need some JOINs. +It feels like we've lost something.
+Tables are great! +But there's such a thing as too many tables. +We started out with a table +with a bunch of rows and a bunch of columns, +and ended up with a bunch of tables +with a bunch of rows but just a few columns.
+I have a brilliant idea! +Let's combine all these property tables into just one table, +by adding a "property" column!
+sw_id | +property | +value | +
---|---|---|
1 | +first_name | +Luke | +
2 | +first_name | +Leia | +
3 | +first_name | +Darth | +
4 | +first_name | +Han | +
5 | +first_name | +Anakin | +
1 | +last_name | +Skywalker | +
2 | +last_name | +Skywalker | +
3 | +last_name | +Vader | +
4 | +last_name | +Solo | +
5 | +last_name | +Skywalker | +
1 | +occupation | +Jedi | +
2 | +occupation | +princess | +
4 | +occupation | +scoundrel | +
1 | +enemy | +3 | +
2 | +enemy | +3 | +
3 | +enemy | +1 | +
3 | +enemy | +2 | +
3 | +enemy | +4 | +
4 | +enemy | +1 | +
1 | +father | +3 | +
2 | +father | +3 | +
1 | +lightsaber_color | +green | +
3 | +lightsaber_color | +red | +
4 | +ship | +Millenium Falcon | +
It turns out that I'm not the first one to think of this idea. +People call it "Entity, Attribute, Value" or "EAV". +People also call it an "anti-pattern", +in other words: a clear sign that you've made a terrible mistake.
+There are lots of circumstances in which +one big, extremely generic table is a bad idea. +First of all, you can't do very much +with the datatypes for the property and value columns. +They kind of have to be strings. +It's potentially difficult to index. +And tables like this are miserable to query, +because you end up with all sorts of self-joins to handle.
+But there's at least one use case where it turns out to work quite well...
+Tables are great! +Until they're not.
+The strong row and column structure of tables +makes them great for lots of things, +but not so great for merging data from different sources. +Before you can merge two tables +you need to know all about:
+So you need to know the schemas of the two tables +before you can start merging them together. +But if you happen to have two EAV tables then, +as luck would have it, +they already have the same schema!
+You also need to know that you're talking about the same things: +the rows have to be about the same things, +you need to be using the same property names for the same things, +and the cell values also need to line up. +If only there was an open standard for specifying globally unique identifiers...
+Yes, you guessed it: URLs (and URNs and URIs and IRIs)! +Let's assume that we use the same URLs for the same things +across the two tables. +Since we're a close-knit community, +we've come to an agreement on a Star Wars data vocabulary.
+URLs are annoyingly long to use in databases, +so let's use standard "sw" prefix to shorten them. +Now we have table 1:
+sw_id | +property | +value | +
---|---|---|
sw:1 | +sw:first_name | +Luke | +
sw:2 | +sw:first_name | +Leia | +
sw:3 | +sw:first_name | +Darth | +
sw:4 | +sw:first_name | +Han | +
sw:5 | +sw:first_name | +Anakin | +
sw:1 | +sw:last_name | +Skywalker | +
sw:2 | +sw:last_name | +Skywalker | +
sw:3 | +sw:last_name | +Vader | +
sw:4 | +sw:last_name | +Solo | +
sw:5 | +sw:last_name | +Skywalker | +
sw:1 | +sw:occupation | +sw:Jedi | +
sw:2 | +sw:occupation | +sw:princess | +
sw:4 | +sw:occupation | +sw:scoundrel | +
and table 2:
+sw_id | +property | +value | +
---|---|---|
sw:1 | +sw:enemy | +sw:3 | +
sw:2 | +sw:enemy | +sw:3 | +
sw:3 | +sw:enemy | +sw:1 | +
sw:3 | +sw:enemy | +sw:2 | +
sw:3 | +sw:enemy | +sw:4 | +
sw:4 | +sw:enemy | +sw:1 | +
sw:1 | +sw:father | +sw:3 | +
sw:2 | +sw:father | +sw:3 | +
sw:1 | +sw:lightsaber_color | +green | +
sw:3 | +sw:lightsaber_color | +red | +
sw:4 | +sw:ship | +Millenium Falcon | +
To merge these two tables, we simple concatenate them. +It couldn't be simpler.
+Wait, this looks kinda familiar...
+These tables are pretty much in RDF format. +You just have to squint a little!
+Each row of the table is a subject-predicate-object triple. +Our subjects, predicates, and some objects are URLs. +We also have some literal objects. +We could turn this table directly into Turtle format +with a little SQL magic +(basically just concatenating strings):
+SELECT "@prefix sw: <http://example.com/sw_> ."
+UNION ALL
+SELECT ""
+UNION ALL
+SELECT
+ sw_id
+|| " "
+|| property
+|| " "
+|| IF(
+ INSTR(value, ":"),
+ value, -- CURIE
+ """" || value || """" -- literal
+ )
+|| " ."
+FROM triple_table;
+
The first few lines will look like this:
+@prefix sw: <http://example.com/sw_> .
+
+sw:1 sw:first_name "Luke" .
+sw:2 sw:first_name "Leia" .
+sw:3 sw:first_name "Darth" .
+sw:4 sw:first_name "Han" .
+
Two things we're missing from RDF are +language tagged literals and typed literals. +We also haven't used any blank nodes in our triple table. +These are easy enough to add.
+The biggest thing that's different about RDF +is that it uses the "open-world assumption", +so something may be true even though +we don't have a triple asserting that it's true. +The open-world assumption is a better fit +than the closed-world assumption +when we're integrating data on the Web.
+Tables are great! +We use them all the time, +they're strong and rigid, +and we're comfortable with them.
+RDF, on the other hand, looks strange at first. +For most common data processing, +RDF is too flexible. +But sometimes flexiblity is the most important thing.
+The greatest strength of tables is their rigid structure, +but that's also their greatest weakness. +We saw a number of problems with tables, +and how they could be overcome +by breaking tables apart into smaller tables, +until we got down to the most basic pattern: +subject-predicate-object. +Step by step, we were pushed toward RDF.
+Merging tables is particularly painful. +When working with data on the Web, +merging is one of the most common and important operations, +and so it makes sense to use RDF for these tasks. +If self-joins with SQL is the worst problem for EAV tables, +then SPARQL solves it.
+These examples show that it's not really very hard +to convert tables to triples. +And once you've seen SPARQL, the RDF query language, +you've seen one good way to convert triples to tables: +SPARQL SELECT results are just tables!
+Since it's straightforward to convert +tables to triples and back again, +make sure to use the right tool for the right job. +When you need to merge heterogeneous data, reach for triples. +For most other data processing tasks, use tables. +They're great!
+ + + + + + +Learn common mistakes when using ROBOT and how to troubleshoot and fix them.
+Optional.get() cannot be called on an absent value
Use the -vvv option to show the stack trace.
Use the --help option to see usage information
make: *** [mondo.Makefile:454: merge_template] Error 1
On Wikidata the following licenses applies:
+"All structured data from the main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License"
+Adding non-CC0 licensed OBO ontologies in full might be problematic due to +* License stacking
+IANL, but my understanding is that as long as only URI mappings are created to OBO ontology terms no licenses are breached (even if the ontology is not CC0)
+Welcome to the OBOOK and our OBO Semantic Engineering Training!
"},{"location":"#introduction-to-the-obook-open-biological-and-biomedical-ontologies-organized-knowledge","title":"Introduction to the OBOOK (Open Biological and Biomedical Ontologies Organized Knowledge)","text":"Documentation in the OBOOK is organised into 4 distinct sections based on the Di\u00e1taxis framework of documentation:
To accommodate for the various training use cases we support, we added the following categories:
Note: We are in the process of transforming the documentation accordingly, so please be patient if some of the documentation is not yet in the correct place. Feel free to create an issue if you find something that you suspect isn't in place.
"},{"location":"#editorial-team","title":"Editorial Team","text":"If you would like to contribute to this training, please find out more here.
"},{"location":"#content","title":"Content","text":"Critical Path Institute (CPI) is an independent, nonprofit organisation dedicated to bringing together experts from regulatory agencies, industry and academia to collaborate and improve the medical product development process.
In April 2021, the CPI has commissioned the first version of this OBO course, contributing not only funding for the preparation and delivery of the materials, but also valuable feedback about the course contents and data for the practical exercises. We thank the CPI for contributing significantly to the OBO community and open science!
https://c-path.org/
"},{"location":"contributing/","title":"Contributing to OBO Semantic Engineering Tutorials","text":"We rely on our readers to correct our materials and add to them - the hope is to centralise all the usual teaching materials for OBO ontology professionals in one place. Feel free to:
#obo-training
channel) to ask any questions (you can request access on the issue tracker)The OBOOK is trying to centralise all OBO documentation in one place. It is, and will be, a big construction site, for years to come. The goal is to iterate and make things better.
We follow two philosophies:
There are three main consequences to this:
We just introduced a new concept to OBOOK called pathways
. The idea is that we provide a linear guide for all 6 roles mentioned on the getting started page through the materials. This will help us also complete the materials and provide a good path to reviewing them regularly.
A step-by-step guide to complete a well-defined mini-project. Examples: ROBOT template tutorial. DOSDP template tutorial. Protege tutorial on using the reasoner.
"},{"location":"getting-started-obook/#lesson","title":"Lesson","text":"A collection of materials (tutorials, explanations and how-to-guides) that together seek to teach a well defined concept. Examples: Contributing to OBO ontologies; An Introduction to templates in OBO; An Introduction to OBO Application development. While the distinction to \"tutorial\" is often fuzzy, the main distinguishing feature should be that a lesson conveys a general concept independent of some concrete technology stack. While we use concrete examples in lessons, we do always seek to generalise to problem space.
"},{"location":"getting-started-obook/#course","title":"Course","text":"A convenience content type that allows us to assemble materials from obook for a specific taught unit, such as the yearly ICBO tutorials, or the ongoing Monarch Ontology Tutorials and others. Course pages serve as go-to-pages for course participants and link to all the relevant materials in the documentation. Course usually comprise lessons, tutorials and how-to guides.
"},{"location":"getting-started-obook/#pathways","title":"Pathways","text":"A pathway is a kind of course, but without the expectation that it is ever taught in a concrete setting. A pathways pertains to a single concrete role (Ontology Curator, Pipeline Developer etc). It is a collection of materials (lessons, tutorials, how-to-guides) that is ordered in a linear fashion for the convenience of the student. For example, we are developing a pathway for ontology pipeline developers that start by teaching common concepts such as how to make term requests etc, and then go into depth on ROBOT pipelines, ODK and Make.
"},{"location":"getting-started-obook/#best-practices","title":"Best practices:","text":"Before you start with the lessons of this course, keep the following in mind:
There are a wide variety of entry points into the OBO world, for example:
"},{"location":"getting-started/#database-curator-you-are","title":"Database Curator: You are","text":"make
and ROBOT
Of course, many of you will occupy more than one of the above \"hats\" or roles. While they all require specialised training, many shared skill requirements exist. This course is being developed to:
make
Description: add here
"},{"location":"config/template/#learning-objectives","title":"Learning objectives","text":"Wednesday, September 15, 2021
The goal of this tutorial is to provide a flavor of the OBO landscape, from the OBO Foundry organization to the ontology curators and OBO engineers that are doing the daily ontology development.
"},{"location":"courses/icbo2021/#organizers","title":"Organizers","text":"September 26, 2022, 9:00 am \u2013 12:30 pm ET
We'd love any feedback on this tutorial via this short survey.
"},{"location":"courses/icbo2022/#overview","title":"Overview","text":"The Open Biological and Biomedical Ontologies (OBO) community includes hundreds of open source scientific ontology projects, committed to shared principles and practices for interoperability and FAIR data. An OBO tutorial has been a regular feature of ICBO for a decade, introducing new and experienced ontology users and developers to ontologies in general, and to current OBO tools and techniques specifically. While ICBO attracts many ontology experts, it also includes an audience of ontology beginners, and of ontology users looking to become ontology developers or to further refine their skills. Our OBO tutorial will help beginner and intermediate ontology users with a combination of theory and hands-on practice.
For ICBO 2022 we will host a half-day OBO tutorial consisting of two parts, with a unifying theme of ontology term reuse.
The first part of our tutorial will be introductory, aimed at an audience that is new to ontologies and to the OBO Foundry. We will introduce OBO, its community, principles, resources, and best practices. We will finish the first part with a hands-on lesson in basic tools: ontology browsers, how to contribute to ontologies via GitHub (creating issues and making Pull Requests), and the Protege ontology editor.
The second part will build on the first, addressing an audience that is familiar with ontologies and OBO, and wants to make better use of OBO workflows and tools in their own projects. The focus will be on making best use of OBO community open source software. We will introduce ROBOT, the command-line tool and library for automating ontology development tasks. We will show how the Ontology Development Kit (ODK) is used to standardize ontology projects with a wide range of best practices. The special emphasis of this year's tutorial will be ontology reuse, and specifically on how ROBOT and ODK can be used to manage imports from other ontologies and overcome a number of challenges to term reuse.
This material for this year's OBO Tutorial will build on the content here in the OBO Academy. The OBO Academy offers free, open, online resources with self paced learning materials covering various aspects of ontology development and curation and OBO. Participants are encouraged to continue their learning using this OBO Academy website, and contribute to improving the OBO documentation.
As an outcome of this workshop, we expect that new ontologists will have a clearer understanding of why we need and use ontologies, how to find ontology terms and contribute to ontologies and make basic edits using Protege. Our more advanced participants should be able to apply OBO tools and workflows to their own ontology development practices.
"},{"location":"courses/icbo2022/#organizers","title":"Organizers","text":"Instructor: Nicole Vasilevsky
"},{"location":"courses/icbo2022/#outline","title":"Outline","text":"Example: We will work on this ticket.
"},{"location":"courses/icbo2023/","title":"ICBO OBO Tutorial 2023: Using and Reusing Ontologies","text":"Conference website: https://icbo-conference.github.io/icbo2023/
ICBO Workshops details: https://www.icbo2023.ncor-brasil.org/program.html#workshops
Date: August 28, 2023 13:30-15:00 (Part 1) and 15:30-15:45 (Part 2)
The Open Biological and Biomedical Ontologies (OBO) community includes hundreds of open source scientific ontology projects, committed to shared principles and practices for interoperability and FAIR data. An OBO tutorial has been a regular feature of ICBO for a decade, introducing new and experienced ontology users and developers to ontologies in general, and to current OBO tools and techniques specifically. While ICBO attracts many ontology experts, it also includes an audience of ontology beginners, and of ontology users looking to become ontology developers or to further refine their skills. Our OBO tutorial will help beginner and intermediate ontology users with a combination of theory and hands-on practice.
For ICBO 2023 we will host a half-day OBO tutorial consisting of two parts.
The first part of our tutorial will be introductory, aimed at an audience that is new to ontologies and to the OBO Foundry. We will introduce OBO, its community, principles, resources, and best practices. We will finish the first part with a hands-on lesson in basic tools: ontology browsers, how to contribute to ontologies via GitHub (creating issues and making Pull Requests), and the Protege ontology editor.
The second part will build on the first, addressing an audience that is familiar with ontologies and OBO, and wants to make better use of OBO workflows and tools in their own projects.
This material for this year's OBO Tutorial will build on the content here in the OBO Academy. The OBO Academy offers free, open, online resources with self paced learning materials covering various aspects of ontology development and curation and OBO. Participants are encouraged to continue their learning using this OBO Academy website, and contribute to improving the OBO documentation.
"},{"location":"courses/icbo2023/#organizers","title":"Organizers","text":"The tutorial is designed to be 'show and tell' format, but you are welcome to install the following software on your machine in advance, if you'd like to follow along in real time:
The goal of this course is to provide ongoing training for the OBO community. As with previous tutorials, we follow the flipped classroom concept: as organisers, we provide you with materials to look at, and you will work through the materials on your own. During our biweekly meeting, we will answer your questions, provide you with additional demonstrations where needed and go into depth wherever you as a student are curious to learn more. This means that this course can only work if you are actually putting in the time to preparing the materials. That said, we nevertheless welcome anyone to just lurk or ask related questions.
"},{"location":"courses/monarch-obo-training/#you-students","title":"You (Students)","text":"Note: this is tentative and subject to change
Date Lesson Notes Recordings 2023/10/03 Units modelling in and around OBO James Overton 2023/09/19 Improving ontology interoperability with Biomappings Charlie Hoyt 2023/09/05 Modern prefix management with Bioregistry andcuries
Charlie Hoyt 2023/08/22 How to determine if two entities are the same? Nico (subject open for debate) 2023/08/08 OMOP Vocabulary Management - how does it work and how does it differ from the OBO way of thinking? Nico (subject open for debate) July 2023 Cancelled: Summer break 2023/06/27 Cancelled 2023/06/13 Modelling with Subclass and Equivalent class statements Tutorial by Henriette Harmse slides 2023/05/30 First steps with ChatGPT for semantic engineers and curators Led by Sierra Moxon and Nico Matentzoglu N/A 2023/05/16 Cancelled (Monarch/C-Path workshop) 2023/05/02 Cancelled (No meeting week) 2023/04/18 Overview of Protege 5.6 - the latest features Tutorial by Damien Goutte-Gattat (slides) Here 2023/04/04 Introduction to Exomiser Tutorial by Valentina, Yasemin and Carlo from QMUL. Here 2023/03/21 Introduction to Wikidata Tutorial by experts in the field Andra Waagmeester and Tiago Lubiana Here 2023/03/07 OAK for the Ontology Engineering community Tutorial by Chris Mungall Here 2023/02/21 OBO Academy Clinic Bring your ontology issues and questions to discuss with Sabrina and Nico! Attend the Ontology Summit Seminars instead! 2023/02/07 Querying the Monarch KG using Neo4J Tutorial by Kevin Schaper Here 2023/01/24 OBO Academy Clinic Bring your ontology issues and questions to discuss with Sabrina and Nico! 2023/01/10 Modeling with taxon constraints Tutorial by Jim Balhoff Here 2022/12/27 No Meeting Enjoy the Holidays! 2022/12/13 Introduction to Semantic Entity Matching Slides Here 2022/11/29 OBO Academy hackathon Work on open tickets together. 2022/11/15 Contributing to OBO ontologies - Part 2 Here 2022/11/01 Contributing to OBO ontologies - Part 1 Here 2022/10/18 Introduction to Medical Action Ontology (MAxO) Here 2022/10/04 No meeting - ISB virtual conference: register here 2022/09/20 How to be an open science ontologist Here 2022/09/06 Pull Requests: Part 2 Here 2022/07/26 Pull Requests: Part 1 Here 2022/07/12 Basic introduction to the CLI: Part 2 Due to intermitent connection issues, the first few minutes of this recording are not included. Refer to the Tutorial link for the initial directions. Here 2022/06/28 Basic introduction to the CLI: Part 1 Here 2022/06/14 Application/project ontologies Here 2022/05/31 Contributing to ontologies: annotation properties Here 2022/05/17 Introduction to managing mappings with SSSOM Here 2022/05/03 No meeting 2022/04/19 Disjointness and Unsatisfiability Here 2022/04/05 No meeting 2022/03/22 Creating an ontology from scratch Here 2022/03/08 Obsoletions in OBO ontologies Review Obsoleting an Existing Ontology Term and Merging Ontology Terms. Slides are here. Here 2022/02/22 SPARQL for OBO ontology development Here 2022/02/07 ODK/DOSDPs Here 2022/01/25 Contributing to OBO ontologies This is not new content but we'll start at the beginning again with our previous lessons. Here 2022/01/11 Office hours with Nicole and Sabrina - no formal lesson Bring any open questions. 2021/12/14 Lessons learned from troubleshooting ROBOT Open discussion, no advance preparation is needed. 2021/11/30 Semantics of object properties (including Relations Ontology) 2021/11/16 SPARQL for OBO ontology development Here 2021/11/02 Templating: DOSDPs and ROBOT 2021/10/19 Ontology Design 2021/10/05 Cancelled due to overlap with ISB conference 2021/09/21 Ontology Pipelines with ROBOT 2 2021/09/08 Migrating legacy ontology systems to ODK 2021/09/07 Ontology Pipelines with ROBOT 2021/09/01 Manage dynamic imports the ODK 2021/08/25 Ontology Release Management with the ODK Here 2021/08/24 Contributing to OBO ontologies 2 Here 2021/08/17 Contributing to OBO ontologies"},{"location":"courses/monarch-obo-training/#notes","title":"Notes","text":"Most of materials used by this course were developed by James Overton, Becky Jackson, Nicole Vasilevsky and Nico Matentzoglu as part of a project with the Critical Path Institute (see here). The materials are improved as part of an internal training program (onboarding and CPD) for the Phenomics First project (NIH / NHGRI #1RM1HG010860-01).
Thanks to Sarah Gehrke for her help with project management.
"},{"location":"courses/ontology-summit-2023/","title":"Ontology Summit 2023","text":"This course unit only covers the OBO part of the Ontology Summit 2023, for a full overview see https://ontologforum.org/index.php/OntologySummit2023.
"},{"location":"courses/ontology-summit-2023/#goal","title":"Goal","text":"Giving a broad overview of the key OBO methodologies and tools to the general ontology community.
"},{"location":"courses/ontology-summit-2023/#tutors","title":"Tutors","text":"Editors: Sabrina Toro (@sabrinatoro), Nicolas Matentzoglu (@matentzn) Examples with images can be found here.
"},{"location":"explanation/annotation-properties/#what-are-annotation-properties","title":"What are annotation properties?","text":"An entity such as an individual, a class, or a property can have annotations, such as labels, synonyms and definitions. An annotation property is used to link the entity to a value, which in turn can be anything from a literal (a string, number, date etc) to another entity (such as, another class).
Here are some examples of frequently used annotation properties: (every element in bold is an annotation property)
http://purl.obolibrary.org/obo/MONDO_0004975
NCIT:C2866
Annotation properties have their own IRIs, just like classes and individuals. For example, the IRI of the RDFS built in label property is http://www.w3.org/2000/01/rdf-schema#label. Other examples:
Annotation properties are just like other entities (classes, individuals) and can have their own annotations. For example, the annotation propert http://purl.obolibrary.org/obo/IAO_0000232 has an rdfs:label ('curator note') and a human readable definition (IAO:0000115): 'An administrative note of use for a curator but of no use for a user'.
Annotation properties can be organised in a hierarchical structure.
For example, the annotation property 'synonym_type_property' (http://www.geneontology.org/formats/oboInOwl#SynonymTypeProperty) is the parent property of other, more specific ones (such as \"abbreviation\").
Annotation properties are (usually) used with specific type of annotation values.
*
Note: the type of annotation required for an annotation property can be defined by adding a Range + \"select datatype\" in the Annotation Property's Description e.g. : 'scheduled for obsoletion on or after' (http://purl.obolibrary.org/obo/IAO_0006012)
Some annotation properties look like data properties (connecting an entity to a literal value) and others look like object properties (connecting an entity to another entity). Other than the fact that statements involving data and object properties look very different in RDF, the key difference from a user perspective is that OWL Reasoners entirely ignore triples involving annotation properties. Data and Object Properties are taken into account by the reasoner.
Object properties are different to annotation properties in that they:
Data properties are different to annotation properties in that they:
Boomer
as all people born between 1946 and 1964. If an individual would be asserted to be a Boomer, but is born earlier than 1946, the reasoner would file a complaint.Note: before creating a new annotation property, it is always a good idea to check for an existing annotation property first.
Detailed explanations for adding a new annotation property can be found here
"},{"location":"explanation/annotation-properties/#the-term-annotation-in-ontologies-and-data-curation-means-different-things","title":"The term \"Annotation\" in Ontologies and Data Curation means different things.","text":"The word \"annotation\" is used in different contexts to mean different things. For instance, \"annotation in owl\" (ie annotations to an ontology term) is different from \"annotation in the biocuration sense\" (ie gene-to-disease, gene-to-phenotype, gene-to-function annotations). It is therefore crucial to give context when using the word \"annotation\".
"},{"location":"explanation/existential-restrictions/","title":"Existential restrictions","text":""},{"location":"explanation/existential-restrictions/#prerequesites","title":"Prerequesites","text":"SubClassOf
vs EquivalentTo
Given
ObjectProperty: r\nClass: D\n EquivalentTo: r some C\nClass: C\n
the semantics of r some C
is the set of individuals such that for each individual x
there is at least 1 individual y
of type C
that is linked to x
via the object property r
.
Based on this semantics, a possible world adhering to our initial equivalence axiom may be:
In this Venn diagram we assume individuals are black dots. Thus, our world consists of 7 individuals, with only 2 classes, namely C
and D
, as well 2 object properties, namely r
and q
. In this world, D
and thus the class r some C
, consist of only 2 individuals. D
and r some C
consist of only 2 individuals because these are the only individuals linked via object property r
to at least 1 individual respectively in C
.
In the following we define a pet owner as someone that owns at least 1 pet.
ObjectProperty: owns\nClass: PetOwner\n EquivalentTo: owns some Pet\nClass: Pet\n
If we want to introduce the class DogOwner
, assuming we can only use the class Pet
and the object property owns
(assuming we have not defined PetOwner
), we could say that a dog owner is a subset of pet owners:
ObjectProperty: owns\nClass: DogOwner\n SubClassOf: owns some Pet\nClass: Pet\n
In this case we use SubClassOf
instead of EquivalentTo
because not every pet owner necessarily owns a dog. This is equivalent to stating:
ObjectProperty: owns\nClass: PetOwner\n EquivalentTo: owns some Pet\nClass: Pet\nClass: DogOwner \n SubClassOf: PetOwner\n
"},{"location":"explanation/existential-restrictions/#variations-on-existential-restrictions","title":"Variations on existential restrictions","text":""},{"location":"explanation/existential-restrictions/#unqualified-existential-restrictions","title":"Unqualified existential restrictions","text":"In the previous section we modeled a PetOwner
as owns some Pet
. In the expression owns some Pet
Pet
is referred to as the filler of owns
and more specifically we say Pet
is the owns
-filler.
The PetOwner EquivalentTo: owns some Pet
state that pet owners are those individuals that own a pet and ignore all other owns
-fillers that are not pets. How can we define arbitrary ownership?
ObjectProperty: owns\nClass: Owner\n EquivalentTo: owns some owl:Thing\n
"},{"location":"explanation/existential-restrictions/#value-restrictions","title":"Value restrictions","text":"We can base restrictions on having a relation to a specific named individual, i.e.:
Individual: UK\nObjectProperty: citizenOf\nClass: UKCitizen\n EquivalentTo: citizenOf hasValue UK\n
"},{"location":"explanation/existential-restrictions/#existential-restrictions-on-data-properties","title":"Existential restrictions on data properties","text":"This far we have only considered existential restrictions based on object properties, but it is possible to define existential restrictions based on data properties. As an example, we all expect that persons have at least 1 name. This could be expressed as follows:
DataProperty: name\nClass: Person\n SubClassOf: name some xsd:string\n
"},{"location":"explanation/existential-restrictions/#when-to-use-subclassof-vs-equivalentto-with-existential-restrictions","title":"When to use SubClassOf vs EquivalentTo with existential restrictions","text":"In our example of Person SubClassOf: name some xsd:string
, why did we use SubClassOf
rather than EquivalentTo
? That is, why did we not use Person EquivalentTo: name some xsd:string
? With using the EquivalentTo
axiom, any individual that has a name, will be inferred to be an instance of Person
. However, there are many things in the world that have names that are not persons. Some examples are pets, places, regions, etc:
Compare this with, for example, DogOwner
:
ObjectProperty: owns\nClass: Dog\nClass: DogOwner\n EquivalentTo: owns some Dog\n
"},{"location":"explanation/intro-to-ontologies/","title":"Introduction to ontologies","text":"Based on CL editors training by David Osumi-Sutherland
"},{"location":"explanation/intro-to-ontologies/#why-do-we-need-ontologies","title":"Why do we need ontologies?","text":"We face an ever-increasing deluge of biological data analysis. Ensuring that this data and analysis are Findable, Accessible, Interoperable, and Re-usable (FAIR) is a major challenge. Findability, Interoperabiltiy, and Resuability can all be enhanced by standardising metadata. Well-standardised metadata can make it easy to find data and analyses despite variations in terminology ('Clara cell' vs 'nonciliated bronchiolar secretory cell' vs 'club cell') and precision ('bronchial epithelial cell' vs 'club cell'). Understanding which entities are referred to in metadata and how they relate to the annotated material can help users work out if the data or analysis they have found is of interest to them and can aid in its re-use and interoperability with other data and analyses. For example, does an annotation of sample data with a term for breast cancer refer to the health status of the patient from which the sample was derived or that the sample itself comes from a breast cancer tumor?
"},{"location":"explanation/intro-to-ontologies/#we-cant-find-what-were-looking-for","title":"We can't find what we're looking for","text":"Given variation in terminology and precision, annotation with free text alone is not sufficient for findability. One very lightweight solution to this problem is to rely on user-generated keyword systems, combined with some method of allowing users to choose from previously used keywords. This can produce some degree of annotation alignment but also results in fragmented annotation and varying levels of precision with no clear way to relate annotations.
For example, trying to refer to feces, in NCBI BioSample:
Query Records Feces 22,592 Faeces 1,750 Ordure 2 Dung 19 Manure 154 Excreta 153 Stool 22,756 Stool NOT faeces 21,798 Stool NOT feces 18,314"},{"location":"explanation/intro-to-ontologies/#we-dont-know-what-were-talking-about","title":"We don't know what we're talking about","text":"Terminology alone can be ambiguous. The same term may be used for completely unrelated or vaguely analogous structures. An insect femur and an mammalian femur are neither evolutionarily related nor structurally similar. Biologists often like to use abbreviations to annotate data, but these can be extremely ambiguous. Drosophila biologists use DA1 to refer to structures in the tracheal system, musculature and nervous system. Outside of Drosophila biology it is used to refer to many other things including a rare disease, and a a neuron type in C.elegans.
Some extreme examples of this ambiguity come from terminological drift in fields with a long history. For example in the male genitalia of a gasteruptiid wasp, these 5 different structures here have each been labeled \"paramere\" by different people, each studying different hymenopteran lineages. How do we know what \"paramere\" means when it is referred to?
This striking example shows that even precise context is not always sufficient for disambiguation.
"},{"location":"explanation/intro-to-ontologies/#controlled-vocabulary-cv","title":"Controlled vocabulary (CV)","text":"Rather than rely on users to generate lists of re-usable keywords, we can instead pre-specify a set of terms to use in annotation. This is usually refered to a controlled vocabulary or CV.
"},{"location":"explanation/intro-to-ontologies/#key-features","title":"Key features","text":"Any controlled vocabulary that is arranged in a hierarchy.
"},{"location":"explanation/intro-to-ontologies/#key-features_1","title":"Key features","text":"Taxonomy describes a hierarchical CV in which hierarchy equals classification. E.g., 'Merlot' is classified as a 'Red' (wine). Not all hierchical CVs are classifications. For example, anatomical atlases often have hierarchical CVs representing \"parthood\". The femur is a part of the leg, but it is not 'a leg'.
"},{"location":"explanation/intro-to-ontologies/#support-for-grouping-and-varying-levels-of-precision","title":"Support for grouping and varying levels of precision","text":"The use of a hierachical CV in which general terms group more specific terms allows for varying precision (glial cell vs some specific subtype) and simple grouping of annotated content.
For example:
"},{"location":"explanation/intro-to-ontologies/#from-hierarchical-cvs-to-ontologies","title":"From hierarchical CVs to ontologies","text":"Hierarchical CVs tend to increase in complexity in particular ways:
"},{"location":"explanation/intro-to-ontologies/#synonyms","title":"Synonyms","text":"To support findability, terms in hierarchical CVs often need to be associated with synonyms, or cross-referenced to closely related terms inside the CV.
"},{"location":"explanation/intro-to-ontologies/#polyhierarchy","title":"Polyhierarchy","text":"CV content is often driven by requests from annotators and so expansion is not driven by any unified vision of scheme. This often leads to pressure for hierarchies to support terms having multiple parents, either reflecting multiple relationship types, or multiple types of classification. For example, an anatomical CV could reasonably put 'retinal bipolar cell' under 'retina' based on location and, at the same time, under 'bipolar neuron' and 'glutamatergic neuron' based on cell type classification.
"},{"location":"explanation/intro-to-ontologies/#named-relationships","title":"Named relationships","text":"Developers of hierarchical CVs often come to realise that multiple relationship types are represented in the hierarchy and that it can be useful to name these relationship for better distinction. For example, a heart glial cell is a 'type of' glial cell, but is 'part of' the heart.
"},{"location":"explanation/intro-to-ontologies/#what-is-an-ontology","title":"What is an ontology?","text":""},{"location":"explanation/intro-to-ontologies/#definition_1","title":"Definition","text":"Definitions of ontologies can be controversial. Rather than attempting a comprehensive definition, this tutorial will emphasise ontologies as:
Terms are arranged in a classification hierarchy
Terms are defined
Terms are richly annotated:
Relationships between terms are defined, allowing logical inference and sophisticated queries as well as graph representations.
Expressed in a knowledge representation language such as RDFS, OBO, or OWL
Terminology can be ambiguous, so text definitions, references, synonyms and images are key to helping users understand the intended meaning of a term.
"},{"location":"explanation/intro-to-ontologies/#identifiers","title":"Identifiers","text":""},{"location":"explanation/intro-to-ontologies/#using-identifiers-devoid-of-intrinsic-meaning","title":"Using identifiers devoid of intrinsic meaning","text":"Identifiers that do not hold any inherent meaning are important to ontologies. If you ever need to change the names of your terms, you're going to need identifiers that stay the same when the term name changes.
For example:
A microglial cell is also known as: hortega cell, microglia, microgliocyte and brain resident macrophage. In the cell ontology, it is however referred to by a unique identifier: CL:0000129
These identifiers are short ways of referring to IRIs (e.g., CL:000129 = http://purl.obolibrary.org/obo/CL_0000129) This IRI is a unique, resolvable identifier on the web. A group of ontologies - loosely co-ordinated through the OBO Foundry, have standardised their IRIs (e.g. http://purl.obolibrary.org/obo/CL_0000129 - A term in the cell ontology; http://purl.oblibrary.org/obo/cl.owl - The cell ontology)
OBO ontologies are mostly written in OWL2 or OBO syntax. The latter is a legacy format that maps completely to OWL.
For a more in-depth explanation of formats (OWL, OBO, RDF etc.) refer to explainer on OWL format variants. In the examples below we will use OWL Manchester syntax, which allows us to express formal logic in English-like sentences.
"},{"location":"explanation/intro-to-ontologies/#an-ontology-as-a-classification","title":"An ontology as a classification","text":"Ontology terms refer to classes of things in the world. For example, the class of all wings.
Below you will see a classification of parts of the insect and how it is represented in a simple ontology.
We use a SubClassOf (or is_a in obo format) to represent that one class fully subsumes another. For example: OWL: hindwing SubClassOf wing OBO: hindwing is_a wing
In English we might say: \"a hindwing is a type of wing\" or more specifically, \"all instances of hindwing are instances of wing.\" 'Instance' here refers to a single wing of an individual fly.
In the previous section, we talked about different types of relationships. In OWL we can define specific relations (known as object properties). One of the commonest is 'part of' which you can see used below.
English: all (insect) legs are part of a thoracic segment OWL: 'leg' SubClassOf part_of some thoracic segment OBO: 'leg'; relationship: part_of thoracic segment
It might seem odd at first that OWL uses subClassOf here too. The key to understanding this is the concept of an anonymous class - in OWL, we can refer to classes without giving them names. In this case, the anonymous class is the class of all things that are 'part of' (some) 'thoracic segment' (in insects). A vast array of different anatomical structures are subclasses of this anonymous class, some of which, such as wings, legs, and spiracles, are visible in the diagram.
Note the existential quantifier some
in OWL format -- it is interpreted as \"there exists\", \"there is at least one\", or \"some\".
The quantifier is important to the direction of relations.
subClassOf: 'wing' SubClassOf part_of some 'thoracic segment'
is correct 'thoracic segment' SubClassOf has_part some 'wing'
is incorrect as it implies all thoracic segment have wings as a part.
Similarly: 'claw' SubClassOf connected_to some 'tarsal segment'
is correct 'tarsal segment' SubClassOf connected_to some 'claw'
is incorrect as it implies all tarsal segments are connected to claws (for example, some tarsal segments are connected to other tarsal segments)
These relationships store knowledge in a queryable format. For more information about querying, please refer to guide on DL queries and SPARQL queries.
"},{"location":"explanation/intro-to-ontologies/#scaling-ontologies","title":"Scaling Ontologies","text":"There are many ways to classify things. For example, a neuron can be classified by structure, electrophysiology, neurotransmitter, lineage, etc. Manually maintaining these multiple inheritances (that occur through multiple classifications) does not scale.
Problems with maintaining multiple inheritance classifications by hand
Doesn\u2019t scale
When adding a new class, how are human editors to know
all of the relevant classifications to add?
how to rearrange the existing class hierarchy?
It is bad for consistency
Reasons for existing classifications often opaque
Hard to check for consistency with distant superclasses
Doesn\u2019t allow for querying
The knowledge an ontology contains can be used to automate classification. For example:
English: Any sensory organ that functions in the detection of smell is an olfactory sensory organ OWL:
'olfactory sensory organ'\n EquivalentTo \u2018sensory organ\u2019\nthat\ncapable_of some \u2018detection of smell\u2019\n
If we then have an entity nose
that is subClassOf sensory organ
and capable_of some detection of smell
, it will be automatically classified as an olfactory sensory organ.
Many classes, especially in the domains of disease and phenotype, describe combinations of multiple classes - but it is very important to carefully distinguish whether this combination follows \"disjunctive\" logic (\"or\") or \"conjunctive\" logic (\"and\"). Both mean something entirely different. Usually where a class has 'and' in the label, such as 'neonatal inflammatory skin and bowel disease' (MONDO:0017411), the class follows a conjunctive logic (as expected), and should be interpreted in a way that someone that presents with this disease has both neonatal inflammatory skin disease and bowel disease at once. This class should be classified as a child of 'bowel disease' and 'neonatal inflammatory skin disease'. Note, however, that naming in many ontologies is not consistent with this logic, and you need to be careful to distinguish wether the interpretation is supposed to be conjunctive or disjunctive (i.e. \"and\" could actually mean \"or\", which is especially often the case for clinical terminologies).
Having asserted multiple SubClassOf axioms means that an instance of the class is a combination of all the SubClass Of statements (conjunctive interpretation, see above). For example, if 'neonatal inflammatory skin and bowel disease' is a subclass of both 'bowel disease' and 'neonatal inflammatory skin disease', then an individual with this disease has 'bowel disease' and 'neonatal inflammatory skin disease'.
If there were a class 'neonatal inflammatory skin or bowel disease', the intention is usually that this class follows disjunctive logic. A class following this logic would be interpreted in a way that an individual with this disease has either bowel disease or neonatal inflammatory skin disease or both. It would not be accurate to classify this class as a child of bowel disease and neonatal inflammatory skin disease. This type of class is often called a \"grouping class\", and is used to aggregate related diseases in a way useful to users, like \"disease\" and \"sequelae of disease\".
"},{"location":"explanation/intro-to-ontologies/#acknowledgements","title":"Acknowledgements","text":"This explainer requires understanding of ontology classifications. Please see \"an ontology as a classification\" section of the introduction to ontologies documentation if you are unfamiliar with these concepts.
You can watch this video about an introduction to Logical Description.
"},{"location":"explanation/logical-axiomatization/#what-are-logical-axioms","title":"What are logical axioms","text":"Logical axioms are relational information about classes that are primarily aimed at machines. This is opposed to annotations like textual definitions which are primarily aimed at humans. These logical axioms allow reasoners to assist in and verify classification, lessening the development burden and enabling expressive queries.
"},{"location":"explanation/logical-axiomatization/#what-should-you-axiomatize","title":"What should you axiomatize?","text":"Ideally, everything in the definition should be axiomatized when possible. For example, if we consider the cell type oxytocin receptor sst GABAergic cortical interneuron
, which has the textual definition:
\"An interneuron located in the cerebral cortex that expresses the oxytocin receptor. These interneurons also express somatostatin.\"
The logical axioms should then follow accordingly:
SubClassOf:
These logical axioms allow a reasoner to automatically classify the term. For example, through the logical axioms, we can infer that oxytocin receptor sst GABAergic cortical interneuron
is a cerebral cortex GABAergic interneuron
.
Axiomatizing definitions well will also allow for accurate querying. For example, if I wanted to find a neuron that expresses oxytocin receptor, having the SubClassOf axioms of interneuron
and expresses some 'oxytocin receptor'
will allow me to do so on DL query (see tutorial on DL query for more information about DL queries).
Everything in the logical axioms must be true, (do not axiomatize things that are true to only part of the entity) For example, the cell type chandelier pvalb GABAergic cortical interneuron
is found in upper L2/3 and deep L5 of the cerebral cortex. We do not make logical axioms for has soma location
some layer 2/3 and layer 5. Axioms with both layers would mean that a cell of that type must be in both layer 2/3 and layer 5, which is an impossibility (a cell cannot be in two seperate locations at once!). Instead we axiomatize a more general location: 'has soma location' some 'cerebral cortex'
An equivalent class axiom is an axiom that defines the class; it is a necessary and sufficient logical axiom that defines the cell type. It means that if a class B fulfils all the criteria/restrictions in the equivalent axiom of class A, class B is by definition a subclass of class A. Equivalent classes allow the reasoner to automatically classify entities.
For example:
chandelier cell
has the equivalent class axiom interneuron and ('has characteristic' some 'chandelier cell morphology')
chandelier pvalb GABAergic cortical interneuron
has the subclass axioms 'has characteristic' some 'chandelier cell morphology'
and interneuron
chandelier pvalb GABAergic cortical interneuron
is therefore a subclass of chandelier cell
Equivalent class axioms classification can be very powerful as it takes into consideration complex layers of axioms.
For example:
primary motor cortex pyramidal cell
has the equivalent class axiom 'pyramidal neuron' and ('has soma location' some 'primary motor cortex')
.Betz cell
has the axioms 'has characteristic' some 'standard pyramidal morphology'
and 'has soma location' some 'primary motor cortex layer 5'
Betz cell
are inferred to be primary motor cortex pyramidal cell
through the following chain (you can see this in Prot\u00e9g\u00e9 by pressing the ? button on inferred class):The ability of the reasoner to infer complex classes helps identify classifications that might have been missed if done manually. However, when creating an equivalent class axiom, you must be sure that it is not overly constrictive (in which case, classes that should be classified under it gets missed) nor too loose (in which case, classes will get wrongly classified under it).
Example of both overly constrictive and overly loose equivalent class axiom:
neuron equivalent to cell and (part_of some 'central nervous system')
In such cases, sometimes not having an equivalent class axioms is better (like in the case of neuron), and asserting is the best way to classify a child.
"},{"location":"explanation/logical-axiomatization/#style-guide","title":"Style guide","text":"Each ontology has certain styles and conventions in how they axiomatize. This style guide is specific to OBO ontologies. We will also give reasons as to why we choose to axiomatize in the way we do. However, be aware of your local ontology's practices.
"},{"location":"explanation/logical-axiomatization/#respect-the-ontology-style","title":"Respect the ontology style","text":"It is important to note that ontologies have specific axiomatization styles and may apply to, for example, selecting a preferred relation. This usually reflects their use cases. For example, the Cell Ontology has a guide for what relations to use. An example of an agreement in the community is that while anatomical locations of cells are recorded using part of
, neurons should be recorded with has soma location
. This is to accommodate for the fact that many neurons have long reaching axons that cover multiple anatomical locations making them difficult to axiomatize using part of
.
For example, Betz cell
, a well known cell type which defines layer V of the primary motor cortex, synapses lower motor neurons or spinal interneurons (cell types that reside outside the brain). Having the axiom 'Betz cell' part_of 'cortical layer V'
is wrong. In this case has soma location
is used. Because of cases like these that are common in neurons, all neurons in CL should use has soma location
.
Do not add axioms that are not required. If a parent class already has the axiom, it should not be added to the child class too. For example:
retinal bipolar neuron
is a child of bipolar neuron
bipolar neuron
has the axiom 'has characteristic' some 'cortical bipolar morphology'
'has characteristic' some 'cortical bipolar morphology'
to retinal bipolar neuron
Axioms add lines to the ontology, resulting in larger ontologies that are harder to use. They also add redundancy, making the ontology hard to maintain as a single change in classification might require multiple edits.
"},{"location":"explanation/logical-axiomatization/#let-the-reasoner-do-the-work","title":"Let the reasoner do the work","text":"Asserted is_a parents do not need to be retained as entries in the 'SubClass of' section of the Description window in Prot\u00e9g\u00e9 if the logical definition for a term results in their inference.
For example, cerebral cortex GABAergic interneuron
has the following logical axioms:
Equivalent_To\n 'GABAergic interneuron' and\n ('has soma location' some 'cerebral cortex')\n
We do not need to assert that it is a cerebral cortex neuron
, CNS interneuron
, or neuron of the forebrain
as the reasoner automatically does that.
We avoid having asserted subclass axioms as these are redundant lines in the ontology which can result in a larger ontology, making them harder to use.
Good practice to let the reasoner do the work:
1) If you create a logical definition for your term, you should delete all redundant, asserted is_a parent relations by clicking on the X to the right of the term.\n2) If an existing term contains a logical definition and still shows an asserted is_a parent in the 'SubClass of' section, you may delete that asserted parent. Just make sure to run the Reasoner to check that the asserted parent is now replaced with the correct reasoned parent(s).\n3) Once you synchronize the Reasoner, you will see the reasoned classification of your new term, including the inferred is_a parent(s).\n4) If the inferred classification does not contain the correct parentage, or doesn't make sense, then you will need to modify the logical definition.\n
"},{"location":"explanation/ontology-matching/","title":"Ontology Matching","text":""},{"location":"explanation/ontology-matching/#ontology-matching-basic-techniques","title":"Ontology Matching: Basic Techniques","text":"10 min overview of J\u00e9r\u00f4me Euzenat and Pavel Shvaiko ground-breaking Ontology Matching.
"},{"location":"explanation/owl-building-blocks/","title":"The logical building blocks of OWL","text":"Here we briefly list the building blocks that are used in OWL that enable reasoning.
OWL Semantics Example instance or individual A member of a set. A person calledMary
or a dog called Fido
. class A set of in dividuals. The Person
class consisting of persons or the Dog
class consisting of dogs. object property A set of pairs of individuals. The owns
object property can link a pet and its owner: Mary owns Fido
. data property A set of pairs where each pair consists of an individual linked to a data value. The data property hasAge
can link a number representing an age to an individual: hasAge(Mary, 10)
."},{"location":"explanation/owl-format-variants/","title":"OWL, OBO, JSON? Base, simple, full, basic? What should you use, and why?","text":"For reference of the more technical aspects of release artefacts, please see documentation on Release Artefacts
Ontologies come in different serialisations, formalisms, and variants For example, their are a full 9 (!) different release files associated with an ontology released using the default settings of the Ontology Development Kit, which causes a lot of confusion for current and prospective users.
Note: In the OBO Foundry pages, \"variant\" is currently referred to as \"products\", i.e. the way we use \"variant\" here is synonymous with with notion of \"product\".
"},{"location":"explanation/owl-format-variants/#overview-of-the-relevant-concepts","title":"Overview of the relevant concepts","text":"Some people like to also list SHACL and Shex as ontology languages and formalism. Formalisms define syntax (e.g. grammar rules) and semantics (what does what expression mean?). The analogue in the real world would be natural languages, like English or Greek.
git diff
, i.e changes to ontologies in functional syntax are much easier to be reviewed. RDF/XML is not suitable for manual review, due to its verbosity and complexity.The real-world analogue of serialisation or format is the script, i.e. Latin or Cyrillic script (not quite clean analogue).
src/ontology/cl-edit.owl
.subClassOf
and partOf
. Some users that require the ontology to correspond to acyclic graphs, or deliberately want to focus only on a set of core relations, will want to use this variant, see docs). The formal definition of the basic variant can be found here.owl:imports
statements - these are easily ignored by your users and make the intended \"content\" of the ontology quite none-transparent.SubClassOf
vs EquivalentTo
","text":""},{"location":"explanation/subClassOf-vs-equivalentTo/#prerequisites","title":"Prerequisites","text":"This lesson assumes you have basic knowledge wrt ontologies and OWL as explained in:
SubClassOf
","text":"In this section we explain the semantics of SubClassOf
, give an example of using SubClassOf
and provide guidance for when not to use SubClassOf
.
If we have
Class: C\n SubClassOf: D\nClass: D\n
the semantics of it is given by the following Venn diagram:
Thus, the semantics is given by the subset relationship, stating the C
is a subset of D
. This means every individual of C
is necessarily an individual of D
, but not every individual of D
is necessarily an individual of C
.
Class: Dog\n SubClassOf: Pet\nClass: Pet\n
which as a Venn diagram will look as follows:
"},{"location":"explanation/subClassOf-vs-equivalentTo/#guidance","title":"Guidance","text":"There are at least 2 scenarios which at first glance may seem like C SubClassOf D
holds, but it does not hold, or using C EquivalentTo D
may be a better option.
C
has many individuals that are in D
, but there is at least 1 individual of C
that is not in D
. The following Venn diagram is an example. Thus, to check whether you may be dealing with this scenario, you can ask the following question: Is there any individual in C
that is not in D
? If 'yes', you are dealing with this scanario and you should not be using C SubClassOf D
. C
in D
, but also every individual in D
is in C
. This means C
and D
are equivalent. In the case you rather want to make use of EquivalentTo
.EquivalentTo
","text":""},{"location":"explanation/subClassOf-vs-equivalentTo/#semantics_1","title":"Semantics","text":"If we have
Class: C\n EquivalentTo: D\nClass: D\n
this means the sets C
and D
fit perfectly on each other, as shown in the next Venn diagram:
Note that C EquivalentTo D
is shorthand for
Class: C\n SubClassOf: D\nClass: D\n SubClassOf: C\n
though, in general it is better to use EquivalentTo
rather than the 2 SubClassOf
axioms when C
and D
are equivalent.
We all probably think of humans and persons as the exact same set of individuals.
Class: Person\n EquivalentTo: Human\nClass: Human\n
and as a Venn diagram:
"},{"location":"explanation/subClassOf-vs-equivalentTo/#guidance_1","title":"Guidance","text":"When do you not want to use EquivalentTo
?
C
that is not in D
.D
that is not in C
.Taxon restrictions (or, \"taxon constraints\") are a formalised way to record what species a term applies to\u2014something crucial in multi-species ontologies.
Even species neutral ontologies (e.g., GO) have classes that have implicit taxon restriction.
GO:0007595 ! Lactation - defined as \u201cThe secretion of milk by the mammary gland.\u201d\n
"},{"location":"explanation/taxon-constraints-explainer/#uses-for-taxon-restrictions","title":"Uses for taxon restrictions","text":"Finding inconsistencies. Taxon restrictions use terms from the NCBI Taxonomy Ontology, which asserts pairwise disjointness between sibling taxa (e.g., nothing can be both an insect and a rodent). When terms have taxon restrictions, a reasoner can check for inconsistencies.
When GO implemented taxon restrictions, they found 5874 errors!
Defining taxon-specific subclasses. You can define a taxon-specific subclass of a broader concept, e.g., 'human clavicle'. This allows you, for example, to assert relationships for the new term that don't apply to all instances of the broader concept:
'human clavicle' EquivalentTo ('clavicle bone' and ('in taxon' some 'Homo sapiens'))\n'human clavicle' SubClassOf ('connected to' some sternum)\n
Creating SLIMs. Use a reasoner to generate ontology subsets containing only those terms that are logically allowed within a given taxon.
Querying. Facet terms by taxon. E.g., in Brain Data Standards, in_taxon axioms allow faceting cell types by species. (note: there are limitations on this and may be incomplete).
There are, in essence, three categories of taxon-specific knowledge we use across OBO ontologies. Given a class C
, which could be anything from an anatomical entity to a biological process, we have the following categories:
C
are in some instance of taxon T
\"C SubClassOf (in_taxon some T)\n
C
.C
are in taxon T
\"C SubClassOf (not (in_taxon some T))`\n
C DisjointWith (in_taxon some T)
C SubClassOf (in_taxon some (not T))
C never_in_taxon T
# Editors use thisnever_in_taxon
annotations, the taxon should be as broad as possible for the maximum utility, but it must be the case that a C
is never found in any subclass of that taxon.C
and in_taxon some T
\"","text":"C
is in taxon T
\".IND:a Type (C and (in_taxon some T))`\n
C_in_T SubClassOf (C and (in_taxon some T)
(C_in_T
will be unsatisifiable if violates taxon constraints)C present_in_taxon T
# Editors use thisPlease see how-to guide on adding taxon restrictions
"},{"location":"explanation/taxon-constraints-explainer/#using-taxon-restrictions-for-quality-control","title":"Using taxon restrictions for Quality Control","text":"As stated above, one of the major applications for taxon restrictions in OBO is for quality control (QC), by finding logical inconsistencies. Many OBO ontologies consist of a complex web of term relationships, often crossing ontology boundaries (e.g., GO biological process terms referencing Uberon anatomical structures or CHEBI chemical entities). If particular terms are only defined to apply to certain taxa, it is critical to know that a chain of logic implies that the term must exist in some other taxon which should be impossible. Propagating taxon restrictions via logical relationships greatly expands their effectiveness (the GO term above may acquire a taxon restriction via the type of anatomical structure in which it occurs).
It can be helpful to think informally about how taxon restrictions propagate over the class hierarchy. It's different for all three types:
in_taxon
) include all superclasses of the taxon, and all subclasses of the subject term: %% Future editors, note that link styles are applied according to the link index, so be careful if adding or removing links. graph BT; n1(hair) ; n2(whisker) ; n3(Mammalia) ; n4(Tetrapoda) ; n2--is_a-->n1 ; n3--is_a-->n4 ; n1==in_taxon==>n3 ; n1-.in_taxon.->n4 ; n2-.in_taxon.->n3 ; n2-.in_taxon.->n4 ; linkStyle 0 stroke:#999 ; linkStyle 1 stroke:#999 ; style n1 stroke-width:4px ; style n3 stroke-width:4px ;never_in_taxon
) include all subclasses of the taxon, and all subclasses of the subject term: %% Future editors, note that link styles are applied according to the link index, so be careful if adding or removing links. graph BT; n1(facial whisker) ; n2(whisker) ; n3(Homo sapiens) ; n4(Hominidae) ; n1--is_a-->n2 ; n3--is_a-->n4 ; n2==never_in_taxon==>n4 ; n2-.never_in_taxon.->n3 ; n1-.never_in_taxon.->n4 ; n1-.never_in_taxon.->n3 ; linkStyle 0 stroke:#999 ; linkStyle 1 stroke:#999 ; style n2 stroke-width:4px ; style n4 stroke-width:4px ;present_in_taxon
) include all superclasses of the taxon, and all superclasses of the subject term: %% Future editors, note that link styles are applied according to the link index, so be careful if adding or removing links. graph BT; n1(hair) ; n2(whisker) ; n3(Felis) ; n4(Carnivora) ; n2--is_a-->n1 ; n3--is_a-->n4 ; n2==present_in_taxon==>n3 ; n1-.present_in_taxon.->n3 ; n2-.present_in_taxon.->n4 ; n1-.present_in_taxon.->n4 ; linkStyle 0 stroke:#999 ; linkStyle 1 stroke:#999 ; style n2 stroke-width:4px ; style n3 stroke-width:4px ;The Relation Ontology defines number of property chains for the in_taxon
property. This allows taxon restrictions to propagate over other relationships. For example, the part_of o in_taxon -> in_taxon
chain implies that if a muscle is part of a whisker, then the muscle must be in a mammal, but not in a human, since we know both of these things about whiskers:
Property chains are the most common way in which taxon restrictions propagate across ontology boundaries. For example, Gene Ontology uses various subproperties of results in developmental progression of to connect biological processes to Uberon anatomical entities. Any taxonomic restrictions which hold for the anatomical entity will propagate to the biological process via this property.
The graph depictions in the preceding illustrations are informal; in practice never_in_taxon
and present_in_taxon
annotations are translated into more complex logical constructions using the in_taxon
object property, described in the next section. These logical constructs allow the OWL reasoner to determine that a class is unsatisfiable when there are conflicts between taxon restriction inferences.
The OWL axioms required to derive the desired entailments for taxon restrictions are somewhat more complicated than one might expect. Much of the complication is the result of workarounds to limitations dictated by the OWL EL profile. Because of the size and complexity of many of the ontologies in the OBO Library, particularly those heavily using taxon restrictions, we primarily rely on the ELK reasoner, which is fast and scalable since it implements OWL EL rather than the complete OWL language. In the following we discuss the particular kinds of axioms required in order for taxon restrictions to work with ELK, with some comments about how it could work with HermiT (which implements the complete OWL language but is much less scalable). We will focus on this example ontology:
%% Future editors, note that link styles are applied according to the link index, so be careful if adding or removing links. graph BT; n1(hair) ; n2(whisker) ; n3(muscle) ; n4(whisker muscle) ; n5(whisker muscle in human) ; n6(whisker in catfish) ; n7(whisker in human) ; n8(Vertebrata) ; n9(Teleostei) ; n10(Siluriformes) ; n11(Tetrapoda) ; n12(Mammalia) ; n13(Hominidae) ; n14(Homo sapiens) ; n2--is_a-->n1 ; n4--is_a-->n3 ; n9--is_a-->n8 ; n10--is_a-->n9 ; n11--is_a-->n8 ; n12--is_a-->n11 ; n13--is_a-->n12 ; n14--is_a-->n13 ; n5--is_a-->n4 ; n6--is_a-->n2 ; n7--is_a-->n2 ; n4--part_of-->n2 ; n11 --disjoint_with--- n9 ; n1==in_taxon==>n12 ; n2==never_in_taxon==>n13 ; n5==in_taxon==>n14 ; n7==in_taxon==>n14 ; n6==in_taxon==>n10 ; linkStyle 0 stroke:#999 ; linkStyle 1 stroke:#999 ; linkStyle 2 stroke:#999 ; linkStyle 3 stroke:#999 ; linkStyle 4 stroke:#999 ; linkStyle 5 stroke:#999 ; linkStyle 6 stroke:#999 ; linkStyle 7 stroke:#999 ; linkStyle 8 stroke:#999 ; linkStyle 9 stroke:#999 ; linkStyle 10 stroke:#999 ; linkStyle 11 stroke:#008080 ; linkStyle 12 stroke:red ; style n5 stroke-width:4px,stroke:red ; style n6 stroke-width:4px,stroke:red ; style n7 stroke-width:4px,stroke:red ;There are three classes outlined in red which were created mistakenly; the asserted taxon for each of these conflicts with taxon restrictions in the rest of the ontology:
part_of o in_taxon -> in_taxon
. This conflicts with its asserted in_taxon 'Homo sapiens', a subclass of 'Hominidae'.We can start by modeling the two taxon restrictions in the ontology like so:
'hair' SubClassOf (in_taxon some 'Mammalia')
'whisker' SubClassOf (not (in_taxon some 'Hominidae'))
Both HermiT and ELK can derive that 'whisker in human' is unsatisfiable. This is the explanation:
'human whisker' EquivalentTo ('whisker' and (in_taxon some 'Homo sapiens'))
'Homo sapiens' SubClassOf 'Hominidae'
'whisker' SubClassOf (not ('in_taxon' some 'Hominidae'))
Unfortunately, neither reasoner detects the other two problems. We'll address the 'whisker in catfish' first. The reasoner infers that this class is in_taxon
both 'Mammalia' and 'Siluriformes'. While these are disjoint classes (all sibling taxa are asserted to be disjoint in the taxonomy ontology), there is nothing in the ontology stating that something can only be in one taxon at a time. The most intuitive solution to this problem would be to assert that in_taxon
is a functional property. However, due to limitations of OWL, functional properties can't be used in combination with property chains. Furthermore, functional properties aren't part of OWL EL. There is one solution that works for HermiT, but not ELK. We could add an axiom like the following to every \"always in taxon\" restriction:
'hair' SubClassOf (in_taxon only 'Mammalia')
This would be sufficient for HermiT to detect the unsatisfiability of 'whisker in catfish' (assuming taxon sibling disjointness). Unfortunately, only
restrictions are not part of OWL EL. Instead of adding the only
restrictions, we can generate an extra disjointness axiom for every taxon disjointness in the taxonomy ontology, e.g.:
(in_taxon some 'Tetrapoda') DisjointWith (in_taxon some 'Teleostei')
The addition of axioms like that is sufficient to detect the unsatisfiability of 'whisker in catfish' in both HermiT and ELK. This is the explanation:
'whisker in catfish' EquivalentTo ('whisker' and (in_taxon some 'Siluriformes'))
'whisker' SubClassOf 'hair'
'hair' SubClassOf (in_taxon some 'Mammalia')
'Mammalia' SubClassOf 'Tetrapoda'
'Siluriformes' SubClassOf 'Teleostei'
(in_taxon some 'Teleostei') DisjointWith (in_taxon some 'Tetrapoda')
While we can now detect two of the unsatisfiable classes, sadly neither HermiT nor ELK yet finds 'whisker muscle in human' to be unsatisfiable, which requires handling the interaction of a \"never\" assertion with a property chain. If we were able to make in_taxon
a functional property, HermiT should be able to detect the problem; but as we said before, OWL doesn't allow us to combine functional properties with property chains. The solution is to add even more generated disjointness axioms, one for each taxon (in combination with the extra disjointness we added in the previous case), e.g.,:
(in_taxon some Hominidae) DisjointWith (in_taxon some (not Hominidae))
While that is sufficient for HermiT, for ELK we also need to add another axiom to the translation of each never_in_taxon assertion, e.g.,:
'whisker' SubClassOf (in_taxon some (not 'Hominidae'))
Now both HermiT and ELK can find 'whisker muscle in human' to be unsatisfiable. This is the explanation from ELK:
'whisker muscle in human' EquivalentTo ('whisker muscle' and (in_taxon some 'Homo sapiens'))
'Homo sapiens' SubClassOf 'Hominidae'
'whisker muscle' SubClassOf (part_of some 'whisker')
'whisker' SubClassOf (in_taxon some ('not 'Hominidae'))
part_of o in_taxon SubPropertyOf in_taxon
(in_taxon some 'Hominidae') DisjointWith (in_taxon some (not 'Hominidae'))
The above example didn't incorporate any present_in_taxon (SOME-IN) assertions. These work much the same as ALL-IN in_taxon assertions. However, instead of stating that all instances of a given class are in a taxon (C SubClassOf (in_taxon some X)
), we either state that there exists an individual of that class in that taxon, or that there is some subclass of that class whose instances are in that taxon:
<generated individual IRI> Type (C and (in_taxon some X))
\u2014 violations involving this assertion will make the ontology logically inconsistent.
or
<generated class IRI> SubClassOf (C and (in_taxon some X))
\u2014 violations involving this assertion will make the ontology logically incoherent, i.e., a named class is unsatisfiable (here, <generated class IRI>
).
Incoherency is easier to debug than inconsistency, so option 2 is the default expansion for present_in_taxon
.
In summary, the following constructs are all needed for QC using taxon restrictions:
in_taxon
property chains for relations which should propagate in_taxon
inferencesX DisjointWith Y
for all sibling taxa X
and Y
(in_taxon some X) DisjointWith (in_taxon some Y)
for all sibling taxa X
and Y
(in_taxon some X) DisjointWith (in_taxon some (not X))
for every taxon X
C in_taxon X
C SubClassOf (in_taxon some X)
C never_in_taxon X
C SubClassOf (not (in_taxon some X))
C SubClassOf (in_taxon some (not X))
C present_in_taxon X
)<generated class IRI> SubClassOf (C and (in_taxon some X))
If you are checking an ontology for coherency in a QC pipeline (such as by running ROBOT within the ODK), you will need to have the required constructs from the previous section present in your import chain:
http://purl.obolibrary.org/obo/ncbitaxon.owl
)http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim-disjoint-over-in-taxon.owl
(or implement a way to generate the needed disjointness axioms)(in_taxon some X) DisjointWith (in_taxon some (not X))
. You may need to implement a way to generate the needed disjointness axioms until this is corrected.never_in_taxon
and present_in_taxon
shortcut annotation properties, you can expand these into the logical forms using robot expand
.present_in_taxon
expansions add named classes to your ontology, you will probably want to organize your pipeline in such a way that this expansion only happens in a QC check, and the output is not included in your published ontology.Using the DL Query panel and a running reasoner, it is straightforward to check whether a particular taxon restriction holds for a term (such as when someone has requested one be added to your ontology). Given some term of interest, e.g., 'whisker', submit a DL Query such as 'whisker' and (in_taxon some Mammalia)
. Check the query results:
Equivalent classes
includes owl:Nothing
, then a never_in_taxon is implied for that taxon.Equivalent classes
includes the term of interest itself (and not owl:Nothing
), then an in_taxon is implied for that taxon.Superclasses
includes the term of interest (and the query isn't equivalent to owl:Nothing
), then there is no particular taxon restriction involving that taxon.To quickly see exactly which taxon restrictions are in effect for a selected term, install the OBO taxon constraints plugin for Prot\u00e9g\u00e9. Once you have the plugin installed, you can add it to your Prot\u00e9g\u00e9 window by going to the menu Window > Views > OBO views > Taxon constraints
, and then clicking the location to place the panel. The plugin will show the taxon constraints in effect for the selected OWL class. When a reasoner is running, any inferred taxon constraints will be shown along with directly asserted ones. The plugin executes many reasoner queries behind the scenes, so there may be a delay before the user interface is updated.
Comments are annotations that may be added to ontology terms to further explain their intended usage, or include information that is useful but does not fit in areas like definition.
Some examples of comments, and possible standard language for their usage, are:
WARNING: THESE EXAMPLES ARE NOT UNIVERSALLY USED AND CAN BE CONTROVERSIAL IN SOME ONTOLOGIES! PLEASE CHECK WITH THE CONVENTIONS OF YOUR ONTOLOGY BEFORE DOING THIS!
"},{"location":"explanation/term-comments/#do-not-annotate","title":"Do Not Annotate","text":"This term should not be used for direct annotation. It should be possible to make a more specific annotation to one of the children of this term.
Example: GO:0006810 transport
Note that this term should not be used for direct annotation. It should be possible to make a more specific annotation to one of the children of this term, for e.g. transmembrane transport, microtubule-based transport, vesicle-mediated transport, etc.
"},{"location":"explanation/term-comments/#do-not-manually-annotate","title":"Do Not Manually Annotate","text":"This term should not be used for direct manual annotation. It should be possible to make a more specific manual annotation to one of the children of this term.
Example: GO:0000910 cytokinesis
Note that this term should not be used for direct annotation. When annotating eukaryotic species, mitotic or meiotic cytokinesis should always be specified for manual annotation and for prokaryotic species use 'FtsZ-dependent cytokinesis; GO:0043093' or 'Cdv-dependent cytokinesis; GO:0061639'. Also, note that cytokinesis does not necessarily result in physical separation and detachment of the two daughter cells from each other.
"},{"location":"explanation/term-comments/#additional-information","title":"Additional Information","text":"Information about the term that do not belong belong in the definition or gloss, but are useful for users or editors. This might include information that is adjacent to the class but pertinent to its usage, extended information about the class (eg extended notes about a characteristic of a cell type) that might be useful but does not belong in the definition, important notes on why certain choices were made in the curation of this terms (eg why certain logical axioms were excluded/included in the way they are) (Note: dependent on ontology, some of these might belong in editors_notes, etc.).
Standard language for these are not given as they vary dependent on usage.
"},{"location":"explanation/which-ontology-to-use/","title":"Which biomedical ontologies should we use?","text":"As a rule of thumb, for every single problem/term/use case, you will have 3-6 options to choose from, in some cases even more. The criteria for selecting a good ontology are very much dependent on your particular use case, but some concerns are generally relevant. A good first pass is to apply to \"10 simple rules for selecting a Bio-ontology\" by Malone et al, but I would further recommend to ask yourself the following:
Aside from aspects of your analysis, there is one more thing you should consider carefully: the open-ness of your ontology in question. As a user, you have quite a bit of power on the future trajectory of the domain, and therefore should seek to endorse and promote open standards as much as possible (for egotistic reasons as well: you don't want to have to suddenly pay for the ontologies that drive your semantic analyses). It is true that ontologies such as SNOMED have some great content, and, even more compellingly, some really great coverage. In fact, I would probably compare SNOMED not with any particular disease ontology, but with the OBO Foundry as a whole, and if you do that, it is a) cleaner, b) better integrated. But this comes at a cost. SNOMED is a commercial product - millions are being paid every year in license fees, and the more millions come, the better SNOMED will become - and the more drastic consequences will the lock-in have if one day you are forced to use SNOMED because OBO has fallen too far behind. Right now, the sum of all OBO ontologies is probably still richer and more valuable, given their use in many of the central biological databases (such as the ones hosted by the EBI) - but as SNOMED is seeping into the all aspects of genomics now (for example, it will soon be featured on OLS!) it will become increasingly important to actively promote the use of open biomedical ontologies - by contributing to them as well as by using them.
"},{"location":"explanation/writing-good-issues/","title":"Writing Good Issues","text":"Based on Intro to GitHub (GO-Centric) with credit to Nomi Harris and Chris Mungall
Writing a good ticket (or issue) is crucial to good management of a repo. In this explainer, we will discuss some good practices in writing a ticket and show examples of what not to do.
"},{"location":"explanation/writing-good-issues/#best-practices","title":"Best Practices","text":"See Daily Curator Workflow for creating branches and basic Prot\u00e9g\u00e9 instructions.
In the main Prot\u00e9g\u00e9 window, click on the \"Entities\" tab. Below that, click the \"Annotation properties\" tab.
Select the subset_property
annotation property.
Click on the \"Add sub property\" button.
In the pop-up window, add the name of the new slim. The IRI will automatically populate according to settings in the user's \"New entities\" settings. Click OK.
With the newly created annotation property selected, click on \"Refactor > Rename entity...\" in the menu.
In the pop-up window, select the \"Show full IRI\" checkbox. The IRI will appear. Edit the IRI to fit the following standard:
http://purl.obolibrary.org/obo/{ontology_abbreviation}#{label_of_subset}
For example, in CL, the original IRI will appear as:
http://purl.obolibrary.org/obo/CL_1234567
If the subset was labeled \"kidney_slim\", the IRI should be updated to:
http://purl.obolibrary.org/obo/cl#kidney_slim
In the 'Annotations\" window, click the +
next to \"Annotations\".
In the pop-up window, select the rdfs:comment
annotation property. Under \"Value\" enter a brief descripton for the slim. Under \"Datatype\" select xsd:string
. Click OK.
See Daily Curator Workflow section for commit, push and merge instructions.
"},{"location":"howto/add-new-slim/#adding-a-class-term-to-a-subset-slim","title":"Adding a class (term) to a subset (slim)","text":"See Daily Curator Workflow for creating branches and basic Prot\u00e9g\u00e9 instructions.
In the main Prot\u00e9g\u00e9 window, click on the \"Entities\" tab. Select the class that is to be added to a subset (slim).
In the 'Annotations\" window, click the +
next to \"Annotations\".
In the pop-up window, select the in_subset
annotation property.
Click on the \u2018Entity IRI\u2019 tab.
Search for the slim label under \"Entity IRI\". In the pop-up that appears, double-click on the desired slim. Ensure that a sub property of subset_property
is selected. Click OK.
See Daily Curator Workflow section for commit, push and merge instructions.
"},{"location":"howto/add-taxon-restrictions/","title":"Adding taxon restrictions","text":"Before adding taxon restrictions, please review the types of taxon restrictions documentation.
See Daily Workflow for creating branches and basic Prot\u00e9g\u00e9 instructions.
in taxon
relations are added as Subclasses
.+
.'in taxon' some Viridiplantae
).never in taxon
or present in taxon
relations added as Annotations
.+
.never_in_taxon
or present_in_taxon
as appropriate.See Daily Workflow section for commit, push and merge instructions.
"},{"location":"howto/change-files-pull-request/","title":"How to change files in an existing pull request","text":""},{"location":"howto/change-files-pull-request/#using-github","title":"Using GitHub","text":"Warning: You should only use this method if the files you are editing are reasonably small (less than 1 MB).
This method only works if the file you want to edit has already been editing as part of the pull request.
...
, and then \"Edit file\".If this option is greyed out, it means that - you don't have edit rights on the repository - the edit was made from a different fork, and the person that created the pull request did not activate the \"Allow maintainers to make edits\" option when submitting the PR - the pull request has already been merged
In GitHub Desktop, click the branch switcher button and paste in branch name (or you can type it in).
Now you are on the branch, you can open the files to be edited and make your intended changes and push via the usual workflow.
If a user forked the repository and created a branch, you can find that branch by going to the branch switcher button in GitHub Desktop, click on Pull Requests (next to Branches) and looking for that pull request
Select that pull request and edit the appropriate files as needed and push via the usual workflow.
Prerequisite: Install Github Desktop Github Desktop can be downloaded here
For the purpose of going through this how-to guide, we will use Mondo as an example. However, all obo onotlogies can be cloned in a similar way.
mondo
can be replaced with any ontology that is setup using the ODK as their architecture should be the same.If this all works okay, you are all set to start editing!
"},{"location":"howto/create-new-term/","title":"Creating a New Ontology Term in Protege","text":"To create a new term, the 'Asserted view' must be active (not the 'Inferred view').
In the Class hierarchy window, click on the 'Add subclass' button at the upper left of the window.
+
next to Annotations 2. Add Definition References\n 1. Click on the circle with the \u2018@\u2019 in it next to definition and in the resulting pop-up click on the ```+``` to add a new ref, making sure they are properly formatted with a database abbreviation followed by a colon, followed by the text string or ID. Examples: ```PMID:27450630```.\n 2. Click OK.\n 3. Add each definition reference separately by clicking on the ```+``` sign.\n
3. Add synonyms and dbxrefs following the same procedure if they are required for the term.
+
sign in the appropriate section (usually SubClass Of) and typing it in, using Tab
to autocomplete terms.Converting to Equivalent To axioms:\nIf you want to convert your SubClassOf axioms to EquivalentTo axioms, you can select the appropriate rows and right click, selecting \"Convert selected rows to defined class\"\n
In some cases, logical axioms reuiqre external ontologies (eg in the above example, the newly added CL term has_soma_location in the cerebellar cortex which is an uberon term), it might be necessary to import the term in. For instructions on how to do this, please see the import managment section of your local ontology documentation (an example of that in CL can be found here: https://obophenotype.github.io/cell-ontology/odk-workflows/UpdateImports/)
When you have finished adding the term, run the reasoner to ensure that nothing is problematic with the axioms you have added (if there is an issue, you will see it being asserted under owl:Nothing)
Save the file on protege and review the changes you have made in your Github Desktop (or use git diff
in your terminal if you do not use Github Desktop)
See Daily Workflow section for commit, push and merge instructions.
Editors:
Summary:
This is a guide to build an OBO ontology from scratch. We will focus on the kind of thought processes you want to go through, and providing the following:
Before reading on, there are three simple rules for when NOT to build an ontology everyone interested in ontologies should master, like a mantra:
Do not build a new ontology if:
Scope is one of the hardest and most debated subjects in the OBO Foundry operation calls. There are essentially two aspects to scope:
phenotype
, disease
, anatomical entity
, assay
, environmental exposure
, biological process
, chemical entity
. Before setting out to build an ontology, you should get a rough sense of what kind of entities you need to describe your domain. However, this is an iterative process and more entities will be revealed later on.Alzheimer's Disease
, which will need many different kinds of biological entities (like anatomical entity
and disease
classes).As a rule of thumb, you should NOT create a term if another OBO ontology has a branch of for entities of the same kind
. For example, if you have to add terms for assays, you should work with the Ontology for Biomedical Investigations to add these to their assay branch.
Remember, the vision of OBO is to build a semantically coherent ontology for all of biology, and the individual ontologies in the OBO Foundry should be considered \"modules\" of this super ontology. You will find that while collaboration is always hard the only way for our community to be sustainable and compete with commercial solutions is to take that hard route and work together.
"},{"location":"howto/create-ontology-from-scratch/#something-simpler-works-condition","title":"Something-simpler-works condition","text":"There are many kinds of semantic artefacts that can work for your use case:
Think of it in terms of cost. Building a simple vocabulary with minimal axiomatisation is 10x cheaper than building a full fledged domain model in OWL, and helps solving your use case just the same. Do not start building an ontology unless you have some understanding of these alternatives first.
"},{"location":"howto/create-ontology-from-scratch/#killer-use-case-condition","title":"Killer-use-case condition","text":"Do not build an ontology because someone tells you to or because you \"think it might be useful\". Write out a proper use case description, for example in the form of an agile user story, convince two or three colleagues this is worthwhile and only then get to work. Many ontologies are created for very vague use cases, and not only do they cost you time to build, they also cost the rest of the community time - time it takes them to figure out that they do not want to use your ontology. Often, someone you trust tells you to build one and you believe they know what they are doing - do not do that. Question the use of building the ontology until you are convinced it is the right thing to do. If you do not care about reasoning (either for validation or for your application), do not build an ontology.
"},{"location":"howto/create-ontology-from-scratch/#basic-recipe-to-start-building-an-ontology","title":"Basic recipe to start building an ontology","text":"Depending on your specific starting points, the way you start will be slightly different, but some common principles apply.
workflow
system, i.e. some way to run commands like release
or test
, as you will run these repeatedly. A typical system to achieve this is make, and many projects choose to encode their workflows as make
targets (ODK, OBI Makfile).Note: Later in the process, you also want to think about the following:
There are many different starting points for building an ontology:
There are two fundamentally different kinds of ontologies which need to be distinguished:
Some things to consider:
It is imperative that it is clear which of the two you are building. Project ontologies sold as domain ontologies are a very common practice and they cause a lot of harm for open biomedical data integration.
"},{"location":"howto/create-ontology-from-scratch/#example-building-vertebrate-breed-ontology","title":"Example: Building Vertebrate Breed Ontology","text":"We will re-iterate some of the steps taken to develop the Vertebrate Breed Ontology. At the time of this writing, the VBO is still in early stages, but it nicely illustrates all the points above.
"},{"location":"howto/create-ontology-from-scratch/#use-case","title":"Use case","text":"See here. Initial interactions with the OMIA team further determined more long term goals such as phenotypic similarity and reasoning.
"},{"location":"howto/create-ontology-from-scratch/#similar-ontologies","title":"Similar ontologies","text":"Similar ontologies. While there is no ontology OBO ontology related to breeds, the Livestock Breed Ontology (LBO) served as an inspiration (much different scale). NCBI taxonomy is a more general ontology about existing taxa as they occur in the wild.
"},{"location":"howto/create-ontology-from-scratch/#starting-point","title":"Starting point","text":"Our starting point was the raw OMIA data.
species
represents the same concept as \u2018species\u2019 in NCBI, the ontology should be built \u2018on top of\u2019 NCBI terms to avoid confusion of concepts and to avoid conflation of terms with the same conceptWarnings based on our experience:
For us this was using Google Sheets, ROBOT & ODK.
"},{"location":"howto/create-ontology-from-scratch/#the-ontology-id","title":"The Ontology ID","text":"At first, we chose to name the ontology \"Unified Breed Ontology\" (UBO). Which meant that for everything from ODK setup to creating identifiers for our terms, we used the UBO
prefix. Later in the process, we decided to change the name to \"Vertebrate Breed Ontology\". Migrating all the terms and the ODK setup from ubo
to vbo
required some expert knowledge on the workings of the ODK, and created an unnecessary cost. We should have finalised the choice of name first.
Thank you to Melanie Courtot, Sierra Moxon, John Graybeal, Chris Stoeckert, Lars Vogt and Nomi Harris for their helpful comments on this how-to.
"},{"location":"howto/daily-curator-workflow/","title":"Daily Ontology Curator Workflow with GitHub","text":""},{"location":"howto/daily-curator-workflow/#updating-the-local-copy-of-the-ontology-with-git-pull","title":"Updating the local copy of the ontology with 'git pull'","text":"Navigate to the ontology directory of go-ontology: cd repos/MY-ONTOLOGY/src/ontology
.
If the terminal window is not configured to display the branch name, type: git status
. You will see:
On branch [master] [or the name of the branch you are on] Your branch is up-to-date with 'origin/master'.
If you\u2019re not in the master branch, type: git checkout master
.
From the master branch, type: git pull
. This will update your master branch, and all working branches, with the files that are most current on GitHub, bringing in and merging any changes that were made since you last pulled the repository using the command git pull
. You will see something like this:
~/repos/MY-ONTOLOGY(master) $ git pull\nremote: Counting objects: 26, done.\nremote: Compressing objects: 100% (26/26), done.\nremote: Total 26 (delta 12), reused 0 (delta 0), pack-reused 0\nUnpacking objects: 100% (26/26), done.\nFrom https://github.com/geneontology/go-ontology\n 580c01d..7225e89 master -> origin/master\n * [new branch] issue#13029 -> origin/issue#13029\nUpdating 580c01d..7225e89\nFast-forward\n src/ontology/go-edit.obo | 39 ++++++++++++++++++++++++---------------\n 1 file changed, 24 insertions(+), 15 deletions(-)\n~/repos/MY-ONTOLOGY(master) $\n
"},{"location":"howto/daily-curator-workflow/#creating-a-new-working-branch-with-git-checkout","title":"Creating a New Working Branch with 'git checkout'","text":"When starting to work on a ticket, you should create a new branch of the repository to edit the ontology file.
Make sure you are on the master branch before creating a new branch. If the terminal window is not configured to display the branch name, type: git status
to check which is the active branch. If necessary, go to master by typing git checkout master
.
To create a new branch, type: git checkout -b issue-NNNNN
in the terminal window. For naming branches, we recommend using the string 'issue-' followed by the issue number. For instance, for this issue in the tracker: https://github.com/geneontology/go-ontology/issues/13390, you would create this branch: git checkout -b issue-13390
. Typing this command will automatically put you in the new branch. You will see this message in your terminal window:
~/repos/MY-ONTOLOGY/src/ontology(master) $ git checkout -b issue-13390\nSwitched to a new branch 'issue-13390'\n~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $\n
"},{"location":"howto/daily-curator-workflow/#continuing-work-on-an-existing-working-branch","title":"Continuing work on an existing Working Branch","text":"If you are continuing to do work on an existing branch, in addition to updating master, go to your branch by typing git checkout [branch name]
. Note that you can view the existing local branches by typing git branch -l
.
OPTIONAL: To update the working branch with respect to the current version of the ontology, type git pull origin master
. This step is optional because it is not necessary to work on the current version of the ontology; all changes will be synchronized when git merge is performed.
Before launching Prot\u00e9g\u00e9, make sure you are in the correct branch. To check the active branch, type git status
.
Click on the 'File' pulldown. Open the file: go-edit.obo. The first time, you will have to navigate to repos/MY-ONTOLOGY/src/ontology
. Once you have worked on the file, it will show up in the menu under 'Open'/'Recent'.
Click on the 'Classes' tab.
Searching: Use the search box on the upper right to search for a term in the ontology. Wait for autocomplete to work in the pop-up window.
Viewing a term: Double-click on the term. This will reveal the term in the 'Class hierarchy' window after a few seconds.
Launching the reasoner: To see the term in the 'Class hierarchy' (inferred) window, you will need to run the 'ELK reasoner'. 'Reasoner' > select ELK 0.4.3, then click 'Start reasoner'. Close the various pop-up warnings about the ELK reasoner. You will now see the terms in the inferred hierarchy.
After modification of the ontology, synchronize the reasoner. Go to menu: 'Reasoner' > ' Synchronize reasoner'.
NOTE: The only changes that the reasoner will detect are those impacting the ontology structure: changes in equivalence axioms, subclasses, merges, obsoletions, new terms.
TIP: When adding new relations/axioms, 'Synchronize' the reasoner. When deleting relations/axioms, it is more reliable to 'Stop' and 'Start' the reasoner again.
Use File > Save to save your changes.
Review: Changes made to the ontology can be viewed by typing git diff
in the terminal window. If there are changes that have already been committed, the changes in the active branch relative to master can be viewed by typing git diff master
.
Commit: Changes can be committed by typing: git commit -m \u2018Meaningful message Fixes #ticketnumber\u2019 go-edit.obo
.
For example:
git commit -m \u2018hepatic stellate cell migration and contraction and regulation terms. Fixes #13390\u2019 go-edit.obo\n
This will save the changes to the go-edit.obo file. The terminal window will show something like:
~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $ git commit -m 'Added hepatic stellate cell migration and contraction and regulation terms. Fixes #13390' go-edit.obo\n [issue-13390 dec9df0] Added hepatic stellate cell migration and contraction and regulation terms. Fixes #13390\n 1 file changed, 79 insertions(+)\n ~/repos/MY-ONTOLOGY/src/ontology(issue-13390) $\n
Committer: Kimberly Van Auken vanauken@kimberlukensmbp.dhcp.lbnl.us Your name and email address were configured automatically based on your username and hostname. Please check that they are accurate.
Push: To incorporate the changes into the remote repository, type: git push origin mynewbranch
.
Example:
git push origin issue-13390\n
Pull
geneontology/go-ontology/code
. You will see your commit listed at the top of the page in a light yellow box. If you don\u2019t see it, click on the 'Branches' link to reveal it in the list, and click on it.Merge If the Travis checks are succesful and if you are done working on that branch, merge the pull request. Confirming the merge will close the ticket if you have used the word 'fixes' in your commit comment. NOTE: Merge the branches only when the work is completed. If there is related work to be done as a follow up to the original request, create a new GitHub ticket and start the process from the beginning.
Delete your branch on the repository using the button on the right of the successful merge message.
You may also delete the working branch on your local copy. Note that this step is optional. However, if you wish to delete branches on your local machine, in your terminal window:
git checkout master
.git pull origin master
git branch -d workingbranchname
. Example: git branch -d issue-13390
Dealing with very large ontologies, such as the Protein Ontology (PR), NCBI Taxonomy (NCBITaxon), Gene Ontology (GO) and the CHEBI Ontology is a big challenge when developing ontologies, especially if we want to import and re-use terms from them. There are two major problems:
There are a few strategies we can employ to deal with the problem of memory consumption:
To deal with file size, we:
All four strategies will be discussed in the following. We will then look a bit
"},{"location":"howto/deal-with-large-ontologies/#overwrite-odk-default-less-fancy-custom-modules","title":"Overwrite ODK default: less fancy, custom modules","text":"The default recipe for creating a module looks something like this:
imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt\n if [ $(IMP) = true ]; then $(ROBOT) query -i $< --update ../sparql/preprocess-module.ru \\\n extract -T imports/$*_terms_combined.txt --force true --copy-ontology-annotations true --individuals exclude --method BOT \\\n query --update ../sparql/inject-subset-declaration.ru --update ../sparql/postprocess-module.ru \\\n annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/%_import.owl\n
(Note: This snippet was copied here on 10 February 2021 and may be out of date by the time you read this.)
As you can see, a lot of stuff is going on here: first we run some preprocessing (which is really costly in ROBOT, as we need to load the ontology into Jena, and then back into the OWL API \u2013 so basically the ontology is loaded three times in total), then extract a module, then run more SPARQL queries etc, etc. Costly. For small ontologies, this is fine. All of these processes are important to mitigate some of the shortcomings of module extraction techniques, but even if they could be sorted in ROBOT, it may still not be enough.
So what we can do now is this. In your ont.Makefile
(for example, go.Makefile
, NOT Makefile
), located in src/ontology
, you can add a snippet like this:
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt\n if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \\\n annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/pr_import.owl\n
Note that all the %
variables and uses of $*
are replaced by the ontology ID in question. Adding this to your ont.Makefile
will overwrite the default ODK behaviour in favour of this new recipe.
The ODK supports this reduced module out of the box. To activate it, do this:
import_group:\n products:\n - id: pr\n use_gzipped: TRUE\n is_large: TRUE\n
This will (a) ensure that PR is pulled from a gzipped location (you have to check whether it exists though. It must correspond to the PURL, followed by the extension .gz
, for example http://purl.obolibrary.org/obo/pr.owl.gz
) and (b) that it is considered large, so the default handling of large imports is activated for pr
, and you don't need to paste anything into ont.Makefile
.
If you prefer to do it yourself, in the following sections you can find a few snippets that work for three large ontologies. Just copy and paste them into ont.Makefile
, and adjust them however you wish.
imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt\n if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \\\n annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/pr_import.owl\n
"},{"location":"howto/deal-with-large-ontologies/#ncbi-taxonomy-ncbitaxon","title":"NCBI Taxonomy (NCBITaxon)","text":"imports/ncbitaxon_import.owl: mirror/ncbitaxon.owl imports/ncbitaxon_terms_combined.txt\n if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/ncbitaxon_terms_combined.txt --force true --method BOT \\\n annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/ncbitaxon_import.owl\n
"},{"location":"howto/deal-with-large-ontologies/#chebi","title":"CHEBI","text":"imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt\n if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \\\n annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/chebi_import.owl\n
Feel free to use an even cheaper approach, even one that does not use ROBOT, as long as it produces the target of the goal (e.g. imports/chebi_import.owl
).
For some ontologies, you can find slims that are much smaller than full ontology. For example, NCBITaxon maintains a slim for OBO here: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.owl, which smaller than the 1 or 2 GB of the full version. Many ontologies maintain such slims, and if not, probably should. (I would really like to see an OBO slim for Protein Ontology!)
(note the .obo file is even smaller but currently robot has issues getting obo files from the web)
You can also add your favourite taxa to the NCBITaxon slim by simply making a pull request on here: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt
You can use those slims simply like this:
import_group:\n products:\n - id: ncbitaxon\n mirror_from: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo\n
"},{"location":"howto/deal-with-large-ontologies/#manage-imports-manually","title":"Manage imports manually","text":"This is a real hack \u2013 and we want to strongly discourage it \u2013 but sometimes, importing an ontology just to import a single term is total overkill. What we do in these cases is to maintain a simple template to \"import\" minimal information. I can't stress enough that we want to avoid this, as such information will necessarily go out of date, but here is a pattern you can use to handle it in a sensible way:
Add this to your src/ontology/ont-odk.yaml
:
import_group:\n products:\n - id: my_ncbitaxon\n
Then add this to src/ontology/ont.Makefile
:
mirror/my_ncbitaxon.owl:\n echo \"No mirror for $@\"\n\nimports/my_ncbitaxon_import.owl: imports/my_ncbitaxon_import.tsv\n if [ $(IMP) = true ]; then $(ROBOT) template --template $< \\\n --ontology-iri \"$(ONTBASE)/$@\" --output $@.tmp.owl && mv $@.tmp.owl $@; fi\n\n.PRECIOUS: imports/my_ncbitaxon_import.owl\n
Now you can manage your import manually in the template, and the ODK will not include your manually-curated import in your base release. But again, avoid this pattern for anything except the most trivial case (e.g. you need one term from a huge ontology).
"},{"location":"howto/deal-with-large-ontologies/#file-is-too-large-network-timeouts-and-long-runtimes","title":"File is too large: Network timeouts and long runtimes","text":"Remember that ontologies are text files. While this makes them easy to read in your browser, it also makes them huge: from 500 MB (CHEBI) to 2 GB (NCBITaxon), which is an enormous amount.
Thankfully, ROBOT can automatically read gzipped ontologies without the need of unpacking. To avoid long runtimes and network timeouts, we can do the following two things (with the new ODK 1.2.26):
import_group:\n products:\n - id: pr\n use_gzipped: TRUE\n
This will try to append .gz
to the default download location (http://purl.obolibrary.org/obo/pr.owl \u2192 http://purl.obolibrary.org/obo/pr.owl.gz). Note that you must make sure that this file actually exists. It does for CHEBI and the Protein Ontology, but not for many others.
If the file exists, but is located elsewhere, you can do this:
import_group:\n products:\n - id: pr\n mirror_from: http://purl.obolibrary.org/obo/pr.owl.gz\n
You can put any URL in mirror_from
(including non-OBO ones!)
We developed a completely automated variant of the Custom OBO Dashboard Workflow, which does not require any local installation.
dashboard-config.yml
file, in particular the ontologies
section:mirror_from
field.profile
section to overwrite the custom robot report profile and add custom checks!yaml profile: baseprofile: \"https://raw.githubusercontent.com/ontodev/robot/master/robot-core/src/main/resources/report_profile.txt\" custom: - \"WARN\\tfile:./sparql/missing_xrefs.sparql\"
Click on Settings
> Pages
to configure the GitHub pages
. Set the Source
to deploy from branch, and Branch
to build from main
(or master
if you are still using the old default) and /(root)
as directory. Hit Save
.
Click on the Actions
tab in your repo. On the left, select the Run dashboard
workflow and click on the Run workflow
button. This action will rebuild the dashboard and make a pull request with the changes.
Visit site
and you should find your new shiny dashboard page!Failed: make dashboard ROBOT_JAR=/tools/robot.jar ROBOT=robot -B with return code 2
There is a known bug at the moment requiring at least one ontology with a warning, error, info and pass, see https://github.com/OBOFoundry/OBO-Dashboard/issues/85.
dashboard-config.yml
, add a temporary ontology we created to make this work. This is already in the Dashboard template repository. ontologies:\ncustom:\n- id: tmp\nmirror_from: \"https://raw.githubusercontent.com/monarch-ebi-dev/robot_tests/master/custom-dashboard.owl\"\n
"},{"location":"howto/deploy-custom-obo-dashboard/#error-on-github-action-create-pull-request-section","title":"Error on GitHub Action - Create Pull Request section","text":"remote: Permission to <name of the user or organization>/<name of the repository>.git denied to github-actions[bot].
You need to update the workflow permission for the repository.
Settings
, then Actions
on the left menu, then General
.Error: GitHub Actions is not permitted to create or approve pull requests.
You need to enable GitHub Actions to create pull requests.
Settings
, then Actions
on the left menu, then General
.Contributed by @XinsongDu
, edited by @matentzn
.gitignore
from the obo-nor.github.io
repo is also copied to your new repo (it is frequently skipped or hidden from the user in Finder
or when using the cp
command) and push to everything to GitHub.docker pull obolibrary/odkfull\n
dashboard-config.yml
file, in particular the ontologies
section:mirror_from
field.#
before pip install networkx==2.6.2
to ensure the correct network x version is installed.sh run-dash.sh
(make sure dashboard folder is empty before running, e.g. rm -rf dashboard/*
).Before you start:
Using Prot\u00e9g\u00e9 you can add annotations such as labels, definitions, synonyms, database cross references (dbxrefs) to any OWL entity. The panel on the right, named Annotations, is where these annotations are added. OBO Foundry ontologies includes a pre-declared set of annotation properties. The most commonly used annotations are below.
Note: OBO ontologies allow only one rdfs:label, definition, and comment.
Note, most of these are bold in the annotation property list:
Use this panel to add a definition to the class you created. Select the + button to add an annotation to the selected entity. Click on the annotation 'definition' on the left and copy and paste in the definition to the white editing box on the right. Click OK.
Example (based on MONDO):
Definition: A disorder characterized by episodes of swelling under the skin (angioedema) and an elevated number of the white blood cells known as eosinophils (eosinophilia). During these episodes, symptoms of hives (urticaria), fever, swelling, weight gain and eosinophilia may occur. Symptoms usually appear every 3-4 weeks and resolve on their own within several days. Other cells may be elevated during the episodes, such as neutrophils and lymphocytes. Although the syndrome is often considered a subtype of the idiopathic hypereosinophilic syndromes, it does not typically have organ involvement or lead to other health concerns.
Definitions in OBO ontologies should have a 'database cross reference' (dbxref), which is a reference to the definition source, such as a paper from the primary literature or another database. For references to papers, we cross reference the PubMed Identifier in the format, PMID:XXXXXXXX. (Note, no space)
To add a dbxref to the definition:
To add a synonym:
database_cross_reference
on the left panel and add your reference to the Literal tab on the right hand sideWe have seen how to add sub/superclasses and annotate the class hierarchy. Another way to do the same thing is via the Class description view. When an OWL class is selected in the entities view, the right-hand side of the tab shows the class description panel. If we select the 'vertebral column disease' class, we see in the class description view that this class is a \"SubClass Of\" (= has a SuperClass) the 'musculoskeletal system disease' class. Using the (+) button beside \"SubClass Of\" we could add another superclass to the 'skeletal system disease' class.
Note the Anonymous Ancestors. These are superclasses that are inherited from the parents. If you hover over the Subclass Of (Anonymous Ancestor) you can see the parent that the class inherited the superclass from.
When you press the '+' button to add a SubClass of
axiom, you will notice a few ways you can add a term. The easiest of this is to use the Class expression editor. This allows you to type in the expression utilizing autocomplete. As you start typing, you can press the 'TAB' or '->|' button on your keyboard, and protege will suggest terms. You will also note that the term you enter is not in the ontology, protege will not allow you add it, with the box being highlighted red, and the term underlined red.
This guide explains how to embed a YouTube video into a page in this OBO Academy material. Example, see the videos on the Contributing to OBO Ontologies page.
"},{"location":"howto/embed-video/#instructions","title":"Instructions","text":"The content should look something like this: <iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/_z8-KGDzZ6U\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>
The embedded video should look like this:
"},{"location":"howto/filter-text-file/","title":"Command Line Trick: Filter text files based on a list of strings","text":"Let's say you want to remove some lines from a large text file programmatically. For example, you want to remove every line that contains certain IDs, but you want to keep the rest of the lines intact. You can use the command line utility grep
with option -v
to find all the lines in the file that do NOT contain your search term(s). You can make a file with a list of several search terms and use that file with grep
using the -f
option as follows:
grep -v -f your_list.txt target_file.tsv | tee out_file.tsv\n
"},{"location":"howto/filter-text-file/#explanation","title":"Explanation","text":"csv
, tsv
, obo
etc. For example, you wish to filter a file with these lines:keep this 1 this line is undesired 2, so you do not wish to keep it keep this 3 keep this 4 keep this 5 keep this 6 something undesired 2 this line is undesired 1 keep this 7
your_list.txt
is a text file with your list of search terms. Format: one search term per line. For example:undesired 1 undesired 2
The utility tee
will redirect the standard output to both the terminal and write it out to a file.
You expect the out_file.tsv
to contain lines:
keep this 1 keep this 3 keep this 4 keep this 5 keep this 6 keep this 7
"},{"location":"howto/filter-text-file/#do-the-filtering-and-updating-of-your-target-file-in-one-step","title":"Do the filtering and updating of your target file in one step","text":"You can also do a one-step filter-update when you are confident that your filtering works as expected, or if you have a backup copy of your target_file.tsv
. Use cat
and pipe the contents of your text file as the input for grep
. Redirect the results to both your terminal and overwrite your original file so it will contain only the filtered lines.
cat target_file.tsv | grep -v -f your_list.txt | tee target_file.tsv\n
"},{"location":"howto/fixing-conflicts/","title":"Fixing merge conflicts","text":"This video illustrates an example of fixing a merge conflict in the Mondo Disease Ontology.
Instructions:
If a merge conflict error appears in your Github.com pull request after committing a change, open GitHub Desktop and select the corresponding repository from the \"Current Repository\" button. If the conflict emerged after editing the ontology outside of Prot\u00e9g\u00e9 5.5.0, see Ad hoc Reserialisation below.
With the repository selected, click the \"Fetch origin\" button to fetch the most up-to-date version of the repository.
Click the \"Current Branch\" button and select the branch with the merge conflict.
From the menu bar, select Branch > \"Update from master\".
A message indicating the file with a conflict should appear along with the option to open the file (owl or obo file) in a text/code editor, such as Sublime Text. Click the button to open the file.
Search the file for conflict markings ( <<<<<<< ======= >>>>>>> ).
Make edits to resolve the conflict, e.g., arrange terms in the correct order.
Remove the conflict markings.
Save the file.
Open the file in Prot\u00e9g\u00e9. If prompted, do not reload any previously opened file. Open as a new file.
Check that the terms involved in the conflict appear OK, i.e., have no obvious errors.
Save the file in Prot\u00e9g\u00e9 using File > 'Save as...' from the menu bar and replace the ontology edit file, e.g., mondo-edit.obo
Return to GitHub Desktop and confirm the conflicts are now resolved. Click the \"Continue Merge\" button and then the \"Push origin\" button.
Return to Github.com and allow the QC queries to rerun.
The conflict should be resolved and the branch allowed to be merged.
Ad hoc Reserialisation
If the owl or obo file involved in the merge conflict was edited using Prot\u00e9g\u00e9 5.5.0, the above instructions should be sufficient. If edited in any other way, such as fixing a conflict in a text editor, the serialisation order may need to be fixed. This can be done as follows:
Reserialise the master file using the Ontology Development Kit (ODK). This requires setting up Docker and ODK. If not already set up, follow the instructions here.
Open Docker.
At the line command (PC) or Terminal (Mac), use the cd (change directory) command to navigate to the repository's src/ontology/ directory. For example,
cd PATH_TO_ONTOLOGY/src/ontology/
Replace \"PATH_TO_ONTOLOGY\" with the actual file path to the ontology. If you need to orient yourself, use the pwd
(present working directory) or ls
(list) line commands.
sh run.sh make normalize_src
If you are resolving a conflict in an .obo file, run:
sh run.sh make normalize_obo_src
In some ontologies (such as the Cell ontology (CL)), edits may result in creating a large amount of unintended differences involving ^^xsd:string. If you see these differences after running the command above, they can be resolved by following the instructions here.
Continue by going to step 1 under the main Instructions above.
The command line tool Robot has a diff tool that compares two ontology files and can print the differences between them in multiple formats, among them markdown.
We can use this tool and GitHub actions to automatically post a comment when a Pull Request to master is created, with the differences between the two ontologies.
To create a new GitHub action, create a folder in your ontology project root folder called .github
. Then create a yaml file in a subfolder called workflows
, e.g. .github/workflows/diff.yml
. This file contains code that will be executed in GitHub when certain conditions are meant, in this case, when a PR to master is submitted. The comments in this file from FYPO will help you write an action for your own repository.
The comment will look something like this.
"},{"location":"howto/github-create-fork/","title":"Fork an ontology for editing","text":"Note: Creating a fork allows you to create your copy GitHub repository. This example provides instructions on forking the Mondo GitHub reposiitory. You can't break any of the Mondo files by editing your forked copy.
Clone your forked repo:
If you have GitHub Desktop installed - click Code -> Open with GitHub Desktop
How are you planning to use this fork? To contribute to parent project
In GitHub Desktop, create a new branch:
Click Current Branch - > New Branch
Give your branch a name, like c-path-training-1
You will make changes to the Mondo on the branch of your local copy.
Further instructions on forking a repo
"},{"location":"howto/github-create-pull-request/","title":"Create a Pull Request in GitHub","text":""},{"location":"howto/github-create-pull-request/#overview","title":"Overview","text":""},{"location":"howto/github-create-pull-request/#github-workflows","title":"GitHub workflows","text":"A Git repo consists of a set of branches each with a complete history of all changes ever made to the files and directories. This is true for a local copy you check out to your computer from GitHub or for a copy (fork) you make on GitHub.
A Git repo typically has a master or main branch that is not directly edited. Changes are made by creating a branch from Master (complete copy of the Master + its history) (either a direct branch or via a fork).
"},{"location":"howto/github-create-pull-request/#branch-vs-fork","title":"Branch vs Fork","text":"You can copy (fork) any GitHub repo to some other location on GitHub without having to ask permission from the owners.\u00a0 If you modify some files in that repo, e.g. to fix a bug in some code, or a typo in a document, you can then suggest to the owners (via a Pull Request) that they adopt (merge) you your changes back into their repo. See the Appendix for instructions on how to make a fork.
If you have permission from the owners, you can instead make a new branch.
"},{"location":"howto/github-create-pull-request/#what-is-a-pull-request","title":"What is a Pull Request?","text":"A Pull Request (PR) is an event in Git where a contributor (you!) asks a maintainer of a Git repository to review changes (e.g. edits to an ontology file) they want to merge into a project (e.g. the owl file) (see reference). Create a pull request to propose and collaborate on changes to a repository. These changes are proposed in a branch, which ensures that the default branch only contains finished and approved work. See more details here.
"},{"location":"howto/github-create-pull-request/#committing-pushing-and-making-pull-requests","title":"Committing, pushing and making pull requests","text":"See these instructions on cloning an ontology repo and creating a branch using GitHub Dekstop.
Review: Once changes are made to the ontology file, they can be viewed in GitHub Desktop.
Before committing, check the diff. An example diff from the Cell Ontology (CL) is pasted below. Large diffs are a sign that something went wrong. In this case, do not commit the changes and consider asking the ontology editor team for help instead.
Example 1 (Cell Ontology):
Example 2 (Mondo):
Commit message: Before Committing, you must add a commit message. In GitHub Desktop in the Commit field in the lower left, there is a subject line and a description.
Give a very descriptive title: Add a descriptive title in the subject line. For example: add new class ONTOLOGY:ID [term name] (e.g. add new class MONDO:0000006 heart disease)
Write a great summary of what the change is in the Description box, referring to the issue. The sentence should clearly state how the issue is addressed.
To link the issue, you can use the word 'fixes' or 'closes' in the description of the commit message, followed by the corresponding ticket number (in the format #1234) - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
Note: 'Fixes' and \"Closes' are case-insensitive.
If you don't want to close the ticket, just refer to the ticket # without the word 'Fixes' or use 'Addresses'. The commit will be associated with the correct ticket but the ticket will remain open. 7.NOTE: It is also possible to type a longer message than allowed when using the '-m' argument; to do this, skip the -m, and a vi window (on mac) will open in which an unlimited description may be typed.
Click Commit to [branch]. This will save the changes to the ontology edit file.
Push: To incorporate the changes into the remote repository, click Publish branch.
Click: Create Pull Request in GitHub Desktop
This will automatically open GitHub Desktop
Click the green button 'Create pull request'
You may now add comments to your pull request.
The CL editors team will review your PR and either ask for changes or merge it.
The changes will be available in the next release.
Curators and projects are assigned specific ID ranges within the prefix for your ontology. See the README-editors.md for your ontology
An example: go-idranges.owl
NOTE: You should only use IDs within your range.
If you have only just set up this repository, modify the idranges file and add yourself or other editors.
Once you have your assigned ID range, you need to configure Protege so that your ID range is recorded in the Preferences menu. Protege does not read the idranges file.
In the Protege menu, select Preferences.
In the resulting pop-up window, click on the New Entities tab and set the values as follows.
In the Entity IRI box:
Start with: Specified IRI: http://purl.obolibrary.org/obo
Followed by: /
End with: Auto-generated ID
Same as label renderer: IRI: http://www.w3.org/2000/01/rdf-schema#label
In the Auto-generated ID section:
Numeric
Prefix GO_
Suffix: leave this blank
Digit Count 7
Start: see go-idranges.owl. Only paste the number after the GO:
prefix. Also, note that when you paste in your GO ID range, the number will automatically be converted to a standard number, e.g. pasting 0110001 will be converted to 110,001.)
End: see go-idranges.owl
Remember last ID between Protege sessions: ALWAYS CHECK THIS
(Note: You want the ID to be remembered to prevent clashes when working in parallel on branches.)
"},{"location":"howto/install-protege/","title":"5 step installation guide for Prot\u00e9g\u00e9","text":".asc
extension to verify the integrity of the downloaded Prot\u00e9g\u00e9 version..zip
or .tar.gz
file with tools appropriate for your operating system.Follow the steps as needed by your operating system to install the Prot\u00e9g\u00e9 application. For example, on macOS: drag and drop Prot\u00e9g\u00e9.app
to the Applications
folder and replace any older versions of the software. You may need to right click Prot\u00e9g\u00e9.app
and then choose Open
from the menu to authorise the programme to run on your machine. Alternatively, go to Preferences -> Security -> General
. You need to open the little lock, then click Mac stopped an application from Running (Prot\u00e9g\u00e9)
-> Open anyways
.
Adjust memory settings if necessary. Memory settings can now be adjusted in a jvm.conf configuration file that can be located either in the .protege/conf directory under your home directory, or in the conf directory within the application bundle itself. For example, to set the maximum amount of memory available for Prot\u00e9g\u00e9 to, say, 12GB, put the following in the jvm.conf file:
max_heap_size=12G\n
/Applications/Prote\u0301ge\u0301.app/Contents/conf/jvm.conf\n
Edit this part:
# Uncomment the line below to set the maximal heap size to 8G\n#max_heap_size=8G\n
"},{"location":"howto/installing-elk-in-protege/","title":"Install Elk 0.5 in Protege","text":"Click here to get the latest Protege Plugin latest build (this is available on the bottom of ELK pages. This will download a zipped file.)
When downloaded, unzip and copy puli and elk jars (two .jar files) in the unpacked directory.
Remove old org.semanticweb.elk.jar
Install ELK plugin on Mac:
This can be done via one of two ways:
Approach 1
open ~/.Protege, then click on plugins
Approach 2
~/.Protege
and a directory called plugins
does not exist in this folder, you can create it.Important: it seems Elk 0.5. Does not work with all versions of Protege, in particular, 5.2 and below. These instructions were only tested with Protege 5.5.
"},{"location":"howto/installing-elk-in-protege/#video-explanation","title":"Video Explanation","text":""},{"location":"howto/merge-terms/","title":"Merging Terms","text":"NOTE This documentation is incomplete, for now you may be better consulting the GO Editor Docs
For instructions on obsoleting terms (without merging/replacing with a new term, see obsoletion how to guide.)
"},{"location":"howto/merge-terms/#merging-ontology-terms","title":"Merging Ontology Terms","text":"See Daily Workflow for creating branches and basic Prot\u00e9g\u00e9 instructions.
Note Before performing a merge, make sure that you know all of the consequences that the merge will cause. In particular, be sure to look at child terms and any other terms that refer to the \u2018obsoleted\u2019 term. In many cases a simple merge of two terms is not sufficient because it will result in equivalent classes for child terms. For example if obsoleted term X is going to be merged into target term Y and \u2018regulation of X\u2019 and \u2018regulation of Y\u2019 terms exist, then you will need to merge the regulation terms in addition to the primary terms. You will also need to edit any terms that refer to the obsoleted term to be sure that the names and definitions are consistent.
"},{"location":"howto/merge-terms/#manual-workflow","title":"Manual Workflow","text":"Duplicate class
then OK in the pop up window. This should create a class with the exact same name.Change IRI (Rename)
_
in the identifier instead of the colon :
, for example: GO_1234567
. Make sure that the 'change all entities with this URI' box is checked.o
to change the label of the obsoleted term.has_broad_synonym
has_exact_synonym
has_narrow_synonym
has_related_synonym
(if unsure, this is the safest choice)x
on the right.x
on the right.x
on the right.rdfs:comment
that states that term was duplicated and to refer to the new new.term replaced by
annotations as per the instructions and add the winning merged term.\u2261
in the class hierarchy view on the left hand panel.See Daily Workflow section for commit, push and merge instructions.
"},{"location":"howto/merge-terms/#merge-using-owltools","title":"Merge using owltools","text":"To use owltools will need to have Docker installed and running (see instructions here).
This is the workflow that is used in Mondo.
owltools --use-catalog mondo-edit.obo --obsolete-replace [CURIE 1] [CURIE 2] -o -f obo mondo-edit.obo
CURIE 1 = term to be obsoleted CURIE 2 = replacement term (ie term to be merged with)
For example: If to merge MONDO:0023052 ectrodactyly polydactyly with MONDO:0009156 ectrodactyly-polydactyly syndrome, the command is:
owltools --use-catalog mondo-edit.obo --obsolete-replace MONDO:0023052 MONDO:0009156 -o -f obo mondo-edit.obo
TROUBLESHOOTING: Travis/Jenkins errors
:: ERROR: ID-mentioned-twice:: GO:0030722 :: ERROR: ID-mentioned-twice:: GO:0048126 GO:0030722 :: ERROR: has-definition: missing definition for id
The cause of this error is that Term A (GO:0048126) was obsoleted and had replace by Term B (GO:0030722). The GO editor tried to merge Term B into a third term term C (GO:0007312). The Jenkins checkk failed because 'Term A replaced by' was an alternative_id rather than by a main_id. Solution: In the ontology, go to the obsolete term A and replace the Term B by term C to have a primary ID as the replace_by.
"},{"location":"howto/obsolete-term/","title":"Obsoleting an Existing Ontology Term","text":"See Daily Workflow for creating branches and basic Prot\u00e9g\u00e9 instructions.
Warning: Every ontology has their procedures on how they obsolete terms (eg notice periods, notification emails, to_be_obsolete tags, etc.), this how-to guide only serves as a guide on how obsolete a term directly on protege.
For instructions on how to merge terms (i.e., replace a term with another term in the ontology), see instructions here.
"},{"location":"howto/obsolete-term/#pre-obsoletion-process-or-basic-obsoletion-etiquette","title":"PRE OBSOLETION PROCESS (or basic obsoletion etiquette)","text":"Check if the term (or any of its children) is being used for annotation:
Go to your ontology browser of choice, search for the term, either by label or ID
Notify affected groups (usually by adding an issue in their tracker)
Check if the term is used elsewhere in the ontology
Warning: some ontologies give advance notice on terms that will be obsoleted through the annotation 'scheduled for obsoletion on or after' instead of directly obsoleting the term. Please check with the conventions of your ontology before obsoleting a term.
Examples of additional annotations to add:
IAO:0000233 term tracker item (type xsd:anyURI) - link to GitHub issue
has_obsolence_reason
add \u2018OBSOLETE.\u2019 to the term definition: In the 'Description' window, click on the o
on the right-hand side of the definition entry. In the resulting window, in the Literal tab, at the beginning of the definition, type: OBSOLETE.
if the obsoleted term was not replaced by another term in the ontology, but there are existing terms that might be appropriate for annotation, add those term IDs in the 'consider' tag: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select consider
and enter the ID of the replacement term.
NOTE: Here you have to add the ID of the entity as an xsd:string
, e.g. GO:0005819, not the term label.
Add a statement about why the term was made obsolete: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select rdfs:comment
and select Type: Xsd:string
. Consult the wiki documentation for suggestions on standard comments:
- [http://wiki.geneontology.org/index.php/Curator_Guide:_Obsoletion](http://wiki.geneontology.org/index.php/Curator_Guide:_Obsoletion)\n\n - [http://wiki.geneontology.org/index.php/Obsoleting_GO_Terms](http://wiki.geneontology.org/index.php/Obsoleting_GO_Terms)\n\n - [http://wiki.geneontology.org/index.php/Editor_Guide](http://wiki.geneontology.org/index.php/Editor_Guide)\n
If the obsoleted term was replaced by another term in the ontology: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select term replaced by
and enter the ID of the replacement term.
If the obsoleted term was not replaced by another term in the ontology, but there are existing terms that might be appropriate for annotation, add those term IDs in the 'consider' tag: In the 'Annotations' window, select +
to add an annotation. In the resulting menu, select consider
and enter the ID of the replacement term.
NOTE: Here you have to add the ID of the entity as an xsd:string
, e.g. GO:0005819, not the term label.
Add any additional annotations needed - this is specific to ontologies and you should consult the conventions of the ontology you are working on.
Examples of additional annotations to add:
See Daily Workflow section for commit, push and merge instructions.
"},{"location":"howto/odk-add-orcidio-module/","title":"Import ORCIDIO","text":""},{"location":"howto/odk-add-orcidio-module/#adding-an-orcidio-import-to-your-ontology-with-odk","title":"Adding an ORCIDIO import to your ontology with ODK","text":"The Open Researcher and Contributor Identifier (ORCID) is a global, unambiguous way to identify a researcher. ORCID URIs (e.g., https://orcid.org/0000-0003-4423-4370) can therefore be used to unambigously and actionably attribute various aspects of ontology terms in combination with DC Terms or IAO predicates. However, URIs themselves are opaque and it is difficult to disambiguate to which person an ORCID corresponds when browsing an ontology (e.g., in Prot\u00e9g\u00e9).
ORCIDIO is an ontology that declares ORCID URIs as named individuals and associates basic metadata (e.g., name, description) to each such that tools like Prot\u00e9g\u00e9 can display a human-readable label rather than the URI itself as in the following example.
In this guide, we discuss how to add ORCIDIO to your ODK setup.
"},{"location":"howto/odk-add-orcidio-module/#1-include-orcidio-as-an-import-into-the-odk-config-file","title":"1. Include ORCIDIO as an import into the ODK config file","text":"In your ODK configuration (e.g. src/ontology/myont-odk.yaml
), add the following to the import_group
:
import_group:\nannotation_properties:\n- rdfs:label\n- dc:description\n- dc:source\n- IAO:0000115\nproducts:\n- id: orcidio\nmirror_from: https://w3id.org/orcidio/orcidio.owl\nmodule_type: filter\nbase_iris:\n- https://orcid.org/\n
The list of annotation properties, in particular dc:source
, is important for the filter
module to work (ORCIDIO relies heavily on axiom annotations for provenance).
TODO: \"as usual\" should be re-written to cross-link to another guide about updating the catalog (or don't say as usual to keep this more self-contained) As usual, add a statement into your catalog (src/ontology/catalog-v001.xml
):
<uri name=\"http://purl.obolibrary.org/obo/ro/imports/orcidio_import.owl\" uri=\"imports/orcidio_import.owl\"/>\n
"},{"location":"howto/odk-add-orcidio-module/#3-update-the-edit-file","title":"3. Update the edit file","text":"TODO: \"as usual\" should be re-written to cross-link to another guide about updating the edit file (or don't say as usual to keep this more self-contained) As usual, add an imports declaration to your edit file (src/ontology/myont-edit.owl
):
Import(<http://purl.obolibrary.org/obo/ro/imports/orcidio_import.owl>)\n
TODO: link to explanation of base merging strategy Note: This is not necessary when using the base merging
strategy (you will know what this means when you do use it).
Add a new SPARQL query: src/sparql/orcids.sparql
. This is used to query for all ORCIDs used in your ontology.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nprefix owl: <http://www.w3.org/2002/07/owl#>\nSELECT DISTINCT ?orcid\nWHERE {\n VALUES ?property {\n <http://purl.org/dc/elements/1.1/creator>\n <http://purl.org/dc/elements/1.1/contributor>\n <http://purl.org/dc/terms/creator>\n <http://purl.org/dc/terms/contributor> \n }\n ?term ?property ?orcid . \n FILTER(isIRI(?term))\n}\n
Next, overwrite your ORCID seed generation to using this query by adding the following to your src/ontology/myont.Makefile
(not Makefile
!):
$(IMPORTDIR)/orcidio_terms_combined.txt: $(SRCMERGED)\n$(ROBOT) query -f csv -i $< --query ../sparql/orcids.sparql $@.tmp &&\\\ncat $@.tmp | sort | uniq > $@\n
For your specific use-case, it may be necessary to tweak this SPARQL query, for example if your ORCIDs are used on axiom annotation level rather than entity annotation level.
"},{"location":"howto/odk-add-orcidio-module/#5-updating-config-and-orcidio","title":"5. Updating Config and ORCIDIO","text":"Now run to apply your ODK changes:
sh run.sh make update_repo\n
This will update a number of files in your project, such as the autogenerated Makefile
.
Lastly, update your ORCIDIO import to apply the changes:
sh run.sh make refresh-orcidio\n
Commit all the changes to a branch, wait for continuous integration to finish, and enjoy your new ORCIDIO import module.
"},{"location":"howto/odk-create-repo/","title":"Creating a new Repository with the Ontology Development Kit","text":"This is instructions on how to create an ontology repository in GitHub. This will only need to be done once per project. You may need assistance from someone with basic unix knowledge in following instructions here.
We will walk you though the steps to make a new ontology project
"},{"location":"howto/odk-create-repo/#1-install-requirements","title":"1. Install requirements","text":"docker ps
in your terminal or command line (CMD). If all is ok, you should be seeing something like:CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n
.gitconfig
file in your user directory!docker pull obolibrary/odkfull
NOTE The very first time you run this it may be slow, while docker downloads necessary images. Don't worry, subsequent runs should be much faster!
NOTE Windows users, occasionally it has been reported that files downloaded on a Windows machine get a wrong file ending, for example seed-via-docker.bat.txt
instead of seed-via-docker.bat
, or, as we will see later, project.yaml.txt
instead of project.yaml
. If you have problems, double check your files are named correctly after the download!
You can either pass in a configuration file in YAML format that specifies your ontology project setup, or you can pass arguments on the command line. You can use dir
in your command line on PC to ensure that your wrapper script, .gitconfig, and project.yaml (if you so choose) are all in the correct directory before running the wrapper script.
Passing arguments on the command line:
./seed-via-docker.sh -d po -d ro -d pato -u cmungall -t \"Triffid Behavior ontology\" triffo\n
Using a the predefined project.yaml file:
./seed-via-docker.sh -C examples/triffo/project.yaml\n
"},{"location":"howto/odk-create-repo/#windows","title":"Windows","text":"Passing arguments on the command line:
seed-via-docker.bat -d po -d ro -d pato -u cmungall -t \"Triffid Behavior ontology\" triffo\n
Using a the predefined project.yaml config file:
seed-via-docker.bat -C project.yaml\n
"},{"location":"howto/odk-create-repo/#general-instructions-for-both-linux-and-windows","title":"General instructions for both Linux and Windows","text":"-u cmungall
you should be using your own username (i.e. -u nico
), for example for your GitHub or GitLab hosting sites.-c
stands for clean
or \"clean up previous attempts before running again\" and -C
stands for \"the next parameter is the relative path to my config file\".command+s
on Mac or ctrl+s
on Windows to save it in the same directory as your seed-via-docker
script. Then you can open the file with a text editor like Notepad++, Atom, Sublime or even nano, and adapt it to your project. Other more comprehensive examples can be found here.This will create your starter files in target/triffid-behavior-ontology
. It will also prepare an initial release and initialize a local repository (not yet pushed to your Git host site such as GitHub or GitLab).
There are three frequently encountered problems at this stage:
.gitconfig
in user directory.gitconfig
in user directory","text":"The seed-via-docker script requires a .gitconfig
file in your user directory. If your .gitconfig
is in a different directory, you need to change the path in the downloaded seed-via-docker
script. For example on Windows (look at seed-via-docker.bat
):
docker run -v %userprofile%/.gitconfig:/root/.gitconfig -v %cd%:/work -w /work --rm -ti obolibrary/odkfull /tools/odk.py seed %*\n
%userprofile%/.gitconfig
should be changed to the correct path of your local .gitconfig
file.
We have had reports of users having trouble if there paths (say, D:\\data
) contain a space symbol, like D:/Dropbox (Personal)
or similar. In this case, we recommend to find a directory you can work in that does not contain a space symbol.
You can customize at this stage, but we recommend to first push the changes to you Git hosting site (see next steps).
"},{"location":"howto/odk-create-repo/#during-download-your-filenames-got-changed-windows","title":"During download, your filenames got changed (Windows)","text":"Windows users, occasionally it has been reported that files downloaded on a Windows machine get a wrong file ending, for example seed-via-docker.bat.txt
instead of seed-via-docker.bat
, or, as we will see later, project.yaml.txt
instead of project.yaml
. If you have problems, double check your files are named correctly after the download!
The development kit will automatically initialize a git project, add all files and commit.
You will need to create a project on you Git hosting site.
For GitHub:
-u
option. The name MUST be the one you set with -t
, just with lower case letters and dashes instead of spaces. In our example above, the name \"Triffid Behavior Ontology\" translates to triffid-behavior-ontology
.For GitLab:
-u
option. The name MUST be the one you set with -t
.Follow the instructions there. E.g. (make sure the location of your remote is exactly correct!).
cd target/triffo\ngit remote add origin https://github.com/matentzn/triffid-behavior-ontology.git\ngit branch -M main\ngit push -u origin main\n
Note: you can now mv target/triffid-behavior-ontology
to anywhere you like in your home directory. Or you can do a fresh checkout from github.
I generally feel its easier and less error prone to deviate from the standard instructions above. I keep having problems with git, passwords, typose etc, so I tend to do it, inofficially, as follows:
target/triffo
).In your repo you will see a README-editors.md file that has been customized for your project. Follow these instructions.
"},{"location":"howto/odk-create-repo/#obo-library-metadata","title":"OBO Library metadata","text":"The assumption here is that you are adhering to OBO principles and want to eventually submit to OBO. Your repo will contain stub metadata files to help you do this.
You can create pull requests for your ontology on the OBO Foundry. See the src/metadata
file for more details.
For more documentation, see http://obofoundry.org
"},{"location":"howto/odk-create-repo/#additional","title":"Additional","text":"You will want to also:
See the README-editors.md file that has been generated for your project.
"},{"location":"howto/odk-setup/","title":"Getting set up with Docker and the Ontology Development Kit","text":""},{"location":"howto/odk-setup/#installation","title":"Installation","text":""},{"location":"howto/odk-setup/#for-windows","title":"For Windows","text":"docker pull obolibrary/odkfull
. This will download the ODK (will take a few minutes, depending on you internet connection).Raw
, and then, when the file is open in your browser, CTRL+S to save it. Ideally, you save this file in your project directory, the directory you will be using for your exercises, as it will only allow you to edit files in that very same directory (or one of its sub-directories).docker ps
in your terminal or command line (CMD). If all is ok, you should be seeing something like:CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n
docker pull obolibrary/odkfull
on your command line to install the ODK. This will take while.sh odk.sh robot --version
to see whether it works.sh odk.sh bash
(to leave the ODK container again, simply run exit
from within the container). On Windows, use run.bat bash
instead. However, for many of the ontologies we develop, we already ship an ODK wrapper script in the ontology repo, so we dont need the odk.sh or odk.bat file. That file is usually called run.sh
or run.bat
and can be found in your ontology repo in the src/ontology
directory and can be used in the exact same way.One of the most frequent problems with running the ODK for the first time is failure because of lack of memory. There are two potential causes for out-of-memory errors:
JAVA
inside the ODK docker container. This memory is set as part of the ODK wrapper files, i.e. src/ontology/run.bat
or src/ontology/run.sh
, usually with ODK_JAVA_OPTS
.Out-of-memory errors can take many forms, like a Java OutOfMemory exception, but more often than not it will appear as something like an Error 137
.
There are two places you need to consider to set your memory:
robot_java_args: '-Xmx8G'
to your src/ontology/cl-odk.yaml file, see for example here.robot_java_args
variable. You can manage your memory settings by right-clicking on the docker whale in your system bar-->Preferences-->Resources-->Advanced, see picture below.If your problem is that you do not have enough memory on your machine, the only solution is to try to engineer the pipelines a bit more intelligently, but even that has limits: large ontologies require a lot of memory to process when using ROBOT. For example, handling ncbitaxon as an import in any meaningful way easily consumes up to 12GB alone. Here are some tricks you may want to contemplate to reduce memory:
robot query
uses an entirely different framework for representing the ontology, which means that whenever you use ROBOT query, for at least a short moment, you will have the entire ontology in memory twice. Sometimes you can optimse memory by seperating query
and other robot
commands into seperate commands (i.e. not chained in the same robot
command).The robot reason
command consumes a lot of memory. reduce
and materialise
potentially even more. Use these only ever in the last possible moment in a pipeline.
`
A new version of the Ontology Development Kit (ODK) is out? This is what you should be doing:
docker pull obolibrary/odkfull\n
src/ontology
directory.cd myrepo/src/ontology\n
Now run the update command TWICE (the first time it may fail, as the update command needs to update itself).
sh run.sh make update_repo\nsh run.sh make update_repo\n
.github/workflows/qc.yml
(from the top level of your repository) and make sure that it is using the latest version of the ODK.For example, container: obolibrary/odkfull:v1.3.0
, if v1.3.0
. Is the latest version. If you are unsure what the latest version is, you can find that information here: https://hub.docker.com/r/obolibrary/odkfull/tags
OPTIONAL: if you have any other GitHub actions you would like to update to the latest ODK, now is the time! All of your GitHub actions can be found in the .github/workflows/
directory from the top level of your repo.
Review all the changes and commit them, and make a PR the usual way. 100% wait for the PR to pass QC - ODK updates can be significant!
Send a reminder to all other ontology developers of your repo and tell them to install the latest version of ODK (step 1 only).
This 'how to' guide provides a template for an Ontology Overview for your ontology. Please create a markdown file using this template and share it in your ontology repository, either as part of your ReadMe file or as a separate document in your documentation. The Ontology Overview should include the following three sections:
Describe the ontology level curation, ie how to add terms. For example, terms are added to the ontology via:
Note: There is no need for details about QC, ODK unless it is related to curation (ie pipeline that automatically generates mappings, include that)
"},{"location":"howto/ontology-overview/#how-the-ontology-used-in-practice","title":"How the ontology used in practice","text":"Include 1-3 actual use cases. Please provide concrete examples.
For example:
Contributors:
Status: This is a working document! Feel free to add more content!
The Open Science Engineer contributes to the collection and standardisation of publicly available scientific knowledge through curation, community-building and data, ontology and software engineering.
Open Science and all its sub-divisions, including Open Data and Open Ontologies, are central to tackling global challenges from rare disease to climate change. Open licenses are only part of the answer - the really tough part is the standardisation of data (including the unification of ontologies, the FAIRification of data and adoption of common semantic data models) and the organisation of a global, fully decentralised community of Open Science engineers. Here, we will discuss some basic principles on how we can maximise our impact as members of a global community combating the issues of our time:
We discuss how to best utilise social workflows to achieve positive impact. We will try to convince you that building a close collaborative international community by spending time on submitting and answering issues on GitHub, helping on Stack Overflow and other online platforms, or just reaching out and donating small amounts of time to other open science efforts can make a huge difference.
"},{"location":"howto/open-science-engineer/#table-of-contents","title":"Table of contents","text":"For a quick 10 minute overview:
"},{"location":"howto/open-science-engineer/#monarch-obo-training-tutorial","title":"Monarch OBO training Tutorial","text":"How to be an open science ontologist
"},{"location":"howto/open-science-engineer/#principle-of-collaboration","title":"Principle of Collaboration","text":"The heart and soul of a successful Open Science culture is collaboration. The relative isolation into which many projects are forced due to limitations imposed by certain kinds of funding makes it even more important to develop effective social, collaborative workflows. This involve effective online communication, vocal appreciation (likes, upvotes, comments), documentation and open-ness.
"},{"location":"howto/open-science-engineer/#question-answering-and-documentation","title":"Question answering and documentation","text":"<details>
tag: <details><summary>[click arrow to expand]</summary>
. See example hereMaximising impact of your changes is by far the best way you can benefit society as an Open Science Engineer. Open Science projects are a web of mutually dependent efforts, for example:
The key to maximising your impact is to push any fixes as far upstream as possible. Consider the following projects and the way they depend on each other (note that this is a gross simplification for illustration; in reality the number of dependencies is much higher):
Let's think of the following (entirely fabricated) scenario based on the image above.
It is, therefore, possible that:
Imagine a user of Open Targets that sees this evidence, and reports it to Open Targets as a bug. Open Targets could take the easy way out: remove the erroneous record from the database permanently. This means that the IMPC (itself with hundreds of dependent users and tools), Monarch (again with many dependents), uPheno and HPO (with probably thousands of dependents) would still carry forward that (tiny) mistake. This is the basic idea of maximising impact through Upstream Fixing: The higher up-stream (up the dependency graph) an error is fixed, the more cumulative benefit there is to a huge ecosystem of tools and services.
An even better fix would be to have each fix to the ontology result in a new, shared quality control test. For example, some errors (duplicate labels, missing definition, etc) can be caught by automated testing. Here is a cool story.
"},{"location":"howto/open-science-engineer/#case-study-external-contribution-and-upstream-fixing","title":"Case Study: External contribution and upstream fixing","text":"@vasvir
(GitHub name), a member of the global community reached out to us on Uberon: https://github.com/obophenotype/uberon/issues/2424. https://github.com/obophenotype/uberon/pull/2640Gasserian ganglion
and gasserian ganglion
where previously considered distinct). Note: before the PRs, @vasvir did not speak any SPARQL. Instead of simply deleting the synonyms for his NLP projects, @vasvir
instead decided to report the issues straight to the source. This way, hundreds, if not thousands of projects will directly or indirectly benefit from him!
Example 1: While curating Mondo, Nicole identified issues relevant to Orphanet and created this issue.
Example 2: There is overlap between Mondo and Human Phenotype Ontology and the Mondo and HPO curators tag each other on relevant tickets.
Example 3: In Mondo, if new classifications are made, Mondo curators report this back to the source ontology to see if they would like to follow our classification.
"},{"location":"howto/open-science-engineer/#conclusions-upstream-fixing","title":"Conclusions: Upstream Fixing","text":"Have you ever wondered how much impact changing a synonym from exact
to related
could have? Or the addition of a precise mapping? The fixing of a typo in a label? It can be huge. And this does not only relate to ontologies, this goes for tool development as well. We tend to work around bugs when we are building software. Instead, or at least in addition to, we should always report the bug at the source to make sure it gets fixed eventually.
Many of the resources we develop are financed by grants. Grants are financed in the end by the taxpayer. While it is occasionally appropriate to protect open work with creative licenses, it rarely makes sense to restrict access to Open Ontologies work - neither to commercial nor research exploitation (we may want to insist on appropriate attribution to satisfy our grant developers).
On the other side there is always the risk of well-funded commercial endeavours simply \"absorbing\" our work - and then tying stakeholders into their closed, commercial ecosystem. However, this is not our concern. We cannot really call it stealing if it is not really ours to begin with! Instead of trying to prevent unwanted commercialisation and closing, it is better to work with corporations in pre-competitive schemes such as Pistoia Alliance or Allotrope Foundation and lobby for more openness. (Also, grant authorities should probably not allow linking scientific data to less than totally open controlled vocabularies.)
Here, we invite you to embrace the idea that ontologies and many of the tools we develop are actually community-driven, with no particular \"owners\" and \"decision makers\". While we are not yet there (we don't have sufficiently mature governance workflows for full fledged onto-communism), and most ontologies are still \"owned\" by an organisation that provides a major source of funding, we invite you to think of this as a preliminary state. It is better to embrace the idea of \"No-ownership\" and figure out social workflows and governance processes that can handle the problems of decision making.
"},{"location":"howto/open-science-engineer/#take-responsibility-for-your-community-ontologies","title":"Take responsibility for your community (ontologies)","text":"Feel empowered to nudge reviewers or experts to help. Get that issue answered and PR merged whatever it takes!
Example: After waiting for the PR to be reviewed, Meghan kindly asked Nicole if she should find a different reviewer. 1. Find review buddies. For every ontology you seek to contribute to pair up with someone who will review your pull requests and you will review their pull requests. Sometimes, it is very difficult to get anyone to review your pull request. Reach out to people directly, and form an alliance for review. It is fun, and you learn new things (and get to know new people!). 1. Be proactive
Prettier standardizes the representation and formatting of Markdown. More information is available at https://prettier.io/. Note, these instructions are for a Mac.
"},{"location":"howto/prettify/#install-npm","title":"Install npm","text":"If you do not have npm installed, this can be installed using homebrew (if you have homebrew installed).
brew install node
npm install --save-dev --save-exact prettier
npx prettier --write .
Note: Windows users should open Protege using run.bat Note: For the purpose of this how-to, we will be using MONDO as the ontology
The Prot\u00e9g\u00e9 interface follows a basic paradigm of Tabs and Panels. By default, Prot\u00e9g\u00e9 launches with the main tabs seen below. The layout of tabs and panels is configurable by the user. The Tab list will have slight differences from version to version, and depending on your configuration. It will also reflect your customizations.
To customize your view, go to the Window tab on the toolbar and select Views. Here you can customize which panels you see in each tab. In the tabs view, you can select which tabs you will see. You will commonly want to see the Entities tab, which has the Classes tab and the Object Properties tab.
Note: if you open a new ontology while viewing your current ontology, Prot\u00e9g\u00e9 will ask you if you'd like to open it in a new window. \u00a0For most normal usage you should answer no. This will open in a new window.
The panel in the center is the ontology annotations panel. This panel contains basic metadata about the ontology, such as the authors, a short description and license information.
"},{"location":"howto/protege-browse-search/#running-the-reasoner","title":"Running the reasoner","text":"Before browsing or searching an ontology, it is useful to run an OWL reasoner first. This ensures that you can view the full, intended classification and allows you to run queries. Navigate to the query menu, and run the ELK reasoner:
"},{"location":"howto/protege-browse-search/#entities-tab","title":"Entities tab","text":"You will see various tabs along the top of the screen. Each tab provides a different perspective on the ontology. For the purposes of this tutorial, we care mostly about the Entities tab, the DL query tab and the search tool. OWL Entities include Classes (which we are focussed on editing in this tutorial), relations (OWL Object Properties) and Annotation Properties (terms like, 'definition' and 'label' which we use to annotate OWL entities. Select the Entities tab and then the Classes sub-tab. Now choose the inferred view (as shown below).
The Entities tab is split into two halves. The left-hand side provides a suite of panels for selecting various entities in your ontology. When a particular entity is selected the panels on the right-hand side display information about that entity. The entities panel is context specific, so if you have a class selected (like Thing) then the panels on the right are aimed at editing classes. The panels on the right are customizable. Based on prior use you may see new panes or alternate arrangements. You should see the class OWL:Thing. You could start browsing from here, but the upper level view of the ontology is too abstract for our purposes. To find something more interesting to look at we need to search or query.
"},{"location":"howto/protege-browse-search/#searching-in-protege","title":"Searching in Protege","text":"You can search for any entity using the search bar on the right:
The search window will open on top of your Protege pane, we recommend resizing it and moving it to the side of the main window so you can view together.
Here's an example search for 'COVID-19':
It shows results found in display names, definitions, synonyms and more. The default results list is truncated. To see full results check the 'Show all results option'. You may need to resize the box to show all results. Double clicking on a result, displays details about it in the entities tab, e.g.
In the Entities, tab, you can browse related types, opening/closing branches and clicking on terms to see details on the right. In the default layout, annotations on a term are displayed in the top panel and logical assertions in the 'Description' panel at the bottom.
Try to find these specific classes:
Note - a cool feature in the search tool in Protege is you can search on partial string matching. For example, if you want to search for \u2018down syndrome\u2019, you could search on a partial string: \u2018do synd\u2019.
Note - if the search is slow, you can uncheck the box \u2018Search in annotation values. Try this and search for a term and note if the search is faster. Then search for \u2018shingles\u2019 again and note what results you get.
"},{"location":"howto/revert-commit/","title":"How to revert a commit using GitHub Desktop","text":""},{"location":"howto/revert-commit/#prerequisites","title":"Prerequisites","text":"You need to have a GitHub account GitHub and download GitHub Desktop
"},{"location":"howto/revert-commit/#background","title":"Background","text":""},{"location":"howto/revert-commit/#reversing-a-commit","title":"Reversing a commit","text":"Pull Requests are GitHub's mechanism for allowing one person to propose changes to a file (which could be a chunk of code, documentation, or an ontology) and enabling others to comment on (review) the proposed changes. You can learn more about creating Pull Requests (PRs) here; this document is about reviewing other people's PRs.
One key aspect of reviewing pull requests (aka code review or ontology change review) is that the purpose is not just to improve the quality of the proposed change. It is also about building shared coding habits and practices and improving those practices for all engineers (ontology and software) across a whole organisation (effectively building the breadth of project knowledge of the developers and reducing the amount of hard-to-understand code).
Reviewing is an important aspect of open science and engineering culture that needs to be learned and developed. In the long term, this habit will have an effect on the growth and impact of our tools and ontologies comparable to the engineering itself.
It is central to open science work that we review other people's work outside our immediate team. We recommend choosing a few people with whom to mutually review your work, whether you develpo ontologies, code or both. It is of great importance that pull requests are addressed in a timely manner, ideally within 24 hours of the request. The requestor is likely in the headspace of being receptive to changes and working hard to get the code fixed when they ask for a code review.
"},{"location":"howto/review-pull-request/#overarching-workflow","title":"Overarching workflow","text":"Understand the Context: First, read the description of the pull request (PR). It should explain what changes have been made and why. Understand the linked issue or task related to this PR. This will help you understand the context of the changes.
Check the Size: A good PR should not be too large, as this makes it difficult to understand the full impact of the changes. If the PR is very large, it may be a good idea to ask the author to split it into smaller, more manageable PRs.
Review the Code: Go through the code changes line by line. Check the code for clarity, performance, and maintainability. Make sure the code follows the style guide and best practices of your project. Look out for any potential issues such as bugs, security vulnerabilities, or performance bottlenecks.
Check the Tests: The PR should include tests that cover the new functionality or changes. Make sure the tests are meaningful, and they pass. If the project has a continuous integration (CI) system, all tests should pass in the CI environment. In some cases, manual testing may be helpful (see below).
Check the Documentation: If the PR introduces new functionality, it should also update the documentation accordingly. Even for smaller changes, make sure that comments in the code are updated.
Give Feedback: Provide constructive feedback on the changes. If you suggest changes, explain why you think they are necessary. Be clear, respectful, and concise. Remember, your goal is to help improve the quality of the code.
Follow Up: After you have provided feedback, check back to see if the author of the PR has made the suggested changes. You might need to have a discussion or explain your points further.
Approve/Request Changes: If you are satisfied with the changes and all your comments have been addressed, approve the PR. If not, request changes and explain what should be done before the PR can be approved.
Merge the PR: Once the PR is approved and all CI checks pass, it can be merged into the main branch. If your project uses a specific merge strategy (like squash and merge or rebase and merge), make sure it's followed.
xsd:string
declarations), request before doing a review to reduce the changes to only the changes pertaining to the specific issue at hand.In many cases, we may not have the time to perform a proper code review. In that case, try at least to achieve this:
The instructions below describe how to capture a screenshot of your screen, either your entire screen or a partial screenshot. These can be pasted into GitHub issues, pull requests or any markdown file.
"},{"location":"howto/screenshot/#screenshot-instructions-mac","title":"Screenshot Instructions (Mac)","text":"Different keyboards have different keys. One of the following options should work:
(This was adopted from the Gene Ontology editors guide and Mondo documentation). Updated 2023-03-13 by Nicole Vasilevsky
"},{"location":"howto/set-up-protege/#mac-instructions","title":"Mac Instructions","text":"These instructions are for Mac OS
"},{"location":"howto/set-up-protege/#protege-version","title":"Protege version","text":"As of February 2023, OBO ontology editors are using Protege version 5.6.1.
"},{"location":"howto/set-up-protege/#download-and-install-protege","title":"Download and install Protege","text":"Protege needs at least 4G of RAM to cope with large ontologie like Mondo, ideally use 12G or 16G if your machine can handle it. Edits to the Protege configuration files will not take effect until Protege is restarted.
<string>-Xss16M</string>
<string>-Xmx12G</string>
Some Mac users might find that the edits need to be applied to /Applications/Prot\u00e9g\u00e9.app/Contents/Info.plist
.
Taken in part from Memory Management with Prot\u00e9g\u00e9 by Michael DeBellis. Updated by Nicole Vasilevsky.
The following instructions will probably not work if Prot\u00e9g\u00e9 was installed from the platform independent version, which does not include the Java Runtime Environment or a Windows .exe launcher.
Protege-<version>-win.zip
Protege.l4j.ini
in the same directory as Protege.exe
. Opening large ontologies like MONDO will require an increase to Protege's default maximum Java heap size, which is symbolized as -Xmx<size>
. 4GB is usually adequate for opening MONDO, as long as 4GB of free memory is really available on your system before you launch Prot\u00e9g\u00e9! Allocating even more memory will improve some tasks, like reasoning. You can check your available memory by launching the Windows Task Manager, clicking on the More details button on the bottom of the window and then checking the Performance tab at the top of the window.Protege.l4j.ini
before editingOpen Protege.l4j.ini
with a lightweight text editor like Atom or Sublime. Using notepad.exe instead might work, but may change character encodings or the character(s) used to represent End of Line.
After increasing the memory available to Prot\u00e9g\u00e9, Protege.l4j.ini
might look like this.
-Xms200M\n-Xmx4G\n-Xss16M\n
Note that there is no whitespace between -Xmx
, the numerical amount of memory, and the Megabytes/Gigabytes suffix. Don't forget to save.
Taking advantage of the memory increase requires that Prot\u00e9g\u00e9 is shut down and relaunched, if applicable. The methods discussed here may not apply if Prot\u00e9g\u00e9 is launched through any method other than double clicking Protege.exe
from the folder where the edited Protege.l4j.ini
resides.
If you have issues opening Protege, then reduce the memory, try 10G (or lower) instead.
"},{"location":"howto/set-up-protege/#add-elk-reasoner","title":"Add ELK reasoner","text":"See instructions here. Note: Protege 5.6.1 has the ELK reasoner installed.
"},{"location":"howto/set-up-protege/#instructions-for-new-protege-users","title":"Instructions for new Protege users","text":""},{"location":"howto/set-up-protege/#setting-your-id-range","title":"Setting your ID range","text":"See instructions here.
"},{"location":"howto/set-up-protege/#user-details","title":"User details","text":"User name
Click Use supplied user name:
add your name (ie nicolevasilevsky)Use Git user name when available
ORCID
. Add the ID number only, do not include https://, ie 0000-0001-5208-3432Preferences
> New Entities Metadata
tabAnnotate new entities with creator (user)
boxCreator property
Add http://purl.org/dc/terms/contributorCreator value
Select Use ORCIDDate property
Add http://purl.org/dc/terms/dateDate value format
Select ISO-8601This plugin enables some extra functionality, such as the option to obsolete entities from the menu. To install it:
File > Check for plugins...
.OBO Annotations Editor
and click on Install
.Edit > Make entity obsolete
.Preferences > Plugins
.docker pull obolibrary/odkfull
. This will download the ODK (will take a few minutes, depending on you internet connection).By: Nicole Vasilevsky
"},{"location":"howto/switching-ontologies/#description","title":"Description","text":"When you edit an ontology, you need to make sure you are using the correct prefix and your assigned ID range for that on ontology. Protege (unfortunately) does not remember the last prefix or ID range that you used when you switch between ontologies. Therefore we need to manually update this each time we switch ontologies.
"},{"location":"howto/switching-ontologies/#instructions","title":"Instructions","text":"src/ontology/[ontology-name]-idranges.owl
. (For example, src/ontology/mondo-idranges.owl.)You need to have a GitHub account to make term requests. Sign up for a free GitHub account.
"},{"location":"howto/term-request/#background","title":"Background","text":""},{"location":"howto/term-request/#recommended-reading","title":"Recommended reading","text":"This guide on How to select and request terms from ontologies by Chris Mungall provides some helpful background and tips for making term requests.
"},{"location":"howto/term-request/#why-make-a-new-term-request","title":"Why make a new term request?","text":"Onologies are under constant development and are continuously expanded and iterated upon. You may discover that a term you need is not available in your preferred ontology. In this case, please make a new term request to the ontology.
"},{"location":"howto/term-request/#making-term-requests-to-existing-ontologies","title":"Making term requests to existing ontologies","text":"In the following text below, we describe best practices for making a term request to an ontology. In general, requests for new terms are make on the ontology GitHub issue tracker. For example, this is the GitHub issue tracker for the Uberon Anatomy onology.
Note: These are suggestions and not strict rules. We appreciate your contributions to extending and improving ontologies. Following best guidelines is appreciated by the curators and developers, and assists them in addressing your issue more quickly. However, we understand if you are not always able to follow these best practices. Please add as much information as possible, and if there are any questions, the ontology developer may follow up with you for further clarification.
"},{"location":"howto/term-request/#making-a-new-term-request","title":"Making a new term request","text":"This page discusses how to update the contents of your imports using the ODK, like adding or removing terms.
Note: This is a specialised how-to for ODK managed ontologies and is replicated from ODK docs to consolidate workflows in the obook. Not all ontologies use ODKs and many ontologies have their own workflows for imports, please also check with your local ontology documents and/or developers.
Note: The extract function in ROBOT can also be used to extract subsets from onotlogies for modular imports without the use of the ODK. For details on that, please refer to the ROBOT documentation
"},{"location":"howto/update-import/#importing-a-new-term","title":"Importing a new term","text":"Note: some ontologies now use a merged-import system to manage dynamic imports, for these please follow instructions in the section title \"Using the Base Module approach\".
Importing a new term is split into two sub-phases:
There are three ways to declare terms that are to be imported from an external ontology. Choose the appropriate one for your particular scenario (all three can be used in parallel if need be):
This workflow is to be avoided, but may be appropriate if the editor does not have access to the ODK docker container. This approach also applies to ontologies that use base module import approach.
Now you can use this term for example to construct logical definitions. The next time the imports are refreshed (see how to refresh here), the metadata (labels, definitions, etc) for this term are imported from the respective external source ontology and becomes visible in your ontology.
"},{"location":"howto/update-import/#using-term-files","title":"Using term files","text":"Every import has, by default a term file associated with it, which can be found in the imports directory. For example, if you have a GO import in src/ontology/go_import.owl
, you will also have an associated term file src/ontology/go_terms.txt
. You can add terms in there simply as a list:
GO:0008150\nGO:0008151\n
Now you can run the refresh imports workflow) and the two terms will be imported.
"},{"location":"howto/update-import/#using-the-custom-import-template","title":"Using the custom import template","text":"This workflow is appropriate if:
To enable this workflow, you add the following to your ODK config file (src/ontology/cl-odk.yaml
), and update the repository (using sh run.sh make update_repo
):
use_custom_import_module: TRUE\n
Now you can manage your imported terms directly in the custom external terms template, which is located at src/templates/external_import.owl
. Note that this file is a ROBOT template, and can, in principle, be extended to include any axioms you like. Before extending the template, however, read the following carefully.
The main purpose of the custom import template is to enable the management off all terms to be imported in a centralised place. To enable that, you do not have to do anything other than maintaining the template. So if you, say current import APOLLO_SV:00000480
, and you wish to import APOLLO_SV:00000532
, you simply add a row like this:
ID Entity Type\nID TYPE\nAPOLLO_SV:00000480 owl:Class\nAPOLLO_SV:00000532 owl:Class\n
When the imports are refreshed see imports refresh workflow, the term(s) will simply be imported from the configured ontologies.
Now, if you wish to extent the Makefile (which is beyond these instructions) and add, say, synonyms to the imported terms, you can do that, but you need to (a) preserve the ID
and ENTITY
columns and (b) ensure that the ROBOT template is valid otherwise, see here.
WARNING. Note that doing this is a widespread antipattern (see related issue). You should not change the axioms of terms that do not belong into your ontology unless necessary - such changes should always be pushed into the ontology where they belong. However, since people are doing it, whether the OBO Foundry likes it or not, at least using the custom imports module as described here localises the changes to a single simple template and ensures that none of the annotations added this way are merged into the base file (see format variant documentation for explanation on what base file is)
"},{"location":"howto/update-import/#refresh-imports","title":"Refresh imports","text":"If you want to refresh the import yourself (this may be necessary to pass the travis tests), and you have the ODK installed, you can do the following (using go as an example):
First, you navigate in your terminal to the ontology directory (underneath src in your hpo root directory).
cd src/ontology\n
Then, you regenerate the import that will now include any new terms you have added. Note: You must have docker installed.
sh run.sh make PAT=false imports/go_import.owl -B\n
Since ODK 1.2.27, it is also possible to simply run the following, which is the same as the above:
sh run.sh make refresh-go\n
Note that in case you changed the defaults, you need to add IMP=true
and/or MIR=true
to the command below:
sh run.sh make IMP=true MIR=true PAT=false imports/go_import.owl -B\n
If you wish to skip refreshing the mirror, i.e. skip downloading the latest version of the source ontology for your import (e.g. go.owl
for your go import) you can set MIR=false
instead, which will do the exact same thing as the above, but is easier to remember:
sh run.sh make IMP=true MIR=false PAT=false imports/go_import.owl -B\n
"},{"location":"howto/update-import/#using-the-base-module-approach","title":"Using the Base Module approach","text":"Since ODK 1.2.31, we support an entirely new approach to generate modules: Using base files. The idea is to only import axioms from ontologies that actually belong to it. A base file is a subset of the ontology that only contains those axioms that nominally belong there. In other words, the base file does not contain any axioms that belong to another ontology. An example would be this:
Imagine this being the full Uberon ontology:
Axiom 1: BFO:123 SubClassOf BFO:124\nAxiom 1: UBERON:123 SubClassOf BFO:123\nAxiom 1: UBERON:124 SubClassOf UBERON 123\n
The base file is the set of all axioms that are about UBERON terms:
Axiom 1: UBERON:123 SubClassOf BFO:123\nAxiom 1: UBERON:124 SubClassOf UBERON 123\n
I.e.
Axiom 1: BFO:123 SubClassOf BFO:124\n
Gets removed.
The base file pipeline is a bit more complex then the normal pipelines, because of the logical interactions between the imported ontologies. This is solved by _first merging all mirrors into one huge file and then extracting one mega module from it.
Example: Let's say we are importing terms from Uberon, GO and RO in our ontologies. When we use the base pipelines, we
imports/merged_import.owl
The first implementation of this pipeline is PATO, see https://github.com/pato-ontology/pato/blob/master/src/ontology/pato-odk.yaml.
To check if your ontology uses this method, check src/ontology/cl-odk.yaml to see if use_base_merging: TRUE
is declared under import_group
If your ontology uses Base Module approach, please use the following steps:
First, add the term to be imported to the term file associated with it (see above \"Using term files\" section if this is not clear to you)
Next, you navigate in your terminal to the ontology directory (underneath src in your hpo root directory).
cd src/ontology\n
Then refresh imports by running
sh run.sh make imports/merged_import.owl\n
Note: if your mirrors are updated, you can run sh run.sh make no-mirror-refresh-merged
This requires quite a bit of memory on your local machine, so if you encounter an error, it might be a lack of memory on your computer. A solution would be to create a ticket in an issue tracker requesting for the term to be imported, and your one of the local devs should pick this up and run the import for you.
Lastly, restart Protege, and the term should be imported in ready to be used.
"},{"location":"images/","title":"About using images in Git/GitHub","text":"There are two places you'll probaby want to use images in GitHub, in issue tracker and in markdown files, html etc. The way you handle images in these contexts is quite different, but easy once you get the hang of it.
"},{"location":"images/#in-markdown-files-and-html-etc","title":"In markdown files (and html etc)","text":"All images referenced in static files such as html and markdown need to be referenced using a URL; dragging and dropping is not supported and could actually cause problems. Keeping images in a single directory enables them to be referenced more readily. Sensible file names are highly recommended, preferably without spaces as these are hard to read when encoded.
An identical file, named in two different ways is shown as an example below. They render in the same way, but the source \"code\" looks ugly when spaces are used in file names.
Eg.
encoding needed no encoding needed![](github%20organizations%20teams%20repos.png
![](github-organizations-teams-repos.png)
In this example, the filename is enough of a 'url' because this file (https://ohsu-library.github.io/github-tutorial/howto/images/index.md) and the images are in the same directory https://ohsu-library.github.io/github-tutorial/howto/images/.
To reference/embed an image that is not in the same directory, a more careful approach is needed.
"},{"location":"images/#referencing-images-in-your-repository-and-elsewhere","title":"Referencing images in your repository and elsewhere","text":"Absolute path referencing Relative path referencing![](https://github.com/OHSU-Library/github-tutorial/raw/master/docs/other-images/owl.jpg)
![](other-images/owl.jpg)
Each instance of ../
means 'go up one level' in the file tree.
It is also possible to reference an image using an external URL outside your control, in another github organization, or anywhere on the web, however this method can be fragile if the URL changes or could lead to unintended changes. Therefore make your own copies and reference those unless:
For example, it is not clear for how long the image below will manage to persist at this EPA link, or sadly, for how long the image will even be an accurate reflection of the current situation in the arctic. https://www.epa.gov/sites/production/files/styles/microsite_banner/public/2016-12/epa-banner-images/science_banner_arctic.png
"},{"location":"images/#in-github-issue-tracker","title":"In GitHub issue tracker","text":"Images that are embedded into issues can be dragged and dropped in the GitHub issues interface. Once you've done so, it will look something like this with GitHub assigning an arbitrary URL (githubuserassets) for the image.
![](screenshot-of-images-in-issues.png)
Ideally, a Markdown document is renderable in a variety of output formats and devices. In some cases, it may be desirable to create non-portable Markdown that uses HTML syntax to position images. This limits the longevity of the artifact, but may be necessary sometimes. We describe how to manage this below.
In order to size images, use the native html syntax: width =
with the <img src=, as per below.
<img src=\"https://github.com/monarch-initiative/monarch-app/raw/master/image/Phenogrid3Compare.png\" width=\"53\">
These materials are under construction and incomplete.
"},{"location":"lesson/analysing-linked-data/#prerequisites","title":"Prerequisites","text":"In the following we will look a bit at the general Linked Data landscape, and name some of its flagship projects and standards. It is important to be clear that the Semantic Web field is a very heterogenous one:
"},{"location":"lesson/analysing-linked-data/#flagship-projects-of-the-wider-semantic-web-community","title":"Flagship projects of the wider Semantic Web community","text":"While these Semantic Web flagship projects are doubtlessly useful, it is sometimes hard to see how they can help for your biomedical research. We rarely make use of them in our day to day work as ontologists, but there are some notable exceptions:
The OBO format is a very popular syntax for representing biomedical ontologies. A lot of tools have been built over the years to hack OBO ontologies on the basis of that format - I still work with it on a daily basis. Although it has semantically been proven to be a subset of OWL (i.e. there is a lossless mapping of OBO into OWL) and can be viewed as just another syntax, it is in many ways idiosyncratic. For starters, you wont find many, if any, IRIs in OBO ontologies. The format itself uses CURIEs which are mapped to the general OBO PURL namespace during transformation to OWL. For example, if you see MONDO:0003847 in an OBO file, and were to translate it to OWL, you will see this term being translated to http://purl.obolibrary.org/obo/MONDO_0003847. Secondly, you have a bunch of built-in properties like BROAD or ABBREVIATION that mapped to a vocabulary called oboInOwl (oio). These are pretty non-standard on the general Semantic Web, and often have to be manually mapped to the more popular counterparts in the Dublin Core or SKOS namespaces.
Having URIs as identifiers is not generally popular in the life sciences. As discussed elsewhere, it is much more likely to encounter CURIEs such as MONDO:0003847 than URIs such as http://purl.obolibrary.org/obo/MONDO_0003847 in biomedical databases.
"},{"location":"lesson/analysing-linked-data/#useful-tools-for-biomedical-research","title":"Useful tools for biomedical research","text":"Why does the biomedical research, and clinical, community care about the Semantic Web and Linked Data? There are endless lists of applications that try to apply semantic technologies to biomedical problems, but for this week, we only want to look at the broader picture. In our experience, the use cases where Semantic Web standards are applied successfully are:
As a rule of thumb, for every single problem/term/use case, you will have 3-6 options to choose from, in some cases even more. The criteria for selecting a good ontology are very much dependent on your particular use case, but some concerns are generally relevant. A good first pass is to apply to \"10 simple rules for selecting a Bio-ontology\" by Malone et al, but I would further recommend to ask yourself the following:
Aside from aspects of your analysis, there is one more thing you should consider carefully: the open-ness of your ontology in question. As a user, you have quite a bit of power on the future trajectory of the domain, and therefore should seek to endorse and promote open standards as much as possible (for egotistic reasons as well: you don't want to have to suddenly pay for the ontologies that drive your semantic analyses). It is true that ontologies such as SNOMED have some great content, and, even more compellingly, some really great coverage. In fact, I would probably compare SNOMED not with any particular disease ontology, but with the OBO Foundry as a whole, and if you do that, it is a) cleaner, b) better integrated. But this comes at a cost. SNOMED is a commercial product - millions are being payed every year in license fees, and the more millions come, the better SNOMED will become - and the more drastic consequences will the lock-in have if one day you are forced to use SNOMED because OBO has fallen too far behind. Right now, the sum of all OBO ontologies is probably still richer and more valuable, given their use in many of the central biological databases (such as the ones hosted by the EBI) - but as SNOMED is seeping into the all aspects of genomics now (for example, it will soon be featured on OLS!) it will become increasingly important to actively promote the use of open biomedical ontologies - by contributing to them as well as by using them.
We will discuss ontologies in the medical, phenomics and genomics space in more detail in a later session of the course.
"},{"location":"lesson/analysing-linked-data/#other-interesting-links","title":"Other interesting links","text":"In this section we will discuss the following:
Note of caution: No two Semantic Web overviews will be equivalent to each other. Some people claim the Semantic Web as an idea is an utter failure, while others praise it as a great success (in the making) - in the end you will have to make up your own mind. In this section I focus on parts of the Semantic Web step particularly valuable to the biomedical domain, and I will omit many relevant topics in the wider Semantic Web area, such as Enterprise Knowledge Graphs, decentralisation and personalisation, and many more. Also, the reader is expected to be familiar with the basic notions of the Semantic Web, and should use this overview mainly to tie some of the ideas together.
The goal of this section is to give the aspiring Semantic Data Engineer in the biomedical domain a rough idea of key concepts around Linked Data and the Semantic Web insofar as they relate to their data science and and data engineering problems. Even after 20 years of Semantic Web research (the seminal paper, conveniently and somewhat ironically behind a paywall, was published in May 2001), the area is still dominated by \"academic types\", although the advent of the Knowledge Graph is already changing that. As I already mentioned above, no two stories of what the Semantic Web is will sound the same. However, there are a few stories that are often told to illustrate why we need semantics. The OpenHPI course names a few:
<span about=\"dbpedia:Jaguar\">Jaguar</span>
, will make it easier for the search engine to understand what your site is about and link it to other relevant content. From this kind of mark-up, structured data can be extracted and integrate into a giant, worldwide database, and exposed through SPARQL endpoints, that can then be queried using a suitable query language.I am not entirely sure anymore that any of these ways (web of data, machine understanding, layered stack of matching standards) to motivate the Semantic Web are particularly effective for the average data scientists or engineer. If I had to explain the Semantic Web stack to my junior self, just having finished my undergraduate, I would explain it as follows (no guarantee though it will help you).
The Semantic Web / Linked Data stack comprises roughly four components that are useful for the aspiring Semantic (Biomedical) Data Engineer/Scientist to distinguish:
"},{"location":"lesson/analysing-linked-data/#a-way-to-refer-to-things-including-entities-and-relations-in-a-global-namespace","title":"A way to refer to things (including entities and relations) in a global namespace.","text":"You, as a scientist, might be using the term \"gene\" to refer to basic physical and functional unit of heredity, but me, as a German, prefer the term \"Gen\". In the Semantic Web, instead of natural language words, we prefer to use URIs to refer to things such as https://www.wikidata.org/wiki/Q7187: if you say something using the name https://www.wikidata.org/wiki/Q7187, both your German and Japanese colleagues will \"understand\" what you are referring to. More about that in the next chapter.
"},{"location":"lesson/analysing-linked-data/#lots-loaaaads-of-ways-to-make-statements-about-things","title":"Lots (loaaaads!) of ways to make statements about things.","text":"For example, to express \"a mutation of SHH in humans causes isolated microphthalmia with coloboma-5\" you could say something like (http://purl.obolibrary.org/obo/MONDO_0012709 | \"microphthalmia, isolated, with coloboma 5\")--[http://purl.obolibrary.org/obo/RO_0004020 | \"has basis in dysfunction of\"]-->(https://identifiers.org/HGNC:10848 | \"SSH (gene)\"). Or you could say: (http://purl.obolibrary.org/obo/MONDO_0012709 | \"microphthalmia, isolated, with coloboma 5\")--[http://www.w3.org/2000/01/rdf-schema#subClassOf | \"is a\"]-->(http://purl.obolibrary.org/obo/MONDO_0003847 | \"Mendelian Disease\"). If we use the analogy of \"language\", then the URIs (above) are the words, and the statements are sentences in a language. Unfortunately, there are many languages in the Semantic Web, such as OWL, RDFS, SKOS, SWRL, SHACL, SHEX, and dialects (OWL 2 EL, OWL 2 RL) and a plethora of formats, or serialisations (you can store the exact same sentence in the same language such as RDF, or OWL, in many different ways)- more about that later. In here lies also one of the largest problems of the Semantic Web - lots of overlapping standards means, lots of incompatible data - which raises the bar for actually being able to seamlessly integrate \"statements about things\" across resources.
"},{"location":"lesson/analysing-linked-data/#collections-of-statements-about-things-that-somehow-belong-together-and-provide-some-meaning-or-context-for-those-things","title":"Collections of statements about things that somehow belong together and provide some meaning, or context, for those things.","text":"Examples include:
For example (as always, non exhaustive):
This week will focus on 1 (identifiers) and 4 (applications) - 2 (languages and standards) and 3 (controlled vocabularies and ontologies) will be covered in depth in the following weeks.
Note on the side: Its not always 100% clear what is meant by Linked Data in regular discourse. There are some supposedly \"clear\" definitions (\"method for publishing structured data\", \"collection of interrelated datasets on the Web\"), but when it comes down to the details, there is plenty of confusion (does an OWL ontology constitute Linked Data when it is published on the Web? Is it Linked Data if it does not use RDF? Is it Linked Data if it is less than 5-star - see below). In practice all these debates are academic and won't mean much to you and your daily work. There are entities, statements (context) being said about these entities using some standard (associated with the Semantic Web, such as OWL or RDFS) and tools that do something useful with the stuff being said.
"},{"location":"lesson/analysing-linked-data/#when-i-say-mendelian-disease-i-mean-httppurlobolibraryorgobomondo_0003847","title":"When I say \"Mendelian Disease\" I mean http://purl.obolibrary.org/obo/MONDO_0003847","text":"One of the top 5 features of the Semantic Web (at least in the context of biomedical sciences) is the fact that we can use URIs as a global identifier scheme that is unambiguous, independent of database implementations, independent of language concerns to refer to the entities in our domain.
For example, if I want to refer to the concept of \"Mendelian Disease\", I simply refer to http://purl.obolibrary.org/obo/MONDO_0003847 - and everyone, in Japan, Germany, China or South Africa, will be able to \"understand\" or look up what I mean. I don't quite like the word \"understanding\" in this context as it is not actually trivial to explain to a human how a particular ID relates to a thing in the real world (semiotics). In my experience, this process is a bit rough in practice - it requires that there is a concept like \"Mendelian Disease\" in the mental model of the person, and it requires some way to link the ID http://purl.obolibrary.org/obo/MONDO_0003847 to that \"mental\" concept - not always as trivial as in this case (where there are standard textbook definitions). The latter is usually achieved (philosophers and linguists please stop reading) by using an annotation that somehow explains the term - either a label or some kind of formal definition - that a person can understand. In any case, not trivial, but thankfully not the worst problem in the biomedical domain where we do have quite a wide range of shared \"mental models\" (more so in Biology than Medical Science..). Using URIs allows us to facilitate this \"understanding\" process by leaving behind some kind of information at the location that is dereferenced by the URI (basically you click on the URI and see what comes up). Note that there is a huge deal of compromise already happening across communities. In the original Semantic Web community, the hope was somehow that dereferencing the URI (clicking on it, navigating to it) would reveal structured information about the entity in question that could used by machines to understand what the entity is all about. In my experience, this was rarely ever realised in the biomedical domain. Some services like Ontobee expose such machine readable data on request (using a technique called content negotiation), but most URIs simply refer to some website that allow humans to understand what it means - which is already a huge deal. For more on names and identifiers I refer the interested reader to James Overton's OBO tutorial here.
Personal note: Some of my experienced friends in the bioinformatics world say that \"IRI have been more pain than benefit\". It is clear that there is no single thing in the Semantic Web is entirely uncontested - everything has its critics and proponents.
"},{"location":"lesson/analysing-linked-data/#the-advent-of-the-curie-and-the-bane-of-the-curie-map","title":"The advent of the CURIE and the bane of the CURIE map","text":"In reality, few biological resources will contain a reference to http://purl.obolibrary.org/obo/MONDO_0003847. More often, you will find something like MONDO:0003847
, which is called a CURIE. You will find CURIEs in many contexts, to make Semantic Web languages easier to read and manage. The premise is basically that your document contains a prefix declaration that says something like this:
PREFIX MONDO: <http://purl.obolibrary.org/obo/MONDO_>\n
which allows allows the interpreter to unfold the CURIE into the IRI:
MONDO:0003847 -> http://purl.obolibrary.org/obo/MONDO_0003847\n
In reality, the proliferation of CURIEs has become a big problem for data engineers and data scientists when analysing data. Databases rarely, if ever, ship the CURIE maps with their data required to understand what a prefix effectively stands for, leading to a lot of guess-work in the daily practice of the Semantic Data Engineer (if you ever had to distinguish ICD: ICD10: ICD9: UMLS:, UMLSCUI: without a prefix map, etc you will know what I am talking about). Efforts to bring order to this chaos, essentially globally agreed CURIE maps (e.g. prefixcommons), or ID management services such as identifiers.org exist, but right now there is no one solution - prepare yourself to having to deal with this issue when dealing with data integration efforts in the biomedical sciences. More likely than not, your organisation will build its own curie map and maintain it for the duration of your project.
"},{"location":"lesson/analysing-linked-data/#semantic-web-in-the-biomedical-domain-success-stories","title":"Semantic Web in the biomedical domain: Success stories","text":"There are probably quite a few divergent opinions on this, but I would like to humbly list the following four use cases as among the most impactful applications of Semantic Web Technology in the biomedical domain.
"},{"location":"lesson/analysing-linked-data/#light-semantics-for-data-aggregation","title":"Light Semantics for data aggregation.","text":"We can use hierarchical relations in ontology to group data. For example, if I know that http://purl.obolibrary.org/obo/MONDO_0012709 (\"microphthalmia, isolated, with coloboma 5\") http://www.w3.org/2000/01/rdf-schema#subClassOf (\"is a\") http://purl.obolibrary.org/obo/MONDO_0003847 (\"Mendelian Disease\"), then a specialised Semantic Web tool called a reasoner will know that, if I ask for all genes associated with Mendelian diseases, you also want to get those associated with \"microphthalmia, isolated, with coloboma 5\" specifically (note that many query engines such as SPARQL with RDFS entailment regime have simple reasoners embedded in them, but we would not call them \"reasoner\" - just query engine).
"},{"location":"lesson/analysing-linked-data/#heavy-semantics-for-ontology-management","title":"Heavy Semantics for ontology management.","text":"Ontologies are extremely hard to manage and profit from the sound logical foundation provided by the Web Ontology Language (OWL). We can logically define our classes in terms of other ontologies, and then use a reasoner to classify our ontology automatically. For example, we can define abnormal biological process phenotypes in terms of biological processes (Gene Ontology) and classify our phenotypes entirely using the classification of biological processes in the Gene Ontology (don't worry if you don't understand a thing - we will get to that in a later week).
"},{"location":"lesson/analysing-linked-data/#globally-unique-identifiers-for-data-integration","title":"Globally unique identifiers for data integration.","text":"Refer to the same thing the same way. While this goal was never reached in total perfection, we have gotten quite close. In my experience, there are roughly 3-6 ways to refer to entities in the biomedical domain (like say, ENSEMBL, HGNC, Entrez for genes; or SNOMED, NCIT, DO, MONDO, UMLS for diseases). So while the \"refer to the same thing the same way\" did not truly happen, a combination of standard identifiers with terminological mappings, i.e. links between terms, can be used to integrate data across resources (more about Ontology Matching later). Again, many of my colleagues disagree - they don't like IRIs, and unfortunately, you will have to build your own position on that.
Personal note: From an evolutionary perspective, I sometimes think that having 2 or 3 competing terminological systems is better than 1, as the competition also drives the improvements in quality, but there is a lot of disagreement on this.
"},{"location":"lesson/analysing-linked-data/#coordinated-development-of-mutually-compatible-ontologies-across-the-biomedical-domain-the-open-biological-and-biomedical-ontologies-obo-foundry","title":"Coordinated development of mutually compatible ontologies across the biomedical domain: The Open Biological and Biomedical Ontologies (OBO) Foundry.","text":"The OBO Foundry is a community-driven effort to coordinate the development of vocabularies and ontologies across the biomedical domain. It develops standards for the representation of terminological content (like standard properties), and ontological knowledge (shared design patterns) as well as shared systems for quality control. Flagship projects include:
In the following, we will list some of the technologies you may find useful, or will be forced to use, as a Semantic Data Engineer. Most of these standards will be covered in the subsequent weeks of this course.
Standard Purpose Use case Web Ontology Language (OWL) Representing Knowledge in Biomedical Ontologies All OBO ontologies must be provided in OWL as well. Resource Description Framework (RDF) Model for data interchange. Triples, the fundamental unit of RDF, are ubiquitous on the Semantic Web SPARQL Query Language for RDF A standard query language for RDF and RDFS. Primary query language to interrogate RDF/RDFS/Linked Data on the Web. Simple Knowledge Organization System (SKOS) Another, more lightweight, knowledge organisation system in many ways competing with OWL. Not as widely used in the biomedical domain as OWL, but increasing uptake of \"matching\" vocabulary (skos:exactMatch, etc). RDF-star A key shortcoming of RDF is that, while I can in principle say everything about everything, I cannot directly talk about edges, for example to attribute provenance: \"microphthalmia, isolated, with coloboma 5 is kind of Mendelian disease\"--source: Wikipedia Use cases here. JSON-LD A method to encoding linked data in JSON format. (Very useful to at least know about). RDFa W3C Recommendation to embed rich semantic metadata in HTML (and XML). I have to admit - in 11 years Semantic Web Work I have not come across much use of RDFa in the biomedical domain. But @jamesaoverton is using it in his tools!A thorough overview of all the key standards and tools can be found on the Awesome Semantic Web repo.
For a rough sense of current research trends it is always good to look at the accepted papers at one of the major conferences in the area. I like ISWC (2020 papers), but for the aspiring Semantic Data Engineering in the biomedical sphere, it is probably a bit broad and theoretical. Other interesting specialised venues are the Journal of Biomedical Semantics and the International Conference on Biomedical Ontologies, but with the shift of the focus in the whole community towards Knowledge Graphs, other journals and conferences are becoming relevant.
Here are a few key research areas, which are, by no means (!), exhaustive.
It is useful to get a picture of the typical tasks a Semantic Data Engineer faces when building ontologies are Knowledge Graphs. In my experience, it is unlikely that any particular set of tools will work in all cases - most likely you will have to try and assemble the right toolchain for your use case and refine it over the lifetime of your project. The following are just a few points for consideration of tasks I regularly encountered - which may or may not overlap with the specific problems you will face.
"},{"location":"lesson/analysing-linked-data/#finding-the-right-ontologies","title":"Finding the right ontologies","text":"There are no simple answers here and it very heavily depends on your use cases. We are discussing some places to look for ontologies here, but it may also be useful to simply upload the terms you are interested in to a service like Zooma and see what the terms map to at a major database provider like EBI.
"},{"location":"lesson/analysing-linked-data/#finding-the-right-data-sources","title":"Finding the right data sources","text":"This is much harder still than it should have to be. Scientific databases are scattered across institutions that often do not talk to each other. Prepare for some significant work in researching the appropriate databases that could benefit your work, using Google and the scientific literature.
"},{"location":"lesson/analysing-linked-data/#extending-existing-ontologies","title":"Extending existing ontologies","text":"It is rare nowadays that you will have to develop an ontology entirely from scratch - most biomedical sub-domains will have some kind of reasonable ontology to build upon. However, there is often a great need to extend existing ontologies - usually because you have the need of representing certain concepts in much more detail, or your specific problem has not been modelled yet - think for example when how disease ontologies needed to be extended during the Coronavirus Crisis. Extending ontologies usually have two major facets:
Also sometimes more broadly referred to as \"data integration\", this problem involves a variety of tasks, such as:
To make your data discoverable, it is often useful to extract a view from the ontologies you are using (for example, Gene Ontology, Disease Ontology) that only contains the terms and relationships of relevance to your data. We usually refer to this kind of ontology as an application ontology, or an ontology specific to your application, which will integrate subsets of other ontologies. This process will typically involve the following:
There are many ways your semantic data can be leveraged for data analysis, but in my experience, two are particularly central:
The open courses of the Hasso Plattner Institute (HPI) offer introductions into the concepts around Linked Data, Semantic Web and Knowledge Engineering. There are three courses of relevance to this weeks topics, all of which overlap significantly.
These materials are under construction and incomplete.
"},{"location":"lesson/automating-ontology-workflows/#prerequisites","title":"Prerequisites","text":"In this course, you will learn the basics of automation in and around the OBO ontology world - and beyond. The primary goal is to enable ontology pipeline developers to plan the automation of their ontology workflows and data pipelines, but some of the materials are very general and apply to scientific computing more widely. The course serves also as a prerequisite for advanced application ontology development.
"},{"location":"lesson/automating-ontology-workflows/#learning-objectives","title":"Learning objectives","text":"make
Please complete the following tutorials.
By: James Overton
Automation is part of the foundation of the modern world. The key to using and building automation is a certain way of thinking about processes, how they can be divided into simple steps, and how they operate on inputs and outputs that must be exactly the same in some respects but different in others.
In this article I want to make some basic points about automation and how to think about it. The focus is on automation with software and data, but not on any particular software or data. Some of these points may seem too basic, especially for experienced programmers, but in 20+ years of programming I've never seen anybody lay out these basic points in quite this way. I hope it's useful.
"},{"location":"lesson/automating-ontology-workflows/#the-basics","title":"The Basics","text":"\"automatos\" from the Greek: \"acting of itself\"
Automation has two key aspects:
The second part is more visible, and tends to get more attention, but the first part is at least as important. While automation makes much of the modern world possible, it is not new, and there are serious pitfalls to avoid. No system is completely automatic, so it's best to think of automation on a spectrum, and starting thinking about automation at the beginning of a new project.
"},{"location":"lesson/automating-ontology-workflows/#examples-of-automation","title":"Examples of Automation","text":"To my mind, the word \"automation\" brings images of car factories, with conveyor belts and robotic arms moving parts and welding them together. Soon they might be assembling self-driving (\"autonomous\") cars. Henry Ford is famous for making cars affordable by building the first assembly lines, long before there were any robots. The essential steps for Ford were standardizing the inputs and the processes to get from raw materials to a completed car. The history of the 20th century is full of examples of automation in factories of all sorts.
Automation was essential to the Industrial Revolution, but it didn't start then. We can look to the printing press. We can look to clocks, which regimented lives in monasteries and villages. We can think of recipes, textiles, the logistics of armies, advances in agriculture, banking, the administration of empires, and so on. The scientific revolution was built on repeatable experiments published in letters and journal articles. I think that the humble checklist is also an important relative of automation.
Automation is not new, but it's an increasingly important part of our work and our lives.
"},{"location":"lesson/automating-ontology-workflows/#software-automation-is-special","title":"Software Automation is Special","text":"Software is almost always written as source code in text files that are compiled and/or interpreted as machine code for a specific set of hardware. Software can drive machines of all sorts, but a lot of software automation stays inside the computer, working on data in files and databases, and across networks. We'll be focused on this kind of software automation, transforming data into data.
The interesting thing about this is that source code is a kind of data, so there are software automation workflows that operate on data that defines software. The upshot is that you can have automation that modifies itself. Doing this on a large scale introduces a lot of complexity, but doing it on a small scale can be a clean solution to certain problems.
Another interesting thing about software is that once we solve an automation problem once we can copy that solution and apply it again and again for almost zero cost. We don't need to build a new factory or a new threshing machine. We can just download a program and run it. Henry Ford could make an accurate estimate of how long it would take to build a car on his assembly line, but software development is not like working on the assembly line, and estimating time and budget for software development is notoriously hard. I think this is because software developers aren't just executing automation, they're building new automation for each new project.
Although we talk about \"bit rot\", and software does require maintenance of a sort, software doesn't break down or wear out in the same ways that physical machines do. So while the Industrial Revolution eliminated many jobs, it also created different jobs, building and maintaining the machines. It's not clear that software automation will work the same way.
Software automation is special because it can operate on itself, and once complete can be cheaply copied. Software development is largely about building automated systems of various sorts, usually out of many existing pieces. We spend most of our time building new systems, or modifying an existing system to handle new inputs, or adapting existing software to a new use case.
"},{"location":"lesson/automating-ontology-workflows/#the-dangers-of-automation","title":"The Dangers of Automation","text":"To err is human; to really foul things up requires a computer.
An obvious danger of automation is that machines are faster than humans, so broken automation can often do more damage more quickly than a human can. A related problem is that humans usually have much more context and depth of experience, which we might call \"common sense\", and a wider range of sensory inputs than most automated systems. This makes humans much better at recognizing that something has gone wrong with a process and that it's time to stop.
New programmers soon learn that a simple program that performs perfectly when the input is in exactly the right format, becomes a complex program once it's updated to handle a wide range of error conditions. In other words, it's almost always much harder to build automation that can gracefully handler errors and problems than it is to automate just the \"happy path\". Old programmers have learned through bitter experience that it's often practically impossible to predict all the things that can go wrong with an automated system in practise.
I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. -- Abraham Maslow
A less obvious danger of automation comes from the sameness requirement. When you've built a great piece of automation, perfectly suited to inputs of a certain type, it's very tempting to apply that automation more generally. You start paying too much attention to how things are the same, and not enough attention to their differences. You may begin to ignore important differences. You may surrender your common sense and good judgment, to save yourself the work of changing the automated system or making an exception.
Bureaucracies are a form of automation. Everyone has had a bad experience filling out some form that ignores critical information, and with some bureaucrat who would not apply common sense and make an exception.
Keep all this in mind as you build automated systems: a broken machine can do a lot of damage very quickly, and a system built around bad assumptions can do a lot of hidden damage.
"},{"location":"lesson/automating-ontology-workflows/#a-spectrum-of-automation","title":"A Spectrum of Automation","text":"Let's consider a simple case of automation with software, and build from the most basic sort of automation to a full-fledged system.
Say you have a bunch of text files in a directory, each containing minutes from meetings that we had together over the years. You can remember that I talked about a particular software package that might solve a problem that you just discovered, but you can't remember the name.
"},{"location":"lesson/automating-ontology-workflows/#1-ad-hoc","title":"1. Ad Hoc","text":"The first thing you try is to just search the directory. On a Mac you would open the Finder, navigate to the directory, and type \"James\" into the search bar. Unfortunately that gives too many results: all the files with the minutes for a meeting where I said something.
The next thing to do is double click some text files, which would open them in Text Edit program, and skim them. You might get lucky!
You know that I the meeting was in 2019, so you can try and filter for files modified in that year. Unfortunately the files have been updated at different times, so the file dates aren't useful.
Now if each file was named with a consistent pattern, including the meeting date, then it would be simple to filter for files with \"2019\" in the name. This isn't automation, but it's the first step in the right direction. Consistent file names are one way to make inputs the same so that you can process them in the same way.
Let's say it works: you filter for files from 2019 with \"James\" in them, skim a few, and find a note where I recommended using Pandoc to convert between document formats. Mission accomplished!
"},{"location":"lesson/automating-ontology-workflows/#2-notes","title":"2. Notes","text":"Next week you need to do something very similar: Becky mentioned a website where you can find an important dataset. It's basically the same problem with different inputs. If you remember exactly what you did last time, then you can get the job done quickly. As the job gets more complicated and more distant in time, and as you find yourself doing similar tasks more often, it's nice to have notes about what you did and how you did it.
If I'm using a graphical user interface (GUI) then for each step I'll note the program I used, and the menu item or button I clicked, e.g. \"Preferences > General > Font Size\", or \"Search\" or \"Run\". If I'm using a command-line interface (CLI) then I'll copy-paste the commands into my notes.
I often keep informal notes like this in a text file in the relevant directory. I name the file \"notes.txt\". A \"README\" file is similar. It's used to describe the contents of a directory, often saying which files are which, or what the column headers for a given table mean.
Often the task is more complicated and requires one or more pieces of software that I don't use every day. If there's relevant documentation, I'll put a link to it in my notes, and then a short summmary of exactly what I did.
In this example I look in the directory of minutes and see my \"notes.txt\" file. I read that and remember how I filtered on \"2019\" and searched for \"James\". This time I filter on \"2020\" and search for \"Becky\", and I find the website for the dataset quickly enough.
As a rule of thumb, it might take you three times longer to find your notes file, write down the steps you took, and provide a short description, than it would to just do the job without taking notes. When you're just taking notes for yourself, this often feels like a waste of time (you'll remember, right?!), and sometimes it is a bit of a waste. If you end up using your notes to help with similar tasks in the future, then this will likely be time well spent.
As a rule of thumb, it might take three times longer to write notes for a broader audience than notes for just yourself. This is because you need to take into account the background knowledge of your reader, including her skills and assumptions and context, and especially the possible misunderstandings that you can try to avoid with careful writing. I often start with notes for just myself and then expand them for a wider audience only when needed.
"},{"location":"lesson/automating-ontology-workflows/#3-checklist","title":"3. Checklist","text":"When tasks get more complicated or more important then informal notes are not enough. The next step on the spectrum of automation is the humble checklist.
The most basic checklists are for making sure that each item has been handled. Often the order isn't important, but lists are naturally ordered from top to bottom, and in many cases that order is useful. For example, my mother lays out her shopping lists in the order of the aisles in her local grocery store, making it easier to get each item and check it off without skipping around and perhaps having to backtrack.
I think of a checklist as a basic form of automation. It's like a recipe. It should lay out the things you need to start, then proceed through the required steps in enough detail that you can reproduce them. In some sense, by using the checklist you are becoming the \"machine\". You are executing an algorithm that should take you from the expected inputs to the expected output.
Humble as the checklist is, there's a reason that astronauts, pilots, and surgical teams live by their checklists. Even when the stakes are not so high, it's often nice to \"put your brain on autopilot\" and just work the checklist without having to remember and reconsider the details of each step.
A good checklist is more focused than a file full of notes. A checklist has a goal at the end. It has specific starting conditions. The steps have been carefully considered, so that they have the proper sequence, and none are missing. Perhaps most importantly, a checklist helps you break a complex task down into simple parts. If one of the parts is still too complex, then break it down again into a nested checklist (really a sort of tree structure).
Checklists sometimes include another key element of automation: conditionals. A shopping list might say \"if there's a sale on crackers, then buy three boxes\". If-then conditions let our automated systems adapt to circumstances. The \"then\" part is just another step, but the \"if\" part is a little different. It's a test to determine whether a condition holds. We almost always want the result of the test to be a simple True or False. Given a bunch of inputs, some of which pass the test and some of which fail it, we can think of the test as determining some way in which all the things that pass are the same and all the things that fail are the same. Programmers will also be familiar with more complex conditionals such as if-then-else, if-elseif-else, and \"case\", which divide process execution across multiple \"branches\".
As a rule of thumb, turning notes into a checklist will likely take at least three times as long as simply writing the notes. If the checklist is for a wider audience, expect it to take three times as long to write, for the same reasons mentioned above for notes.
If a task is simple and I can hold all the steps in my head, and I can finish it in one sitting without distractions, then I won't bother with a checklist. But more and more I find myself writing myself a checklist before I begin any non-trivial tasks. I use bullet points in my favourite text editor, or sometimes the Notes app on my iPhone. I lay out the steps in the expected order, and I check them off as I go. Sometimes I start making the checklist days before I need it, so I have lots of time to think about it and improve it. If there's a job that I'm worried about, breaking it down into smaller pieces usually helps to make the job feel more manageable. Actually, I try to start every workday by skimming my (long) To Do list, picking the most important tasks, and making a checklist for what I want to get done by quitting time.
"},{"location":"lesson/automating-ontology-workflows/#3-checkscript","title":"3. Checkscript","text":"\"Checkscript\" is a word that I think I made up, based on insights from a couple of sources, primarily this blog post on \"Do-nothing scripting: the key to gradual automation\" This is where \"real\" automation kicks in, writing \"real\" code and stuff, but hopefully you'll see that it's just one more step on the spectrum of automation that I'm describing.
The notes and checklists we've been discussing are just text in your favourite text editor. A checkscript is a program. It can be written in whatever programming language you prefer. I'll give examples in Posix Shell, but that blog post uses Python, and it really doesn't matter. You start with a checklist (in your mind at least). The first version of your program should just walk you through your checklist. The program should walk you through each step of your checklist, one by one. That's it.
Here's a checkscript based on the example above. It just prints the first step (echo
), waits for you to press any key (read
), then prints the next step, and so on.
###!/bin/sh\n\necho \"1. Use Finder to filter for files with '2019' in the name\"\nread -p \"Press enter to continue\"\n\necho \"2. Use finder to search file content for 'James'\"\nread -p \"Press enter to continue\"\n\necho \"3. Open files in Text Edit and search for 'James'\"\nread -p \"Press enter to continue\"\n\necho \"Done!\"\n
So far this is just a more annoying way to use a checklist. The magic happens once you break the steps down into small enough pieces and realize that you know how to tell the computer to do some of the steps instead of doing them all yourself.
For example, you know that the command-line tool grep
is used for searching the contents of files, and that you can use \"fileglob\"s to select just the files that you want to search, and that you can send the output of grep
to another file to read in your favourite text editor. Now you know how to automate the first two steps. The computer can just do that work without waiting for you:
###!/bin/sh\n\ngrep \"James\" *2019* > search_results.txt\n\necho \"1. Open 'search_results.txt' in Text Edit and search for 'James'\"\nread -p \"Press enter to continue\"\n\necho \"Done!\"\n
Before we were using the Finder, and it is possible to write code to tell the Finder to filter and seach for files. The key advantage of grep
here is that we send the search results to another file that we can read now or save for later.
This is also a good time to mention the advantage of text files over word processor files. If the minutes were stored in Word files, for example, then Finder could probably search them and you could use Word to read them, but you wouldn't be able to use grep
or easily output the results to another file. Unix tools such as grep
treat all text files the same, whether they're source code or meeting minutes, which means that these tools work pretty much the same on any text file. By keeping your data in Word you restrict yourself to a much smaller set of tools and make it harder to automate you work with simple scripts like this one.
Even if you can't get the computer to run any of the steps for you automatically, a checkscript can still be useful by using variables instead of repeating yourself:
###!/bin/sh\n\nFILE_PATTERN=\"*2019*\"\nFILE_CONTENTS=\"James\"\n\necho \"1. Use Finder to filter for files with '${FILE_PATTERN}' in the name\"\nread -p \"Press enter to continue\"\n\necho \"2. Use finder to search file content for '${FILE_CONTENTS}'\"\nread -p \"Press enter to continue\"\n\necho \"3. Open files in Text Edit and search for '${FILE_CONTENTS}'\"\nread -p \"Press enter to continue\"\n\necho \"Done!\"\n
Now if I want to search for \"Becky\" I can just change the FILE_CONTENTS variable in one place. I find this especially useful for dates and version numbers.
This is pretty simple for a checkscript, with very few steps. A more realistic example would be if there were many directories containing the minutes of many meetings, maybe in different file formats and with different naming conventions. In order to be sure that we're searching all of them we might need a longer checkscript.
Writing and using a checkscript instead of a checklist will likely take (you guessed it) about three times as long. But the magic of the checkscript is in the title of the blog post I mentioned: \"gradual automation\". Once you have a checkscript, you can run through it all manually, but you can also automate bits a pieces of the task, saving yourself time and effort next time.
"},{"location":"lesson/automating-ontology-workflows/#5-script","title":"5. Script","text":"A \"script\" is a kind of program that's easy to edit and run. There are technical distinctions to be made between \"compiled\" programs and \"interpreted\" programs, but they turn out to be more complicated and less helpful than they seem at first. Technically, a checkscript is just a script that waits for you to do the hard parts. In this section I want to talk about \"fully automated\" or \"standalone\" scripts that you just provide some input and execute.
Most useful programs are useful because they call other programs (in the right ways). I like shell scripts because they're basically just commands that are copied and pasted from work I was doing on the command-line. It's really easy to call other programs.
To continue our example, say that our minutes were stored in Word files. There are Python libraries for this, such as python-docx. You can write a little script using this library that works like grep
to search for specified text in selected files, and output the results to a search results file.
As you add more and more functionality to a script it can become unwieldy. Scripts work best when they have a simple \"flow\" from beginning to end. They may have some conditionals and some loops, but once you start seeing nested conditionals and loops, then your script is doing too much. There are two main options to consider:
The key difference between a checkscript and a \"standalone\" script is handling problems. A checkscript relies on you to supervise it. A standalone script is expected to work properly without supervision. So the script has to be designed to handle a wider range of inputs and fail gracefully when it gets into trouble. This is a typical case of the \"80% rule\": the last 20% takes 80% of the time. As a rule of thumb, expect it to take three times as long to write a script that can run unsupervised than it takes you to write a checkscript that does \"almost\" the same thing.
"},{"location":"lesson/automating-ontology-workflows/#6-specialized-tool","title":"6. Specialized Tool","text":"When your script needs nested conditionals and loops, then it's probably time to reach for a programming language that's designed to write code \"in the large\". Some languages such as Python can make a pretty smooth transition from a script in a single file to a set of files in a module, working together nicely. You might also choose another language that can provide better performance or efficiency.
It's not just the size and the logical complexity of your script, consider its purpose. The specialized tools that I have in mind have a clear purpose that helps guide their design. This also makes them easier to reuse across multiple projects.
I often divide my specialized tools into two parts: a library and a command-line interface. The library can be used in other programs, and contains the most distinctive and important functionality. But the command-line interface is essential, because it lets me use my specialized tool in the shell and in scripts, so I can build more automation on top of it.
Writing a tool in Java or C++ or Rust usually takes longer than a script in shell or Python because there are more details to worry about such as types and efficient memory management. In return you usually get more reliability and efficiency. But as a rule of thumb, expect it to take three times as long to write a specialized tool than it would to \"just\" write the script. On the other hand, if you already have a script that does most of what you want, and you're already familiar with the target you are moving to, then it can be fairly straightforward to translate from the script to the specialized tool. That's why it's often most efficient to write a prototype script first, do lots of quick experiments to explore the design space, and when you're happy with the design then start on the \"production\" version.
"},{"location":"lesson/automating-ontology-workflows/#7-workflow","title":"7. Workflow","text":"The last step in the spectrum of automation is to bring together all your scripts into a single \"workflow\". My favourite tool for this is the venerable Make. A Makefile
is essentially a bunch of small scripts with their input and output files carefully specified. When you ask Make to build a given output file, it will look at the whole tree of scripts, figure out which input files are required to build your requested output file, then which files are required to build those files, and so on until it has determined a sequence of steps. Make is also smart enough to check whether some of the dependencies are already up-to-date, and can skip those steps. Looking at a Makefile
you can see everything broken down into simple steps and organized into a tree, through which you can trace various paths. You can make changes at any point, and run Make again to update your project.
I've done this all so many times that now I often start with a Makefile
in an empty directory and build from there. I try experiments on the command line. I make notes. I break the larger task into parts with a checklist. I automate the easy parts first, and leave some parts as manual steps with instructions. I write little scripts in the Makefile
. I write larger scripts in the src/
directory. If these get too big or complex, I start thinking about building a specialized tool. (And of course, I store everything in version control.) It takes more time at the beginning, but I think that I usually save time later, because I have a nice place to put everything from the start.
In other words, I start thinking about automation at the very beginning of the project, assuming from the start that it will grow, and that I'll need to go back and change things. With a mindset for automation, from the start I'm thinking about how the inputs I care about are the same and different, which similarities I can use for my tests and code, and which differences are important or unimportant.
"},{"location":"lesson/automating-ontology-workflows/#conclusion","title":"Conclusion","text":"In the end, my project isn't ever completely automated. It doesn't \"act of itself\". But by making everything clear and explicit I'm telling the computer how to do a lot of the work and other humans (or just my future self) how to do the rest of it. The final secret of automation, especially when it comes to software and data, is communication: expressing things clearly for humans and machines so they can see and do exactly what you did.
"},{"location":"lesson/automating-ontology-workflows/#scientific-computing-an-overview","title":"Scientific Computing: An Overview","text":"By: James Overton
By \"scientific computing\" we mean using computers to help with key aspect of science such as data collection, cleaning, interpretation, analysis, and visualization. Some people use \"scientific computing\" to mean something more specific, focusing on computational modelling or computationally intensive analysis. We'll be focusing on more general and day-to-day topics: how can a scientist make best use of a computer to do their work well?
These three things apply to lots of fields, but are particularly important to scientists:
It should be no surprise that automation can help with all of these. When working properly, computers make fewer mistakes than people, and the mistakes they do make are more predictable. If we're careful, our software systems can be easily reproduced, which means that an entire data analysis pipeline can be copied and run by another lab to confirm the results. And scientific publications are increasingly including data and code as part of the review and final publication process. Clear code is one of the best ways to communicate detailed steps.
Automation is critical to scientific instruments and experiments, but we'll focus on the data processing and analysis side: after the data has been generated, how should you deal with it.
Basic information management is always important:
More advanced data management is part of this course:
Some simple rules of thumb can help reduce complexity and confusion:
When starting a new project, make a nice clean new space for it. Try for that \"new project smell\".
It's not always clear when a project is really \"new\" or just a new phase of an old project. But try to clear some space to make a fresh start.
"},{"location":"lesson/automating-ontology-workflows/#firm-foundations","title":"Firm Foundations","text":"A lot of data analysis starts with a reference data set. It might be a genome or a proteome. It might be a corpus. It might be a set of papers or data from those papers.
Start by finding that data and selecting a particular version of it. Write that down clearly in your notes. If possible, include a unique identifier such as a (persistent) URL or DOI. If that's not possible, write down the steps you took. If the data isn't too big, keep a copy of it in your fresh new project directory. If the data is a bit too big, keep a compressed copy in a zip
or gz
file. A lot of software is perfectly happy to read directly from compressed files, and you can compress or uncompress data using piped commands in your shell or script. If the data is really too big, then be extra careful to keep notes on exactly where you can find it again. Consider storing just the hashes of the big files, so you can confirm that they have exactly the same contents.
If you know from the start that you will need to compare your results with someone else's, make sure that you're using the same reference data that they are. This may require a conversation, but trust me that it's better to have this conversation now than later.
"},{"location":"lesson/automating-ontology-workflows/#one-way-data-flow","title":"One-Way Data Flow","text":"It's much easier to think about processes that flow in one direction. Branches are a little trickier, but usually fine. The real trouble comes with loops. Once a process loops back on itself it's much more difficult to reason about what's happening. Loops are powerful, but with great power comes great responsibility. Keep the systems you design as simple as possible (but no simpler).
In practical terms:
It's very tempting: you could automate this step, or you could just do it manually. It might take three times as long to automate it, right? So you can save yourself some precious time by just opening Excel and \"fixing\" things by hand.
Sometimes that bet will pay off, but I lose that bet most of the time. I tend to realize my mistake only at the last minute. The submission deadline is tomorrow but the core lab \"fixed\" something and they have a new version of the dataset that we need to use for the figures. Now I really don't have time to automate, so I'm up late clicking through Excel again and hoping that I remembered to redo all the changes that I made last time.
Automating the process would have actually saved me time, but more importantly it would have avoided a lot of stress. By now I should know that the dataset will almost certainly be revised at the last minute. If I have the automation set up, then I just update the data, run the automation again, and quickly check the results.
"},{"location":"lesson/automating-ontology-workflows/#test-from-the-start","title":"Test from the Start","text":"Tests are another thing that take time to implement.
One of the key benefits to tests is (again) communication. When assessing or trying out some new piece of software I often look to the test files to see examples of how the code is really used, and the shape of the inputs and outputs.
There's a spectrum of tests that apply to different parts of your system:
Tests should be automated. The test suite should either pass or fail, and if it fails something needs to be fixed before any more development is done. The automated test suite should run before each new version is committed to version control, and ideally more often during development.
Tests come with costs:
The first is obvious but the other two often more important. A slow test suite is annoying to run, and so it won't get run. A test suite that's hard to update won't get updated, and then failures will be ignored, which defeats the entire purpose.
"},{"location":"lesson/automating-ontology-workflows/#documentation-is-also-for-you","title":"Documentation is also for You","text":"I tend to forget how bad a memory I have. In the moment, when I'm writing brilliant code nothing could be more obvious than the perfect solution that is pouring forth from my mind all over my keyboard. But when I come back to that code weeks, months, or years later, I often wonder what the heck I was thinking.
We think about the documentation we write as being for other people, but for a lot of small projects it's really for your future self. Be kind to your future self. They may be even more tired, even more stressed than you are today.
There's a range of different forms of documentation, worth a whole discussion of its own. I like this four-way distinction:
You don't need all of these for your small project, but consider a brief explanation of why it works the way it does (aimed at a colleague who knows your field well), and some brief notes on how-to do the stuff this project is for. These could both go in the README of a small project.
"},{"location":"lesson/automating-ontology-workflows/#additional-materials-and-resources","title":"Additional materials and resources","text":"In this lesson, we will take a look at the generative capabilities of LLM's in general and ChatGPT in particular, to try and get a beginning sense on how to leverage it to enhance ontology curation workflows.
The goal of the lesson is to give a mental model of what ChatGPT and LLMs are used for (ignoring details on how they work), contextualise the public discourse a bit, and then move on to looking at some concrete examples at its potential for improving curation activties.
To achieve this we engaged in a dialog with ChatGPT to generate almost the entire content of the lesson. The lesson authors provide the general \"structure\" of the lesson, provided to ChatGPT as a series of prompts, and get ChatGPT to provide the content. This content is obviously not as good as it could have been if it was created by a human with infinite resources, but we hope it does get the following points across:
We believe that from a user perspective, prompt engineering will be the most important skill that need to be learned when dealing with generative AI. Not just ChatGPT (which generates text), but also tools that generate images from text such as DALL-E or Midjourney, so this is what we will focus on. In the long term, applications like Monarchs OntoGPT will do some of the heavy lifting around writing perfect prompts, but it seems pretty clear that some basic knowledge of prompt engineering will be useful, or even necessary, for a long time to come.
For a reference of effective ChatGPT prompts for ontology development see here.
Note: - ChatGPT is rapidly evolving. The moment we add an answer, it will probably be outdated. For example, I created the first version of this tutorial on April 17th 2023. On May 27th, almost all answers ChatGPT is giving are completely different from the ones given in the first round. This is also important to remember when building applications around ChatGPT. - Note: https://open-assistant.io/chat is free and can be used to follow this tutorial instead of ChatGPT.
"},{"location":"lesson/chatgpt-ontology-curation/#contributors","title":"Contributors","text":"Prompts
We use quote syntax with the prompt icon to indicate a concrete prompt for ChatGPT
Comments
We use quote syntax with the comment icon to indicate a comment by the author
Replies by ChatGPT
Replies are given in normal text form. All text after the table of contents, apart from comments, prompts and the section on executable workflows are generated by ChatGPT.
"},{"location":"lesson/chatgpt-ontology-curation/#prequisites","title":"Prequisites","text":"None of the text in this section is generated with ChatGPT.
In essence, an LLM takes as an input a piece of text, and returns text as an output. A \"prompt\" is a piece of text that is written by an agent. This can be a human, or a software tool, or a combination of the two. In most cases, a human agent will pass the prompt to a specialised tool that pre-processes the prompt in certain ways (like translating it, adding examples, structuring it and more) before passing it to the large language model (LLM). For example, a when a chatbot tool like ChatGPT receives a prompt, it processes the prompt in certain ways, than leveraging the trained LLM to generate the text (which is probably postprocessed) and passed back to the human agent.
There are an infinite number of possible tools you can imagine following this rough paradigm. Monarch's own ontogpt, for example, receives the prompt from the human agent, then augments the prompt in a certain way (by adding additional instructions to it) before passing the augmentd prompt to an LLM like gpt3.5 (or lately even gpt4), which generates an instance of a curation schema. This is a great example for an LLM generating not only human readable text, but structured text. Another example for this is to ask an LLM to generate, for example, a SPARQL query to obtain publications from Wikidata.
Given the wide range of applications LLMs can serve, it is important to get a mental model of how these can be leveraged to improve our ontology and data curation workflows. It makes sense for our domain (semantic engineering and curation) to distinguish four basic models of interacting with LLMs (which are technically not much different):
Using LLMs as advisors has a huge number of creative applications. An advisor in this sense is a machine that \"knows a lot\" and helps you with your own understanding of the world. Large language models trained on a wide range of inputs are particularly interesting in this regard because of the immense breadth of their knowledge (rather than depth), which is something that can be difficult to get from human advisors. For example, the authors of this article have used ChatGPT and other LLM-based chatbots to help with understanding different domains, and how they might relate to knowledge management and ontologies in order to give specific career advice or to prepare for scientific panel discussions. For ontology curators, LLMs can be used to generate arguments for certain classification decisions (like a disease classification) or even suggest a mapping.
Using LLMs as assistants is probably the most important use of LLM-based tools at the moment, which includes aspects like summarising texts, generating sometimes boring, yet important, creative work (documentation pages, tutorials, blog-posts etc). It is probably not a good idea, at least as of May 2023, to defer to LLM-based tools to classify a term in an ontology, for example because of its tendency to hallucinate. Despite many arguments to the contrary LLMs are not databases. They are programs to generate text.
Using LLMs to extract information, similar to \"LLMs as assistants\", is, similar to the above, also about automating certain tasks, but the endpoint is not a software program rather than a human. It is the most important basic model of LLMs for us curators and software engineers to understand, because it is, in essence, the one that threatens our current work-life the most: What happens if LLMs become better at extracting structured knowledge from papers (or similarly generate software codes for user stories) than us? It is important that this thought is not ignored out of fear, but approached with a realistic and positive mindset.
Training. Apart from the fact that LLMs take text as an input and return text as an output, it is important to be aware how they are trained.
The basic technique for trainig is called \"next token prediction\". In essence, tokens in the text such as words or phrases are masked out by the training function and the LLM is trained to correctly predict these masked out phrases given the previous words in the sentence (or paragraph). The corpus used for this training ranges from specialised databases all the way to a large chunk of the publicly accessible textual information on the internet. Again, it is important to realise that text does not have to be composed of natural language sentences - learning how to generate programming code and structured data works much the same way.
In the following, we will ask ChatGPT or Open Assistant a few questions that are relevant to this lesson, which allows us to make certain points. The main point is that a lot of the work for preparing a useful tutorial can be deferred to ChatGPT is an assistant! Many of the questions explore how these tools could relate to us biocurators and semantic engineers in particular, but non of this should be viewed as written in stone. The most important take away from this lesson is that you should play with these tools yourself!
Can you explain in 3 simple sentences what an LLM is and how it works? [Note this question was asked to [Open Assistant](https://open-assistant.io/chat/) instead of ChatGPT.]
An LLM (Langauge Modell) is a AI model that can be trained to perform specific tasks such as speech recognition, text generation or machine translation. It's called a \"model\" because it is being trained on large sets of data to predict the next step based on previously learned relationships between inputs and outputs. LLMs are designed to recognize patterns in human language by processing input data through multiple layers of artificial neural networks that simulate the way neurons communicate in the brain. The training process involves feeding vast amounts of text into the model so that it can learn to identify underlying structures within the language itself. This allows the model to make predictions about likely next words or phrases given previous context. By continually adjusting its internal weights and biases using gradient descent techniques, the LLM becomes increasingly proficient at generating coherent text based on statistical probabilities derived from its knowledge of the language.
What is interesting here: the LLM actually made a typo, or rather, three: Langauge Modell
- missing the first L (large), Langauge instead of Language and Modell with double L (which is German). Note that both words are spellied correctly later in the text.
What are the 5 core concepts relevant for end-users to understand LLMs?
Describe the idea of prompt engineering in simple terms for an audience of domain experts
Prompt engineering is the process of carefully crafting input queries, or \"prompts,\" to guide the behavior of a large language model (LLM) like GPT-4, ensuring that it provides useful, relevant, and accurate responses. As domain experts, you understand the importance of asking the right questions to extract valuable information. Similarly, with prompt engineering, the goal is to design questions or statements in a way that maximizes the LLM's ability to generate meaningful, high-quality output.
Essentially, prompt engineering serves as a \"communication bridge\" between humans and the LLM, allowing the model to better understand the intent behind the query and generate the desired response. This can involve refining the wording, providing context, or specifying the format of the desired answer, among other techniques.
By focusing on the art and science of prompt engineering, researchers and practitioners can better harness the potential of LLMs and overcome some of the challenges related to biases, incomplete knowledge, and misleading responses.
Providing an effective prompt to ChatGPT is an art form, and as resources like learnprompting.org testify, one that requires cultivation and care. The better you become at writing prompts, the more value you will draw from ChatGPT for your daily ontology curation tasks. How effective our prompts are depends on the exact language models used in ChatGPT (at the moment 3.5 or 4), which means that this lesson will keep evolving.
"},{"location":"lesson/chatgpt-ontology-curation/#what-are-some-basic-tips-for-writing-good-prompts-for-chatgpt","title":"What are some basic tips for writing good prompts for ChatGPT?","text":"In the excellent 60 minute OpenAI, Deeplearning.ai course on Prompt Engineering, we learn that we should be (1) writing clear an specific instructions and (2) give ChatGPT \"time to think\".
What are some potential applications of ChatGPT for ontology development?
ChatGPT can be a valuable tool for ontology development in various ways. Some potential applications include:
By leveraging the capabilities of ChatGPT, ontology developers can streamline the development process, enhance the quality of their ontologies, and facilitate their adoption and use across various domains.
"},{"location":"lesson/chatgpt-ontology-curation/#how-can-we-concretely-enhance-the-ontology-curation-process","title":"How can we, concretely, enhance the ontology curation process?","text":"A normal workflow of ontology curation is this:
How can ChatGPT be used concretely to help with that process?
ChatGPT can be incorporated into various stages of the ontology curation workflow to assist biocurators and ontology curators in streamlining and improving the process. Here's how ChatGPT can be used concretely at each step:
By integrating ChatGPT into the ontology curation workflow, biocurators and ontology curators can benefit from its natural language understanding and generation capabilities, making the process more efficient and accurate. However, it is essential to keep in mind that ChatGPT should be used as a complementary tool and not as a replacement for human expertise or specialized software designed for ontology development and curation.
"},{"location":"lesson/chatgpt-ontology-curation/#using-chatgpt-for-ontology-mappings","title":"Using ChatGPT for ontology mappings","text":"What about ontology mappings?
LLMs like ChatGPT can be used to identify equivalent or similar concepts across different ontologies by analyzing concept labels, definitions, and other metadata. They can understand and process natural language descriptions and compare them to find potential matches or alignments between concepts. Specifically, LLMs can:
By leveraging these capabilities, LLMs can assist ontology curators in the challenging task of concept matching, ultimately enhancing the ontology mapping and alignment process.
Can you provide an example of mapping a concept from the Human Phenotype Ontology to the Mammalian Phenotype Ontology?
Let's consider the concept \"Abnormality of the ear\" in HPO.
Our goal is to find a matching or similar concept in the Mammalian Phenotype Ontology (MPO).
In this case, both concepts describe abnormalities of the ear in their respective domains (human and mammalian). The labels and definitions are similar, indicating that these concepts might be considered equivalent or closely related.
Thus, we can propose the following mapping:
COMMENT. The 27th May version of ChatGPT is giving a different answer:
Sure, let's consider an example for this mapping:
Human Phenotype Ontology (HPO) is designed to classify and annotate the phenotypic abnormalities encountered in human disease. Each term in HPO describes a specific phenotypic abnormality, such as 'HP:0004322 - Short stature' which describes an abnormality of body height that is less than -2 standard deviations compared to the appropriate reference population.
Mammalian Phenotype Ontology (MPO), on the other hand, is used in the annotation of data from mammalian phenotyping experiments, typically in model organisms like mice. Its equivalent term for short stature might be 'MP:0001265 - decreased body size' which describes a mammal with an overall smaller body size.
So, in this case, 'HP:0004322 - Short stature' in HPO could be mapped to 'MP:0001265 - decreased body size' in MPO.
Please note that this is a simplified example and the actual mapping process could be more complex, due to the nuanced differences between human and mammalian phenotypes and the structure of these ontologies. For actual mappings, it is essential to ensure the descriptions and definitions align appropriately, and the context of use is correctly considered.
Also, there may not always be a one-to-one correspondence between the terms in different ontologies. Sometimes, one term in an ontology may correspond to multiple terms in another, or there may not be an equivalent term at all. This is why ontological mapping is a complex task requiring domain knowledge and careful consideration.
Can you express the above mapping in SSSOM?
The Simple Standard for Sharing Ontology Mappings (SSSOM) is a standard that enables sharing and comparing ontology mappings using a simple tabular format (TSV). Here's the example mapping from Human Phenotype Ontology (HPO) to Mammalian Phenotype Ontology (MPO) expressed in SSSOM format:
subject_id subject_label predicate_id object_id object_label match_type confidence\nHP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology Lexical 0.90\n
In this case, I have used skos:closeMatch as the predicate to represent the close relationship between the two concepts. The match_type is \"Lexical\" since the mapping is based on similar labels and definitions. The confidence score is set to 0.90 to reflect a high degree of similarity between the concepts, but this value may vary depending on the actual similarity estimation method used.
Please note that the SSSOM format includes metadata and other required fields in a TSV header. The provided example only focuses on the core mapping information. For a complete SSSOM file, you would need to include the appropriate header and metadata.
The SSSOM example you gave is using an old version of the SSSOM standard. In particular, match_type now is mapping_justification. Can you try to rewrite the mapping using the new standard?
Apologies for the confusion. Here's the example mapping from Human Phenotype Ontology (HPO) to Mammalian Phenotype Ontology (MPO) expressed in the updated SSSOM format:
subject_id subject_label predicate_id object_id object_label mapping_justification confidence\nHP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology Lexical 0.90\n
In this revised example, the match_type field has been replaced with mapping_justification, which is now set to \"Lexical\" to indicate that the mapping is based on similar labels and definitions. The remaining fields are unchanged from the previous example.
mapping_justification values should be populated from a controlled vocabulary, called the Semantic Mapping Vocabulary (SEMAPV). You can look at the vocabulary in RDF here: https://raw.githubusercontent.com/mapping-commons/semantic-mapping-vocabulary/main/semapv.owl. Given that vocabulary, can you revise the value for the mapping_justification?
My apologies for the oversight. Based on the Semantic Mapping Vocabulary (SEMAPV), I will revise the mapping justification for the example. Here's the updated SSSOM mapping using the SEMAPV:
subject_id subject_label predicate_id object_id object_label mapping_justification confidence\nHP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatch 0.90\n
In this updated example, I have replaced the previous mapping_justification value with semapv:LexicalMatch from the Semantic Mapping Vocabulary (SEMAPV) to represent the lexical similarity between the two concepts based on their labels and definitions. The remaining fields are unchanged.
You were hallucinating earlier. Here is the semapv vocabulary as a tsv:
IRI skos:prefLabel skos:definition dc:source skos:example rdfs:comment altLabel Parent\nID A skos:prefLabel A skos:definition AI dc:source A skos:example A rdfs:comment A skos:altLabel SC %\nsemapv:MappingActivity mapping activity A process that relates to the creation, confirmation, rejection or curation of a mapping. Matching is a mapping activity that results in the creating of a mapping; mapping review is an activity that results in the confirmation of a mapping. \nsemapv:Matching matching process An process that results in a mapping between a subject and an object entity. The label of a subject entity matches to an exact synonym of an object entity. matching operation|matching task semapv:MappingActivity\nsemapv:Mapping mapping A triple <s,p,o> comprising a subject entity s, an object entity o and a mapping predicate p. The subject entity NCI:C9305 is mapped to the object entity ICD10:C80.9 using the skos:relatedMatch mapping predicate. \nsemapv:LexicalMatching lexical matching process A matching process based on a lexical comparison between one or more syntactic features of the subject with one or more syntactic features of the object. The label of a subject entity matches to an exact synonym of an object entity. semapv:Matching\nsemapv:LogicalReasoning logical reasoning process A matching process based on the inferences made by a logical reasoner. Two classes across ontologies are determined equivalent by an OWL reasoner such as ELK. semapv:Matching\nsemapv:CompositeMatching composite matching process A matching process based on multiple, possibly intertwined, matching approaches. An ontology matching tool determines that a subject should be mapped to an object by employing a range of techniques, including lexical, semantic and structural. semapv:Matching\nsemapv:UnspecifiedMatching unspecified matching process A matching process based on an unspecified comparison. A mapping between a subject and an object was established, but it is no longer clear how or why. semapv:Matching\nsemapv:SemanticSimilarityThresholdMatching semantic similarity threshold-based matching process A matching process based on a minimum threshold of a score from a comparison based on a semantic similarity algorithm. A match between a subject and an object entity was established because they had a Jaccard score higher than 0.8 based on the set of (common) superclasses. semapv:Matching\nsemapv:LexicalSimilarityThresholdMatching lexical similarity threshold-based matching process A lexical matching process based on a minimum threshold of a score from a comparison based on a lexical similarity algorithm. A match between a subject and an object was established because they had a Levenshtein score higher than 0.8. semapv:Matching\nsemapv:StructuralMatching structural matching process https://doi.org/10.1016/j.websem.2009.11.002 A match between a subject and an object was established because of the similarity of their structural features, e.g., the number of direct property of a class. Structural matching does not involve looking at \"values\" of properties. semapv:Matching\nsemapv:InstanceBasedMatching instance-based matching process A matching process based on individual representations (or instances). https://doi.org/10.1007/978-3-642-38721-0 A match between a subject A and an object B was established because they share the same instances. semapv:Matching\nsemapv:BackgroundKnowledgeBasedMatching background knowledge-based matching process A matching process that exploits background knowledge from external resources, commonly referred to as background knowledge resources. This approach is also known as indirect matching, BK-based matching or context-based matching. https://doi.org/10.1016/j.websem.2018.04.001 A match between a subject A and an object B was established because they appear equivalent under consideration of externally provided background knowledge. semapv:Matching\nsemapv:MappingChaining mapping chaining-based matching process A matching process based on the traversing of multiple mappings. A match between a subject A and an object B was established because A was mapped to C, C was mapped to D and D was mapped to B. semapv:Matching\nsemapv:MappingReview mapping review A process that is concerned with determining if a mapping \u201ccandidate\u201d (otherwise determined) is reasonable/correct. A match between a subject A and an object B was established elsewhere, and a human reviewer determined that the mapping is true (or false) based on an independent evaluation. semapv:MappingActivity\nsemapv:ManualMappingCuration manual mapping curation An matching process that is performed by a human agent and is based on human judgement and domain knowledge. A human curator determines that a subject should be mapped to an object by virtue of their domain expertise. semapv:Matching\nsemapv:MatchingPreprocessing matching process pre-processing A preliminary processing of inputs prior to performing matching. semapv:MappingActivity\nsemapv:Stemming stemming The process of reducing a word to its word stem. semapv:Normalization\nsemapv:Lemmatization lemmatization The process of reducing the different forms of a word to one single form. https://doi.org/10.1007/978-3-642-38721-0 semapv:Normalization\nsemapv:StopWordRemoval stop-word removal A preprocessing method transforming a string by removing a list of stop words. https://doi.org/10.1007/978-3-642-38721-0 The stop-words \u201cof, and, is, a\u201d are removed from the label field of the subject entity in the mapping. semapv:Normalization\nsemapv:RegexRemoval regex removal A preprocessing method transforming a string by matching a regular expression and then removing that match. The regex match \u201c[ -]phenotype\u201d is removed from the label field of the subject entity in the mapping. semapv:Normalization\nsemapv:RegexReplacement regex replacement A preprocessing method transforming a string by matching a regular expression and then replacing that match with a specified substitution string. The regex match \u201c[ -]phenotype\u201d is replaced by \u201c-disease\u201d for the label field of the subject entity in the mapping. semapv:Normalization\nsemapv:MatchingPostProcessing matching process post-processing A process applied to a set of mappings, usually with the intention of changing it. semapv:MappingActivity\nsemapv:ThresholdFiltering threshold filtering semapv:MatchingPostProcessing\nsemapv:StableMarriageFiltering stable marriage filtering semapv:MatchingPostProcessing\nsemapv:CardinalityFiltering cardinality filtering semapv:MatchingPostProcessing\nsemapv:LogicalConsistencyFiltering logical consistency filtering semapv:MatchingPostProcessing\nsemapv:Normalization normalization A family of preprocessing techniques for reducing strings to be compared to a common format. https://doi.org/10.1007/978-3-642-38721-0 semapv:MatchingPreprocessing\nsemapv:CaseNormalization case normalization A normalization technique converting each alphabetic character in a string to their lower case counterpart. https://doi.org/10.1007/978-3-642-38721-0 The string 'Ontology' is normalised to 'ontology' (lower case). semapv:Normalization\nsemapv:DiacriticsSuppression diacritics suppression A normalization technique replacing diactritic signs (e.g. accents, hats) with their most frequent replacements. https://doi.org/10.1007/978-3-642-38721-0 The string 'Prot\u00e9g\u00e9' is normalised to 'Protege'. semapv:Normalization\nsemapv:BlankNormalisation blank normalisation A normalization technique replacing all blank characters, such as space, tabulation, carriage return (or sequences of these) into a single blank character. https://doi.org/10.1007/978-3-642-38721-0 The string 'ontology matching' (two spaces between 'ontology' and 'matching) becomes 'ontology matching' (one space). semapv:Normalization\nsemapv:LinkStripping link stripping A normalization technique replacing specific links between words, such as apostrophes, dashes, underscore, etc into dashes or blanks. https://doi.org/10.1007/978-3-642-38721-0 The string 'alzheimer's' becomes 'alzheimers'. semapv:Normalization\nsemapv:DigitSuppression digit suppression A normalization technique removing all numbers in a string. https://doi.org/10.1007/978-3-642-38721-0 The string 'ontology1' becomes 'ontology'. semapv:Normalization\n
Can you use this to pick the right mapping justification?
Apologies for the confusion earlier. Based on the provided SEMAPV vocabulary as a TSV, I will revise the mapping justification for the example. Here's the updated SSSOM mapping using the correct SEMAPV:
subject_id subject_label predicate_id object_id object_label mapping_justification confidence\nHP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatching 0.90\nIn this updated example, I have replaced the previous mapping_justification value with semapv:LexicalMatching from the provided Semantic Mapping Vocabulary (SEMAPV) to represent the lexical similarity between the two concepts based on their labels and definitions. The remaining fields are unchanged.\n
Please remember that a complete SSSOM file would require the appropriate header and metadata in addition to the core mapping information provided in the example.
Can you add 4 more examples to that SSSOM table
Sure, here are 4 more examples of mappings between the Human Phenotype Ontology (HPO) and the Mammalian Phenotype Ontology (MPO) using SSSOM format:
subject_id subject_label predicate_id object_id object_label mapping_justification confidence\nHP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatching 0.90\nHP:0000365 Hearing impairment skos:closeMatch MP:0006325 hearing loss semapv:LexicalMatching 0.92\nHP:0001382 Joint hypermobility skos:closeMatch MP:0004927 increased joint flexibility semapv:LexicalMatching 0.85\nHP:0001392 Hepatomegaly skos:closeMatch MP:0002764 enlarged liver semapv:LexicalMatching 0.88\nHP:0001250 Seizures skos:closeMatch MP:0002064 seizures semapv:LexicalMatching 0.95\n
These examples illustrate mappings between HPO and MPO concepts based on lexical matching, where the similarity between the labels and definitions is considered. Note that the confidence values are provided as an example and may differ in real-world scenarios.
Can you print only the table using markdown table syntax in a way that I can copy paste it
subject_id subject_label predicate_id object_id object_label mapping_justification confidence HP:0000598 Abnormality of the ear skos:closeMatch MP:0002161 abnormal ear morphology semapv:LexicalMatching 0.90 HP:0000365 Hearing impairment skos:closeMatch MP:0006325 hearing loss semapv:LexicalMatching 0.92 HP:0001382 Joint hypermobility skos:closeMatch MP:0004927 increased joint flexibility semapv:LexicalMatching 0.85 HP:0001392 Hepatomegaly skos:closeMatch MP:0002764 enlarged liver semapv:LexicalMatching 0.88 HP:0001250 Seizures skos:closeMatch MP:0002064 seizures semapv:LexicalMatching 0.95Can you add 100 more examples, this time only providing a markdown table that I can copy paste, with no further comments around it?
Creating 100 mappings with accuracy requires domain expertise and a comprehensive understanding of the ontology mappings. However, I can provide you with 20 examples to help you understand the format. Please note that these examples may not be completely accurate and should not be used for any critical applications.
subject_id subject_label predicate_id object_id object_label mapping_justification confidence HP:0000010 Megabladder skos:closeMatch MP:0000547 enlarged bladder semapv:LexicalMatching 0.85 HP:0000014 Renal agenesis skos:closeMatch MP:0000548 absent kidney semapv:LexicalMatching 0.82 HP:0000015 Bladder diverticulum skos:closeMatch MP:0000556 bladder diverticula semapv:LexicalMatching 0.88 HP:0000016 Urethral atresia skos:closeMatch MP:0000565 closed urethra semapv:LexicalMatching 0.90 HP:0000020 Renal duplication skos:closeMatch MP:0000571 duplex kidney semapv:LexicalMatching 0.84 HP:0000021 Hypospadias skos:closeMatch MP:0000573 abnormal penis morphology semapv:LexicalMatching 0.76 HP:0000022 Urethral stricture skos:closeMatch MP:0000574 urethral obstruction semapv:LexicalMatching 0.86 HP:0000025 Polycystic kidney dysplasia skos:closeMatch MP:0000576 polycystic kidney semapv:LexicalMatching 0.92 HP:0000026 Hyperechogenic kidneys skos:closeMatch MP:0000581 abnormal kidney echogenicity semapv:LexicalMatching 0.81 HP:0000027 Renal cortical microcysts skos:closeMatch MP:0000584 renal cortical cysts semapv:LexicalMatching 0.89 HP:0000028 Renal tubular atrophy skos:closeMatch MP:0000585 atrophic renal tubules semapv:LexicalMatching 0.87 HP:0000031 Renal hypoplasia skos:closeMatch MP:0000591 small kidney semapv:LexicalMatching 0.79 HP:0000032 Renal hypoplasia/aplasia skos:closeMatch MP:0000592 renal hypoplasia semapv:LexicalMatching 0.91 HP:0000033 Proximal tubulopathy skos:closeMatch MP:0000593 abnormal proximal tubule semapv:LexicalMatching 0.75 AT THIS POINT CHATGPT JUST STOPPED GENERATING CONTENT. "},{"location":"lesson/chatgpt-ontology-curation/#generating-pre-coordinated-expressions-and-schema-instances","title":"Generating pre-coordinated expressions and schema instances","text":"Here are a few slots in a schema: quality, anatomical_entity and chemical_entity
quality:\ndescription: The property being measured, or changes in this property, for example, amount, level, increased amount, decreased concentration\nannotations:\n prompt.example: amount, level, increased amount, decreased concentration\nrange: Quality\n\nanatomical_entity:\ndescription: The anatomical location that the chemical entity is measured in\nrange: AnatomicalEntity\nannotations:\n prompt.example: liver, heart, brain, finger\n\nchemical_entity:\ndescription: The chemical entity that is being measured\nrange: ChemicalEntity\nannotations:\n prompt.example: lysine, metabolite\n
Can you create a YAML file with those three elements as keys, and extract the contents of the string \"increased blood glucose levels\" into as values to these keys? Output should be just a simple yaml file, like:
quality: concentration\nanatomical_entity: liver\nchemical_entity: lysine\n
"},{"location":"lesson/chatgpt-ontology-curation/#todo-finish-this-section","title":"TODO FINISH THIS SECTION","text":""},{"location":"lesson/chatgpt-ontology-curation/#from-chat-to-exectutable-workflows-what-we-need-to-do-to-leverage-llms","title":"From Chat to exectutable workflows: what we need to do to leverage LLMs","text":"The above tutorial was a fun case study using ChatGPT with GPT-4. 95% of the content provided was generated by ChatGPT with GPT-4. While certainly not as great as possible, it took a solid ontology engineer (@matentzn
) about 90 minutes to write this lesson, which would have usually cost him more than 8 hours.
It is clear that learning how to talk to AI, the process we refer to as \"prompt engineering\" is going to be absolutely essential for ontology curators moving forward - as LLMs improve and understand even complex languages like OWL better, perhaps as important as ontology modelling itself. I dont think there is any doubt that enganging is a good amount of play and study on this subject is both fun and hugely beneficial.
All that said, perceiving LLMs through the lens of a chat bot leaves a lot of potential unexplored. For example, if ChatGPT (or LLMs in general) can generate structured data, why not implement this directly into our curation tools (like Protege)? Tools like GitHub co-pilot are already used to making good programmers a lot more effective, but so far, these tools focus on development environments where the majority of the generated content is text (e.g. software code), and not so much heavily UI driven ones like Protege.
A lot of blog posts have circulated recently on Twitter and LinkedIn explored the potential of LLMs to generate RDF and OWL directly. It is already clear that LLMs can and will do this moving forward. For ontology curation specifically, we will need to develop executable workflows that fit into our general ontology curation process. As a first pass, some members of our community have developed OntoGPT. We will explore how to use OntoGPT in a future lesson.
"},{"location":"lesson/chatgpt-ontology-curation/#some-more-thoughts-on-hallucinations","title":"Some more thoughts on hallucinations","text":"Update 27 May 2023: It seems that complaints wrt to hallucinations, the chat part of ChatGPT is a bit more sensitive to database like queries:
"},{"location":"lesson/chatgpt-ontology-curation/#cool-applications-the-authors-of-this-tutorial-used","title":"Cool applications the authors of this tutorial used","text":"(Current) Limitations:
Participants will need to have access to the following resources and tools prior to the training:
Description: How to contribute terms to existing ontologies.
"},{"location":"lesson/contributing-to-obo-ontologies/#learning-objectives","title":"Learning objectives","text":"GitHub - distributed version control (Git) + social media for geeks who like to build code/documented collaboratively.
A Git repo consists of a set of branches each with a complete history of all changes ever made to the files and directories. This is true for a local copy you check out to your computer from GitHub or for a copy (fork) you make on GitHub.
A Git repo typically has a master or main branch that is not directly editing. Changes are made by creating a branch from Master (complete copy of the Master + its history).
"},{"location":"lesson/contributing-to-obo-ontologies/#branch-vs-fork","title":"Branch vs Fork","text":"You can copy (fork) any GitHub repo to some other location on GitHub without having to ask permission from the owners.\u00a0 If you modify some files in that repo, e.g. to fix a bug in some code, or a typo in a document, you can then suggest to the owners (via a Pull Request) that they adopt (merge) you your changes back into their repo.
If you have permission from the owners, you can instead make a new branch. For this training, we gave you access to the repository. See the Appendix for instructions on how to make a fork.
"},{"location":"lesson/contributing-to-obo-ontologies/#create-github-issues","title":"Create GitHub Issues","text":"Tip: you can easily obtain term metadata like OBO ID, IRI, or the term label by clicking the three lines above the Annotations box (next to the term name) in Protege, see screenshot below. You can also copy the IRI in markdown, which is really convenient for pasting into GitHub.
"},{"location":"lesson/contributing-to-obo-ontologies/#video-explanation","title":"Video Explanation","text":"See this example video on creating a new term request to the Mondo Disease Ontology:
"},{"location":"lesson/contributing-to-obo-ontologies/#basic-open-source-etiquette","title":"Basic Open Source etiquette","text":"A README is a text file that introduces and explains a project. It is intended for everyone, not just the software or ontology developers. Ideally, the README file will include detailed information about the ontology, how to get started with using any of the files, license information and other details. The README is usually on the front page of the GitHub repository.
"},{"location":"lesson/contributing-to-obo-ontologies/#basics-of-ontology-development-workflows","title":"Basics of ontology development workflows","text":""},{"location":"lesson/contributing-to-obo-ontologies/#ontology-development-workflows","title":"Ontology development workflows","text":"The steps below describe how to make changes to an ontology.
training-NV
)The instructions below are using the Mondo Disease Ontology as an example, but this can be applied to any ontology.
"},{"location":"lesson/contributing-to-obo-ontologies/#open-the-mondo-in-protege","title":"Open the Mondo in Prot\u00e9g\u00e9","text":"Note: Windows users should open Protege using run.bat
The Prot\u00e9g\u00e9 interface follows a basic paradigm of Tabs and Panels. By default, Prot\u00e9g\u00e9 launches with the main tabs seen below. The layout of tabs and panels is configurable by the user. The Tab list will have slight differences from version to version, and depending on your configuration. It will also reflect your customizations.
To customize your view, go to the Window tab on the toolbar and select Views. Here you can customize which panels you see in each tab. In the tabs view, you can select which tabs you will see. You will commonly want to see the Entities tab, which has the Classes tab and the Object Properties tab.
Note: if you open a new ontology while viewing your current ontology, Prot\u00e9g\u00e9 will ask you if you'd like to open it in a new window. \u00a0For most normal usage you should answer no. This will open in a new window.
The panel in the center is the ontology annotations panel. This panel contains basic metadata about the ontology, such as the authors, a short description and license information.
"},{"location":"lesson/contributing-to-obo-ontologies/#running-the-reasoner","title":"Running the reasoner","text":"Before browsing or searching an ontology, it is useful to run an OWL reasoner first. This ensures that you can view the full, intended classification and allows you to run queries. Navigate to the query menu, and run the ELK reasoner:
For more details on why it is important to have the reasoner on when using the editors version of an ontology, see the Reasoning reference guide. But for now, you don't need a deeper understanding, just be sure that you always have the reasoner on.
"},{"location":"lesson/contributing-to-obo-ontologies/#entities-tab","title":"Entities tab","text":"You will see various tabs along the top of the screen. Each tab provides a different perspective on the ontology. For the purposes of this tutorial, we care mostly about the Entities tab, the DL query tab and the search tool. OWL Entities include Classes (which we are focussed on editing in this tutorial), relations (OWL Object Properties) and Annotation Properties (terms like, 'definition' and 'label' which we use to annotate OWL entities. Select the Entities tab and then the Classes sub-tab. Now choose the inferred view (as shown below).
The Entities tab is split into two halves. The left-hand side provides a suite of panels for selecting various entities in your ontology. When a particular entity is selected the panels on the right-hand side display information about that entity. The entities panel is context specific, so if you have a class selected (like Thing) then the panels on the right are aimed at editing classes. The panels on the right are customizable. Based on prior use you may see new panes or alternate arrangements. You should see the class OWL:Thing. You could start browsing from here, but the upper level view of the ontology is too abstract for our purposes. To find something more interesting to look at we need to search or query.
"},{"location":"lesson/contributing-to-obo-ontologies/#searching-in-protege","title":"Searching in Protege","text":"You can search for any entity using the search bar on the right:
The search window will open on top of your Protege pane, we recommend resizing it and moving it to the side of the main window so you can view together.
Here's an example search for 'COVID-19':
It shows results found in display names, definitions, synonyms and more. The default results list is truncated. To see full results check the 'Show all results option'. You may need to resize the box to show all results. Double clicking on a result, displays details about it in the entities tab, e.g.
In the Entities, tab, you can browse related types, opening/closing branches and clicking on terms to see details on the right. In the default layout, annotations on a term are displayed in the top panel and logical assertions in the 'Description' panel at the bottom.
Try to find these specific classes:
Note - a cool feature in the search tool in Protege is you can search on partial string matching. For example, if you want to search for \u2018down syndrome\u2019, you could search on a partial string: \u2018do synd\u2019.
Note - if the search is slow, you can uncheck the box \u2018Search in annotation values. Try this and search for a term and note if the search is faster. Then search for \u2018shingles\u2019 again and note what results you get.
"},{"location":"lesson/contributing-to-obo-ontologies/#use-github-make-pull-requests","title":"Use GitHub: make pull requests","text":""},{"location":"lesson/contributing-to-obo-ontologies/#committing-pushing-and-making-pull-requests","title":"Committing, pushing and making pull requests","text":"Changes made to the ontology can be viewed in GitHub Desktop.
Before committing, check the diff. Examples of a diff are pasted below. Large diffs are a sign that something went wrong. In this case, do not commit the changes and ask the ontology editors for help instead.
Example 1:
NOTE: You can use the word 'fixes' or 'closes' in the description of the commit message, followed by the corresponding ticket number (in the format #1234) - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
Note: 'Fixes' and \"Closes' are case-insensitive.
If you don't want to close the ticket, just refer to the ticket # without the word 'Fixes' or use 'Adresses'. The commit will be associated with the correct ticket but the ticket will remain open. NOTE: It is also possible to type a longer message than allowed when using the '-m' argument; to do this, skip the -m, and a vi window (on mac) will open in which an unlimited description may be typed.
Click Commit to [branch]. This will save the changes to the cl-edit.owl file.
Push: To incorporate the changes into the remote repository, click Publish branch.
The instructions below are using the Mondo Disease Ontology as an example, but this can be applied to any ontology.
"},{"location":"lesson/contributing-to-obo-ontologies/#setup","title":"Setup","text":""},{"location":"lesson/contributing-to-obo-ontologies/#setting-preferences-for-new-entities","title":"Setting Preferences for New entities","text":"Ontology terms have separate names and IDs. The names are annotation values (labels) and the IDs are represented using IRIs. The OBO foundry has a policy on IRI (or ID) generation (http://www.obofoundry.org/principles/fp-003-uris.html). You can set an ID strategy using the \"New Entities\" tab under the Prot\u00e9g\u00e9 Preferences -- on the top toolbar, click the \"Prot\u00e9g\u00e9 dropdown, then click Preferences.
Set your new entity preferences precisely as in the following screenshot of the New Entities tab.
Note - you have been assigned an ID range in the Mondo idranges file\u00a0 - you should be able to find your own range assigned there.
DIY (only if you know what you are doing!)
To add your own ID ranges:
Go into src/ontology
create a branch
Find and edit mondo-idranges.owl by adding the following:
Datatype: idrange:10 #update this to next following integer from previous\n\n Annotations:\n allocatedto: \"Your Name\" #change to your name\n\n EquivalentTo:\n xsd:integer[>= 0806000 , <= 0806999]. #add a range of 999 above the previous integer\n
Be sure to change \"Your Name\" to your actual name! And note that this value should almost always be an individual, and not an organization or group.
create a pull request and add matentzn or nicolevasilevsky as a reviewer
proceed to settting up as below:
Specified IRI: http://purl.obolibrary.org/obo/
Note - if you edit more than one ontology in Protege, you will need to update your Preferences for each ontology before you edit.
"},{"location":"lesson/contributing-to-obo-ontologies/#setting-preferences-for-user-details","title":"Setting Preferences for User details","text":"User name: click Use supplied user name and enter your username in the field below
Click Use Git user name when available
In the ORCID field, add your ORCID ID (in the format 0000-0000-0000-0000)
"},{"location":"lesson/contributing-to-obo-ontologies/#setting-preferences-for-new-entities-metadata","title":"Setting Preferences for New entities metadata","text":"The current recommendation of the OBO Foundry Technical Working Group is that an editor who creates a new term SHOULD add a http://purl.org/dc/terms/contributor
annotation, set to the ORCID or GitHub username of the editor, and a http://purl.org/dc/terms/date
annotation, set to the current date.
You can have Prot\u00e9g\u00e9 automatically add those annotations by setting your preferences to match the screenshot below, in the New entities metadata tab (under preferences).
If you do not have an ORCID, register for for free here: https://orcid.org/
"},{"location":"lesson/contributing-to-obo-ontologies/#protege-editing","title":"Protege editing","text":""},{"location":"lesson/contributing-to-obo-ontologies/#creating-a-new-class","title":"Creating a new class","text":"Before you start:
make sure you are working on a branch - see quick guide here.
make sure you have the editor's file open in Protege as detailed here.
New classes are created in the Class hierarchy panel on the left.
There are three buttons at the top of the class hierarchy view. These allow you to add a subclass (L-shaped icon), add a sibling class (c-shaped icon), or delete a selected class (x'd circle).
Practice adding a new term:
We will work on these two tickets:
Search for the parent term 'hypereosinophilic syndrome' (see search guide if you are unsure how to do this).
When you are clicked on the term in the Class hierarchy pane, click the add subclass button to add a child class to 'hypereosinophilic syndrome'
A dialog will popup. Name this new subclass: migratory muscle precursor. Click \"OK\" to add the class.
"},{"location":"lesson/contributing-to-obo-ontologies/#adding-annotations","title":"Adding annotations","text":"Using Prot\u00e9g\u00e9 you can add annotations such as labels, definitions, synonyms, database cross references (dbxrefs) to any OWL entity. The panel on the right, named Annotations, is where these annotations are added. CL includes a pre-declared set of annotation properties. The most commonly used annotations are below.
Note, most of these are bold in the annotation property list:
Use this panel to add a definition to the class you created. Select the + button to add an annotation to the selected entity. Click on the annotation 'definition' on the left and copy and paste in the definition to the white editing box on the right. Click OK.
Definition: A disorder characterized by episodes of swelling under the skin (angioedema) and an elevated number of the white blood cells known as eosinophils (eosinophilia). During these episodes, symptoms of hives (urticaria), fever, swelling, weight gain and eosinophilia may occur. Symptoms usually appear every 3-4 weeks and resolve on their own within several days. Other cells may be elevated during the episodes, such as neutrophils and lymphocytes. Although the syndrome is often considered a subtype of the idiopathic hypereosinophilic syndromes, it does not typically have organ involvement or lead to other health concerns.
Definitions in Mondo should have a 'database cross reference' (dbxref), which is a reference to the definition source, such as a paper from the primary literature or another database. For references to papers, we cross reference the PubMed Identifier in the format, PMID:XXXXXXXX. (Note, no space)
To add a dbxref to the definition:
We have seen how to add sub/superclasses and annotate the class hierarchy. Another way to do the same thing is via the Class description view. When an OWL class is selected in the entities view, the right-hand side of the tab shows the class description panel. If we select the 'vertebral column disease' class, we see in the class description view that this class is a \"SubClass Of\" (= has a SuperClass) the 'musculoskeletal system disease' class. Using the (+) button beside \"SubClass Of\" we could add another superclass to the 'skeletal system disease' class.
Note the Anonymous Ancestors. This is a difficult concept we will return to later, and the contents of this portion may seem confusing at first (some of these may be clearer after you complete the \"Basics of OWL\" section below). These are OWL expressions that are inherited from the parents. If you hover over the Subclass Of (Anonymous Ancestor) you can see the parent that the class inherited the expression from. For many ontologies, you will see some quite abstract expressions in here inherited from upper ontologies, but these can generally be ignored for most purposes.
"},{"location":"lesson/contributing-to-obo-ontologies/#revising-a-superclass","title":"Revising a superclass:","text":"If you want to revise the superclass, click the 'o' symbol next to the superclass and replace the text. Try to revise 'musculoskeletal system disease' to\u00a0 'disease by anatomical system'.
If you want to delete a superclass, click the 'x' button next to the superclass. Delete the 'disease by anatomical system' superclass.
Close this window without saving.
Save your work.
"},{"location":"lesson/contributing-to-obo-ontologies/#make-a-pull-request","title":"Make a Pull Request","text":"Click: Create Pull Request in GitHub Desktop
This will automatically open GitHub Desktop
Click the green button 'Create pull request'
You may now add comments to your pull request.
The CL editors team will review your PR and either ask for changes or merge it.
The changes will be available in the next release.
Dead Simple Ontology Design Patterns (DOSDPs) are specifications, written in yaml format, that specify how ontology terms should be created (see article here). They can be used to:
DOSDPs have some key features:
Examples of design patterns are available here:
under development
"},{"location":"lesson/contributing-to-obo-ontologies/#basics-of-owl","title":"Basics of OWL","text":"BDK14_exercises
from your file systembasic-subclass/chromosome-parts.owl
in Prot\u00e9g\u00e9, then do the following exercises:basic-restriction/er-sec-complex.owl
in Prot\u00e9g\u00e9, then do the following exercise:basic-dl-query/cc.owl
in Prot\u00e9g\u00e9, then do the following exercises:owl:Nothing
is defined as the very bottom node of an ontology, therefore the DL query results will show owl:Nothing
as a subclass. This is expected and does not mean there is a problem with your ontology! It's only bad when something is a subclass of owl:Nothing
and therefore unsatisfiable (more on that below).basic-classification/ubiq-ligase-complex.owl
in Prot\u00e9g\u00e9, then do the following exercises:Below are exercises to demonstrate how to:
These instructions will use the Mondo disease ontology as an example.
"},{"location":"lesson/contributing-to-obo-ontologies/#practice-1","title":"Practice 1","text":""},{"location":"lesson/contributing-to-obo-ontologies/#add-new-terms-with-an-equivalance-axiom-to-mondo","title":"Add New Terms with an Equivalance Axiom to Mondo:","text":""},{"location":"lesson/contributing-to-obo-ontologies/#creating-a-new-class_1","title":"Creating a new class","text":"New classes are created in the Class hierarchy panel on the left.
There are three buttons at the top of the class hierarchy view. These allow you to add a subclass (L-shaped icon), add a sibling class (c-shaped icon), or delete a selected class (x'd circle).
"},{"location":"lesson/contributing-to-obo-ontologies/#practice-adding-a-new-term","title":"Practice adding a new term:","text":""},{"location":"lesson/contributing-to-obo-ontologies/#add-the-new-term-mycotoxin-allergy","title":"Add the new term 'mycotoxin allergy'","text":"Equivalence axioms in Mondo are added according to Dead Simple Ontology Design Patterns (DOSDPs). You can view all of the design patterns in Mondo by going to code/src/patterns/dosdp-patterns/
For this class, we want to follow the design pattern for allergy.
As noted above, equivalence axioms in Mondo are added according to Dead Simple Ontology Design Patterns (DOSDPs). You can view all of the design patterns in Mondo by going to code/src/patterns/dosdp-patterns/
For this class, we want to follow the design pattern for acquired.
Develop skills to lead a new or existing OBO project, or reference ontology develoment.
"},{"location":"lesson/developing-an-obo-ontology/#learning-objectives","title":"Learning objectives","text":"Please complete the following and then continue with this tutorial below:
By the end of this session, you should be able to:
robot merge
robot reason
robot annotate
Like software, official OBO Foundry ontologies have versioned releases. This is important because OBO Foundry ontologies are expected to be shared and reused. Since ontologies are bound to change over time as more terms are added and refined, other developers need stable versions to point to so that there are no surprises. OBO Foundry ontologies use GitHub releases to maintain these stable copies of older versions.
Generally, OBO Foundry ontologies maintain an \"edit\" version of their file that changes without notice and should not be used by external ontology developers because of this. The edit file is used to create releases on a (hopefully) regular basis. The released version of an OBO Foundry ontology is generally a merged and reasoned version of the edit file. This means that all modules and imports are combined into one file, and that file has the inferred class hierarchy actually asserted. It also often has some extra metadata, including a version IRI. OBO Foundry defines the requirements for version IRIs here.
"},{"location":"lesson/developing-an-obo-ontology/#the-release-workflow-process-should-be-stable-and-can-be-written-as-a-series-of-steps-for-example","title":"The release workflow process should be stable and can be written as a series of steps. For example:","text":"robot template
robot merge
robot reason
robot annotate
Since we can turn these steps into a series of commands, we can create a Makefile
that stores these as \"recipes\" for our ontology release!
report
and query
convert
, extract
, and template
merge
, reason
, annotate
, and diff
These materials are under construction and incomplete.
"},{"location":"lesson/developing-application-ontologies/#prerequisites","title":"Prerequisites","text":"Description: Combining ontology subsets for use in a project.
"},{"location":"lesson/developing-application-ontologies/#learning-objectives","title":"Learning objectives","text":"All across the biomedical domain, we refer to domain entities (such as chemicals or anatomical parts) using identifiers, often from controlled vocabularies.
The decentralised evolution of scientific domains has led to to the emergence of disparate \"semantic spaces\" with different annotation practices and reference vocabularies and formalisms.
To bridge between these spaces, entity mappings have emerged, which link, for example, genes from HGNC to ENSEMBL, diseases between OMIM and Mondo and anatomical entities between FMA and Uberon.
Entity matching is the process of establishing a link between an identifier in one semantic space to an identifier in another. There are many cultures of thought around entity matching, including Ontology Matching, Entity Resolution and Entity Linking.
"},{"location":"lesson/entity-matching/#table-of-contents","title":"Table of Contents","text":"The excellent OpenHPI course on Knowledge Engineering with Semantic Web Technologies gives a good overview:
Another gentle overview on Ontology Matching was taught as part of the Knowledge & Data course at Vrije Universiteit Amsterdam.
"},{"location":"lesson/entity-matching/#basic-tutorials","title":"Basic tutorials","text":"In the following, we consider a entity a symbol that is intended to refer to a real world entity, for example:
rdfs:label
\"Friedreichs Ataxia\". The label itself is not necessarily a term - it could change, for example to \"Friedreichs Ataxia (disease)\", and still retain the same meaning.Friedreich's Ataxia
\" (example on the left) may be a term in my controlled vocabulary which I understand to correspond to that respective disease (not all controlled vocabularies have IDs for their terms). This happens for example in clinical data models that do not use formal identifiers to refer to the values of slots in their data model, like \"MARRIED\" in /datamodel/marital_status.In our experience, there are roughly four kinds of mappings:
cheese sandwich (wikidata:Q2734068)
to sandwich (wikidata:Q111836983)
and cheese wikidata:Q10943
. These are the rarest and most complicated kinds of mappings and are out of scope for this lesson.In some ways, these four kinds of mappings can be very different. We do believe, however, that there are enough important commonalities such as common features, widely overlapping use cases and overlapping toolkits to consider them together. In the following, we will discuss these in more detail, including important features of mappings and useful tools.
"},{"location":"lesson/entity-matching/#important-features-of-mappings","title":"Important features of mappings","text":"Mappings have historically been neglected as second-class citizens in the medical terminology and ontology worlds - the metadata is insufficient to allow for precise analyses and clinical decision support, they are frequently stale and out of date, etc. The question \"Where can I find the canonical mappings between X and Y\"? is often shrugged off and developers are pointed to aggregators such as OxO or UMLS which combine manually curated mappings with automated ones causing \"mapping hairballs\".
There are many important metadata elements to consider, but the ones that are by far the most important to consider one way or another are:
Whenever you handle mappings (either create, or re-use), make sure you are keenly aware of at least these three metrics, and capture them. You may even want to consider using a proper mapping model like the Simple Shared Standard for Ontology Mappings (SSSOM) which will make your mappings FAIR and reusable.
"},{"location":"lesson/entity-matching/#string-string-mappings","title":"String-string mappings","text":"String-string mappings are mappings that relate two strings. The task of matching two strings is ubiquitous for example in database search fields (where a user search string needs to be mapped to some strings in a database). Most, if not all effective ontology matching techniques will employ some form of string-string matching. For example, to match simple variations of labels such as \"abnormal heart\" and \"heart abnormality\", various techniques such as Stemming and bag of words can be employed effectively. Other techniques such as edit-distance or Levenshtein can be used to quantify the similarity of two strings, which can provide useful insights into mapping candidates.
"},{"location":"lesson/entity-matching/#string-entity-mappings-synonyms","title":"String-entity mappings / synonyms","text":"String-entity mappings relate a specific string or \"label\" to their corresponding term in a terminology or ontology. Here, we refer to these as \"synonyms\", but there may be other cases for string-entity mappings beyond synonymy.
There are a lot of use cases for synonyms so we will name just a few here that are relevant to typical workflows of Semantic Engineers in the life sciences.
Thesauri are reference tools for finding synonyms of terms. Modern ontologies often include very rich thesauri, with some ontologies like Mondo capturing more than 70,000 exact and 35,000 related synonyms. They can provide a huge boost to traditional NLP pipelines by providing synonyms that can be used for both Named Entity Recognition and Entity Resolution. Some insight on how, for example, Uberon was used to boost text mining can be found here.
"},{"location":"lesson/entity-matching/#entity-entity-mappings-ontology-mappings","title":"Entity-entity mappings / ontology mappings","text":"Entity-entity mappings relate a entity (or identifier), for example a class in an ontology, to another entity, usually from another ontology or database. The entity-entity case of mappings is what most people in the ontology domain would understand when they hear \"ontology mappings\". This is also what most people understand when they here \"Entity Resolution\" in the database world - the task of determining whether, in essence, two rows in a database correspond to the same thing (as an example of a tool doing ER see deepmatcher, or py-entitymatcher). For a list standard entity matching toolkit outside the ontology sphere see here.
"},{"location":"lesson/entity-matching/#monarch-obo-training-tutorials","title":"Monarch OBO Training Tutorials","text":""},{"location":"lesson/entity-matching/#introduction-to-semantic-entity-matching","title":"Introduction to Semantic Entity Matching","text":""},{"location":"lesson/entity-matching/#how-are-mappings-collected-in-practice","title":"How are mappings collected in practice?","text":"Mappings between terms/identifiers are typically collected in four ways:
The main trade-off for mappings is very simple: 1. Automated mappings are very error prone (not only are they hugely incomplete, they are also often faulty). 1. Human curated mappings are very costly.
--> The key for any given mapping project is to determine the highest acceptable error rate, and then distribute the workload between human and automated matching approaches. We will discuss all three ways of collecting mappings in the following.
Aside from the main tradeoff above, there are other issues to keep in mind: - Manually curated mappings are far from perfect. Most of the cost of mapping review lies in the decision how thorough a mapping should be reviewed. For example, a human reviewer may be tasked with reviewing 1000 mappings. If the acceptable error rate is quite high, the review may simply involve the comparison of labels (see here), which may take around 20 seconds. A tireless reviewer could possibly accept or dismiss 1000 mappings just based on the label in around 6 hours. Note that this is hardly better than what most automated approaches could do nowadays. - Some use cases involve so much data that manual curation is nearly out of the question.
"},{"location":"lesson/entity-matching/#manual-curation-of-mappings","title":"Manual curation of mappings","text":"It is important to remember that matching in its raw form should not be understood to result in semantic mappings. The process of matching, in particular lexical or fuzzy semantic matching is error prone and usually better treated as resulting in candidates for mappings. This means that when we calculate the effort of a mapping project, we should always factor in the often considerable effort required by a human to verify the correctness of a candidate mapping. There are many tools that can help with this process, for example by filtering out conflicting lower-confidence mappings, but in the end the reality is this: due to the fact that source and target do not share the same semantics, mappings will always be a bit wobbly. There are two important kinds of review which are very different:
orange juice [wikidata:Q219059]
and orange juice (unpasteurized) [FOODON:00001277]
may not be considered as the same thing in the sense of skos:exactMatch
. oak lexmatch
this usually involves hacking labels and synonyms by removing or replacing words. More sophisticated matchers like Agreement Maker Light (AML) have many more tuning options, and it requires patience and expertise to find the right ones. One good approach here is to include semantically or lexically similar matches in the results, and review if generally consistent patterns of lexical variation can be spotted. For example: orange juice (liquid) [FOODON:00001001]
seems to be exactly what orange juice [wikidata:Q219059]
is supposed to mean. The labels are not the same, but lexically similar: a simple lexical distance metric like Levenshtein could have been used to identify these.Tip: always keep a clear visible list of unmapped classes around to sanity check how good your mapping has been so far.
"},{"location":"lesson/entity-matching/#automated-matching","title":"Automated matching","text":"There are many (many) tools out there that have been developed for entity matching. A great overview can be found in Euzenats Ontology Matching. Most of the matchers apply a mix of lexical and semantic approaches.
As a first pass, we usually rely on a heuristic that an exact match on the label is strong evidence that the two entities correspond to the same thing. Obviously, this cannot always be the case Apple
(the fruit) and Apple
(the company) are two entirely different things, yet a simple matching tool (like OAK lexmatch
) would return these as matching. The reason why this heuristic works in practice is because we usually match between already strongly related semantic spaces, such as two gene databases, two fruit ontologies or two disease terminologies. When the context is narrow, lexical heuristics have a much lower chance to generate excessively noisy mappings.
After lexical matchings are created, other techniques can be employed, including syntactic similarity (match all entities which have labels that are more than 80% similar and end with disease
) and semantic similarity (match all entities whose node(+graph)-embedding have a cosine similarity of more than 80%). Automated matching typically results in a large number of false positives that need to be filtered out using more sophisiticated approaches for mapping reconciliation.
The refinement step may involve automated approaches that are sensitive to the logical content of the sources involved (for example by ensuring that the result does not result in equivalence cliques, or unsatisfiable classes), but more often than not, human curators are employed to curate the mapping candidates generated by the various automated approaches.
"},{"location":"lesson/entity-matching/#some-examples-of-domain-specific-mapping-of-importance-to-the-biomedical-domain","title":"Some examples of domain-specific mapping of importance to the biomedical domain","text":""},{"location":"lesson/entity-matching/#phenotype-ontology-mappings","title":"Phenotype ontology mappings","text":"Mapping phenotypes across species holds great promise for leveraging the knowledge generated by Model Organism Database communities (MODs) for understanding human disease. There is a lot of work happening at the moment (2021) to provide standard mappings between species specific phenotype ontologies to drive translational research (example). Tools such as Exomiser leverage such mappings to perform clinical diagnostic tasks such as variant prioritisation. Another app you can try out that leverages cross-species mappings is the Monarch Initiatives Phenotype Profile Search.
"},{"location":"lesson/entity-matching/#disease-ontology-mappings","title":"Disease ontology mappings","text":"Medical terminology and ontology mapping is a huge deal in medical informatics (example). Mondo is a particularly rich source of well provenanced disease ontology mappings.
"},{"location":"lesson/entity-matching/#further-reading","title":"Further reading","text":"Sign up for a free GitHub account
"},{"location":"lesson/getting-hands-on/#preparation","title":"Preparation","text":"No advance preparation is necessary.
Optional: If you are unfamiliar with ontologies, this introduction to ontologies explanation may be helpful.
"},{"location":"lesson/getting-hands-on/#what-is-delivered-as-part-of-the-course","title":"What is delivered as part of the course","text":"Description: The purpose of this lesson is to train biomedical researchers on how to find a term, what to do if they find too many terms, how to decide on which term to use, and what to do if no term is found.
"},{"location":"lesson/getting-hands-on/#learning-objectives","title":"Learning objectives","text":"This how to guide on How to be an Open Science Engineer - maximizing impact for a better world has a lot of details about the philosophy behind open science ontology engineering. Some key points are summarized below.
See lesson on Using Ontologies and Ontology Terms
"},{"location":"lesson/getting-hands-on/#how-to-make-new-term-requests","title":"How to make new term requests","text":"See How to guide on Make term requests to existing ontologies
"},{"location":"lesson/getting-hands-on/#exercise","title":"Exercise","text":"Pull Requests lesson
"},{"location":"lesson/hackathon/#outline","title":"Outline","text":"In this lesson, we will give an intuition of how to work with object properties
in OBO ontologies, also referred to as \"relations\".
We will cover, in particular, the following subjects:
We have worked with the University of Manchester to incorporate the Family History Knowledge Base Tutorial fully into OBO Academy.
This is it: OBOAcademy: Family History - Modelling with Object Properties.
In contrast to the Pizza tutorial, the Family history tutorial focuses on modelling with individuals. Chapters 4, 5, 8 and 9 are full of object property modelling, and are not only great to get a basic understanding of using them in your ontology, but also give good hints at where OWL and object properties fall short. We refer to the FHKB in the following and expect you to have completed at least chapter 5 before reading on.
"},{"location":"lesson/modelling-with-object-properties/#the-role-of-object-properties-in-the-obo-sphere","title":"The Role of Object Properties in the OBO-sphere","text":"To remind ourselves, there are three different types of relations in OWL:
For some example usage, run the following query in the ontobee OLS endpoint:
http://www.ontobee.org/sparql
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nprefix owl: <http://www.w3.org/2002/07/owl#>\nSELECT distinct *\nWHERE {\nGRAPH ?graph_uri\n{ ?dp rdf:type owl:DatatypeProperty .\n ?sub ?dp ?obj }\n}\n
Note that many uses of data properties across OBO are a bit questionable, for example, you do never want to attach a modification dates or similar to your classes using data properties, as these fall under OWL semantics. This means that logically, if a superclass has a relation using a DatatypeProperty, then this relation _holds for all subclasses of that class as well.
Annotation properties are similar to data properties, but they are outside of OWL semantics, i.e. OWL reasoners and reasoning do not care, in fact ignore, anything related to annotation properties. This makes them suitable for attaching metadata like labels etc to our classes and properties. We sometimes use annotation properties even to describe relationships between classes if we want reasoners to ignore them. The most typical example is IAO:replaced_by, which connects an obsolete term with its replacement. Widely used annotation properties in the OBO-sphere are standardised in the OBO Metadata Ontology (OMO).
The main type of relation we use in OBO Foundry are object properties. Object properties relate two individuals or classes with each other, for example:
OWLObjectPropertyAssertion(:part_of, :heart, :cardiovascular_system)\n
In the same way as annotation properties are maintained in OMO (see above), object properties are maintained in the Relation Ontology (RO).
Object properties are of central importance to all ontological modelling in the OBO sphere, and understanding their semantics is critical for any put the most trivial ontologies. We assume the reader to have completed the Family History Tutorial mentioned above.
"},{"location":"lesson/modelling-with-object-properties/#object-property-semantics-in-obo","title":"Object property semantics in OBO","text":"In our experience, these are the most widely used characteristics we specify about object properties (OP):
ecologically co-occurs with
in RO has the domain 'organism or virus or viroid'
, which means that whenever anything ecologically co-occurs with
something else, it will be inferred to be a 'organism or virus or viroid'
.produced by
has the domain material entity
. Note that in ontologies, ranges are slightly less powerful then domains: If we have a class Moderna Vaccine
which is SubClass of 'produced by' some 'Moderna'
we get that Moderna Vaccine
is a material entity
due to the domain constraint, but NOT that Moderna
is a material entity
due to the range constraint (explanation to this is a bit complicated, sorry).Other characteristics like functionality and symmetry are used across OBO ontologies, but not nearly to the same extend as the 5 described above.
"},{"location":"lesson/modelling-with-object-properties/#the-relation-ontology-ro","title":"The Relation Ontology (RO)","text":"The Relation Ontology serves two main purposes in the OBO world:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nprefix owl: <http://www.w3.org/2002/07/owl#>\nSELECT distinct ?graph_uri ?s\nWHERE {\nGRAPH ?graph_uri\n{ ?s rdf:type owl:ObjectProperty ;\n rdfs:label \"part of\" . }\n}\n
On the OntoBee SPARQL endpoint still reveals a number of ontologies using non-standard part-of relations. In our experience, most of these are accidental due to past format conversions, but not all. This problem was much worse before RO came along, and our goal is to unify the representation of key properties like \"part of\" across all OBO ontologies. The OBO Dashboard checks for object properties that are not aligned with RO.
To add a relationship we usually follow the following process. For details, please refer to the RO documentation.
These materials are under construction and incomplete.
"},{"location":"lesson/ontology-design/#prerequisites","title":"Prerequisites","text":"Participants will need to have access to the following resources and tools prior to the training:
Description: This course will cover reasoning with OWL.
"},{"location":"lesson/ontology-design/#learning-objectives","title":"Learning objectives","text":"At the end of this lesson, you should know how to do:
OpenHPI Course Content
In OWL, we use object properties to describe binary relationships between two individuals (or instances). We can also use the properties to describe new classes (or sets of individuals) using restrictions. A restriction describes a class of individuals based on the relationships that members of the class participate in. In other words, a restriction is a kind of class, in the same way that a named class is a kind of class.
For example, we can use a named class to capture all the individuals that are idiopathic diseases. But we could also describe the class of idiopathic disease as all the instances that are 'has modifier' idiopathic disease.
In OWL, there are three main types of restrictions that can be placed on classes. These are quantifier restriction, cardinality restrictions, and hasValue restriction. In this tutorial, we will initially focus on quantifier restrictions.
Quantifier restrictions are further categorized into two types, the existential and the universal restriction.
idiopathic disease
class. In Protege, the keyword 'some' is used to denote existential restrictions.In this tutorial, we will deal exclusively with the existential (some) quantifier.
"},{"location":"lesson/ontology-design/#superclass-restrictions","title":"Superclass restrictions","text":"Strictly speaking in OWL, you don't make relationships between classes, however, using OWL restrictions we essentially achieve the same thing.
We wanted to capture the knowledge that the named class 'idiopathic achalasia' is an idiopathic disease. In OWL speak, we want to say that every instance of an ' idiopathic achalasia' is also an instance of the class of things that have at least one 'has modifier' relationship to an idiopathic disease. In OWL, we do this by creating an existential restriction on the idiopathic achalasia class.
This example introduces equivalence axioms or defined classes (also called logical definitions) and automatic classification.
The example involves classification of Mendelian diseases that have a monogenic (single gene) varation. These equivalence axioms are based off the Mondo Design Pattern disease_series_by_gene.
Constructs:
'cardioacrofacial dysplasia 1'
'cardioacrofacial dysplasia'
that has dysfunction in the PRKACA gene.'cardioacrofacial dysplasia' and ('disease has basis in dysfunction of' some PRKACA)
For teaching purposes, let's say we need a new class that is 'fungal allergy'.
By default, OWL assumes that these classes can overlap, i.e. there are individuals who can be instances of more than one of these classes. We want to create a restriction on our ontology that states these classes are different and that no individual can be a member of more than one of these classes. We can say this in OWL by creating a disjoint classes axiom.
Below we'll review an example of one class and how to fix it. Next you should review and fix another one on your own and create a pull request for Nicole or Nico to review. Note, fixing these may require a bit of review and subjective decision making and the fix described below may not necessarily apply to each case.
Bickerstaff brainstem encephalitis
: To understand why this class appeared under owl:Nothing, first click the ? next to owl:Nothing in the Description box. (Note, this can take a few minutes).Guillain-Barre syndrome
, which is a child of syndromic disease
.Bickerstaff brainstem encephalitis
is an appropriate child of regional variant of Guillain-Barre syndrome
. Note, Mondo integrates several disease terminologies and ontologies, and brought in all the subclass hierarchies from these source ontologies. To see the source of this superclass assertion, click the @ next to the assertion.regional variant of Guillain-Barre syndrome
(see this paper and this paper. It seems a bit unclear what the relationship of BBE is to Guillain-Barre syndrome. This also brings into the question if a disease can be syndromic and an infectious disease - maybe this disjoint axiom is wrong, but let's not worry about this for the teaching purposes.)These materials are under construction and incomplete.
"},{"location":"lesson/ontology-development/#prerequisites","title":"Prerequisites","text":"These materials are under construction and incomplete.
"},{"location":"lesson/ontology-fundamentals/#prerequisites","title":"Prerequisites","text":"BDK14_exercises
from your file systembasic-subclass/chromosome-parts.owl
in Prot\u00e9g\u00e9, then do the following exercises:basic-restriction/er-sec-complex.owl
in Prot\u00e9g\u00e9, then do the following exercise:basic-dl-query/cc.owl
in Prot\u00e9g\u00e9, then do the following exercises:owl:Nothing
is defined as the very bottom node of an ontology, therefore the DL query results will show owl:Nothing
as a subclass. This is expected and does not mean there is a problem with your ontology! It's only bad when something is a subclass of owl:Nothing
and therefore unsatisfiable (more on that below).basic-classification/ubiq-ligase-complex.owl
in Prot\u00e9g\u00e9, then do the following exercises:Description: Learn the fundamentals of ontologies.
"},{"location":"lesson/ontology-fundamentals/#learning-objectives","title":"Learning objectives","text":"robot convert
(Review; ~15 minutes)robot extract
(Review; ~15 minutes)robot template
(Review; ~15 minutes)These materials are under construction and may be incomplete.
"},{"location":"lesson/ontology-pipelines/#prerequisites","title":"Prerequisites","text":"convert
, extract
and template
annotate
, merge
, reason
and diff
There are two basic ways to edit an ontology: 1. Manually, using tools such as Protege, or 2. Using computational tools such as ROBOT.
Both have their advantages and disadvantages: manual curation is often more practical when the required ontology change follows a non-standard pattern, such as adding a textual definition or a synonym, while automated approaches are usually much more scalable (ensure that all axioms in the ontology are consistent, or that imported terms from external ontologies are up-to-date or that all labels start with a lower-case letter).
Here, we will do a first dive into the \"computational tools\" side of the edit process. We strongly believe that the modern ontology curator should have a basic set of computational tools in their Semantic Engineering toolbox, and many of the lessons in this course should apply to this role of the modern ontology curator.
ROBOT is one of the most important tools in the Semantic Engineering Toolbox. For a bit more background on the tool, please refer to the paper ROBOT: A Tool for Automating Ontology Workflows.
We also recommend to get a basic familiarity with SPARQL, the query language of the semantic web, that can be a powerful combination with ROBOT to perform changes and quality control checks on your ontology.
"},{"location":"lesson/ontology-pipelines/#additional-materials-and-resources","title":"Additional materials and resources","text":""},{"location":"lesson/ontology-pipelines/#contributors","title":"Contributors","text":"These materials are under construction and may be incomplete.
"},{"location":"lesson/ontology-term-use/#prerequisites","title":"Prerequisites","text":"Description: Using ontology terms for annotations and structuring data.
"},{"location":"lesson/ontology-term-use/#learning-objectives","title":"Learning objectives","text":"Ontologies provide a logical classification of information in a particular domain or subject area. Ontologies can be used for data annotations, structuring disparate data types, classifying information, inferencing and reasoning across data, and computational analyses.
"},{"location":"lesson/ontology-term-use/#difference-between-a-terminology-and-an-ontology","title":"Difference between a terminology and an ontology","text":""},{"location":"lesson/ontology-term-use/#terminology","title":"Terminology","text":"A terminology is a collection of terms; a term can have a definition and synonyms.
"},{"location":"lesson/ontology-term-use/#ontology","title":"Ontology","text":"An ontology contains a formal classification of terminology in a domain that provides textual and machine readable definitions, and defines the relationships between terms. An ontology is a terminology, but a terminology is not (necessarily) an ontology.
"},{"location":"lesson/ontology-term-use/#2-finding-good-ontologies","title":"2. Finding good ontologies","text":"Numerous ontologies exist. Some recommended sources to find community developed, high quality, and frequently used ontologies are listed below.
example of usage
.The OBO Foundry is a community of ontology developers that are committed to developing a library of ontologies that are open, interoperable ontologies, logically well-formed and scientifically accurate. OBO Foundry participants follow and contribute to the development of an evolving set of principles including open use, collaborative development, non-overlapping and strictly-scoped content, and common syntax and relations, based on ontology models that work well, such as the Gene Ontology (GO).
The OBO Foundry is overseen by an Operations Committee with Editorial, Technical and Outreach working groups.
"},{"location":"lesson/ontology-term-use/#find-terms-using-ontology-browsers","title":"Find terms using ontology browsers","text":"Various ontology browsers are available, we recommend using one of the ontology browsers listed below.
Some considerations for determining which ontologies to use include the license and quality of the ontology.
"},{"location":"lesson/ontology-term-use/#license","title":"License","text":"Licenses define how an ontology can legally be used or reused. One requirement for OBO Foundry Ontologies is that they are open, meaning that the ontologies are openly and freely available for use with acknowledgement and without alteration. OBO ontologies are required to be released under a Creative Commons CC-BY license version 3.0 or later, OR released into the public domain under CC0. The license should be clearly stated in the ontology file.
"},{"location":"lesson/ontology-term-use/#quality","title":"Quality","text":"Some criteria that can be applied to determine the quality of an ontology include:
Data can be mapped to ontology terms manually, using spreadsheets, or via curation tools such as:
The figure below by Chris Mungall on his blog post on How to select and request terms from ontologies describes a workflow on searching for identifying missing terms from an ontology.
"},{"location":"lesson/ontology-term-use/#7-make-term-requests-to-existing-ontologies","title":"7. Make term requests to existing ontologies","text":"See separate lesson on Making term requests to existing ontologies.
"},{"location":"lesson/ontology-term-use/#8-differences-between-iris-curies-and-labels","title":"8. Differences between IRIs, CURIEs, and labels","text":""},{"location":"lesson/ontology-term-use/#uri","title":"URI","text":"A uniform resource identifier (URI) is a string of characters used to identify a name or a resource.
"},{"location":"lesson/ontology-term-use/#url","title":"URL","text":"A URL is a URI that, in addition to identifying a network-homed resource, specifies the means of acting upon or obtaining the representation.
A URL such as this one:
https://github.com/obophenotype/uberon/blob/master/uberon_edit.obo
has three main parts:
The protocol tells you how to get the resource. Common protocols for web pages are http (HyperText Transfer Protocol) and https (HTTP Secure). The host is the name of the server to contact (the where), which can be a numeric IP address, but is more often a domain name. The path is the name of the resource on that server (the what), here the Uberon anatomy ontology file.
"},{"location":"lesson/ontology-term-use/#iri","title":"IRI","text":"A Internationalized Resource Identifiers (IRI) is an internet protocol standard that allows permitted characters from a wide range of scripts. While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the Universal Character Set (Unicode/ISO 10646), including Chinese or Japanese kanji, Korean, Cyrillic characters, and so forth. It is defined by RFC 3987.
More information is available here.
"},{"location":"lesson/ontology-term-use/#curies","title":"CURIEs","text":"A Compact URI (CURIE) consists of a prefix and a suffix, where the prefix stands in place of a longer base IRI.
By converting the prefix and appending the suffix we get back to full IRI. For example, if we define the obo prefix to stand in place of the IRI as: http://purl.obolibrary.org/obo/, then the CURIE obo:UBERON_0002280 can be expanded to http://purl.obolibrary.org/obo/UBERON_0002280, which is the UBERON Anatomy term for \u2018otolith\u2019. Any file that contains CURIEs need to define the prefixes in the file header.
"},{"location":"lesson/ontology-term-use/#label","title":"Label","text":"A label is the textual, human readable name that is given to a term, class property or instance in an ontology.
"},{"location":"lesson/rdf/","title":"Introduction to RDF","text":"First Instructor: James Overton Second Instructor: Becky Jackson
"},{"location":"lesson/rdf/#warning","title":"Warning","text":"These materials are under construction and incomplete.
"},{"location":"lesson/rdf/#description","title":"Description","text":"Modelling and querying data with RDF triples, and working with RDF using tables
"},{"location":"lesson/rdf/#topics","title":"Topics","text":"OpenHPI Linked Data Engineering (2016)
Using Databases and SQL
"},{"location":"lesson/rdf/#new-material","title":"New Material","text":"These materials are under construction and incomplete.
"},{"location":"lesson/semantic-database-fundamentals/#prerequisites","title":"Prerequisites","text":"Description: Using ontology terms in a database.
"},{"location":"lesson/semantic-database-fundamentals/#learning-objectives","title":"Learning objectives","text":"Ontologies are notoriously hard to edit. This makes it a very high burden to edit ontologies for anyone but a select few. However, many of the contents of ontologies are actually best edited by domain experts with often little or known ontological training - editing labels and synonyms, curating definitions, adding references to publications and many more. Furthermore, if we simply remove the burden of writing OWL axioms, editors with very little ontology training can actually curate even logical content: for example, if we want to describe that a class is restricted to a certain taxon (also known as taxon-restriction), the editor is often capable to select the appropriate taxon for a term (say, a \"mouse heart\" is restricted to the taxon of Mus musculus), but maybe they would not know how to \"add that restriction to the ontology\".
Tables are great (for a deep dive into tables and triples see here). Scientists in particular love tables, and, even more importantly, can be trained easily to edit data in spreadsheet tools, such as Google Sheets or Microsoft Excel.
Ontology templating systems, such as DOSDP templates, ROBOT templates and Reasonable Ontology Templates (OTTR) allow separating the raw data in the ontology (labels, synonyms, related ontological entities, descriptions, cross-references and other metadata) from the OWL language patterns that are used to manifest them in the ontology. There are three main ingredients to a templating system:
In OBO we are currently mostly concerned with ROBOT templates and DOSDP templates. Before moving on, we recommend to complete a basic tutorial in both:
Ontologies, especially in the biomedical domain, are complex and, while growing in size, increasingly hard to manage for their curators. In this section, we will look at some of the key differences of two popular templating systems in the OBO domain: Dead Simple Ontology Design Patterns (DOSDPs) and ROBOT templates. We will not cover the rationale for templates in general in much depth (the interested reader should check ontology design patterns and Reasonable Ontology Templates (OTTR): Motivation and Overview, which pertains to a different system, but applies none-the-less in general), and focus on making it easier for developers to pick the right templating approach for their particular use case. We will first discuss in detail representational differences, before we go through the functional ones and delineate use cases.
"},{"location":"lesson/templates-for-obo/#structural-differences-formats-and-tools","title":"Structural differences, formats and tools","text":""},{"location":"lesson/templates-for-obo/#dosdp-templates-structure-and-format","title":"DOSDP templates: structure and format","text":"DOSDP separates data and templates into two files: a yaml file which defines the template, and a TSV file which holds the data. Lets look at s example.
The template: abnormalAnatomicalEntity
pattern_name: abnormalAnatomicalEntity\npattern_iri: http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml\ndescription: \"Any unspecified abnormality of an anatomical entity.\"\n\ncontributors:\n - https://orcid.org/0000-0002-9900-7880\n - https://orcid.org/0000-0001-9076-6015\n - https://orcid.org/0000-0003-4148-4606\n - https://orcid.org/0000-0002-3528-5267\n\nclasses:\n quality: PATO:0000001\n abnormal: PATO:0000460\n anatomical entity: UBERON:0001062\n\nrelations:\n inheres_in_part_of: RO:0002314\n has_modifier: RO:0002573\n has_part: BFO:0000051\n\nannotationProperties:\n exact_synonym: oio:hasExactSynonym\n\nvars:\n anatomical_entity: \"'anatomical entity'\"\n\nname:\n text: \"abnormal %s\"\n vars:\n - anatomical_entity\n\nannotations:\n - annotationProperty: exact_synonym\n text: \"abnormality of %s\"\n vars:\n - anatomical_entity\n\ndef:\n text: \"Abnormality of %s.\"\n vars:\n - anatomical_entity\n\nequivalentTo:\n text: \"'has_part' some ('quality' and ('inheres_in_part_of' some %s) and ('has_modifier' some 'abnormal'))\"\n vars:\n - anatomical_entity\n
The data: abnormalAnatomicalEntity.tsv
defined_class defined_class_label anatomical_entity anatomical_entity_label HP:0040286 Abnormal axial muscle morphology UBERON:0003897 axial muscle HP:0011297 Abnormal digit morphology UBERON:0002544 digit"},{"location":"lesson/templates-for-obo/#robot-templates-structure-and-format","title":"ROBOT templates: structure and format","text":"ROBOT encodes both the template and the data in the same TSV; after the table header, the second row basically encodes the entire template logic, and the data follows in table row 3.
ID Label EQ Anatomy Label ID LABEL EC 'has_part' some ('quality' and ('inheres_in_part_of' some %) and ('has_modifier' some 'abnormal')) HP:0040286 Abnormal axial muscle morphology UBERON:0003897 axial muscle HP:0011297 Abnormal digit morphology UBERON:0002544 digitNote that for the Anatomy Label
we deliberately left the second row empty, which instructs the ROBOT template tool to completely ignore this column.
From an ontology engineering perspective, the essence of the difference between DOSDP and ROBOT templates could be captured as follows:
DOSDP templates are more about generating annotations and axioms, while ROBOT templates are more about curating annotations and axioms.\n
Curating annotations and axioms
means that an editor, or ontology curator, manually enters the labels, synonyms, definitions and so forth into the spreadsheet.
Generating axioms
in the sense of this section means that we try to automatically generate labels, synonyms, definitions and so forth based on the related logical entities in the patterns. E.g., using the example template above, the label \"abnormal kidney\" would automatically be generated when the Uberon term for kidney is supplied.
While both ROBOT and DOSDP can be used for \"curation\" of annotation of axioms, DOSDP seeks to apply generation rules to automatically generate synonyms, labels, definitions and so forth while for ROBOT template seeks to collect manually curated information in an easy-to-use table which is then compiled into OWL. In other words:
However, there is another dimension in which both approaches differ widely: sharing and re-use. DOSDPs by far the most important feature is that it allows a community of developers to rally around a modelling problem, debate and establish consensus; for example, a pattern can be used to say: this is how we model abnormal anatomical entities. Consensus can be made explicit by \"signing off\" on the pattern (e.g. by adding your ORCId to the list of contributors), and due to the template/data separation, the template can be simply imported using its IRI (for example http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml) and re-used by everyone. Furthermore, additional metadata fields including textual descriptions, and more recently \"examples\", make DOSDP template files comparatively easy to understand, even by a less technically inclined editor.
ROBOT templates on the other hand do not lend themselves to community debates in the same way; first of all, they are typically supplied including all data merged in; secondly, they do not provide additional metadata fields that could, for example, conveniently be used to represent a sign off (you could, of course, add the ORCId's into a non-functional column, or as a pipe-separated string into a cell in the first or second row; but its obvious that this would be quite clunky) or a textual description. A yaml file is much easier for a human to read and understand then the header of a TSV file, especially when the template becomes quite large.
However, there is a flipside to the strict separation of data and templates. One is that DOSDP templates are really hard to change. Once, for example, a particular variable name was chosen, renaming the variable will require an excessive community-wide action to rename columns in all associated spreadsheets - which requires them all to be known beforehand (which is not always the case). You don't have such a problem with ROBOT templates; if you change a column name, or a template string, everything will continue to work without any additional coordination.
"},{"location":"lesson/templates-for-obo/#summary","title":"Summary","text":"Both ROBOT templates and DOSDP templates are widely used. The author of this page uses both in most of the projects he is involved in, because of their different strengths and capabilities. You can use the following rules of thumb to inform your choice:
Consider ROBOT templates if your emphasis is on
Consider DOSDP templates if your emphasis is on
There is a nice debate going on which questions the use of tables in ontology curation altogether. There are many nuances in this debate, but I want to stylise it here as two schools of thoughts (there are probably hundreds in between, but this makes it easier to follow): The one school (let's call them Tablosceptics) claims that using tables introduces a certain degree of fragility into the development process due to a number of factors, including:
They prefer to use tools like Protege that show the curator immediately the consequences of their actions, like reasoning errors (unintended equivalent classes, unsatisfiable classes and other unintended inferences). The Tablophile school of thought responds to these accusations in essence with \"tools\"; they say that tables are essentially a convenient matrix to input the data (which in turns opens ontology curation to a much wider range of people), and it is up to the tools to ensure that QC is run, hierarchies are being presented for review and weird ID space clashes are flagged up. Furthermore, they say, having a controlled input matrix will actually decrease the number of faulty annotations or axioms (which is evidenced by the large number of wrongful annotation assertions across OBO foundry ontologies I see every day as part of my work). At first sight, both template systems are affected equally by the war of the Tablosceptics and the Tablophile. Indeed, in my on practice, the ID space issue is really problematic when we manage 100s and more templates, and so far, I have not seen a nice and clear solution that ensures that no ID used twice unless it is so intended and respects ID spaces which are often semi-formally assigned to individual curators of an ontology.
Generally in this course we do not want to take a 100% stance. The author of this page believes that the advantage of using tables and involving many more people in the development process outweighs any concerns, but tooling is required that can provide more immediate feedback when such tables such as the ones presented here are curated at scale.
"},{"location":"lesson/using-disease-and-phenotype-ontologies/","title":"Finding and using Disease and Phenotype Ontologies","text":""},{"location":"lesson/using-disease-and-phenotype-ontologies/#prerequisites","title":"Prerequisites","text":"Description: An introduction to the landscape of disease and phenotype terminologies and ontologies, and how they can be used to add value to your analysis.
"},{"location":"lesson/using-disease-and-phenotype-ontologies/#learning-objectives","title":"Learning objectives","text":"A landscape analysis of major disease and phenotype ontologies that are currently available is here (also available in Zenodo here).
"},{"location":"lesson/using-disease-and-phenotype-ontologies/#decide-which-phenotype-or-disease-ontology-to-use-for-different-use-cases","title":"Decide which phenotype or disease ontology to use for different use cases","text":"Different ontologies are built for different purposes and were created for various reasons. For example, some ontologies are built for text mining purposes, some are built for annotating data and downstream computational analysis.
The unified phenotype ontology (uPheno) aggregates species-specific phenotype ontologies into a unified resource. Several species-specific phenotype ontologies exist, such as the Human Phenotype Ontology, Mammalian Phenotype Ontology (http://www.informatics.jax.org/searches/MP_form.shtml), and many more.
Similarly to the phenotype ontologies, there are many disease ontologies that exist that are specific to certain areas of diseases, such as infectious diseases (e.g. Infectious Disease Ontology), cancer (e.g. National Cancer Institute Thesaurus), rare diseases (e.g. Orphanet), etc.
In addition, there are several more general disease ontologies, such as the Mondo Disease Ontology, the Human Disease Ontology (DO), SNOMED, etc.
Different disease ontologies may be built for different purposes; for example, ontologies like Mondo and DO are intended to be used for classifying data, and downstream computational analyses. Some terminologies are used for indexing purposes, such as the International classification of Diseases (ICD). ICD-11 is intended for indexing medical encounters for the purposes of billing and coding. Some of the disease ontologies listed on the landscape contain terms that define diseases, such as Ontology for General Medical Sciences (OGMS) are upper-level ontologies and are intended for integration with other ontologies.
When deciding on which phenotype or disease ontology to use, some things to consider:
# Early Career Data Scientist
Description: A collection of videos, tutorials, training materials, and exercises targeted towards any entry-level, early-career person interested in learning basic skills in data science.
"},{"location":"pathways/early-career-data-scientist/#1-data-science-ethics","title":"1. Data Science Ethics","text":"Description: Introductory data science ethics videos hosted by Northeat Big Data Innovation Hub.
"},{"location":"pathways/early-career-data-scientist/#videos","title":"Videos","text":"6 videos available here
"},{"location":"pathways/early-career-data-scientist/#2-overview-what-is-data-science","title":"2. Overview: What is Data Science","text":""},{"location":"pathways/early-career-data-scientist/#videos_1","title":"Videos","text":"Note: for the tutorials below PC users need to install ODK (instructions are linked from the tutorial)
Survival strategies for team communication
"},{"location":"pathways/ontology-contributor/","title":"Ontology Contributor Pathway","text":"Description: These guidelines are developed for anyone interested in contributing to ontologies to guide how to contribute to OBO Foundry ontologies.
"},{"location":"pathways/ontology-contributor/#why-should-you-contribute-to-ontology-development-efforts","title":"Why should you contribute to ontology development efforts?","text":"Ontologies are routinely used for data standardization and in analytical analysis, but the ontologies themselves are under constant revisions and iterative development. Building ontologies is a community effort, and we need expertise from different areas:
The OBO foundry ontologies are open, meaning anyone can access them and contribute to them. The types of contributions may include reporting issues, identifying bugs, making requests for new terms or changes, and you can also contribute directly to the ontology itself- if you are familiar with ontology editing workflows, you can download our ontologies and make edits on a branch and make a pull request in GitHub.
"},{"location":"pathways/ontology-contributor/#providing-feedback-to-an-ontology","title":"Providing Feedback to an Ontology","text":"Community feedback is welcome for all open OBO Foundry ontologies. Feedback is often provided in the form of:
Note: There is no one single accepted way of doing ontology curation in the OBO-World, see here. This guide reflects the practice of the GO-style ontology curation, as it is used by GO, Uberon, CL, PATO and others.
Note: Work on this document is still in progress, items that are not linked are currently being worked on.
"},{"location":"pathways/ontology-curator-go-style/#getting-set-up","title":"Getting Set-up","text":"This section is a non-ordered collection of how to documents that a curator might needs
Note: There is no one single accepted way of doing ontology curation in the OBO-World, see here. This guide reflects the practice of the OBI-style ontology curation, as it is used by OBI, IAO and others.
"},{"location":"pathways/ontology-curator-obi-style/#getting-set-up","title":"Getting Set-up","text":""},{"location":"pathways/ontology-curator-obi-style/#learning","title":"Learning","text":""},{"location":"pathways/ontology-curator-obi-style/#learning-git-and-github","title":"Learning Git and GitHub","text":"There is no one single accepted methodology for building ontologies in the OBO-World. We can distinguish at least two major schools of ontology curation
Note that there are many more variants, probably as many as there are ontologies. Both schools differ only in how they curate their ontologies - the final product is always an ontology in accordance with OBO Principles. These are some of the main differences of the two schools:
GO-style OBI-style Edit format Historically developed in OBO format Developed in an OWL format Annotation properties Many annotation properties from the oboInOwl namespace, for example for synonyms and provenance. Many annotation properties from the IAO namespace. Upper Ontology Hesitant alignment with BFO, often uncommitted. Strong alignment with BFO. Logic Tend do be simple existential restrictions (some
), ontologies in OWL 2 EL. No class expression nesting. Simple logical definition patterns geared towards automating classification Tend to use a lot more expressive logic, including only
and not
. Class expression nesting can be more complex. Examples GO, Uberon, Mondo, HPO, PATO, CL, BSPO OBI, IAO, OGMS There are a lot of processes happening that are bringing these schools together, sharing best practices (GitHub, documentation) and reconciling metadata conventions and annotation properties in the OBO Metadata Ontology (OMO). The Upper Level alignment is now done by members of both schools through the Core Ontology for Biology and Biomedicine (COB). While these processes are ongoing, we decided to curate separate pathways for both schools:
As a ontology engineer, it would be useful for you to know how curators work, as such, it would be useful to be familiar with all the concepts in the ontology curator pathways document. This pathways will however be focusing on the engineering side of things.
"},{"location":"pathways/ontology-engineer/#very-basics","title":"Very basics","text":"This section is a non-ordered collection of how to documents that an engineer might need (this includes everything from the curators list as they may be pertinent knowledge to an engineer).
Pathways are materials from OBOOK in a linear fashion for the purpose of helping people in different roles finding the materials relevant to their work more easily. To browse through the pathways, look under the \"Pathways\" menu item.
"},{"location":"reference/chatgpt-prompts-for-ontology-development/","title":"Leveraging ChatGPT for ontology curation","text":""},{"location":"reference/chatgpt-prompts-for-ontology-development/#effective-chatgpt-prompts-for-ontology-development","title":"Effective ChatGPT prompts for ontology development","text":"For a basic tutorial on how to leverage ChatGPT for ontology development see here.
"},{"location":"reference/chatgpt-prompts-for-ontology-development/#act-as-a-mapping-api","title":"Act as a mapping API","text":"I want you to act as a REST API, which takes natural language searches a an input and returns an SSSOM mapping in valid JSON in a codeblock, no comments, no additional text. An example of a valid mapping is
{ \"subject_id\": \"a:something\", \"predicate_id\": \"rdfs:subClassOf\", \"object_id\": \"b:something\", \"mapping_justification\": \"semapv:LexicalMatching\", \"subject_label\": \"XXXXX\", \"subject_category\": \"biolink:AnatomicalEntity\", \"object_label\": \"xxxxxx\", \"object_category\": \"biolink:AnatomicalEntity\", \"subject_source\": \"a:example\", \"object_source\": \"b:example\", \"mapping_tool\": \"rdf_matcher\", \"confidence\": 0.8, \"subject_match_field\": [ \"rdfs:label\" ], \"object_match_field\": [ \"rdfs:label\" ], \"match_string\": [ \"xxxxx\" ], \"comment\": \"mock data\" }
As a first task, I want you to return a suitable mapping for MONDO:0004975 in ICD 10 CM.
"},{"location":"reference/formatting-license/","title":"Formatting your ontology annotations correctly","text":"The new OBO Foundry guidelines encourage the annotation of ontologies with an appropriately formatted description, title and license. Here are some examples that can be used as a guide to implement those in your ontology.
Note: these examples purposefully do not include version information, this should not be manually added, instead it should be added by ROBOT as part of a pipeline. An ontology set up with the ODK will take care of all of this for you.
"},{"location":"reference/formatting-license/#rdfxml-example","title":"RDF/XML Example:","text":"<?xml version=\"1.0\"?>\n<rdf:RDF xmlns=\"http://purl.obolibrary.org/obo/license.owl#\"\n xml:base=\"http://purl.obolibrary.org/obo/license.owl\"\n xmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n xmlns:owl=\"http://www.w3.org/2002/07/owl#\"\n xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n xmlns:xml=\"http://www.w3.org/XML/1998/namespace\"\n xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\"\n xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"\n xmlns:terms=\"http://purl.org/dc/terms/\">\n <owl:Ontology rdf:about=\"http://purl.obolibrary.org/obo/license.owl\">\n <dc:description rdf:datatype=\"http://www.w3.org/2001/XMLSchema#string\">An integrated and fictional ontology for the description of abnormal tomato phenotypes.</dc:description>\n <dc:title rdf:datatype=\"http://www.w3.org/2001/XMLSchema#string\">Tomato Phenotype Ontology (TPO)</dc:title>\n <terms:license rdf:resource=\"https://creativecommons.org/licenses/by/3.0/\"/>\n </owl:Ontology>\n <owl:AnnotationProperty rdf:about=\"http://purl.org/dc/elements/1.1/description\"/>\n <owl:AnnotationProperty rdf:about=\"http://purl.org/dc/elements/1.1/title\"/>\n <owl:AnnotationProperty rdf:about=\"http://purl.org/dc/terms/license\"/>\n</rdf:RDF>\n
"},{"location":"reference/formatting-license/#functional-syntax-example","title":"Functional Syntax Example:","text":"Prefix(:=<http://purl.obolibrary.org/obo/license.owl#>)\nPrefix(owl:=<http://www.w3.org/2002/07/owl#>)\nPrefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)\nPrefix(xml:=<http://www.w3.org/XML/1998/namespace>)\nPrefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)\nPrefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)\n\n\nOntology(<http://purl.obolibrary.org/obo/license.owl>\nAnnotation(<http://purl.org/dc/elements/1.1/description> \"An integrated and fictional ontology for the description of abnormal tomato phenotypes.\"^^xsd:string)\nAnnotation(<http://purl.org/dc/elements/1.1/title> \"Tomato Phenotype Ontology (TPO)\"^^xsd:string)\nAnnotation(<http://purl.org/dc/terms/license> <https://creativecommons.org/licenses/by/3.0/>)\n\n)\n
"},{"location":"reference/formatting-license/#owlxml-example","title":"OWL/XML Example:","text":"<?xml version=\"1.0\"?>\n<Ontology xmlns=\"http://www.w3.org/2002/07/owl#\"\n xml:base=\"http://purl.obolibrary.org/obo/license.owl\"\n xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n xmlns:xml=\"http://www.w3.org/XML/1998/namespace\"\n xmlns:xsd=\"http://www.w3.org/2001/XMLSchema#\"\n xmlns:rdfs=\"http://www.w3.org/2000/01/rdf-schema#\"\n ontologyIRI=\"http://purl.obolibrary.org/obo/license.owl\">\n <Prefix name=\"\" IRI=\"http://purl.obolibrary.org/obo/license.owl#\"/>\n <Prefix name=\"owl\" IRI=\"http://www.w3.org/2002/07/owl#\"/>\n <Prefix name=\"rdf\" IRI=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"/>\n <Prefix name=\"xml\" IRI=\"http://www.w3.org/XML/1998/namespace\"/>\n <Prefix name=\"xsd\" IRI=\"http://www.w3.org/2001/XMLSchema#\"/>\n <Prefix name=\"rdfs\" IRI=\"http://www.w3.org/2000/01/rdf-schema#\"/>\n <Annotation>\n <AnnotationProperty IRI=\"http://purl.org/dc/elements/1.1/description\"/>\n <Literal>An integrated and fictional ontology for the description of abnormal tomato phenotypes.</Literal>\n </Annotation>\n <Annotation>\n <AnnotationProperty IRI=\"http://purl.org/dc/elements/1.1/title\"/>\n <Literal>Tomato Phenotype Ontology (TPO)</Literal>\n </Annotation>\n <Annotation>\n <AnnotationProperty abbreviatedIRI=\"terms:license\"/>\n <IRI>https://creativecommons.org/licenses/by/3.0/</IRI>\n </Annotation>\n <Declaration>\n <AnnotationProperty IRI=\"http://purl.org/dc/elements/1.1/title\"/>\n </Declaration>\n <Declaration>\n <AnnotationProperty IRI=\"http://purl.org/dc/elements/1.1/description\"/>\n </Declaration>\n <Declaration>\n <AnnotationProperty IRI=\"http://purl.org/dc/terms/license\"/>\n </Declaration>\n</Ontology>\n
"},{"location":"reference/formatting-license/#obo-example","title":"OBO Example:","text":"format-version: 1.2\nontology: license\nproperty_value: http://purl.org/dc/elements/1.1/description \"An integrated and fictional ontology for the description of abnormal tomato phenotypes.\" xsd:string\nproperty_value: http://purl.org/dc/elements/1.1/title \"Tomato Phenotype Ontology (TPO)\" xsd:string\nproperty_value: http://purl.org/dc/terms/license https://creativecommons.org/licenses/by/3.0/\n
"},{"location":"reference/frequently-used-odk-commands/","title":"Frequently used ODK commands","text":""},{"location":"reference/frequently-used-odk-commands/#updates-the-makefile-to-the-latest-odk","title":"Updates the Makefile to the latest ODK","text":"sh run.sh make update_repo\n
"},{"location":"reference/frequently-used-odk-commands/#recreates-and-deploys-the-automated-documentation","title":"Recreates and deploys the automated documentation","text":"sh run.sh make update_docs\n
"},{"location":"reference/frequently-used-odk-commands/#preparing-a-new-release","title":"Preparing a new release","text":"sh run.sh make prepare_release\n
"},{"location":"reference/frequently-used-odk-commands/#refreshing-a-single-import","title":"Refreshing a single import","text":"sh run.sh make refresh-%\n
Example:
sh run.sh make refresh-chebi\n
"},{"location":"reference/frequently-used-odk-commands/#refresh-all-imports","title":"Refresh all imports","text":"sh run.sh make refresh-imports\n
"},{"location":"reference/frequently-used-odk-commands/#refresh-all-imports-excluding-large-ones","title":"Refresh all imports excluding large ones","text":"sh run.sh make refresh-imports-excluding-large\n
"},{"location":"reference/frequently-used-odk-commands/#run-all-the-qc-checks","title":"Run all the QC checks","text":"sh run.sh make test\n
"},{"location":"reference/frequently-used-odk-commands/#print-the-version-of-the-currently-installed-odk","title":"Print the version of the currently installed ODK","text":"sh run.sh make odkversion\n
"},{"location":"reference/frequently-used-odk-commands/#checks-the-owl2-dl-profile-validity","title":"Checks the OWL2 DL profile validity","text":"(of a specific file)
sh run.sh make validate_profile_%\n
Example:
sh run.sh make validate_profile_hp-edit.owl\n
"},{"location":"reference/gh-actions-errors/","title":"Common Errors in GitHub actions","text":""},{"location":"reference/gh-actions-errors/#killed-running-out-of-memory","title":"Killed
: Running out of memory","text":"Running the same workflow several times simultaneously (e.g. if two PRs are submitted in a short time, and the second PR triggers the CI workflow while the CI workflow triggered by the first PR is still running) could lead to lack-of-memory situations because all concurrent workflows have to share a single memory limit.
(Note: it isn't really clear with documentation of GitHub Actions on whether concurrent workflow runs share a single memory limit.)
What could possibly be done is to forbid a given workflow from ever running as long as there is already a run of the same workflow ongoing, using the concurrency property.
"},{"location":"reference/git-faq/","title":"Git FAQs","text":"This page aims to consolidate some tips and tricks that ontology editors have found useful in using git
. It is not meant to be a tutorial of git
, but rather as a page with tips that could help in certain specialised situations.
src/ontology
, in terminal use: git checkout master -- imports/uberon_import.owl
.git log
to list out the previous commits and copy the commit code of the commit you would like to revert to (example: see yellow string of text in screenshot below).git checkout ff18c9482035062bbbbb27aaeb50e658298fb635 -- imports/uberon_import.owl
using whichever commit code you want instead of the commit code in this example.For most of our training activities, we recommend using GitHub Desktop. It provides a very convenient way to push and pull changes, and inspect the \"diff\". It is, however, not mandatory if you are already familiar with other git workflows (such as command line, or Sourcetree).
"},{"location":"reference/github-intro/","title":"Git, GitHub and GitHub Desktop (version control)","text":"A repository can consist of many files with several users simultaneously editing those files at any moment in time. In order to ensure conflicting edits between the users are not made and a history of the edits are tracked, software classified as a \"distributed version control system\" is used.
All OBO repositories are managed by the Git version control system. This allows users to make their own local branch of the repository, i.e., making a mirror copy of the repository directories and files on their own computers, and make edits as desired. The edits can then be reviewed by other users before the changes are incorporated in the 'main' or 'master' branch of the repository. This process can be executed by running Git line commands and/or by using a web interface (Github.com) along with a desktop application (GitHub Desktop).
Documentation, including an introduction to GitHub, can be found here: Hello World.
"},{"location":"reference/glossary-of-terms/","title":"Glossary of Terms","text":"This document is a list of terms that you might encounter in the ontology world. It is not an exhaustive list and will continue to evolve. Please create a ticket if there is a term you find missing or a term you encounter that you do not understand, and we will do our best to add them. This list is not arranged in any particular order. Please use the search function to find terms.
Acknowledgement: Many terms are taken directly from OAK documentation with the permission of Chris Mungall. Many descriptions are also taken from https://www.w3.org/TR/owl2-syntax/.
"},{"location":"reference/glossary-of-terms/#annotation","title":"Annotation","text":"This term is frequently ambiguous. It can refer to Text Annotation, OWL Annotation, or Association.
"},{"location":"reference/glossary-of-terms/#annotationproperty","title":"AnnotationProperty","text":"Annotation properties are OWL axioms that are used to place annotations on individuals, class names, property names, and ontology names. They do not affect the logical definition unless they are used as a \"shortcut\" that a pipeline expands to a logical axiom.
"},{"location":"reference/glossary-of-terms/#anonymous-ancestor","title":"Anonymous Ancestor","text":"An accumulation of all of the superclasses from ancestors of a class.
"},{"location":"reference/glossary-of-terms/#anonymous-individual","title":"Anonymous Individual","text":"If an individual is not expected to be used outside an ontology, one can use an anonymous individual, which is identified by a local node ID rather than a global IRI. Anonymous individuals are analogous to blank nodes in RDF.
"},{"location":"reference/glossary-of-terms/#api","title":"API","text":"Application Programming Interface. An intermediary that allows two or more computer programs to communicate with each other. In ontologies, this usually means an Endpoint in which the ontology can be programmatically accessed.
"},{"location":"reference/glossary-of-terms/#application-ontology","title":"Application Ontology","text":"Usually refers to a Project Ontology.
"},{"location":"reference/glossary-of-terms/#axiom","title":"Axiom","text":"Axioms are statements that are asserted to be true in the domain being described. For example, using a subclass axiom, one can state that the class a:Student is a subclass of the class a:Person. (Note: in OWL, there are also annotation axioms which does not apply any logical descriptions)
"},{"location":"reference/glossary-of-terms/#bioportal","title":"Bioportal","text":"An Ontology Repository that is a comprehensive collection of multiple biologically relevant ontologies.
"},{"location":"reference/glossary-of-terms/#controlled-vocabulary","title":"Controlled Vocabulary","text":"Standardized and organized arrangements of words and phrases that provide a consistent way to describe data. A controlled vocabulary may or may not include definitions. Ontologies can be seen as a controlled vocabulary expressed in an ontological language which includes relations.
"},{"location":"reference/glossary-of-terms/#class","title":"Class","text":"An OWL entity that formally represents something that can be instantiated. For example, the class \"heart\".
"},{"location":"reference/glossary-of-terms/#curie","title":"CURIE","text":"A CURIE is a compact URI. For example, CL:0000001
expands to http:purl.obolibrary.org/obo/CL_0000001. For more information, please see https://www.w3.org/TR/curie/.
An abstract model that organizes elements of data and standardizes how they relate to one another.
"},{"location":"reference/glossary-of-terms/#dataproperty","title":"dataProperty","text":"dataProperty relate OWL entities to literal data (e.g., strings, numbers, datetimes, etc.) as opposed to ObjectProperty which relate individuals to other OWL entities. Unlike AnnotationProperty, dataProperty axioms fall on the logical side of OWL and are hence useable by reasoners.
"},{"location":"reference/glossary-of-terms/#datatype","title":"Datatype","text":"Datatypes are OWL entities that refer to sets of data values. Thus, datatypes are analogous to classes, the main difference being that the former contain data values such as strings and numbers, rather than individuals. Datatypes are a kind of data range, which allows them to be used in restrictions. For example, the datatype xsd:integer denotes the set of all integers, and can be used with the range of a dataProperty to state that the range of said dataProperty must be an integer.
"},{"location":"reference/glossary-of-terms/#description-logic","title":"Description Logic","text":"Description Logics (DL) are a family of formal knowledge representation languages. It provides a logical formalism for ontologies and is what OWL is based on. DL querying can be used to query ontologies in Protege.
"},{"location":"reference/glossary-of-terms/#domain","title":"Domain","text":"Domain, in reference to a dataProperty or ObjectProperty, refers to the restriction on the subject of a triple - if a given property has a given class in its domain this means that any individual that has a value for the property, will be inferred to be an instance of that domain class. For example, if John hasParent Mary
and Person
is listed in the domain of hasParent
, then John
will be inferred to be an instance of Person
.
Dead Simple Ontology Design Patterns. A templating system for ontologies with well-documented patterns and templates.
"},{"location":"reference/glossary-of-terms/#edge","title":"Edge","text":"A typed, directed link between Nodes in a knowledge graph. Translations of OWL into Knowledge graphs vary, but typically edges are generated for simple triples, relating two individuals or two classes via an AnnotationProperty or ObjectProperty and simple existential restrictions (A SubClassOf R some B), with the edge type corresponding to the property.
"},{"location":"reference/glossary-of-terms/#endpoint","title":"Endpoint","text":"Where an API interfaces with the ontology.
"},{"location":"reference/glossary-of-terms/#existential-restriction","title":"Existential Restriction","text":"A relationship between two classes, A R (some) B, that states that all individuals of class A stand in relation R to at least one individual of class B. For example, neuron has_part some dendrite
states that all instances of neuron have at least one individual of type dentrite as a part. In Manchester syntax, the keyword 'some' is used to denote existential restrictions and is interpreted as \"there exists\", \"there is at least one\", or \"some\". See documentation on classifications for more details.
An official syntax of OWL (others are RDF-XML and OWL-XML) in which each line represents and axiom (although things get a little more complex with axiom annotations, and axioms use prefix syntax (order = relation (subject, object)). This is in contrast to in-fix syntax (e.g. Manchester syntax) (order = subject relation object). Functional syntax is the preferred syntax for editor files maintained on GitHub, because it can be safely diff'd and (somewhat) human readable.
"},{"location":"reference/glossary-of-terms/#graph","title":"Graph","text":"Formally a graph is a data structure consisting of Nodes and Edges. There are different forms of graphs, but for our purposes an ontology graph has all Terms as nodes, and relationships connecting terms (is-a, part-of) as edges. Note the concept of an ontology graph and an RDF graph do not necessarily fully align - RDF graphs of OWL ontologies employ numerous blank nodes that obscure the ontology structure.
"},{"location":"reference/glossary-of-terms/#individual","title":"Individual","text":"An OWL entity that represents an instance of a class. For example, the instance \"John\" or \"John's heart\". Note that instances are not commonly represented in ontologies. For instance, \"John\" (an instance of person) or \"John's heart\" (an instance of heart).
"},{"location":"reference/glossary-of-terms/#information-content","title":"Information Content","text":"A measure of how informative an ontology concept is; broader concepts are less informative as they encompass many things, whereas more specific concepts are more unique. This is usually measured as -log2(Pr(term))
. The method of calculating the probability varies, depending on which predicates are taken into account (for many ontologies, it makes sense to use part-of as well as is-a), and whether the probability is the probability of observing a descendant term, or of an entity annotated using that term.
A programmatic abstraction that allows us to focus on what something should do rather than how it is done.
"},{"location":"reference/glossary-of-terms/#jaccard-similarity","title":"Jaccard Similarity","text":"A measures of the similarity between two sets of data to see which members are shared and distinct.
"},{"location":"reference/glossary-of-terms/#kgcl","title":"KGCL","text":"Knowledge Graph Change Language (KGCL) is a data model for communicating desired changes to an ontology. It can also be used to communicate differences between two ontologies. See KGCL docs.
"},{"location":"reference/glossary-of-terms/#knowledge-graph","title":"Knowledge Graph","text":"A network of real-world entities (i.e., objects, events, situations, and concepts) that illustrates the relationships between them. Knowledge graphs (in relation to ontologies) are thought of as real data built using an ontology as a framework.
"},{"location":"reference/glossary-of-terms/#label","title":"Label","text":"Usually refers to a human-readable text string corresponding to the rdfs:label
predicate. Labels are typically unique per ontology. In OBO Format and in the bio-ontology literature, labels are sometimes called Names. Sometimes in the machine learning literature, and in databases such as Neo4J, \"label\" actually refers to a Category.
Lutra is the open source reference implementation of the OTTR templating language.
"},{"location":"reference/glossary-of-terms/#mapping","title":"Mapping","text":"A means of linking two resources (e.g. two ontologies, or an ontology and a database) together. Also see SSSOM
"},{"location":"reference/glossary-of-terms/#materialised","title":"Materialised","text":"The process of making inferred axioms explicit by asserting them.
"},{"location":"reference/glossary-of-terms/#name","title":"Name","text":"Usually synonymous with Label, but in the formal logic and OWL community, \"Name\" sometimes denotes an Identifier
"},{"location":"reference/glossary-of-terms/#named-individual","title":"Named Individual","text":"An Individual that is given an explicit name that can be used in any ontology to refer to the same object; named individuals get IRIs whereas anonymous individuals do not.
"},{"location":"reference/glossary-of-terms/#nodes","title":"Nodes","text":"Terms represented in a graph
"},{"location":"reference/glossary-of-terms/#object","title":"Object","text":"The \"right\" side of a Triple.
"},{"location":"reference/glossary-of-terms/#objectproperty","title":"ObjectProperty","text":"An owl entity that is used to related 2 individuals ('my left foot' part_of 'my left leg') or two classes ('foot' part_of some leg) or an individual and a class ('the neuron depicted in this image' (is) has_soma_location some 'primary motor cortex. More rarely it is used to define a class in terms of some individual (the class 'relatives of Shawn' related_to Value Shawn.
"},{"location":"reference/glossary-of-terms/#obo","title":"OBO","text":"Open Biological and Biomedical Ontology. This could refer to the OBO Foundry (e.g. OBO ontologies = ontologies that follow the standards of the OBO Foundry) or OBO Format
"},{"location":"reference/glossary-of-terms/#obo-format","title":"OBO Format","text":"A serialization format for ontologies designed for easy viewing, direct editing, and readable diffs. It is popular in bioinformatics, but not widely used or known outside the genomics sphere. OBO is mapped to OWL, but only expresses a subset, and provides some OWL abstractions in a more easy to understand fashion.
"},{"location":"reference/glossary-of-terms/#ols","title":"OLS","text":"Ontology Lookup Service. An Ontology Repository that is a curated collection of multiple biologically relevant ontologies, many from OBO. OLS can be accessed with this link
"},{"location":"reference/glossary-of-terms/#ontology","title":"Ontology","text":"A flexible concept loosely encompassing any collection of OWL entities and statements or relationships connecting them.
"},{"location":"reference/glossary-of-terms/#odk","title":"ODK","text":"Ontology Development Kit. A toolkit and docker image for managing ontologies.
"},{"location":"reference/glossary-of-terms/#ontology-library","title":"Ontology Library","text":"The systems or platform where various types of ontologies are stored from different sources and provide the ability to data providers and application developers to share and reuse the ontologies.
"},{"location":"reference/glossary-of-terms/#ontology-repository","title":"Ontology Repository","text":"A curated collection of ontologies.
"},{"location":"reference/glossary-of-terms/#ottr","title":"OTTR","text":"Reasonable Ontology Templates. A system for composable ontology templates and documentation.
"},{"location":"reference/glossary-of-terms/#owl","title":"OWL","text":"Web Ontology Language. An ontology language that uses constructs from Description Logic. OWL is not itself an ontology format, it can be serialized through different formats such as Functional Syntax, and it can be mapped to :RDF and serialized via an RDF format.
"},{"location":"reference/glossary-of-terms/#owl-annotation","title":"OWL Annotation","text":"In the context of OWL, the term Annotation means a piece of metadata that does not have a strict logical interpretation. Annotations can be on entities, for example, Label annotations, or annotations can be on Axioms.
"},{"location":"reference/glossary-of-terms/#owl-api","title":"OWL API","text":"A java-based API to interact with OWL ontologies. Full documentation can be found at http://owlcs.github.io/owlapi/apidocs_5/index.html
"},{"location":"reference/glossary-of-terms/#owl-entity","title":"OWL Entity","text":"OWL Entities, such as classes, properties, and individuals, are identified by IRIs. They form the primitive terms of an ontology and constitute the basic elements of an ontology. For example, a class a:Person can be used to represent the set of all people. Similarly, the object property a:parentOf can be used to represent the parent-child relationship. Finally, the individual a:Peter can be used to represent a particular person called \"Peter\". The following is a complete list of types of OWL Entities:
An OWL entity that represents the type of a Relationship. Typically corresponds to an ObjectProperty in OWL, but this is not always true; in particular, the is-a relationship type is a builtin construct SubClassOf
in OWL Examples:
An ontology that is specific to a project and does not necessarily have interoperability with other ontologies in mind.
"},{"location":"reference/glossary-of-terms/#pronto","title":"Pronto","text":"An Ontology Library for parsing obo and owl files.
"},{"location":"reference/glossary-of-terms/#property","title":"Property","text":"An OWL entity that represents an attribute or a characteristic of an element. In OWL, properties are divided into disjoint categories:
A typical ontology development tool used by ontology developers in the OBO-sphere. Full documentation can be found at https://protege.stanford.edu/.
"},{"location":"reference/glossary-of-terms/#range","title":"Range","text":"Range, in reference to a dataProperty or ObjectProperty, refers to the restriction on the object of a triple - if a given property has a given class in its domain this means that any individual that has a value for the property (i.e. is the subject of a relation along the property), will be inferred to be an instance of that domain class. For example, if John hasParent Mary
and Person
is listed in the domain of hasParent
, then John
will be inferred to be an instance of Person
.
A datamodel consisting of simple Subject predicate Object Triples organized into an RDF Graph.
"},{"location":"reference/glossary-of-terms/#rdflib","title":"rdflib","text":"A python library to interact with RDF data. Full documentation can be found at https://rdflib.readthedocs.io/en/stable/.
"},{"location":"reference/glossary-of-terms/#reasoner","title":"Reasoner","text":"An ontology tool that will perform inference over an ontology to yield new axioms (e.g. new Edges) or to determine if an ontology is logically coherent.
"},{"location":"reference/glossary-of-terms/#relationship","title":"Relationship","text":"A Relationship is a type connection between two OWL entities. The first element is called the subject, and the second one the Object, with the type of connection being the Relationship Type. Sometimes Relationships are equated with Triples in RDF but this can be confusing, because some relationships map to multiple triples when following the OWL RDF serialization. An example is the relationship \"finger part-of hand\", which in OWL is represented using a Existential Restriction that maps to 4 triples.
"},{"location":"reference/glossary-of-terms/#relationship-type","title":"Relationship Type","text":"See predicate
"},{"location":"reference/glossary-of-terms/#robot","title":"ROBOT","text":"A toolkit for transforming and interacting with ontologies. Full documentation can be found at http://robot.obolibrary.org/
"},{"location":"reference/glossary-of-terms/#semantic-similarity","title":"Semantic Similarity","text":"A means of measuring similarity between either pairs of ontology concepts, or between entities annotated using ontology concepts. There is a wide variety of different methods for calculating semantic similarity, for example Jaccard Similarity and Information Content based measures.
"},{"location":"reference/glossary-of-terms/#semantic-sql","title":"Semantic SQL","text":"Semantic SQL is a proposed standardized schema for representing any RDF/OWL ontology, plus a set of tools for building a database conforming to this schema from RDF/OWL files. See Semantic-SQL
"},{"location":"reference/glossary-of-terms/#sparql","title":"SPARQL","text":"The standard query language and protocol for Linked Open Data on the web or for RDF triplestores - used to query ontologies.
"},{"location":"reference/glossary-of-terms/#sssom","title":"SSSOM","text":"Simple Standard for Sharing Ontological Mappings (https://github.com/mapping-commons/sssom).
"},{"location":"reference/glossary-of-terms/#subject","title":"Subject","text":"The \"left\" side of a Triple.
"},{"location":"reference/glossary-of-terms/#subset","title":"Subset","text":"A named collection of elements, typically grouped for some purpose. In the ODK/OBO world, there is a standard annotation property and pattern for this, for more information, see the subset documentation.
"},{"location":"reference/glossary-of-terms/#term","title":"Term","text":"Usually used to mean Class and Individuals, however sometimes used to refer to wider OWL entities.
"},{"location":"reference/glossary-of-terms/#text-annotation","title":"Text Annotation","text":"The process of annotating spans of texts within a text document with references to ontology terms, or the result of this process. This is frequently done automatically. The Bioportal implementation provides text annotation services.
"},{"location":"reference/glossary-of-terms/#triple","title":"Triple","text":"A set of three entities that codifies a statement about semantic data in the form of Subject-predicate-Object expressions (e.g., \"Bob is 35\", or \"Bob knows John\"). Also see Relationship.
"},{"location":"reference/glossary-of-terms/#triplestore","title":"Triplestore","text":"A purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject\u2013predicate\u2013object, like \"Bob is 35\" or \"Bob knows Fred\".
"},{"location":"reference/glossary-of-terms/#ubergraph","title":"Ubergraph","text":"An integrated OBO ontology Triplestore and a Ontology Repository, with merged set of mutually referential OBO ontologies (see the ubergraph github for list of ontologies included), that allows for SPARQL querying of integrated OBO ontologies.
"},{"location":"reference/glossary-of-terms/#uri","title":"URI","text":"A Uniform Resource Indicator, a generalization of URL. Most people think of URLs as being solely for addresses for web pages (or APIs) but in semantic web technologies, URLs can serve as actual identifiers for entities like OWL entities. Data models like OWL and RDF use URIs as identifiers. In OAK, URIs are mapped to CURIE
"},{"location":"reference/glossary/","title":"Glossary for concepts in and around OBO","text":"IMPORTANT NOTE TO EDITORS, MERGE THIS WITH glossary.md.\n
New OBOOK Glossary
"},{"location":"reference/glossary/#tools","title":"Tools","text":"Term Definition Type Docs Ontology Development Kit (ODK) A toolkit and docker image for managing ontology releases. Tool docs ROBOT A toolkit for transforming and interacting with ontologies. Tool docs rdflib A python library to interact with RDF data Library docs OWL API A java-based API to interact with OWL ontologies Library docs Protege A typical ontology development tool used by ontology developers in the OBO-sphere Tool docs ROBOT templates A templating system based on tables, where the templates are integrated in the same table as the data Standard docs Dead Simple Ontology Design Patterns (DOSDP) A templating system for ontologies with well-documented patterns and templates. Standard docs DOSDP tools DOSDP is the open source reference implementation of the DOSDP templating language. Tool docs Reasonable Ontology Templates (OTTR) A system for composable ontology templates and documentation Standard docs Lutra Lutra is the open source reference implementation of the OTTR templating language. Tool docs"},{"location":"reference/go-style-annotation-property-practice/","title":"Recommended metadata properties to use in curating OBO ontologies (GO-style)","text":"Note that while most of the practices documented here apply to all OBO ontologies this recommendation applies only to ontologies that are developed using GO-style curation workflows.
Type Property to use Required Number/Limit Description Format Annotation Reference/Comments Label rdfs:label Y Max 1 * Full name of the term, must be unique. Free text None * some ontologies have multiple labels for different languages, in which case, there should maximum be one label per language Definition IAO:0000115 Y Max 1 A textual definition of ther term. In most ontologies, must be unique. Free text database_cross_reference: reference materials used and contributors (in ORCID ID link format) See this document for guide on writing definitions Contributor dcterms:contributor N (though highly reccomended) No limit The ORCID ID of people who contributed to the creation of the term. ORCID ID (using full link) None Synonyms http://www.geneontology.org/formats/oboInOwl#hasExactSynonym, http://www.geneontology.org/formats/oboInOwl#hasBroadSynonym, http://www.geneontology.org/formats/oboInOwl#hasNarrowSynonym, http://www.geneontology.org/formats/oboInOwl#hasRelatedSynonym N No limit Synonyms of the term. Free text database_cross_reference: reference material in which the synonymn is used See synonyms documentation for guide on using synonyms Comments rdfs:comment N Max 1 Comments about the term, extended descriptions that might be useful, notes on modelling choices, other misc notes. Free text database_cross_reference: reference material relating to the comment See documentation on comments for more information about comments Editor note IAO:0000116 N Max 1 A note that is not relevant to front users, but might be to editors Free text database_cross_reference: reference material relating to the note Subset http://www.geneontology.org/formats/oboInOwl#inSubset N No limit A tag that marks a term as being part of a subset annotation property that is a subproperty of subset_property (see guide on how to select this) None See Slim documentation for more information on subsets Database Cross Reference http://www.geneontology.org/formats/oboInOwl#hasDbXref N No limit Links out to external references. string and should* take the form {prefix}:{accession}; see db-xrefs yaml for prefixes None *Some ontologies allow full URLS in specific cases, but this is controversial Date created dcterms:created N Max 1 Date in which the term was created ISO-8601 format None Date last updated dcterms:date N Max 1 Date in which the term was last updated ISO-8601 format None Deprecation http://www.w3.org/2002/07/owl#deprecated N Max 1 A tag that marks a term as being obsolete/deprecated xsd:boolean (true/false) None See obsoletion guide for more details Replaced by IAO:0100001 N Max 1 Term that has replaced an obsoleted term IRI/ID (e.g. CL:0000001) None See obsoletion guide and merging terms guide for more details Consider oboInOwl:consider N No limit Term that can be considered from manual replacement of an obsoleted term IRI/ID (e.g. CL:0000001) None See obsoletion guide and merging terms guide for more details"},{"location":"reference/managing-issues/","title":"Tools for Managing Issues","text":"Based on Intro to GitHub (GO-Centric) with credit to Nomi Harris and Chris Mungall
"},{"location":"reference/managing-issues/#labels","title":"Labels","text":"Labels are a useful tool to help group and organize issues, allowing people to filter issues by grouping. Note: Only project contributors can add/change labels
"},{"location":"reference/managing-issues/#best-practices-for-labels","title":"Best Practices for Labels","text":"Superissues are issues that have checklists (added using -[] on items). These are useful as they show progress towards completion. These can be used for issues that require multiple steps to solve.
"},{"location":"reference/managing-issues/#milestones","title":"Milestones","text":"Milestones are used for issues with a specific date/deadline. Milestones contain issues and issues can be filtered by milestones. They are also useful for visualizing how many issues in it is completed.
"},{"location":"reference/managing-issues/#project-boards","title":"Project Boards","text":"Project boards are a useful tool to organise, as the name implies, projects. They can span multiple repos (though the repos need to be in the same organisation). Notes can also be added.
"},{"location":"reference/medical-ontology-landscape/","title":"Medical Ontology landscape","text":""},{"location":"reference/medical-ontology-landscape/#the-landscape-of-disease-and-phenotype-ontologies","title":"The Landscape of Disease and Phenotype Ontologies","text":"Compiled by Nicole Vasilevsky. Feel free to make pull requests to suggest edits. Note: This currently just provides an overview of disease and phenotype ontologies. Contributors are welcome to add more descriptions of other medical ontologies. This was last updated in 2021.
"},{"location":"reference/medical-ontology-landscape/#disease-ontologies-terminologies","title":"Disease Ontologies & Terminologies","text":""},{"location":"reference/medical-ontology-landscape/#disease-summary-table","title":"Disease Summary Table","text":"Name Disease Area Artificial Intelligence Rheumatology Consultant System Ontology (AI-RHEUM) Rheumatic diseases Autism DSM-ADI-R Ontology (ADAR) Autism Autism Spectrum Disorder Phenotype Ontology (ASDPTO) Autism Brucellosis Ontology (IDOBRU) brucellosis Cardiovascular Disease Ontology (CVDO) Cardiovascular Chronic Kidney Disease Ontology (CKDO) Chronic kidney disease Chronic Obstructive Pulmonary Disease Ontology (COPDO) Chronic obstructive pulmonary disease (COPD) Coronavirus Infectious Disease Ontology (CIDO) Coronavirus infectious diseases Diagnostic and Statistical Manual of Mental Disorders (DSM) Mental disorders Dispedia Core Ontology (DCO) Rare diseases Experimental Factor Ontology (EFO) Broad disease coverage Fibrotic Interstitial Lung Disease Ontology (FILDO) Fibrotic interstitial lung disease Genetic and Rare Diseases Information Center (GARD) Rare diseases Holistic Ontology of Rare Diseases (HORD) Rare disease Human Dermatological Disease Ontology (DERMO) Dermatology (skin) Human Disease Ontology (DO) Human disease Infectious Disease Ontology (IDO) Infectious disease International Classification of Functioning, Disability and Health (ICF) Cross-discipline, focuses disabilities International Statistical Classification of Diseases and Related Health Problems (ICD-11) Broad coverage International Classification of Diseases for Oncology (ICD-O) Cancer Logical Observation Identifier Names and Codes (LOINC) Broad coverage Medical Subject Headings (MeSH) Broad coverage MedGen Human medical genetics Medical Dictionary for Regulatory Activities (MedDRA) Broad coverage Mental Disease Ontology (MDO) Mental functioning Mondo Disease Ontology (Mondo) Broad coverage, Cross species National Cancer Institute Thesaurus (NCIT) Humam cancer and neoplasms Neurological Disease Ontology (ND) Neurology Online Mendelian Inheritance in Man (OMIM) Mendelian, genetic diseases. Ontology of Cardiovascular Drug Adverse Events (OCVDAE) Cardiovascular Ontology for General Medical Science (OGMS) Broad coverage Ontology for Genetic Susceptibility Factor (OGSF) Genetic disease Ontology of Glucose Metabolism Disorder (OGMD) Metabolic disorders Ontology of Language Disorder in Autism (LDA) Austism The Oral Health and Disease Ontology (OHD) Oral health and disease Orphanet (ORDO) Rare diseases Parkinson Disease Ontology (PDO) Parkinson disease Pathogenic Disease Ontology (PDO) Pathogenic diseases PolyCystic Ovary Syndrome Knowledgebase (PCOSKB) Polycystic ovary syndrome Rat Disease Ontology (RDO) Broad coverage Removable Partial Denture Ontology (RPDO) Oral health Resource of Asian Primary Immunodeficiency Diseases (RPO) Immunodeficiencies Sickle Cell Disease Ontology (SCDO) Sickle Cell Disease SNOMED Clinical Terminology (SNOMED CT) Broad disease representation for human diseases. Symptom Ontology Human diseases Unified Medical Language System Broad coverage"},{"location":"reference/medical-ontology-landscape/#artificial-intelligence-rheumatology-consultant-system-ontology-ai-rheum","title":"Artificial Intelligence Rheumatology Consultant System ontology (AI-RHEUM)","text":"Description: Contains findings, such as clinical signs, symptoms, laboratory test results, radiologic observations, tissue biopsy results, and intermediate diagnosis hypotheses, for the diagnosis of rheumatic diseases. Disease area: Rheumatic diseases Use Cases: Used by clinicians and informatics researchers. Website: https://bioportal.bioontology.org/ontologies/AI-RHEUM Open: Yes
"},{"location":"reference/medical-ontology-landscape/#autism-dsm-adi-r-ontology-adar","title":"Autism DSM-ADI-R Ontology (ADAR)","text":"Description: An ontology of autism spectrum disorder (ASD) and related neurodevelopmental disorders. Disease area: Autism Use Cases: It extends an existing autism ontology to allow automatic inference of ASD phenotypes and Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria based on subjects\u2019 Autism Diagnostic Interview\u2013Revised (ADI-R) assessment data. Website: https://bioportal.bioontology.org/ontologies/ADAR Open: Yes
"},{"location":"reference/medical-ontology-landscape/#autism-spectrum-disorder-phenotype-ontology-asdpto","title":"Autism Spectrum Disorder Phenotype Ontology (ASDPTO)","text":"Description: Encapsulates the ASD behavioral phenotype, informed by the standard ASD assessment instruments and the currently known characteristics of this disorder. Disease area: Autism Use Cases: Intended for use in research settings where extensive phenotypic data have been collected, allowing a concept-based approach to identifying behavioral features of importance and for correlating these with genotypic data. Website: https://bioportal.bioontology.org/ontologies/ASDPTO Open: Yes
"},{"location":"reference/medical-ontology-landscape/#brucellosis-ontology-idobru","title":"Brucellosis Ontology (IDOBRU)","text":"Description: Describes the most common zoonotic disease, brucellosis, which is caused by Brucella, a type of facultative intracellular bacteria. Disease area: brucellosis bacteria Use Cases: An extension ontology of the core Infectious Disease Ontology (IDO-core). This project appears to be inactive. Website: https://github.com/biomedontology/idobru Open: Yes
"},{"location":"reference/medical-ontology-landscape/#cardiovascular-disease-ontology-cvdo","title":"Cardiovascular Disease Ontology (CVDO)","text":"Description: An ontology to describe entities related to cardiovascular diseases. Disease area: Cardiovascular Use Cases: Describes entities related to cardiovascular diseases including the diseases themselves, the underlying disorders, and the related pathological processes. Imports upper level terms from OGMS and imports some terms from Disease Ontology (DO). GitHub repo: https://github.com/OpenLHS/CVDO/ Website: https://github.com/OpenLHS/CVDO OBO Foundry webpage: http://obofoundry.org/ontology/cvdo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#chronic-kidney-disease-ontology-ckdo","title":"Chronic Kidney Disease Ontology (CKDO)","text":"Description: An ontology of chronic kidney disease in primary care. Disease area: Chronic kidney disease Use Cases: CKDDO was developed to assist routine data studies and case identification of CKD in primary care. Website: http://purl.bioontology.org/ontology/CKDO Open: Yes
"},{"location":"reference/medical-ontology-landscape/#chronic-obstructive-pulmonary-disease-ontology-copdo","title":"Chronic Obstructive Pulmonary Disease Ontology (COPDO)","text":"Description: Models concepts associated with chronic obstructive pulmonary disease in routine clinical databases. Disease area: Chronic obstructive pulmonary disease (COPD) Use Cases: Clinical use. Website: https://bioportal.bioontology.org/ontologies/COPDO Open: Yes
"},{"location":"reference/medical-ontology-landscape/#coronavirus-infectious-disease-ontology-cido","title":"Coronavirus Infectious Disease Ontology (CIDO)","text":"Description: Aims to ontologically represent and standardize various aspects of coronavirus infectious diseases, including their etiology, transmission, epidemiology, pathogenesis, diagnosis, prevention, and treatment. Disease area: Coronavirus infectious diseases, including COVID-19, SARS, MERS; covers etiology, transmission, epidemiology, pathogenesis, diagnosis, prevention, and treatment. Use Cases: Used for disease annotations related to coronavirus infections. GitHub repo: https://github.com/cido-ontology/cido OBO Foundry webpage: http://obofoundry.org/ontology/cido.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#diagnostic-and-statistical-manual-of-mental-disorders-dsm","title":"Diagnostic and Statistical Manual of Mental Disorders (DSM)","text":"Description: Authoritative source to define and classify mental disorders to improve diagnoses, treatment, and research. Disease area: Mental disorders Use Cases: Used in clinical healthcare and research by pyschiatrists and psychologists. Website: https://www.psychiatry.org/psychiatrists/practice/dsm Open: No, must be purchased
"},{"location":"reference/medical-ontology-landscape/#dispedia-core-ontology-dco","title":"Dispedia Core Ontology (DCO)","text":"Description: A schema for information brokering and knowledge management in the complex field of rare diseases. DCO describes patients affected by rare diseases and records expertise about diseases in machine-readable form. Disease area: Rare disease Use Cases: DCO was initially created with amyotrophic lateral sclerosis as a use case. Website: http://purl.bioontology.org/ontology/DCO Open: Yes
"},{"location":"reference/medical-ontology-landscape/#experimental-factor-ontology-efo","title":"Experimental Factor Ontology (EFO)","text":"Description: Provides a systematic description of many experimental variables available in EBI databases, and for projects such as the GWAS catalog. Disease area: Broad disease coverage, integrates the Mondo disease ontology. Use Cases: Application ontology build for European Bioinformatics (EBI) tools and databases and Open Targets Genetics Portal. Website: https://www.ebi.ac.uk/efo/ Open: Yes
"},{"location":"reference/medical-ontology-landscape/#fibrotic-interstitial-lung-disease-ontology-fildo","title":"Fibrotic Interstitial Lung Disease Ontology (FILDO)","text":"Description: An in-progress, four-tiered ontology proposed to standardize the diagnostic classification of patients with fibrotic interstitial lung disease. Disease area: Fibrotic interstitial lung disease Use Cases: Goal is to standardize the diagnostic classification of patients with fibrotic ILD. A paper was published in 2017 and an ontology is not publicly available. Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5803648/ Open: No
"},{"location":"reference/medical-ontology-landscape/#genetic-and-rare-diseases-information-center-gard","title":"Genetic and Rare Diseases Information Center (GARD)","text":"Description: NIH resource that provides the public with access to current, reliable, and easy-to-understand information about rare or genetic diseases in English or Spanish. Disease area: Rare diseases Use Cases: Patient portal. Integrates defintions and synonyms from Orphanet, maps to HPO phenotypes, and is integrated by Mondo. Website: https://rarediseases.info.nih.gov/ Open: Yes
"},{"location":"reference/medical-ontology-landscape/#holistic-ontology-of-rare-diseases-hord","title":"Holistic Ontology of Rare Diseases (HORD)","text":"Description: Describes the biopsychosocial state (i.e., disease, psychological, social, and environmental state) of persons with rare diseases in a holistic way. Disease area: Rare disease Use Cases: Rehabilita, Disruptive Technologies for the Rehabilitation of the Future, a project that aims to enhance rehabilitation transforming it to a more personalized, ubiquitous and evidence-based rehabilitation. Website: http://purl.bioontology.org/ontology/HORD Open: Yes
"},{"location":"reference/medical-ontology-landscape/#human-dermatological-disease-ontology-dermo","title":"Human Dermatological Disease Ontology (DERMO)","text":"Description: The most comprehensive dermatological disease ontology available, with over 3,500 classes available. There are 20 upper-level disease entities, with features such as anatomical location, heritability, and affected cell or tissue type. Disease area: Dermatology (skin) Use Cases: DermO can be used to extract data from patient electronic health records using text mining, or to translate existing variable-granularity coding such as ICD-10 to allow capture and standardization of patient/disease annotations. Website: https://bioportal.bioontology.org/ontologies/DERMO Open: Yes
"},{"location":"reference/medical-ontology-landscape/#human-disease-ontology-do","title":"Human Disease Ontology (DO)","text":"Description: An ontology for describing the classification of human diseases organized by etiology. Disease area: Human disease terms, phenotype characteristics and related medical vocabulary disease concepts. Use Cases: Used by Model Organism Databases (MOD), such as Mouse Genome Informatics disease model for diseae annotations, and Alliance for Genome Resources for disease annotations. In 2018, DO tracked over 300 DO project citations suggesting wide adoption and usage for disease annotations. GitHub repo: https://github.com/DiseaseOntology/HumanDiseaseOntology/ Website: http://www.disease-ontology.org/ OBO Foundry webpage: http://obofoundry.org/ontology/doid.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#infectious-disease-ontology-ido","title":"Infectious Disease Ontology (IDO)","text":"Description: A set of interoperable ontologies that will together provide coverage of the infectious disease domain. IDO core is the upper-level ontology that hosts terms of general relevance across the domain, while extension ontologies host terms to specific to a particular part of the domain. Disease area: Infectious disease features, such as acute, primary, secondary infection, and chronic, hospital acquired and local infection. Use Cases: Does not seem active, has not been released since 2017. GitHub repo: https://github.com/infectious-disease-ontology/infectious-disease-ontology/ Website: http://www.bioontology.org/wiki/index.php/Infectious_Disease_Ontology OBO Foundry webpage: http://obofoundry.org/ontology/ido.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#international-classification-of-functioning-disability-and-health-icf","title":"International Classification of Functioning, Disability and Health (ICF)","text":"Description: Represents diseases and provides a conceptual basis for the definition and measurement of health and disability as organized by patient-oriented outcomes of function and disability. ICF considers environmental factors as well as the relevance of associated health conditions in recognizing major models of disability. Disease area: Cross-discipline, focuses on health and disability Use Cases: ICF is the World Health Organization (WHO) framework for measuring health and disability at both individual and population levels. ICF was officially endorsed by the WHO as the international standard to describe and measure health and disability. Website: https://www.who.int/standards/classifications/international-classification-of-functioning-disability-and-health Open: Yes
"},{"location":"reference/medical-ontology-landscape/#international-statistical-classification-of-diseases-and-related-health-problems-icd-11","title":"International Statistical Classification of Diseases and Related Health Problems (ICD-11)","text":"Description: A medical classification list by the World Health Organization (WHO) that contains codes for diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. Disease area: Broad coverage of human disease features, such as disease of anatomical systems, infectious diseases, injuries, external causes of morbidity and mortality. Use Cases: The main purpose of ICD-11 is for clinical care, billing and coding for insurance companies. Website: https://www.who.int/standards/classifications/classification-of-diseases Open: Yes
"},{"location":"reference/medical-ontology-landscape/#international-classification-of-diseases-for-oncology-icd-o","title":"International Classification of Diseases for Oncology (ICD-O)","text":"Description: A domain-specific extension of the International Statistical Classification of Diseases and Related Health Problems for tumor diseases. Disease area: A multi-axial classification of the site, morphology, behaviour, and grading of neoplasms. Use Cases: Used principally in tumour or cancer registries for coding the site (topography) and the histology (morphology) of neoplasms, usually obtained from a pathology report. Website: https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology Open: Yes
"},{"location":"reference/medical-ontology-landscape/#logical-observation-identifier-names-and-codes-loinc","title":"Logical Observation Identifier Names and Codes (LOINC)","text":"Description: Identifies medical laboratory observations. Disease area: Broad coverage Use Cases: The Regenstrief Institute first developed LOINC in 1994 in response to the demand for an electronic database for clinical care and management. LOINC is publicly available at no cost and is endorsed by the American Clinical Laboratory Association and the College of American Pathologists. Since its inception, LOINC has expanded to include not just medical laboratory code names but also nursing diagnoses, nursing interventions, outcome classifications, and patient care data sets. Website: https://loinc.org/ Open: Yes, registration is required.
"},{"location":"reference/medical-ontology-landscape/#medical-subject-headings-mesh","title":"Medical Subject Headings (MeSH)","text":"Description: Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. Disease area: Broad coverage Use Cases: It is used for indexing, cataloging, and searching of biomedical and health-related information. Integrated into Mondo. Website: https://meshb.nlm.nih.gov/search Open: Yes
"},{"location":"reference/medical-ontology-landscape/#medgen","title":"MedGen","text":"Description: Organizes information related to human medical genetics, such as attributes of conditions and phenotypes of genetic contributions. Disease area: Human medical genetics Use Cases: MedGen is NCBI's portal to information about conditions and phenotypes related to Medical Genetics. Terms from the NIH Genetic Testing Registry (GTR), UMLS, HPO, Orphanet, ClinVar and other sources are aggregated into concepts, each of which is assigned a unique identifier and a preferred name and symbol. The core content of the record may include names, identifiers used by other databases, mode of inheritance, clinical features, and map location of the loci affecting the disorder. The concept identifier (CUI) is used to aggregate information about that concept, similar to the way NCBI Gene serves as a gateway to gene-related information. Website: https://www.ncbi.nlm.nih.gov/medgen/ Open: Yes
"},{"location":"reference/medical-ontology-landscape/#medical-dictionary-for-regulatory-activities-meddra","title":"Medical Dictionary for Regulatory Activities (MedDRA)","text":"Description: Provides a standardized international medical terminology to be used for regulatory communication and evaluation of data about medicinal products for human use. Disease area: Broad coverage Use Cases: Mainly targeted towards industry and regulatory users. Website: https://www.meddra.org/ Open: Yes
"},{"location":"reference/medical-ontology-landscape/#mental-disease-ontology-mdo","title":"Mental Disease Ontology (MDO)","text":"Description: An ontology to describe and classify mental diseases such as schizophrenia, annotated with DSM-IV and ICD codes where applicable. Disease area: Mental functioning, including mental processes such as cognition and traits such as intelligence. Use Cases: The ontology has been partially aligned with the related projects Cognitive Atlas, knowledge base on cognitive science and the Cognitive Paradigm Ontology, which is used in the Brainmap, a database of neuroimaging experiments. GitHub repo: https://github.com/jannahastings/mental-functioning-ontology OBO Foundry webpage: http://obofoundry.org/ontology/mfomd.html Open: yes
"},{"location":"reference/medical-ontology-landscape/#mondo-disease-ontology-mondo","title":"Mondo Disease Ontology (Mondo)","text":"Description: An integrated disease ontology that provides precise mappings between source ontologies that comprehensively covers cross-species diseases, from common to rare diseases. Disease area: Cross species, intended to cover all areas of diseases, integrating source ontologies that cover Mendelian diseases (OMIM), rare diseases (Orphanet), neoplasms (NCIt), human diseases (DO), and others. See all sources here. Use Cases: Mondo was developed for usage in the Monarch Initiative, a discovery system that allows navigation of similarities between phenotypes, organisms, and human diseases across many data sources and organisms. Mondo is also used by ClinGen for disease curations, the Kids First Data Resource Portal for disease annotations and others, see an extensive list here. GitHub repo: https://github.com/monarch-initiative/mondo Website: https://mondo.monarchinitiative.org/ OBO Foundry webpage: http://obofoundry.org/ontology/mondo.html Open: yes
"},{"location":"reference/medical-ontology-landscape/#national-cancer-institute-thesaurus-ncit","title":"National Cancer Institute Thesaurus (NCIT)","text":"Description: NCI Thesaurus (NCIt)is a reference terminology that includes broad coverage of the cancer domain, including cancer related diseases, findings and abnormalities. The NCIt OBO Edition aims to increase integration of the NCIt with OBO Library ontologies. NCIt OBO Edition releases should be considered experimental. Disease area: Cancer and neoplasms Use Cases: NCI Thesaurus (NCIt) provides reference terminology for many National Cancer Institute and other systems. It is used by the Clinical Data Interchange Standards Consortium Terminology (CDISC), the U.S. Food and Drug Administration (FDA), the Federal Medication Terminologies (FMT), and the National Council for Prescription Drug Programs (NCPDP). It provides extensive coverage of neoplasms and cancers. GitHub repo: https://github.com/NCI-Thesaurus/thesaurus-obo-edition/issues Website: https://ncithesaurus.nci.nih.gov/ncitbrowser/pages/home.jsf?version=20.11e OBO Foundry webpage: http://obofoundry.org/ontology/ncit.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#neurological-disease-ontology-nd","title":"Neurological Disease Ontology (ND)","text":"Description: A framework for the representation of key aspects of neurological disease. Disease area: Neurology Use Cases: Goal is to provide a framework to enable representation of aspects of neurological diseases that are relevant to their treatment and study. This project may be inactive, the last commit to GitHub was in 2016. GitHub repo: https://github.com/addiehl/neurological-disease-ontology Open: Yes
"},{"location":"reference/medical-ontology-landscape/#online-mendelian-inheritance-in-man-omim","title":"Online Mendelian Inheritance in Man (OMIM)","text":"Description: a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. Disease area: Mendelian, genetic diseases. Use Cases: Integrated into the disease ontology, used by the Human Phenotype Ontology for disease annotations, patients and researchers. Website: https://omim.org/ Open: yes
"},{"location":"reference/medical-ontology-landscape/#ontology-of-cardiovascular-drug-adverse-events-ocvdae","title":"Ontology of Cardiovascular Drug Adverse Events (OCVDAE)","text":"Description: A biomedical ontology of cardiovascular drug\u2013associated adverse events. Disease area: Cardiovascular Use Cases: One novel study of the OCVDAE project is the development of the PCR method. Specifically, an AE-specific drug class effect is defined to exist when all the drugs (drug chemical ingredients or drug products) in a drug class are associated with an AE, which is formulated as a proportional class level ratio (\u201cPCR\u201d)\u2009=\u20091. See more information in the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5653862/. This project may be inactive, the last GitHub commit was in 2019. GitHub repo: https://github.com/OCVDAE/OCVDAE Website: https://bioportal.bioontology.org/ontologies/OCVDAE Open: yes
"},{"location":"reference/medical-ontology-landscape/#ontology-for-general-medical-science-ogms","title":"Ontology for General Medical Science (OGMS)","text":"Description: An ontology of entities involved in a clinical encounter. Use Cases: Provides a formal theory of disease that can be further elaborated by specific disease ontologies. It is intended to be used as a upper level ontology for other disease ontologies. Used by Cardiovascular Disease Ontology. GitHub repo: https://github.com/OGMS/ogms OBO Foundry webpage: http://obofoundry.org/ontology/ogms.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#ontology-for-genetic-susceptibility-factor-ogsf","title":"Ontology for Genetic Susceptibility Factor (OGSF)","text":"Description: An application ontology to represent genetic susceptibility to a specific disease, adverse event, or a pathological process. Use Cases: Modeling genetic susceptibility to vaccine adverse events. GitHub repo: https://github.com/linikujp/OGSF OBO Foundry webpage: http://obofoundry.org/ontology/ogsf.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#ontology-of-glucose-metabolism-disorder-ogmd","title":"Ontology of Glucose Metabolism Disorder (OGMD)","text":"Description: Represents glucose metabolism disorder and diabetes disease names, phenotypes, and their classifications. Disease area: Metabolic disorders Use Cases: Still under development (last verssion released in BioPortal was in 2021) but there is little information about its usage online. Website: https://bioportal.bioontology.org/ontologies/OGMD Open: Yes
"},{"location":"reference/medical-ontology-landscape/#ontology-of-language-disorder-in-autism-lda","title":"Ontology of Language Disorder in Autism (LDA)","text":"Description: An ontology assembled from a set of language terms mined from the autism literature. Disease area: Austism Use Cases: This has not been released since 2008 and looks like it is inactive. Website: https://bioportal.bioontology.org/ontologies/LDA Open: Yes
"},{"location":"reference/medical-ontology-landscape/#the-oral-health-and-disease-ontology-ohd","title":"The Oral Health and Disease Ontology (OHD)","text":"Description: Represents the content of dental practice health records and is intended to be further developed for use in translational medicine. OHD is structured using BFO (Basic Formal Ontology) and uses terms from many ontologies, NCBITaxon, and a subset of terms from the CDT (Current Dental Terminology). Disease area: Oral health and disease Use Cases: Used to represent the content of dental practice health records and is intended to be further developed for use in translation medicine. Appears to be inactive. OBO Foundry webpage: http://www.obofoundry.org/ontology/ohd.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#orphanet-ordo","title":"Orphanet (ORDO)","text":"Description: The portal for rare diseases and orphan drugs. Contains a structured vocabulary for rare diseases capturing relationships between diseases, genes, and other relevant features, jointly developed by Orphanet and the EBI. It contains information on nearly 10,000 cancers and related diseases, 8,000 single agents and combination therapies, and a wide range of other topics related to cancer and biomedical research. Disease area: Rare diseases Use Cases: Used by rare disease research and clinical community. Integrated into the Mondo disease ontology, aligned with OMIM. Website: https://www.orpha.net/consor/cgi-bin/index.php Open: Yes
"},{"location":"reference/medical-ontology-landscape/#parkinson-disease-ontology-pdo","title":"Parkinson Disease ontology (PDO)","text":"Description: A comprehensive semantic framework with a subclass-based taxonomic hierarchy, covering the whole breadth of the Parkinson disease knowledge domain from major biomedical concepts to different views on disease features held by molecular biologists, clinicians, and drug developers. Disease area: Parkinson disease Use Cases: This resource has been created for use in the IMI-funded AETIONOMY project. Last release was in 2015, may be inactive. Website: https://bioportal.bioontology.org/ontologies/PDON Open: Yes
"},{"location":"reference/medical-ontology-landscape/#pathogenic-disease-ontology-pdo","title":"Pathogenic Disease Ontology (PDO)","text":"Description: Provides information on infectious diseases, disease synonyms, transmission pathways, disease agents, affected populations, and disease properties. Diseases are grouped into syndromic disease categories, organisms are structured hierarchically, and both disease transmission and relevant disease properties are searchable. Disease area: human infectious diseases caused by microbes and the diseases that is related to microbial infection. Use Cases: Has not been released since 2016 and may be inactive. Website: https://bioportal.bioontology.org/ontologies/PDO Open: Yes.
"},{"location":"reference/medical-ontology-landscape/#polycystic-ovary-syndrome-knowledgebase-pcoskb","title":"PolyCystic Ovary Syndrome Knowledgebase (PCOSKB)","text":"Description: Comprises genes, single nucleotide polymorphisms, diseases, gene ontology terms, and biochemical pathways associated with polycystic ovary syndrome, a major cause of female subfertility worldwide. Disease area: polycystic ovary syndrome Use Cases: Ontology underlying the Polycystic Ovary Syndrome Knowledgebase, a manually curated knowledgebase on PCOS. Website: http://pcoskb.bicnirrh.res.in/go_d.php Open: Yes
"},{"location":"reference/medical-ontology-landscape/#rat-disease-ontology-rdo","title":"Rat Disease Ontology (RDO)","text":"Description: Provides the foundation for ten comprehensive disease area\u2013related data sets at the Rat Genome Database Disease Portals. Disease area: Broad coverage including animal diseases, infectious diseases, chemically-induced disorders, occupational diseases, wounds and injuries and more. Use Cases: Developed for use with the Rat Genome Database Disease Portals. Website: https://rgd.mcw.edu/rgdweb/ontology/view.html?acc_id=DOID:4 Open: Yes
"},{"location":"reference/medical-ontology-landscape/#removable-partial-denture-ontology-rpdo","title":"Removable Partial Denture Ontology (RPDO)","text":"Description: Represents knowledge of a patient\u2019s oral conditions and denture component parts, originally developed to create a clinician decision support model. Disease area: Oral health and dentures Use Cases: A paper was published on this in 2016 but it does not appear any other information is available about this ontology on the website, presumably it is an inactive project. Publication: https://www.nature.com/articles/srep27855 Open: No
"},{"location":"reference/medical-ontology-landscape/#resource-of-asian-primary-immunodeficiency-diseases-rpo","title":"Resource of Asian Primary Immunodeficiency Diseases (RPO)","text":"Description: Represents observed phenotypic terms, sequence variations, and messenger RNA and protein expression levels of all genes involved in primary immunodeficiency diseases. Disease area: Primary immunodeficiency diseases Use Cases: This terminology is used in a freely accessible, dynamic and integrated database for primary immunodeficiency diseases (PID) called Resource of Asian Primary Immunodeficiency Diseases (RAPID), which is available here. Publication: https://academic.oup.com/nar/article/37/suppl_1/D863/1004993 Open: Yes
"},{"location":"reference/medical-ontology-landscape/#sickle-cell-disease-ontology-scdo","title":"Sickle Cell Disease Ontology (SCDO)","text":"Description: SCDO establishes (a) community-standardized sickle cell disease terms and descriptions, (b) canonical and hierarchical representation of knowledge on sickle cell disease, and (c) links to other ontologies and bodies of work. Disease area: Sickle Cell Disease (SCD). Use Cases: SCDO is intended to be a comprehensive collection of knowledge on SCD, facilitate exploration of new scientific questions and ideas, facilitate seamless data sharing and collaborations including meta-analysis within the SCD community, support the building of databasing and clinical informatics in SCD. GitHub repo: https://github.com/scdodev/scdo-ontology/issues Website: https://scdontology.h3abionet.org/ OBO Foundry webpage: http://obofoundry.org/ontology/scdo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#snomed-clinical-terminology-snomed-ct","title":"SNOMED Clinical Terminology (SNOMED CT)","text":"Description: A comprehensive clinical terminology/ontology used in healthcare settings. Disease area: Broad disease representation for human diseases. Use Cases: Main coding system used in Electronic Health Records (EHRs). Website: https://browser.ihtsdotools.org/? Open: No, requires a license for usage.
"},{"location":"reference/medical-ontology-landscape/#symptom-ontology","title":"Symptom Ontology","text":"Description: An ontology of disease symptoms, with symptoms encompasing perceived changes in function, sensations or appearance reported by a patient indicative of a disease. Disease area: Human diseases Use Cases: Developed by the Disease Ontology (DO) team and used for describing symptoms of human diseases in the DO. Website: http://symptomontologywiki.igs.umaryland.edu/mediawiki/index.php/Main_Page OBO Foundry webpage: http://obofoundry.org/ontology/symp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#unified-medical-language-system","title":"Unified Medical Language System","text":"Description: The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services. Disease area: Broad coverage Use Cases: Healthcare settings including electronic health records and HL7. Website: https://www.nlm.nih.gov/research/umls/index.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#phenotype-ontologies","title":"Phenotype ontologies","text":""},{"location":"reference/medical-ontology-landscape/#phenotype-summary-table","title":"Phenotype Summary Table","text":"Name Species Area Ascomycete phenotype ontology (APO) Ascomycota C. elegans phenotype (wbphenotype) C elegans Dictyostelium discoideum phenotype ontology (ddpheno) Dictyostelium discoideum Drosophila Phenotype Ontology (DPO) Drosophila Flora Phenotype Ontology (FLOPO) Viridiplantae Fission Yeast Phenotype Ontology (FYPO) S. pombe Human Phenotype Ontology (HPO) Human HPO - ORDO Ontological Module (HOOM) Human Mammalian Phenotype Ontology (MP) Mammals Ontology of Microbial Phenotypes (OMP) Microbe Ontology of Prokaryotic Phenotypic and Metabolic Characters Prokaryotes Pathogen Host Interaction Phenotype Ontology pathogens Planarian Phenotype Ontology (PLANP) Schmidtea mediterranea Plant Trait Ontology (TO) Viridiplantae Plant Phenology Ontology Plants Unified Phenotype Ontology (uPheno) Cross-species coverage Xenopus Phenotype Ontology (XPO) Xenopus Zebrafish Phenotype Ontology (ZP) Zebrafish"},{"location":"reference/medical-ontology-landscape/#ascomycete-phenotype-ontology-apo","title":"Ascomycete phenotype ontology (APO)","text":"Description: A structured controlled vocabulary for the phenotypes of Ascomycete fungi. Species: Ascomycota GitHub repo: https://github.com/obophenotype/ascomycete-phenotype-ontology/ Webpage: http://www.yeastgenome.org/ OBO Foundry webpage: http://obofoundry.org/ontology/wbphenotype.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#c-elegans-phenotype-wbphenotype","title":"C. elegans phenotype (wbphenotype)","text":"Description: A structured controlled vocabulary of Caenorhabditis elegans phenotypes. Species: C elegans GitHub repo: https://github.com/obophenotype/c-elegans-phenotype-ontology OBO Foundry webpage: http://obofoundry.org/ontology/wbphenotype.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#dictyostelium-discoideum-phenotype-ontology-ddpheno","title":"Dictyostelium discoideum phenotype ontology (ddpheno)","text":"Description: A structured controlled vocabulary of phenotypes of the slime-mould Dictyostelium discoideum. Species: Dictyostelium discoideum GitHub repo: https://github.com/obophenotype/dicty-phenotype-ontology/issues Webpage: http://dictybase.org/ OBO Foundry webpage: http://obofoundry.org/ontology/ddpheno.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#drosophila-phenotype-ontology-dpo","title":"Drosophila Phenotype Ontology (DPO)","text":"Description: An ontology of commonly encountered and/or high level Drosophila phenotypes. Species: Drosophila GitHub repo: https://github.com/obophenotype/c-elegans-phenotype-ontology Webpage: http://purl.obolibrary.org/obo/fbcv OBO Foundry webpage: http://obofoundry.org/ontology/dpo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#flora-phenotype-ontology-flopo","title":"Flora Phenotype Ontology (FLOPO)","text":"Description: Traits and phenotypes of flowering plants occurring in digitized Floras. Species: Viridiplantae GitHub repo: https://github.com/flora-phenotype-ontology/flopoontology/ OBO Foundry webpage: http://obofoundry.org/ontology/flopo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#fission-yeast-phenotype-ontology-fypo","title":"Fission Yeast Phenotype Ontology (FYPO)","text":"Description: FYPO is a formal ontology of phenotypes observed in fission yeast. Species: S. pombe GitHub repo: https://github.com/pombase/fypo OBO Foundry webpage: http://obofoundry.org/ontology/fypo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#human-phenotype-ontology-hpo","title":"Human Phenotype Ontology (HPO)","text":"Description: HPO provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality. Species: Human GitHub repo: https://github.com/obophenotype/human-phenotype-ontology Website: https://hpo.jax.org/app/ OBO Foundry webpage: http://obofoundry.org/ontology/hp.html Open: yes
"},{"location":"reference/medical-ontology-landscape/#hpo-ordo-ontological-module-hoom","title":"HPO - ORDO Ontological Module (HOOM)","text":"Description: Orphanet provides phenotypic annotations of the rare diseases in the Orphanet nomenclature using the Human Phenotype Ontology (HPO). HOOM is a module that qualifies the annotation between a clinical entity and phenotypic abnormalities according to a frequency and by integrating the notion of diagnostic criterion. In ORDO a clinical entity is either a group of rare disorders, a rare disorder or a subtype of disorder. The phenomes branch of ORDO has been refactored as a logical import of HPO, and the HPO-ORDO phenotype disease-annotations have been provided in a series of triples in OBAN format in which associations, frequency and provenance are modeled. HOOM is provided as an OWL (Ontologies Web Languages) file, using OBAN, the Orphanet Rare Disease Ontology (ORDO), and HPO ontological models. HOOM provides extra possibilities for researchers, pharmaceutical companies and others wishing to co-analyse rare and common disease phenotype associations, or re-use the integrated ontologies in genomic variants repositories or match-making tools. Species: Human Website: http://www.orphadata.org/cgi-bin/img/PDF/WhatIsHOOM.pdf BioPortal: https://bioportal.bioontology.org/ontologies/HOOM Open: yes
"},{"location":"reference/medical-ontology-landscape/#mammalian-phenotype-ontology-mp","title":"Mammalian Phenotype Ontology (MP)","text":"Description: Standard terms for annotating mammalian phenotypic data. Species: Mammals (main focus is on mouse and rodents) GitHub repo: https://github.com/obophenotype/mammalian-phenotype-ontology Website: http://www.informatics.jax.org/searches/MP_form.shtml OBO Foundry webpage: http://obofoundry.org/ontology/mp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#ontology-of-microbial-phenotypes-omp","title":"Ontology of Microbial Phenotypes (OMP)","text":"Description: An ontology of phenotypes covering microbes. Species: microbes GitHub repo: https://github.com/microbialphenotypes/OMP-ontology Website: http://microbialphenotypes.org OBO Foundry webpage: http://obofoundry.org/ontology/omp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#ontology-of-prokaryotic-phenotypic-and-metabolic-characters","title":"Ontology of Prokaryotic Phenotypic and Metabolic Characters","text":"Description: An ontology of phenotypes covering microbes. Species: Prokaryotes GitHub repo: https://github.com/microbialphenotypes/OMP-ontology/issues Website: http://microbialphenotypes.org/ OBO Foundry webpage: http://obofoundry.org/ontology/omp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#pathogen-host-interaction-phenotype-ontology","title":"Pathogen Host Interaction Phenotype Ontology","text":"Description: PHIPO is a formal ontology of species-neutral phenotypes observed in pathogen-host interactions. Species: pathogens GitHub repo: https://github.com/PHI-base/phipo Website: http://www.phi-base.org OBO Foundry webpage: http://obofoundry.org/ontology/phipo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#planarian-phenotype-ontology-planp","title":"Planarian Phenotype Ontology (PLANP)","text":"Description: Planarian Phenotype Ontology is an ontology of phenotypes observed in the planarian Schmidtea mediterranea. Species: Schmidtea mediterranea GitHub repo: https://github.com/obophenotype/planarian-phenotype-ontology OBO Foundry webpage: http://obofoundry.org/ontology/planp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#plant-trait-ontology-to","title":"Plant Trait Ontology (TO)","text":"Description: A controlled vocabulary of describe phenotypic traits in plants. Species: Viridiplantae GitHub repo: https://github.com/Planteome/plant-trait-ontology/ OBO Foundry webpage: http://obofoundry.org/ontology/to.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#plant-phenology-ontology","title":"Plant Phenology Ontology","text":"Description: An ontology for describing the phenology of individual plants and populations of plants, and for integrating plant phenological data across sources and scales. Species: Plants GitHub repo: https://github.com/PlantPhenoOntology/PPO OBO Foundry webpage: http://obofoundry.org/ontology/ppo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#unified-phenotype-ontology-upheno","title":"Unified Phenotype Ontology (uPheno)","text":"Description: The uPheno ontology integrates multiple phenotype ontologies into a unified cross-species phenotype ontology. Species: Cross-species coverage GitHub repo: https://github.com/obophenotype/upheno OBO Foundry webpage: http://obofoundry.org/ontology/upheno.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#xenopus-phenotype-ontology-xpo","title":"Xenopus Phenotype Ontology (XPO)","text":"Description: XPO represents anatomical, cellular, and gene function phenotypes occurring throughout the development of the African frogs Xenopus laevis and tropicalis. Species: Xenopus GitHub repo: https://github.com/obophenotype/xenopus-phenotype-ontology OBO Foundry webpage: http://obofoundry.org/ontology/xpo.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#zebrafish-phenotype-ontology-zp","title":"Zebrafish Phenotype Ontology (ZP)","text":"Description: The Zebrafish Phenotype Ontology formally defines all phenotypes of the Zebrafish model organism. Species: Zebrafish GitHub repo: https://github.com/obophenotype/zebrafish-phenotype-ontology OBO Foundry webpage: http://obofoundry.org/ontology/zp.html Open: Yes
"},{"location":"reference/medical-ontology-landscape/#references","title":"References","text":"An index page to find some of our favourite articles on Chris' blog. These are not all articles, but I selection we found useful during our every work.
"},{"location":"reference/mungall-blog-radar/#ontology-development-and-modelling","title":"Ontology development and modelling","text":"OntoTips Series. Must read series for the beginning ontology developer.
Warning about complex modelling. Chris is generally big on Occam's Razor solutions: given two solutions that solve a use case, the simpler is better.
OntoTip: Don\u2019t over-specify OWL definitions. From the above OntoTip series.
How to deal with unintentional equivalent classes
Some resources on OBOOK are less well developed than others. We use the OBOOK Maturity Indicator to document this (discussion).
To add a status badge onto a site, simply paste a badge like this right under the title:
<a href=\"https://oboacademy.github.io/obook/reference/obook-maturity-indicator/\"><img src=\"https://img.shields.io/endpoint?url=https%3A%2F%2Fraw.githubusercontent.com%2FOBOAcademy%2Fobook%2Fmaster%2Fdocs%2Fresources%2Fobook-badge-final.json\" /></a>\n
"},{"location":"reference/odk/","title":"Ontology Development Kit (ODK) Reference","text":"The ODK is essentially two things:
The ODK bundles a lot of tools together, such as ROBOT, owltools, fastobo-validator and dosdp-tools. To get a better idea, its best to simply read the Dockerfile specifications of the ODK image:
One of the tools in the toolbox, the \"seed my repo\" function, allows us to generate a complete GitHub repository with everything needed to manage an OBO ontology according to OBO best practices. The two central components are
Schema can be found in ODK documentation here
"},{"location":"reference/ontology-curator/","title":"A Day in the Life of an Ontology Curator","text":"Here's a collection of links about the Open Biological and Biomedical Ontologies (OBO), and related topics.
If you're completely new to OBO, I suggest starting with Ontologies 101:
If you're new to scientific computing more generally, then I strongly recommend Software Carpentry, which provides a set of very pragmatic introductions to the Unix command line, git, Python, Make, and other tools widely used by OBO developers.
"},{"location":"reference/other-resources/#open-biological-and-biomedical-ontologies","title":"Open Biological and Biomedical Ontologies","text":"OBO is a community of people collaborating on open source ontologies for science. We have a set of shared principles and best practises to help people and data work together effectively.
Here is a very incomplete list of some excellent services to help you find an use OBO terms and ontologies.
This is the suite of open source software that most OBO developers use.
This section is for technical reference, not beginners.
OBO projects use Semantic Web and Linked Data technologies:
These standards form layers:
Other useful resources on technical topics:
Nicole Vasilevsky, James Overton, Rebecca Jackson, Sabrina Toro, Shawn Tan, Bradley Varner, David Osumi-Sutherland, & Nicolas Matentzoglu. (2022, August 3). OBO Academy: Training materials for bio-ontologists. 2022 ISMB Bio-Ontologies Community, Madison, WI. https://doi.org/10.5281/zenodo.6955490
"},{"location":"reference/outreach/#generic-obo-academy-slide-deck","title":"Generic OBO Academy slide deck","text":"Available here. Please feel free to use this slide deck to promote the OBO Academy.
"},{"location":"reference/outreach/#presentations","title":"Presentations","text":"To add an ontology term (such as a GO term) that contains '
in its name (e.g. RNA-directed 5'-3' RNA polymerase activity
) in the class expression editor, you need to escape the '
characters. In Proteg\u00e9 5.5 this is not automatically handled when you auto-complete with tab. To escape the character append \\
before the '
-> RNA-directed 5\\'-3\\' RNA polymerase activity
. You won't be able to add the annotation otherwise.
As in Proteg\u00e9 5.5, the \\
characters will show up in the description window, and when hovering over the term, you won't be able to click on it with a link. However, when you save the file, the relationship is saved correctly. You can double-check by going to the ontology text file and see that the term is correctly mentioned in the relationship.
For this reference, we will use the cell ontology to highlight the key information on the user interface in Protege
"},{"location":"reference/protege-interface/#general-interface-buttons","title":"General interface buttons","text":"'+' button (not shown above) = add '?' button = explain axiom '@' button = annotate 'x' button = remove 'o' button = edit
"},{"location":"reference/protege-interface/#active-ontology-tab","title":"Active Ontology tab","text":""},{"location":"reference/protege-interface/#overview","title":"Overview","text":"When you open the ontology on protege, you should land on the Active ontology tab, alternatively, it is available on the top as one of your tabs.
"},{"location":"reference/protege-interface/#ontology-level-annotations","title":"Ontology Level Annotations","text":"Annotations on the active ontology tab are ontology level annotations and contain metadata about the ontology. This includes:
Entities are where your \"entries\" in the ontology live and where you can add terms etc.
"},{"location":"reference/reasoning/","title":"Why do we need reasoning?","text":"A quick personal perspective up-front. When I was finishing my undergrad, I barely had heard the term Semantic Web. What I had heard vaguely intrigued me, so I decided that for my final project, I would try to combine something Semantic Web related with my other major, Law and build a tool that could automatically infer the applicability of a law (written in OWL) given a legal case. Super naively, I just went went ahead, read a few papers about legal ontologies, build a simple one, loaded it into my application and somehow got it to work, with reasoning and all, without even having heard of Description Logic.
In my PhD, I worked on actual reasoning algorithms, which meant, no more avoiding logic. But - I did not get it. Up until this point in my life, I could just study harder and harder, and in the end I was confident with what I learned, but First Order Logic, in particular model theory and proofs, caused me anxiety until the date of my viva. In the end, a very basic understanding of model theory and Tableau did help me with charactering the algorithms I was working with (I was studying the effect of modularity, cutting out logically connected subsets of an ontology, on reasoning performance) but I can confidently say today: I never really, like deeply, understood logical proofs. I still cant read them - and I have a PhD in Reasoning (albeit from an empirical angle).
If you followed the Open HPI courses on logic, and you are anything like me, your head will hurt and you will want to hide under your blankets. Most students feel like that. For a complete education in Semantic Web technologies, going through this part once is essential: it tells you something about how difficult some stuff is under the hood, and how much work has been done to make something like OWL work for knowledge representation. You should have gained some appreciation of the domain, which is no less complex than Machine Learning or Stochastic Processes. But, in my experience, some of the most effective ontology engineers barely understand reasoning - definitely have no idea how it works - and still do amazing work. In that spirit, I would like to invite you at this stage to put logic and reasoning behind you (unless it made you curious of course) - you won't need to know much of that for being an effective Semantic Engineer. In the following, I will summarise some of the key take-aways that I find useful to keep in mind.
Human SubClassOf: Mammal
means that all instances of the Human
class, like me, are also instances of the Mammal
class. Or, in other words, from the statements:Human SubClassOf: Mammal\nNico type: Human\n
Semantics allow as to deduce that Nico:Mammal
. What are semantics practically? Show me your semantics? Look at something like the OWL semantics. In there, you will find language statements (syntax) like X SubClassOf: Y
and a bunch of formulae from model theory that describe how to interpret it - no easy read, and not really important for you now.
When we want reasoners to be faster at making inferences (computational complexity), we need to decrease expressivity. So we need to find a way to balance.
What are the most important practical applications of reasoning? There are many, and there will be many opinions, but in the OBO world, by far (95%) of all uses of reasoners pertain to the following:
inconsistent
- which means, totally broken. A slightly less bad, but still undesirable situation is that some of the classes in your ontologies break (in parlance, become unsatisfiable). This happens when you say some contradictory things about them. Reasoners help you find these unsatisfiable classes, and there is a special reasoning algorithm that can generate an explanation for you - to help fixing your problem.So in general, what is reasoning? There are probably a dozen or more official characterisations in the scientific literature, but from the perspective of biomedical ontologies, the question can be roughly split like this:
How can we capture what we know? This is the (research-) area of knowledge representation, logical formalisms, such as First Order Logic, Description Logic, etc. It is concerned with how we write down what we now:
All cars have four wheels\nIf you are a human, you are also a mammal\nIf you are a bird, you can fly (unless you are a penguin)\n
Lets think about a naive approach: using a fact-, or data-, base.
"},{"location":"reference/release-artefacts/","title":"Release artefacts","text":"For explanation of different release artefacts, please see discussion documentation on owl format variants
We made a first stab add defining release artefacts that should cover all use cases community-wide. We need to (1) agree they are all that is needed and (2) they are defined correctly in terms of ROBOT commands. This functionality replaces what was previously done using OORT.
"},{"location":"reference/release-artefacts/#terminology","title":"Terminology:","text":"The source ontology is the ontology we are talking about. A release artefact is a version of the ontology modified in some specific way, intended for public use. An import is a module of an external ontology which contains all the axioms necessary for the source ontology. A component is a file containing axioms that belong to the source ontology (but are for one reason or another, like definitions.owl, managed in a separate file). An axiom is said to be foreign if it 'belongs' to a different ontology, and native if it belongs to the source ontology. For example, the source ontology might have, for one reason or another, been physically asserted (rather than imported) the axiom TransitiveObjectProperty(BFO:000005). If the source ontology does not 'own' the BFO namespace, this axiom will be considered foreign.
There are currently 6 release defined in the ODK:
We discuss all of them here in detail.
"},{"location":"reference/release-artefacts/#release-artefact-1-base-required","title":"Release artefact 1: base (required)","text":"The base file contains all and only native axioms. No further manipulation is performed, in particular no reasoning, redundancy stripping or relaxation. This release artefact is going to be the new backbone of the OBO strategy to combat incompatible imports and consequent lack of interoperability. (Detailed discussions elsewhere, @balhoff has documentation). Every OBO ontology will contain a mandatory base release (should be in the official OBO recommendations as well).
The ROBOT command generating the base artefact: $(SRC): source ontology $(OTHER_SRC): set of component ontologies
$(ONT)-base.owl: $(SRC) $(OTHER_SRC)\n $(ROBOT) remove --input $< --select imports --trim false \\\n merge $(patsubst %, -i %, $(OTHER_SRC)) \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@\n
"},{"location":"reference/release-artefacts/#release-artefact-2-full-required","title":"Release artefact 2: full (required)","text":"The full release artefact contains all logical axioms, including inferred subsumptions. Redundancy stripping (i.e. redundant subclass of axioms) and typical relaxation operations are performed. All imports and components are merged into the full release artefact to ensure easy version management. The full release represents most closely the actual ontology as it was intended at the time of release, including all its logical implications. Every OBO ontology will contain a mandatory full release.
The ROBOT command generating the full artefact: $(SRC): source ontology $(OTHER_SRC): set of component ontologies
$(ONT)-full.owl: $(SRC) $(OTHER_SRC)\n $(ROBOT) merge --input $< \\\n reason --reasoner ELK \\\n relax \\\n reduce -r ELK \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@\n
"},{"location":"reference/release-artefacts/#release-artefact-3-non-classified-optional","title":"Release artefact 3: non-classified (optional)","text":"The non-classified release artefact reflects the 'unmodified state' of the editors file at release time. No operations are performed that modify the axioms in any way, in particular no redundancy stripping. As opposed to the base artefact, both component and imported ontologies are merged into the non-classified release.
The ROBOT command generating the full artefact: $(SRC): source ontology $(OTHER_SRC): set of component ontologies
$(ONT)-non-classified.owl: $(SRC) $(OTHER_SRC)\n $(ROBOT) merge --input $< \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@\n
"},{"location":"reference/release-artefacts/#release-artefact-4-simple-optional","title":"Release artefact 4: simple (optional)","text":"Many users want a release that can be treated as a simple existential graph of the terms defined in an ontology. This corresponds to the state of OBO ontologies before logical definitions and imports. For example, the only logical axioms in -simple release of CL will contain be of the form CL1 subClassOf CL2
or CL1 subClassOf R some CL3
where R is any objectProperty and CLn is a CL class. This role has be fulfilled by the -simple artefact, which up to now has been supported by OORT.
To construct this, we first need to assert inferred classifications, relax equivalentClass axioms to sets of subClassOf axioms and then strip all axioms referencing foreign (imported) classes. As ontologies occasionally end up with forieign classes and axioms merged into the editors file, we achieve this will a filter based on obo-namespace. (e.g. finding all terms with iri matching http://purl.obolibrary.org/obo/CL_{\\d}7).
The ROBOT command generating the full artefact: $(SRC): source ontology $(OTHER_SRC): set of component ontologies $(SIMPLESEED): all terms that 'belong' to the ontology
$(ROBOT) merge --input $< $(patsubst %, -i %, $(OTHER_SRC)) \\\n reason --reasoner {{ project.reasoner }} --equivalent-classes-allowed {{ project.allow_equivalents }} \\\n relax \\\n remove --axioms equivalent \\\n relax \\\n filter --term-file $(SIMPLESEED) --select \"annotations ontology anonymous self\" --trim true --signature true \\\n reduce -r {{ project.reasoner }} \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@.tmp.owl && mv $@.tmp.owl $@\n
NOTES: This requires $(ONTOLOGYTERMS) to include all ObjectProperties usesd. --select parents
is required for logical axioms to be retained, but results in a few upper-level classes bleeding through. We hope this will be fixed by further improvments to Monarch.
Some legacy users (e.g. MGI) require an OBO DAG version of -simple. OBO files derived from OWL are not guarenteed to be acyclic, but acyclic graphs can be achieved using judicious filtering of relationships (simple existential restrictions) by objectProperty. The -basic release artefact has historically fulfilled this function as part of OORT driven ontology releases. The default -basic version corresponds to the -simple artefact with only 'part of' relationships (BFO:0000050), but others may be added where ontology editors judge these to be useful and safe to add without adding cycles. We generate by taking the simple release and filtering it
The ROBOT command generating the full artefact: $(SRC): source ontology $(OTHER_SRC): set of component ontologies $(KEEPRELATIONS): all relations that should be preserved. $(SIMPLESEED): all terms that 'belong' to the ontology
$(ROBOT) merge --input $< $(patsubst %, -i %, $(OTHER_SRC)) \\\n reason --reasoner {{ project.reasoner }} --equivalent-classes-allowed {{ project.allow_equivalents }} \\\n relax \\\n remove --axioms equivalent \\\n remove --axioms disjoint \\\n remove --term-file $(KEEPRELATIONS) --select complement --select object-properties --trim true \\\n relax \\\n filter --term-file $(SIMPLESEED) --select \"annotations ontology anonymous self\" --trim true --signature true \\\n reduce -r {{ project.reasoner }} \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@ --output $@.tmp.owl && mv $@.tmp.owl $@\n
"},{"location":"reference/release-artefacts/#release-artefact-6-simple-non-classified-optional","title":"Release artefact 6: simple-non-classified (optional)","text":"This artefact caters to the very special and hopefully transient case of some ontologies that do not yet trust reasoning (MP, HP). The simple-non-classified artefact corresponds to the simple artefact, just without the reasoning step.
$(SRC): source ontology $(OTHER_SRC): set of component ontologies $(ONTOLOGYTERMS): all terms that 'belong' to the ontology
$(ONT)-simple-non-classified.owl: $(SRC) $(OTHER_SRC) $(ONTOLOGYTERMS)\n $(ROBOT) remove --input $< --select imports \\\n merge $(patsubst %, -i %, $(OTHER_SRC)) \\\n relax \\\n reduce -r ELK \\\n filter --term-file $(ONTOLOGYTERMS) --trim true \\\n annotate --ontology-iri $(ONTBASE)/$@ --version-iri $(ONTBASE)/releases/$(TODAY)/$@\n
"},{"location":"reference/semantic-engineering-toolbox/","title":"The Semantic OBO Engineer's Toolbox","text":"Essentials
Automation
make
, managed by git
.Text editors:
SPARQL query tool:
SPARQL endpoints
Templating systems
Ontology Mappings
Where to find ontologies and terms: Term browsers and ontology repositories
Ontology visualisation
Dot
.Other tools in my toolbox
These are a bit less essential than the above, but I consider them still tremendously useful.
make
, managed by git
.Semantic Data Engineering or Semantic Extract-Transform-Load (ETL) is an engineering discipline that is concerned with extracting information from a variety of sources, linking it together into a knowledge graph and enabling a range of semantic analyses for downstream users such as data scientists or researchers.
The following glossary only says how we use the terms we are defining, not how they are defined by some higher authority.
Term Definition Example Entity An entity is a thing in the world, like a molecule, or something more complex, like a disease. Entities do not have to be material, they can be processes as well, like cell proliferation. Marfan syndrome, H2O molecule, Ring finger, Phone Term A term is a sequence of characters (string) that refers to an entity in a precise way. SMOKER (referring to the role of being a smoker), HP:0004934 (see explanations below) Relation A link between two (or more) entities that signifies some kind of interaction.:A :loves :B
, :smoking :causes :cancer
Property A type of relation. The :causes
in :smoking :causes :cancer
"},{"location":"reference/semantic-etl/#getting-the-data","title":"Getting the data","text":"As a Semantic Engineer, you typically coordinate the data collection from three largely separate sources: 1. Unstructured text, for example a corpus of scientific literature 2. External biological databases, such as STRING, a database of Protein-Protein Interaction Networks. 3. Manual in-house bio-curation efforts, i.e. the manual translation and integration of information relevant to biology (or medicine) into a database.
Here, we are mostly concerned with the automated approaches of Semantic ETL, so we briefly touch on these and provide pointers to the others.
"},{"location":"reference/semantic-etl/#information-extraction-from-text","title":"Information Extraction from text","text":"The task of information extraction is concerned with extracting information from unstructured textual sources to enable identifying entities, like diseases, phenotypes and chemicals, as well as classifying them and storing them in a structured format.
The discipline that is concerned with techniques for extracting information from text is called Natural Language Processing (NLP).
NLP is a super exciting and vast engineering discipline which goes beyond the scope of this course. NLP is concerned with many problems such as document classification, speech recognition and language translation. In the context of information extraction, we are particularly interested in Named Entity Recognition (NER), and Relationship Extraction (ER).
"},{"location":"reference/semantic-etl/#named-entity-recognition","title":"Named Entity Recognition","text":"Named Entity Recognition (NER) is the task of identifying and categorising entities in text. NER tooling provides functionality to first isolate parts of sentence that correspond to things in the world, and then assigning them to categories (e.g. Drug, Disease, Publication).
For example, consider this sentence:
As in the X-linked Nettleship-Falls form of ocular albinism (300500), the patients showed reduced visual acuity, photophobia, nystagmus, translucent irides, strabismus, hypermetropic refractive errors, and albinotic fundus with foveal hypoplasia.\n
An NER tool would first identify the relevant sentence parts that belong together:
As in the [X-linked] [Nettleship-Falls] form of [ocular albinism] (300500), the patients showed [reduced visual acuity], [photophobia], [nystagmus], [translucent irides], [strabismus], [hypermetropic refractive errors], and [albinotic fundus] with [foveal hypoplasia].\n
And then categorise them according to some predefined categories:
As in the Phenotype[X-linked] [Nettleship-Falls] form of Disease[ocular albinism] (300500), the patients showed Phenotype[reduced visual acuity], Phenotype[photophobia], Phenotype[nystagmus], Phenotype[translucent irides], Phenotype[strabismus], Phenotype[hypermetropic refractive errors], and Phenotype[albinotic fundus] with Phenotype[foveal hypoplasia].\n
Interesting sources for further reading:
Relationship extraction (RE) is the task of extracting semantic relationships from text. RE is an important component for the construction of Knowledge Graphs from the Scientific Literature, a task that many Semantic Data Engineering projects pursue to augment or inform their manual curation processes.
Interesting sources for further reading:
There is a huge amount of literature and tutorials on the topic of integrating data, the practice of consolidating data from disparate sources into a single dataset. We want to emphasise here two aspects of data integration, which are of particular importance to the Semantic Data engineer.
Entity resolution (ER), sometimes called \"record linking\", is the task of disambiguating records that correspond to real world entities across and within datasets. This task as many dimensions, but for us, the most important one is mapping a string, for example the one that was matched by our Named Entity Recognition pipeline, to ontology terms.
Given our example:
As in the Phenotype[X-linked] Nettleship-Falls form of Phenotype[ocular albinism] (300500), the patients showed Phenotype[reduced visual acuity], Phenotype[photophobia], Phenotype[nystagmus], Phenotype[translucent irides], Phenotype[strabismus], Phenotype[hypermetropic refractive errors], and Phenotype[albinotic fundus] with Phenotype[foveal hypoplasia].\n
We could end up, for example, resolving ocular albinism to HP:0001107.
There are a lot of materials about Entity Resolution in general: - https://www.districtdatalabs.com/basics-of-entity-resolution - https://www.sciencedirect.com/topics/computer-science/entity-resolution
In effect the term Ontology Mapping, which is the focus of this lesson, is Entity Resolution for ontologies - usually we don't have problem to use the two terms synonymously, although you may find that the literature typically favours one or the other.
"},{"location":"reference/semantic-etl/#knowledge-graph-ontology-merging","title":"Knowledge Graph / Ontology merging","text":"Knowledge, Knowledge Graph or Ontology Merging are the disciplines concerned with combining all your data sources into a semantically coherent whole. This is a very complex research area, in particular to do this in a way that is semantically consistent. There are essentially two separate problems to be solved to achieve semantic merging: 1. The entities aligned during the entity resolution process must be aligned in the semantically correct way: if you you use logical equivalence to align them (owl:equivalentClasses
) the classes must mean absolutely the same thing, or else you may run into the hairball problem, in essence faulty equivalence cliques. In cases of close, narrow or broadly matching classes, the respective specialised semantically correct relationships need to be used in the merging process. 2. The axioms of the merged ontologies must be logically consistent. For example, one ontology may say: a disease is a material entity. Another: a disease is a process. A background, or upper, ontology such as the ubiquitous Basic Formal Ontology (BFO) furthermore says that a process is not a material entity and vice versa. Merging this two ontologies would cause logical inconsistency.
Unfortunately, the literature on ontology and knowledge graph merging is still sparse and very technical. You are probably best off checking out the OpenHPI course on Ontology Alignment, which is closely related.
"},{"location":"reference/sparql-basics/","title":"Basic SPARQL commands useful for OBO Engineers","text":""},{"location":"reference/sparql-basics/#basic-select-query","title":"Basic SELECT query","text":"A basic SELECT query contains a set of prefixes, a SELECT clause and a WHERE clause.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT ?term ?value\nWHERE {\n ?term rdfs:label ?value .\n}\n
"},{"location":"reference/sparql-basics/#prefixes","title":"Prefixes","text":"Prefixes allow you to specify shortcuts. For example, instead of using the prefixes above, you could have simply said:
SELECT ?term ?value\nWHERE {\n ?term <http://www.w3.org/2000/01/rdf-schema#label> ?value .\n}\n
Without the prefix. It means the exact same thing. But it looks nicer. Some people even go as far as adding entire entities into the prefix header:
PREFIX label: <http://www.w3.org/2000/01/rdf-schema#label>\n\nSELECT ?term ?value\nWHERE {\n ?term label: ?value .\n}\n
This query is, again, the same as the ones above, but even more concise.
"},{"location":"reference/sparql-basics/#select-clause","title":"SELECT clause","text":"The SELECT clause defines what you part of you query you want to show, for example, as a table.
SELECT ?term ?value\n
means: \"return\" or \"show\" whatever you find for the variable ?term
and the variable ?value
.
There are other cool things you can do in the SELECT clause:
This document contains template SPARQL queries that can be adapted. Comments are added in-code with #
above each step to explain them so that queries can be spliced together
note: we assume that all native terms here have the same namespace - that of the ontology
# select unique instances of the variable\nSELECT DISTINCT ?term\nWHERE {\n # selecting where the variable term is either used as a subject or object\n { ?s1 ?p1 ?term . }\n UNION\n { ?term ?p2 ?o2 . }\n # filtering out only terms that have the MONDO namespace (assumed to be native terms)\n FILTER(isIRI(?term) && (STRSTARTS(str(?term), \"http://purl.obolibrary.org/obo/MONDO_\")))\n}\n
"},{"location":"reference/sparql-reference/#report-of-terms-with-labels-containing-certain-strings-in-ubergraph","title":"Report of terms with labels containing certain strings in ubergraph","text":"# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nprefix BFO: <http://purl.obolibrary.org/obo/BFO_>\n\n# selecting only unique instances of the three variables\nSELECT DISTINCT ?entity ?label WHERE\n{\n # the variable label is a rdfs:label\n VALUES ?property {\n rdfs:label\n }\n\n # only look for uberon terms. note: this is only used in ubergraph, use filter for local ontology instead.\n ?entity rdfs:isDefinedBy <http://purl.obolibrary.org/obo/uberon.owl> .\n\n # defining the order of variables in the triple\n ?entity ?property ?label .\n # entity must be material\n ?entity rdfs:subClassOf BFO:0000040\n # filtering out triples where the variable label has sulcus or incisure, or fissure in it\n FILTER(contains(STR(?label), \"sulcus\")||contains(STR(?label), \"incisure\")||contains(STR(?label), \"fissure\"))\n\n}\n# arrange report by entity variable\nORDER BY ?entity\n
"},{"location":"reference/sparql-reference/#report-of-labels-and-definitions-of-terms-with-certain-namespace","title":"Report of labels and definitions of terms with certain namespace","text":"prefix label: <http://www.w3.org/2000/01/rdf-schema#label>\nprefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nprefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>\nprefix owl: <http://www.w3.org/2002/07/owl#>\n\n# select a report with 3 variables\nSELECT DISTINCT ?term ?label ?def\n\n# defining the properties to be used\n WHERE {\n VALUES ?defproperty {\n definition:\n }\n VALUES ?labelproperty {\n label:\n }\n\n# defining the order of the triples\n ?term ?defproperty ?def .\n ?term ?labelproperty ?label .\n\n# selects entities that are in a certain namespace\n FILTER(isIRI(?term) && (STRSTARTS(str(?term), \"http://purl.obolibrary.org/obo/CP_\")))\n}\n\n# arrange report by term variable\nORDER BY ?term\n
"},{"location":"reference/sparql-reference/#definition-lacks-xref","title":"Definition lacks xref","text":"adaptable for lacking particular annotation
# adding prefixes used\nprefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nprefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>\nprefix owl: <http://www.w3.org/2002/07/owl#>\n\nSELECT ?entity ?property ?value WHERE\n{\n # the variable property has to be defintion (IAO:0000115)\n VALUES ?property {\n definition:\n }\n # defining the order of variables in the triple\n ?entity ?property ?value .\n\n # selecting annotation on definition\n ?def_anno a owl:Axiom ;\n owl:annotatedSource ?entity ;\n owl:annotatedProperty definition: ;\n owl:annotatedTarget ?value .\n\n # filters out definitions which do not have a dbxref annotiton\n FILTER NOT EXISTS {\n ?def_anno oboInOwl:hasDbXref ?x .\n }\n\n # removes triples where entity is blank\n FILTER (!isBlank(?entity))\n # selects entities that are native to ontology (in this case MONDO)\n FILTER (isIRI(?entity) && STRSTARTS(str(?entity), \"http://purl.obolibrary.org/obo/MONDO_\"))\n\n}\n# arrange report by entity variable\nORDER BY ?entity\n
"},{"location":"reference/sparql-reference/#checks-wether-definitions-contain-underscore-characters","title":"Checks wether definitions contain underscore characters","text":"adaptable for checking if there is particular character in annotation
# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nprefix IAO: <http://purl.obolibrary.org/obo/IAO_>\nprefix definition: <http://purl.obolibrary.org/obo/IAO_0000115>\n\n# selecting only unique instances of the three variables\nSELECT DISTINCT ?entity ?property ?value WHERE\n{\n # the variable property has to be definition (IAO:0000115)\n VALUES ?property {\n definition:\n }\n # defining the order of variables in the triple\n ?entity ?property ?value .\n # filtering out triples where the variable value has _ in it\n FILTER( regex(STR(?value), \"_\"))\n # removes triples where entity is blank\n FILTER (!isBlank(?entity))\n}\n# arrange report by entity variable\nORDER BY ?entity\n
"},{"location":"reference/sparql-reference/#only-allowing-a-fix-set-of-annotation-properties","title":"Only allowing a fix set of annotation properties","text":"# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nprefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nprefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nprefix IAO: <http://purl.obolibrary.org/obo/IAO_>\nprefix RO: <http://purl.obolibrary.org/obo/RO_>\nprefix mondo: <http://purl.obolibrary.org/obo/mondo#>\nprefix skos: <http://www.w3.org/2004/02/skos/core#>\nprefix dce: <http://purl.org/dc/elements/1.1/>\nprefix dcterms: <http://purl.org/dc/terms/>\n\n# selecting only unique instances of the three variables\nSELECT DISTINCT ?term ?property ?value WHERE\n{\n # order of the variables in the triple\n ?term ?property ?value .\n # the variable property is an annotation property\n ?property a owl:AnnotationProperty .\n # selects entities that are native to ontology (in this case MONDO)\n FILTER (isIRI(?term) && regex(str(?term), \"^http://purl.obolibrary.org/obo/MONDO_\"))\n # removes triples where the variable value is blank\n FILTER(!isBlank(?value))\n # listing the allowed annotation properties\n FILTER (?property NOT IN (dce:creator, dce:date, IAO:0000115, IAO:0000231, IAO:0100001, mondo:excluded_subClassOf, mondo:excluded_from_qc_check, mondo:excluded_synonym, mondo:pathogenesis, mondo:related, mondo:confidence, dcterms:conformsTo, mondo:should_conform_to, oboInOwl:consider, oboInOwl:created_by, oboInOwl:creation_date, oboInOwl:hasAlternativeId, oboInOwl:hasBroadSynonym, oboInOwl:hasDbXref, oboInOwl:hasExactSynonym, oboInOwl:hasNarrowSynonym, oboInOwl:hasRelatedSynonym, oboInOwl:id, oboInOwl:inSubset, owl:deprecated, rdfs:comment, rdfs:isDefinedBy, rdfs:label, rdfs:seeAlso, RO:0002161, skos:broadMatch, skos:closeMatch, skos:exactMatch, skos:narrowMatch))\n}\n
"},{"location":"reference/sparql-reference/#checking-for-misused-replaced_by","title":"Checking for misused replaced_by","text":"adaptable for checking that a property is used in a certain way
# adding prefixes used\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\nPREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nPREFIX replacedBy: <http://purl.obolibrary.org/obo/IAO_0100001>\n\n# selecting only unique instances of the three variables\nSELECT DISTINCT ?entity ?property ?value WHERE {\n # the variable property is IAO_0100001 (item replaced by)\n VALUES ?property { replacedBy: }\n\n # order of the variables in the triple\n ?entity ?property ?value .\n # removing entities that have either owl:deprecated true or oboInOwl:ObsoleteClass (these entities are the only ones that should have replaced_by)\n FILTER NOT EXISTS { ?entity owl:deprecated true }\n FILTER (?entity != oboInOwl:ObsoleteClass)\n}\n# arrange report by entity variable\nORDER BY ?entity\n
"},{"location":"reference/sparql-reference/#count","title":"Count","text":""},{"location":"reference/sparql-reference/#count-class-by-prefixes","title":"Count class by prefixes","text":"# this query counts the number of classes you have with each prefix (eg number of MONDO terms, CL terms, etc.)\n\n# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix obo: <http://purl.obolibrary.org/obo/>\n\n# selecting 2 variables, prefix and numberOfClasses, where number of classes is a count of distinct cls\nSELECT ?prefix (COUNT(DISTINCT ?cls) AS ?numberOfClasses) WHERE\n{\n # the variable cls is a class\n ?cls a owl:Class .\n # removes any cases where the variable cls is blank\n FILTER (!isBlank(?cls))\n # Binds the variable prefix as the prefix of the class (eg. MONDO, CL, etc.). classes that do not have obo purls will come out as blank in the report.\n BIND( STRBEFORE(STRAFTER(str(?cls),\"http://purl.obolibrary.org/obo/\"), \"_\") AS ?prefix)\n}\n# grouping the count by prefix\nGROUP BY ?prefix\n
"},{"location":"reference/sparql-reference/#counting-subclasses-in-a-namespace","title":"Counting subclasses in a namespace","text":"# this query counts the number of classes that are subclass of CL:0000003 (native cell) that are in the pcl namespace\n\n# adding prefixes used\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX CL: <http://purl.obolibrary.org/obo/CL_>\nPREFIX PCL: <http://purl.obolibrary.org/obo/PCL_>\n\n# count the number of unique term\nSELECT (COUNT (DISTINCT ?term) as ?pclcells)\nWHERE {\n # the variable term is a class\n ?term a owl:Class .\n # the variable term has to be a subclass of CL:0000003, including those that are subclassof by property path\n ?term rdfs:subClassOf* CL:0000003\n # only count the term if it is in the pcl namespace\n FILTER(isIRI(?term) && (STRSTARTS(str(?term), \"http://purl.obolibrary.org/obo/PCL_\")))\n}\n
"},{"location":"reference/sparql-reference/#removing","title":"Removing","text":""},{"location":"reference/sparql-reference/#removes-all-ro-terms","title":"Removes all RO terms","text":"adaptable for removing all terms of a particular namespace
# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\n# removing triples\nDELETE {\n ?s ?p ?o\n}\nWHERE\n{\n {\n # the variable p must be a rdfs:label\n VALUES ?p {\n rdfs:label\n }\n # the variable s is an object property\n ?s a owl:ObjectProperty ;\n # the other variables can be anything else (note the above value restriction of p)\n ?p ?o\n # filter out triples where ?s starts with \"http://purl.obolibrary.org/obo/RO_\"\n FILTER (isIRI(?s) && STRSTARTS(str(?s), \"http://purl.obolibrary.org/obo/RO_\"))\n }\n}\n
"},{"location":"reference/sparql-reference/#deleting-axiom-annotations-by-prefix","title":"Deleting axiom annotations by prefix","text":"# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\n\n# delete triples\nDELETE {\n ?anno ?property ?value .\n}\nWHERE {\n # the variable property is either synonym_type: or source:\n VALUES ?property { synonym_type: source: }\n # structure of variable value and variable anno\n ?anno a owl:Axiom ;\n owl:annotatedSource ?s ;\n owl:annotatedProperty ?p ;\n owl:annotatedTarget ?o ;\n ?property ?value .\n # filter out the variable value which start with \"ICD10EXP:\"\n FILTER(STRSTARTS(STR(?value),\"ICD10EXP:\"))\n}\n
"},{"location":"reference/sparql-reference/#replacing","title":"Replacing","text":""},{"location":"reference/sparql-reference/#replace-oboinowlsource-with-oboinowlhasdbxref-in-synonyms-annotations","title":"Replace oboInOwl:source with oboInOwl:hasDbXref in synonyms annotations","text":"adaptable for replacing annotations properties on particular axioms
# adding prefixes used\nprefix owl: <http://www.w3.org/2002/07/owl#>\nprefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nprefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\n# delete triples where the relation is oboInOwl:source\nDELETE {\n ?ax oboInOwl:source ?source .\n}\n# insert triples where the variables ax and source defined above are used, but using oboInOwl:hasDbXref instead\nINSERT {\n ?ax oboInOwl:hasDbXref ?source .\n}\nWHERE\n{\n # restricting to triples where the property variable is in this list\n VALUES ?property { oboInOwl:hasExactSynonym oboInOwl:hasNarrowSynonym oboInOwl:hasBroadSynonym oboInOwl:hasCloseSynonym oboInOwl:hasRelatedSynonym } .\n # order of the variables in the triple\n ?entity ?property ?value .\n # structure on which the variable ax and source applies\n ?ax rdf:type owl:Axiom ;\n owl:annotatedSource ?entity ;\n owl:annotatedTarget ?value ;\n owl:annotatedProperty ?property ;\n oboInOwl:source ?source .\n # filtering out triples where entity is an IRI\n FILTER (isIRI(?entity))\n}\n
"},{"location":"reference/synonyms-obo/","title":"Synonyms in OBO","text":"A synonym indicates an alternative name for a term. Terms can have multiple synonyms.
"},{"location":"reference/synonyms-obo/#the-scope-of-a-synonym-may-fall-into-one-of-four-categories","title":"The scope of a synonym may fall into one of four categories:","text":""},{"location":"reference/synonyms-obo/#exact","title":"Exact","text":"The definition of the synonym is exactly the same as primary term definition. This is used when the same class can have more than one name.
For example, hereditary Wilms' tumor has the exact synonoym familial Wilms' tumor.
Additionally, translations into other languages are listed as exact synonyms. For example, the Plant Ontology list both Spanish and Japanese translations as exact synonyms; e.g. anther wall has exact synonym \u2018pared de la antera\u2019 (Spanish) and \u2018\u846f\u58c1 \u2018(Japanese).
"},{"location":"reference/synonyms-obo/#narrow","title":"Narrow","text":"The definition of the synonym is the same as the primary definition, but has additional qualifiers.
For example, pod is a narrow synonym of fruit.
Note - when adding a narrow synonym, please first consider whether a new subclass should be added instead of a narrow synonym. If there is any uncertainty, start a discussion on the GitHub issue tracker.
"},{"location":"reference/synonyms-obo/#broad","title":"Broad","text":"The primary definition accurately describes the synonym, but the definition of the synonym may encompass other structures as well. In some cases where a broad synonym is given, it will be a broad synonym for more than one ontology term.
For example, Cyst of eyelid has the broad synonym Lesion of the eyelid.
Note - when adding a broad synonym, please first consider whether a new superclass should be added instead of a broad synonym. If there is any uncertainty, start a discussion on the GitHub issue tracker.
"},{"location":"reference/synonyms-obo/#related","title":"Related","text":"This scope is applied when a word of phrase has been used synonymously with the primary term name in the literature, but the usage is not strictly correct. That is, the synonym in fact has a slightly different meaning than the primary term name. Since users may not be aware that the synonym was being used incorrectly when searching for a term, related synonyms are included.
For example, Autistic behavior has the related synonym Autism spectrum disorder.
"},{"location":"reference/synonyms-obo/#synonym-types","title":"Synonym types","text":"Synonyms can also be classified by types. The default is no type. The synonym types vary in each ontology, but some commonly used synonym types include:
Whenever possible, database cross-references (dbxrefs) for synonyms should be provided, to indicate the publication that used the synonym. References to PubMed IDs should be in the format PMID:XXXXXXX (no space). However, dbxrefs for synonyms are not mandatory in most ontologies.
"},{"location":"reference/tables-and-triples/","title":"Tables and Triples","text":"Tables and triples seem very different. Tables are familiar and predictable. Triples are weird and floppy. SQL is normal, SPARQL is bizarre, at least at first.
Tables are great, and they're the right tool for a lot of jobs, but they have their limitations. Triples shine when it comes to merging heterogeneous data. But it turns out that there's a clear path from tables to triples, which should help make RDF make more sense.
"},{"location":"reference/tables-and-triples/#tables","title":"Tables","text":"Tables are great! Here's a table!
first_name last_name Luke Skywalker Leia Organa Darth Vader Han SoloYou won't be surprised to find out that tables have rows and columns. Often each row corresponds to some thing that we want to talk about, such as a fictional character from Star Wars. Each column usually corresponds to some sort of property that those things might have. Then the cells contain the values of those properties for their respective row. We take some sort of complex information about the world, and we break it down along two dimensions: the things (rows) and their properties (columns).
"},{"location":"reference/tables-and-triples/#primary-keys","title":"Primary Keys","text":"Tables are great! We can add another name to our table:
first_name last_name Luke Skywalker Leia Organa Darth Vader Han Solo Anakin SkywalkerHmm. That's a perfectly good table, but it's not capturing the information that we wanted. It turns out (Spoiler Alert!) that Anakin Skywalker is Darth Vader! We might have thought that the rows of our table were describing individual people, but it turns out that they're just describing individual names. A person can change their name or have more than one name.
We want some sort of identifier that lets us pick out the same person, and distinguish them from all the other people. Sometimes there's a \"natural key\" that we can use for this purpose: some bit of information that uniquely identifies a thing. When we don't have a natural key, we can generate an \"artificial key\". Random strings and number can be good artificial keys, but sometimes a simple incrementing integer is good enough.
The main problem with artificial keys is that it's our job to maintain the link between the thing and the identifier that we gave it. We prefer natural keys because we just have to inspect that thing (in some way) to figure out what to call it. Even when it's possible, sometimes that's too much work. Maybe we could use a DNA sequence as a natural key for a person, but it probably isn't practical. We do use fingerprints and facial recognition, for similar things, though.
(Do people in Star Wars even have DNA? Or just midichlorions?)
Let's add a column with an artificial key to our table:
sw_id first_name last_name 1 Luke Skywalker 2 Leia Organa 3 Darth Vader 4 Han Solo 3 Anakin SkywalkerThis is our table of names, allowing a given person to have multiple names. But what we thought we wanted was a person table with one row for each person, like this:
sw_id first_name last_name 1 Luke Skywalker 2 Leia Organa 3 Darth Vader 4 Han SoloIn SQL we could assert that the \"sw_id\" column of the person table is a PRIMARY KEY. This means it must be unique. (It probably shouldn't be NULL either!)
The names in the person table could be the primary names that we use in our Star Wars database system, and we could have another alternative_name table:
sw_id first_name last_name 3 Anakin Skywalker"},{"location":"reference/tables-and-triples/#holes","title":"Holes","text":"Tables are great! We can add more columns to our person table:
sw_id first_name last_name occupation 1 Luke Skywalker Jedi 2 Leia Organa princess 3 Darth Vader 4 Han Solo scoundrelThe 2D pattern of a table is a strong one. It not only provides a \"slot\" (cell) for every combination of row and column, it also makes it very obvious when one of those slots is empty. What does it mean for a slot to be empty? It could mean many things.
For example, in the previous table in the row for Darth Vader, the cell for the \"occupation\" column is empty. This could mean that:
I'm sure I haven't captured all the possibilities. The point is that there's lot of possible reasons why a cell would be blank. So what can we do about it?
If our table is stored in a SQL database, then we have the option of putting a NULL value in the cell. NULL is pretty strange. It isn't TRUE and it isn't FALSE. Usually NULL values are excluded from SQL query results unless you are careful to ask for them.
The way that NULL works in SQL eliminates some of the possibilities above. SQL uses the \"closed-world assumption\", which is the assumption that if a statement is true then it's known to be true, and conversely that if it's not known to be true then it's false. So if Anakin's occupation is NULL in a SQL database, then as far as SQL is concerned, we must know that he doesn't have an occupation. That might not be what you were expecting!
The Software Carpentry module on Missing Data has more information.
"},{"location":"reference/tables-and-triples/#multiple-values","title":"Multiple Values","text":"Tables are great! Let's add even more information to our table:
sw_id first_name last_name occupation enemy 1 Luke Skywalker Jedi 3 2 Leia Organa princess 3 3 Darth Vader 1,2,4 4 Han Solo scoundrel 3We're trying to say that Darth Vader is the enemy of everybody else in our table. We're using the primary key of the person in the enemy column, which is good, but we've ended up with multiple values in the \"enemy\" column for Darth Vader.
In any table or SQL database you could make the \"enemy\" column a string, pick a delimiter such as the comma, and concatenate your values into a comma-separated list. This works, but not very well.
In some SQL databases, such as Postgres, you could given the \"enemy\" column an array type, so it can contain multiple values. You get special operators for querying inside arrays. This can work pretty well.
The usual advice is to break this \"one to many\" information into a new \"enemy\" table:
sw_id enemy 1 3 2 3 3 1 3 2 3 4 4 1Then you can JOIN the person table to the enemy table as needed.
"},{"location":"reference/tables-and-triples/#sparse-tables","title":"Sparse Tables","text":"Tables are great! Let's add even more information to our table:
sw_id first_name last_name occupation father lightsaber_color ship 1 Luke Skywalker Jedi 3 green 2 Leia Organa princess 3 3 Darth Vader red 4 Han Solo scoundrel Millennium FalconA bunch of these columns only apply to a few rows. Now we've got a lot more NULLs to deal with. As the number of columns increases, this can become a problem.
"},{"location":"reference/tables-and-triples/#property-tables","title":"Property Tables","text":"Tables are great! If sparse tables are a problem, then let's try to apply the same solution that worked for the \"many to one\" problem in the previous section.
name table:
sw_id first_name last_name 1 Luke Skywalker 2 Leia Organa 3 Darth Vader 4 Han Solo 3 Anakin Skywalkeroccupation table:
sw_id occupation 1 Jedi 2 princess 4 scoundrelenemy table:
sw_id enemy 1 3 2 3 3 1 3 2 3 4 4 1father table:
sw_id father 1 3 2 3lightsaber_color table:
sw_id lightsaber_color 1 green 3 redship table:
sw_id ship 4 Millennium FalconHmm. Yeah, that will work. But every query we write will need some JOINs. It feels like we've lost something.
"},{"location":"reference/tables-and-triples/#entity-attribute-value","title":"Entity, Attribute, Value","text":"Tables are great! But there's such a thing as too many tables. We started out with a table with a bunch of rows and a bunch of columns, and ended up with a bunch of tables with a bunch of rows but just a few columns.
I have a brilliant idea! Let's combine all these property tables into just one table, by adding a \"property\" column!
sw_id property value 1 first_name Luke 2 first_name Leia 3 first_name Darth 4 first_name Han 5 first_name Anakin 1 last_name Skywalker 2 last_name Skywalker 3 last_name Vader 4 last_name Solo 5 last_name Skywalker 1 occupation Jedi 2 occupation princess 4 occupation scoundrel 1 enemy 3 2 enemy 3 3 enemy 1 3 enemy 2 3 enemy 4 4 enemy 1 1 father 3 2 father 3 1 lightsaber_color green 3 lightsaber_color red 4 ship Millenium FalconIt turns out that I'm not the first one to think of this idea. People call it \"Entity, Attribute, Value\" or \"EAV\". People also call it an \"anti-pattern\", in other words: a clear sign that you've made a terrible mistake.
There are lots of circumstances in which one big, extremely generic table is a bad idea. First of all, you can't do very much with the datatypes for the property and value columns. They kind of have to be strings. It's potentially difficult to index. And tables like this are miserable to query, because you end up with all sorts of self-joins to handle.
But there's at least one use case where it turns out to work quite well...
"},{"location":"reference/tables-and-triples/#merging-tables","title":"Merging Tables","text":"Tables are great! Until they're not.
The strong row and column structure of tables makes them great for lots of things, but not so great for merging data from different sources. Before you can merge two tables you need to know all about:
So you need to know the schemas of the two tables before you can start merging them together. But if you happen to have two EAV tables then, as luck would have it, they already have the same schema!
You also need to know that you're talking about the same things: the rows have to be about the same things, you need to be using the same property names for the same things, and the cell values also need to line up. If only there was an open standard for specifying globally unique identifiers...
Yes, you guessed it: URLs (and URNs and URIs and IRIs)! Let's assume that we use the same URLs for the same things across the two tables. Since we're a close-knit community, we've come to an agreement on a Star Wars data vocabulary.
URLs are annoyingly long to use in databases, so let's use standard \"sw\" prefix to shorten them. Now we have table 1:
sw_id property value sw:1 sw:first_name Luke sw:2 sw:first_name Leia sw:3 sw:first_name Darth sw:4 sw:first_name Han sw:5 sw:first_name Anakin sw:1 sw:last_name Skywalker sw:2 sw:last_name Skywalker sw:3 sw:last_name Vader sw:4 sw:last_name Solo sw:5 sw:last_name Skywalker sw:1 sw:occupation sw:Jedi sw:2 sw:occupation sw:princess sw:4 sw:occupation sw:scoundreland table 2:
sw_id property value sw:1 sw:enemy sw:3 sw:2 sw:enemy sw:3 sw:3 sw:enemy sw:1 sw:3 sw:enemy sw:2 sw:3 sw:enemy sw:4 sw:4 sw:enemy sw:1 sw:1 sw:father sw:3 sw:2 sw:father sw:3 sw:1 sw:lightsaber_color green sw:3 sw:lightsaber_color red sw:4 sw:ship Millenium FalconTo merge these two tables, we simple concatenate them. It couldn't be simpler.
Wait, this looks kinda familiar...
"},{"location":"reference/tables-and-triples/#rdf","title":"RDF","text":"These tables are pretty much in RDF format. You just have to squint a little!
Each row of the table is a subject-predicate-object triple. Our subjects, predicates, and some objects are URLs. We also have some literal objects. We could turn this table directly into Turtle format with a little SQL magic (basically just concatenating strings):
SELECT \"@prefix sw: <http://example.com/sw_> .\"\nUNION ALL\nSELECT \"\"\nUNION ALL\nSELECT\nsw_id\n|| \" \"\n|| property\n|| \" \"\n|| IF(\nINSTR(value, \":\"),\nvalue, -- CURIE\n\"\"\"\" || value || \"\"\"\" -- literal\n)\n|| \" .\"\nFROM triple_table;\n
The first few lines will look like this:
@prefix sw: <http://example.com/sw_> .\n\nsw:1 sw:first_name \"Luke\" .\nsw:2 sw:first_name \"Leia\" .\nsw:3 sw:first_name \"Darth\" .\nsw:4 sw:first_name \"Han\" .\n
Two things we're missing from RDF are language tagged literals and typed literals. We also haven't used any blank nodes in our triple table. These are easy enough to add.
The biggest thing that's different about RDF is that it uses the \"open-world assumption\", so something may be true even though we don't have a triple asserting that it's true. The open-world assumption is a better fit than the closed-world assumption when we're integrating data on the Web.
"},{"location":"reference/tables-and-triples/#conclusion","title":"Conclusion","text":"Tables are great! We use them all the time, they're strong and rigid, and we're comfortable with them.
RDF, on the other hand, looks strange at first. For most common data processing, RDF is too flexible. But sometimes flexiblity is the most important thing.
The greatest strength of tables is their rigid structure, but that's also their greatest weakness. We saw a number of problems with tables, and how they could be overcome by breaking tables apart into smaller tables, until we got down to the most basic pattern: subject-predicate-object. Step by step, we were pushed toward RDF.
Merging tables is particularly painful. When working with data on the Web, merging is one of the most common and important operations, and so it makes sense to use RDF for these tasks. If self-joins with SQL is the worst problem for EAV tables, then SPARQL solves it.
These examples show that it's not really very hard to convert tables to triples. And once you've seen SPARQL, the RDF query language, you've seen one good way to convert triples to tables: SPARQL SELECT results are just tables!
Since it's straightforward to convert tables to triples and back again, make sure to use the right tool for the right job. When you need to merge heterogeneous data, reach for triples. For most other data processing tasks, use tables. They're great!
"},{"location":"reference/troublehooting-robot/","title":"Lessons learned from troubleshooting ROBOT","text":""},{"location":"reference/troublehooting-robot/#prerequisites","title":"Prerequisites","text":"Learn common mistakes when using ROBOT and how to troubleshoot and fix them.
"},{"location":"reference/troublehooting-robot/#lessons-learned","title":"Lessons learned","text":""},{"location":"reference/troublehooting-robot/#copying-pasting-especially-in-google-docs-can-introduce-unexpected-format-changes-in-row-2-of-the-template","title":"Copying-pasting (especially in google docs) can introduce unexpected format changes in row 2 of the template:","text":"Optional.get() cannot be called on an absent value
Use the -vvv option to show the stack trace.
Use the --help option to see usage information
make: *** [mondo.Makefile:454: merge_template] Error 1
On Wikidata the following licenses applies:
\"All structured data from the main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License\"
Adding non-CC0 licensed OBO ontologies in full might be problematic due to * License stacking
IANL, but my understanding is that as long as only URI mappings are created to OBO ontology terms no licenses are breached (even if the ontology is not CC0)
"},{"location":"reference/wikidata/#why-map-obo-uris-to-wikidata","title":"Why map OBO uris to Wikidata?","text":"This tutorial is based off https://ontology101tutorial.readthedocs.io/en/latest/DL_QueryTab.html Created by: Melissa Haendel, Chris Mungall, David Osumi-Sutherland, Matt Yoder, Carlo Torniai, and Simon Jupp
"},{"location":"tutorial/basic-dl-query/#dl-query-tab","title":"DL query tab","text":"The DL query tab shown below provides an interface for querying and searching an ontology. The ontology must be classified by a reasoner before it can be queried in the DL query tab.
For this tutorial, we will be using cc.owl which can be found here.
Open cc.owl in Protege (use Open from URL and enter the https://raw.githubusercontent.com/OHSUBD2K/BDK14-Ontologies-101/master/BDK14_exercises/basic-dl-query/cc.owl
). Run the reasoner. Navigate to the DL Query tab.
Type organelle
into the box, and make sure subclasses
and direct subclasses
are ticked.
You can type any valid OWL class expression into the DL query tab. For example, to find all classes whose members are part_of a membrane, type part_of some membrane
and click execute
. Note the linking underscore for this relation in this ontology. Some ontologies do not use underscores for relations, whereby you'd need single quotes (i.e. part of
).
The OWL keyword and
can be used to make a class expression that is the intersection of two class expressions. For example, to find the classes in the red area below, we want to find subclasses of the intersection of the class organelle
and the class endoplasmic reticulum part
Note that we do not need to use the part
grouping classes in the gene ontology (GO). The same results can be obtained by querying for the intersection of the class organelle
and the restriction part_of some ER
\u2013 try this and see.
We can also ask for superclasses by ticking the boxes as below:
The or
keyword is to used to create a class expression that is the union of two class expressions. For example: (WARNING: or
is not supported by ELK reasoner)
This is illustrated by the red area in the following Venn diagram:
For further exercises, please see https://ontology101tutorial.readthedocs.io/en/latest/EXERCISE_BasicDL_Queries.html
"},{"location":"tutorial/custom-qc/","title":"Tutorial: How to add custom quality checks with ODK","text":"This tutorial explains adding quality checks not included in the ROBOT Report.
"},{"location":"tutorial/custom-qc/#prerequisites","title":"Prerequisites","text":"You have completed the tutorials:
oboInOwl:creation_date
to the root_node
in the CAT Ontology.oboInOwl:creation_date
. It will return the class with the annotation if it's not of type xsd:dateTime
.PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n\nSELECT ?cls WHERE\n{\n ?cls oboInOwl:creation_date ?date .\n FILTER(DATATYPE(?date) != xsd:dateTime)\n}\n
Save the SPARQL query in the src/sparql
folder and name it [violation name]-violation.sparql
. In the case of the tutorial, date-as-string-violation.sparql
Add the check to the ODK config file. In the previous tutorial, this is located at ~/cato/src/ontology/cato-odk.yaml
. Inside robot_report
, add custom_sparql_checks
robot_report:\nuse_labels: TRUE\nfail_on: ERROR\nreport_on:\n- edit\ncustom_sparql_checks:\n- date-as-string\n
sh run.sh make update_repo\n
sh run.sh make sparql_test\nFAIL Rule ../sparql/date-as-string-violation.sparql: 1 violation(s)\ncls\nhttp://purl.obolibrary.org/obo/CATO_0000000\n
To fix this issue, we need to change the annotation value to xsd:dateTime
, and run the test again to certify everything is good this time. sh run.sh make sparql_test\nPASS Rule ../sparql/date-as-string-violation.sparql: 0 violation(s)\n
Push the changes to your repository, and the custom checks will run whenever creating a new Pull Request, as detailed here.
"},{"location":"tutorial/custom-qc/#custom-checks-available-in-odk","title":"Custom checks available in ODK","text":"There are several checks already available in the ODK. If you'd like to add them, add the validation name in your ODK config file.
owldef-self-reference
: verify if the term uses its term as equivalentredundant-subClassOf
: verify if there are redundant subclasses between three classestaxon-range
: verify if the annotations present_in_taxon
or never_in_taxon
always use classes from NCBITaxoniri-range
: verify if the value for the annotations never_in_taxon
, present_in_taxon
, foaf:depicted_by
, oboInOwl:inSubset
and dcterms:contributor
are not an IRIiri-range-advanced
: same as iri-range
plus check for rdfs:seeAlso
annotationlabel-with-iri
: verify if there is IRI in the labelmultiple-replaced_by
: verify if an obsolete term has multiple replaced_by
termsterm-tracker-uri
: verify if the value for the annotation term_tracker_item is not URIillegal-date
: verify if the value for the annotations dcterms:date
, dcterms:issued
and dcterms:created
are of type xds:date
and use the pattern YYYY-MM-DD
ROBOT report can also have custom quality checks.
custom_profile: TRUE
, in the ODK config file. robot_report:\nuse_labels: TRUE\nfail_on: ERROR\ncustom_profile: TRUE\nreport_on:\n- edit\ncustom_sparql_checks:\n- date-as-string\n
2. Create a SPARQL query with your quality check and save it at src/sparql
. There isn't a restriction on the file name. However, it should return the variables ?entity ?property ?value
. SELECT DISTINCT ?entity ?property ?value \nWHERE {\n ...\n}\n
src/ontology/profile.txt
file.ERROR file:../sparql/<file name>.sparql\n
For more detail on the profile file, see here. src/ontology/reports/cato-edit.owl-obo-report.tsv
. The Rule Name will be the SPARQL file name.sh run.sh make test\n
"},{"location":"tutorial/custom-qc/#how-to-choose-between-custom-sparql-or-custom-robot-report","title":"How to choose between Custom SPARQL or Custom ROBOT report","text":"entity
, property
and value
-> ROBOT reportKeep in mind that after changing the profile.txt
, you won't get any upcoming updates, and you need to update manually.
This tutorial is based off https://ontology101tutorial.readthedocs.io/en/latest/Disjointness.html Created by: Melissa Haendel, Chris Mungall, David Osumi-Sutherland, Matt Yoder, Carlo Torniai, and Simon Jupp
For this excercise, we will be using chromosome-parts-interim.owl file that can be found here
"},{"location":"tutorial/disjointness/#disjointness_1","title":"Disjointness","text":"In the chromosome-parts-interim.owl file, at the top of our class hierarchy we have cell, cell part, chromosomal part, intracellular part, organelle and organelle part. By default, OWL assumes that these classes can overlap, i.e. there are individuals who can be instances of more than one of these classes. We want to create a restriction on our ontology that states these classes are different and that no individual can be a member of more than one of these classes. We can say this in OWL by creating a disjoint classes axiom.
If you do not already have it open, load your previous ontology that was derived from the 'interim file'. Note: you can open a recent file by going to File-> Open Recent
We want to assert that organelle
and organelle part
are disjoint. To do this first select the organelle
class. In the class 'Description' view, scroll down and select the (+) button next to Disjoint With. You are presented with the now familiar window allowing you to select, or type, to choose a class. In the hierarchy panel, you can use CTRL to select multiple classes. Select 'organelle part' as disjoint with organelle.
Note that the directionality is irrelevant. Prove this to yourself by deleting the disjoint axiom, and adding it back from organelle part
.
We have introduced a deliberate mistake into the ontology. We previously asserted that intracellular organelle part
is a subclass of both organelle part
and organelle
. We have now added an axiom stating that organelle
and organelle part
are disjoint. We can use the reasoner to check the consistency of our ontology. The reasoner should detect our contradiction.
Prot\u00e9g\u00e9 comes with several reasoners, and more can be installed via the plugins mechanism (see plugins chapter). Select a reasoner from the Reasoner menu (Elk, HermiT, Pellet, or Fact++ will work - we mostly use ELK). Once a reasoner is highlighted, select 'Start reasoner' from the menu. Note: you may get several pop-boxes/warnings, ignore those.
The intracellular organelle part
class will have changed to red indicating that the class is now unsatisfiable.
You can also see unsatisfiable classes by switching to the inferred view.
Here you will a special class called Nothing
. When we previously said that all OWL classes are subclasses of OWL Thing. OWL Nothing
is a leaf class or bottom class of your ontology. Any classes that are deemed unsatisfiable by the reasoner are shown as subclasses or equivalent to OWL Nothing. The inferred view will show you all subclasses of Nothing.
Once the ontology is classified, inferred statements or axioms are shown in the various panels with a light-yellow shading. The class description for intracellular organelle part
should look something like the screen shot below. You will see that the class has been asserted equivalent to the Nothing
class. Inside this statement, a small question mark icon appears, clicking this will get an explanation from the reasoner for this inconsistency.
Select the (?) icon to get an explanation for this inconsistency. The explanation shows the axioms involved. We see the disjoint class axiom alongside the two subclass axioms are causing the inconsistency. We can simply repair this ontology by removing the intracellular organelle part
subClassOf organelle
axiom.
Remove the Disjoint with axiom (click the (x) beside organelle
in the Description pane for intracellular organelle part
), and resynchronise the reasoner from the reasoner menu.
This is a very unprofessional video below recorded as part of one of our trainings. It walks you through this tutorial here, with some additional examples being given and a bit of Q&A.
"},{"location":"tutorial/dosdp-odk/#glossary","title":"Glossary","text":"SC 'part of' some %
which can be instantiated by ROBOT
to be transformed into an OWL axiom: SubClassOf(CATO:001 ObjectSomeValuesFrom(BFO:0000051 UBERON:123))
. Similarly, DOSDP YAML files are often referred to as \"templates\" (which is appropriate). Unfortunately, we often refer to them as \"patterns\" which is not strictly the right way to name them: they are templates that encode patterns (and that only to a limited extend). We recommend to refer to the DOSDP YAML files as \"templates\".equivalentTo
or subClassOf
field: It tells DOSDP tools how to generate an OWL axiom, with which variable slots (vars
).This tutorial assumes you have set up an ODK repo with this config:
id: cato\ntitle: \"Cat Anatomy Ontology\"\ngithub_org: obophenotype\ngit_main_branch: main\nrepo: cat_anatomy_ontology\nrelease_artefacts:\n - base\n - full\n - simple\nprimary_release: full\nexport_formats:\n - owl\n - obo\n - json\nimport_group:\n products:\n - id: ro\n - id: pato\n - id: omo\nrobot_java_args: '-Xmx8G'\n
"},{"location":"tutorial/dosdp-odk/#activate-dosdp-in-odk","title":"Activate DOSDP in ODK","text":"In your src/ontology/{yourontology}-odk.yaml
file, simply add the following:
use_dosdps: true\n
This flag activates DOSDP in ODK - without it, none of the DOSDP workflows in ODK can be used. Technically, this flag tells ODK the following things:
src/ontology/Makefile
is extended as follows:pipelines
, or workflows, for processing patterns, e.g. pattern_schema_checks
for validating all DOSDP templates,patterns
to regenerate all patterns.src/patterns
, is created with the following files:src/patterns/pattern.owl
: This is an ontology of your own patterns. This can be used to browse the your pattern in the form of a class hierarchy, which can help greatly to understand how they relate logically. There are some flaws in this system, like occasional unintended equivalencies between patterns, but for most uses, it is doing ok.src/patterns/definitions.owl
: This is the merged ontology of all your DOSDP generated classes. Basically, if you manage your classes across multiple DOSDP patterns and tables, their generated OWL axioms will all be added to this file.src/patterns/external.txt
: This file can be used to import external patterns. Just add the (p)URL to a pattern to the file, and the DOSDP pipeline will import it when you run it. We use this a lot when sharing DOSDP templates across ontologies.src/patterns/data/default/
) and one in the src/patterns
directory. The former points you to the place where you should put, by default, any DOSDP data tables. More about that in the next sections.To fully activate DOSDP in your ontology, please run:
sh run.sh make update_repo\n
This will:
v1.3
, for example)Makefile
in certain ways(1) Create a new file src/patterns/dosdp-patterns/haircoat_colour_pattern.yaml
and paste the following content:
pattern_name: haircoat_colour_pattern\npattern_iri: http://purl.obolibrary.org/obo/obo-academy/patterns/haircoat_colour_pattern.yaml\n\ndescription: \"\nCaptures the multicoloured characteristic of the fur, i.e. spotted, dotted, motley etc.\"\n\nclasses:\ncolour_pattern: PATO:0001533\ncoat_of_hair: UBERON:0010166\n\nrelations:\nhas_characteristic: RO:0000053\n\nvars:\ncolour_pattern: \"'colour_pattern'\"\n\nname:\ntext: \"%s coat of hair\"\nvars:\n- colour_pattern\n\ndef:\ntext: \"A coat of hair with a %s colour pattern.\"\nvars:\n- colour_pattern\n\nequivalentTo:\ntext: \"'coat_of_hair' and 'has_characteristic' some %s\"\nvars:\n- colour_pattern\n
(2) Let's also create a simple template table to capture traits for our ontology.
Note: the filename of the DOSDP template file (haircoat_colour_pattern.yaml
) excluding the extension must be identical to the filename of the template table (haircoat_colour_pattern.tsv
) excluding the extension.
Let's create the new file at src/patterns/data/default/haircoat_colour_pattern.tsv
.
defined_class colour_pattern\nCATO:0000001 PATO:0000333\n
We are creating a minimal table here with just two columns:
defined_class
refers to the ID for the term that is being modelled by the template (mandatory for all DOSDP templates)colour_pattern
refers to the variable of the same name specified in the vars:
section of the DOSDP template YAML file.Next, we will get a bit used to various commands that help us with DOSDP-based ontology development.
Lets first try to transform the simple table above to OWL using the ODK pipeline (we always use IMP=false
to skip refreshing imports, which can be a lengthy process):
sh run.sh make ../patterns/definitions.owl -B IMP=false\n
This process will will create the ../patterns/definitions.owl
file, which is the file that contains all axioms generated by all templates you have configured. In our simple scenario, this means a simple single pattern. Let us look at definitions.owl in your favourite text editor first.
Tip: Remember, the `-B` tells `make` to run the make command no matter what - one of the advantages of `make` is that it only runs a command again if something changed, for example, you have added something to a DOSDP template table.\n
Tip: Looking at ontologies in text editors can be very useful, both to reviewing files and making changes! Do not be afraid, the ODK will ensure you wont break anything.\n
Let us look in particular at the following section of the definitions.owl file:
# Class: <http://purl.obolibrary.org/obo/CATO_0000001> (http://purl.obolibrary.org/obo/PATO_0000333 coat of hair)\n\nAnnotationAssertion(<http://purl.obolibrary.org/obo/IAO_0000115> <http://purl.obolibrary.org/obo/CATO_0000001> \"A coat of hair with a http://purl.obolibrary.org/obo/PATO_0000333 colour pattern.\"^^xsd:string)\nAnnotationAssertion(rdfs:label <http://purl.obolibrary.org/obo/CATO_0000001> \"http://purl.obolibrary.org/obo/PATO_0000333 coat of hair\"^^xsd:string)\nEquivalentClasses(<http://purl.obolibrary.org/obo/CATO_0000001> ObjectIntersectionOf(<http://purl.obolibrary.org/obo/UBERON_0010166> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/RO_0000053> <http://purl.obolibrary.org/obo/PATO_0000333>)))\n
These are the three axioms / annotation assertions that were created by the DOSDP pipeline. The first annotation is a simple automatically generated definition. What is odd at first glance, is that the definition reads \"A coat of hair with a http://purl.obolibrary.org/obo/PATO_0000333 colour pattern.\"
- what does the PATO:0000333
IRI do in the middle of our definition? Understanding this is fundamental to the DODSP pattern workflow, because it is likely that you will have to fix cases like this from time to time.
The DOSDP workflow is about generating axioms automatically from existing terms. For example, in this tutorial we are trying to generate terms for different kinds of hair coats for our cats, using the colour pattern
(PATO:0001533) hierarchy in the PATO ontology as a basis. The only one term we have added so far is spotted
(PATO:0000333). The problem is though, that dosdp-tools
, the tool which is part of the ODK and responsible for the DOSDP workflows, does not know anything about PATO:0000333 unless it is already imported into the ontology. In order to remedy this situation, lets import the term:
sh run.sh make refresh-pato\n
ODK will automatically see that you have used PATO:0000333 in your ontology, and import it for you. Next, let us make sure that the our edit file has the correct import configured. Open your ontology in a text editor, and make sure you can find the following import statement:
Import(<http://purl.obolibrary.org/obo/cato/patterns/definitions.owl>)\n
Replace cato
in the PURL with whatever is the ID of your own ontology. Also, do not forget to update src/ontology/catalog-v001.xml
, by adding this line:
<group id=\"Folder Repository, directory=, recursive=false, Auto-Update=false, version=2\" prefer=\"public\" xml:base=\"\">\n...\n<uri name=\"http://purl.obolibrary.org/obo/cato/patterns/definitions.owl\" uri=\"../patterns/definitions.owl\"/>\n...\n</group>\n
Important: Remember that we have not yet told dosdp-tools about the freshly imported PATO:0000333 term. To do that, lets run the DOSDP pipeline again:
sh run.sh make ../patterns/definitions.owl -B IMP=false\n
A quick look at src/patterns/definitions.owl
would now reveal your correctly formatted definitions:
AnnotationAssertion(<http://purl.obolibrary.org/obo/IAO_0000115> <http://purl.obolibrary.org/obo/CATO_0000001> \"A coat of hair with a spotted colour pattern.\"^^xsd:string)\n
Now, we are ready to view our ontology (the edit file, i.e. src/ontology/cato-edit.owl
) in Protege:
Still a few things to iron out - there is an UBERON term that we still need to import, and our class is not a subclass of the CATO root node
, but we had a good start.
Re-using terms is at the heart of the OBO philosophy, but when it comes to re-using axiom patterns, such as the ones we can define as part of a ROBOT template, we are (as of 2022) still in the early stages. One thing we can do to facilitate re-use is to share DOSDP templates between different projects. We do that by simply adding the URL at which the pattern is located to src/patterns/dosdp-patterns/external.txt
. Note: if you are copying a URL from GitHub, make sure it is the raw
url, i.e.:
https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/patterns/dosdp-patterns/entity_attribute.yaml\n
Here, we randomly decided to import a pattern defined by the Ontology of Biological Attributes (an ontology of traits such as tail length
or head size
), for example to represent cat traits in our Cat Ontology. After adding the above URL to our the external.txt
file, we can add it to our pipeline:
sh run.sh make update_patterns\n
You will now see the entity_attribute.yaml
template in src/patterns/dosdp-patterns
. We will not do anything with this template as part of this tutorial, so you can remove it again if you wish (by removing the URL from the external.txt
file and physically deleting the src/patterns/dosdp-patterns/entity_attribute.yaml
file).
Sometimes, we want to manage more than one DOSDP pipeline at once. For example, in more than one of our projects, we have some patterns that are automatically generated by software tools, and others that are manually curated by ontology developers. In other use cases, we sometimes want to restrict the pattern pipelines to generating only logical axioms. In either case, we can add new pipelines by adding the following to the src/ontology/youront-odk.yaml
file:
pattern_pipelines_group:\n products:\n - id: manual\n dosdp_tools_options: \"--obo-prefixes=true --restrict-axioms-to=logical\"\n - id: auto\n dosdp_tools_options: \"--obo-prefixes=true\"\n
This does the following: It tells the ODK that you want
"},{"location":"tutorial/dosdp-odk/#reference","title":"Reference","text":""},{"location":"tutorial/dosdp-odk/#a-full-example-odk-configuration","title":"A full example ODK configuration","text":"id: cato\ntitle: \"Cat Anatomy Ontology\"\ngithub_org: obophenotype\ngit_main_branch: main\nuse_dosdps: TRUE\nrepo: cat_anatomy_ontology\nrelease_artefacts:\n - base\n - full\n - simple\nprimary_release: full\nexport_formats:\n - owl\n - obo\n - json\nimport_group:\n products:\n - id: ro\n - id: pato\n - id: omo\nrobot_java_args: '-Xmx8G'\npattern_pipelines_group:\n products:\n - id: manual\n dosdp_tools_options: \"--obo-prefixes=true --restrict-axioms-to=logical\"\n - id: auto\n dosdp_tools_options: \"--obo-prefixes=true\"\n
"},{"location":"tutorial/dosdp-odk/#odk-configuration-reference-for-dosdp","title":"ODK configuration reference for DOSDP","text":"Flag Explanation use_dosdps: TRUE Activates DOSDP in your ODK repository setup pattern_pipelines_group:products: - id: manual dosdp_tools_options: \"--obo-prefixes=true --restrict-axioms-to=logical\" Adding a manual
pipeline to your DOSDP setup in which only logical axioms are generated."},{"location":"tutorial/dosdp-overview/","title":"Getting started with DOSDP templates","text":"Dead Simple OWL Design patterns (DOSDP) is a templating system for documenting and generating new OWL classes. The templates themselves are designed to be human readable and easy to author. Separate tables (TSV files) are used to specify individual classes.
The complete DOSDP documentation can be found here http://incatools.github.io/dead_simple_owl_design_patterns/.
For another DOSDP tutorial see here.
"},{"location":"tutorial/dosdp-overview/#anatomy-of-a-dosdp-file","title":"Anatomy of a DOSDP file:","text":"A DOSDP tempaltes are written in YAML) file, an easily editable format for encoding nested data structures. At the top level of nesting is a set of 'keys', which must match those specified in the DOSDP standard. The various types of key and their function are outlined below. Each key is followed by a colon and then a value, which may be a text string, a list or another set of keys. Lists items are indicated using a '-'. Nesting is achieved via indenting using some standard number of spaces (typically 3 or 4). Here's a little illustration:
key1: some text\nkey2:\n- first list item (text; note the indent)\n- second list item\nkey3:\nkey_under_key3: some text\nanother_key_under_key3:\n- first list item (text; note the indent)\n- second list item\nyet_another_key_under_key3:\nkey_under_yet_another_key_under_key3: some more text\n
In the following text, keys and values together are sometimes referred to as 'fields'.
"},{"location":"tutorial/dosdp-overview/#pattern-level-keys","title":"Pattern level keys","text":"Reference doc
A set of fields that specify general information about a pattern: name, description, IRI, contributors, examples etc
e.g.
pattern_name: abnormalAnatomicalEntity\npattern_iri: http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml\ndescription: \"Any unspecified abnormality of an anatomical entity.\"\n\ncontributors:\n- https://orcid.org/0000-0002-9900-7880\n
"},{"location":"tutorial/dosdp-overview/#dictionaries","title":"Dictionaries","text":"Reference doc
A major aim of the DOSDP system is to produce self-contained, human-readable templates. Templates need IDs in order to be reliably used programatically, but templates that only use IDs are not human readable. DOSDPs therefore include a set of dictionaries that map labels to IDs. Strictly any readable name can be used, but by convention we use class labels. IDs must be OBO curie style e.g. CL:0000001).
Separate dictionaries are required for classes, relations (object properties) & annotationProperties e.g.
classes:\nquality: PATO:0000001\nabnormal: PATO:0000460\nanatomical entity: UBERON:0001062\n\nrelations:\ninheres_in_part_of: RO:0002314\nhas_modifier: RO:0002573\nhas_part: BFO:0000051\n
"},{"location":"tutorial/dosdp-overview/#variables","title":"Variables","text":"Reference doc
These fields specify the names of pattern variables (TSV column names) and map these to a range. e.g. This specifies a variable called 'anatomy' with the range 'anatomical entity':
vars:\nanatomy: \"'anatomical entity'\"\n
The var name (anatomy) corresponds to a column name in the table (TSV file) used in combination with this template, to generate new terms based on the template. The range specifies what type of term is allowed in this column - in this case 'anatomical entity' (UBERON:0001062; as specified in the dictionary) or one of its subclasses, e.g.-
anatomy UBERON:0001154There are various types of variables:
vars
are used to specify OWL classes (see example above). data_vars and data_list_vars are used to specify single pieces or data lists respectively. The range of data_vars is specified using XSD types. e.g.
data_vars:\nnumber: xsd:int\n\ndata_list_vars:\nxrefs: xsd:string\n
A table used to specify classes following this pattern could have the following content. Note that in lists, multiple elements are separated by a '|'.
number xrefs 1 pubmed:123456|DOI:10.1016/j.cell.2016.07.054"},{"location":"tutorial/dosdp-overview/#template-fields","title":"Template fields","text":"Template fields are where the content of classes produced by the template is specified. These mostly follow printf format: A text
field has variable slots specified using %s (for strings), %d for integers and %f for floats (decimals). Variables slots are filled, in order of appearance in the text, with values coming from a list of variables in an associated vars
field e.g.
name:\ntext: \"%s of %s\"\nvars:\n- neuron\n- brain_region\n
If the value associated with the neuron var is (the class) 'glutamatergic neuron' and the value associated with the = 'brain region' var is 'primary motor cortext', this will generate a classes with the name (label) \"glutamatergic neuron of primary motor cortex\".
"},{"location":"tutorial/dosdp-overview/#obo-fields","title":"OBO fields","text":"Reference doc
DOSDPs include a set of convenience fields for annotation of classes that follow OBO conventions for field names and their mappings to OWL annotation properties. These include name
, def
, comment
, namespace
. When the value of a var is an OWL class, the name (label) of the var is used in the substitution. (see example above).
The annotation axioms generated by these template fields can be annotated. One OBO field exists for this purpose: xrefs
allows annotation with a list of references using the obo standard xref annotation property (curies)
e.g.
data_list_vars:\nxrefs: xsd:string\n\ndef:\ntext: \"Any %s that has a soma located in the %s\"\nvars:\n- neuron\n- brain_region\nxrefs: xrefs\n
"},{"location":"tutorial/dosdp-overview/#logical-axioms-convenience-fields","title":"Logical axioms convenience fields","text":"Reference doc
Where a single equivalent Class, subclassOf or GCI axiom is specified, you may use the keys 'EquivalentTo', 'subClassOf' or 'GCI' respectively. If multiple axioms of any type are needed, use the core field logical_axioms
.
annotations:\n- annotationProperty:\ntext:\nvars:\nannotations: ...\n- annotationProperty:\ntext:\nvars:\n\nlogical_axioms:\n- axiom_type: subClassOf\ntext:\nvars:\n-\n-\n- axiom_type: subClassOf\ntext:\nvars:\n-\n-\nannotations:\n- ...\n
"},{"location":"tutorial/dosdp-overview/#advanced-usage","title":"Advanced usage:","text":""},{"location":"tutorial/dosdp-overview/#optionals-and-multiples-0-many","title":"Optionals and multiples (0-many)","text":"TBA
"},{"location":"tutorial/dosdp-overview/#using-dosdp-templates-in-odk-workflows","title":"Using DOSDP templates in ODK Workflows","text":"The Ontology Development Kit (ODK) comes with a few pre-configured workflows involving DOSDP templates. For a detailed tutorial see here.
"},{"location":"tutorial/dosdp-template/","title":"Dead Simple Ontology Design Patterns (DOSDP)","text":"Note: This is an updated Version of Jim Balhoff's DOSDP tutorial here.
The main use case for dosdp-tools
(and the DOS-DP framework) is managing a set of ontology terms, which all follow a common logical pattern, by simply collecting the unique aspect of each term as a line in a spreadsheet. For example, we may be developing an ontology of environmental exposures. We would like to have terms in our ontology which represent exposure to a variety of stressors, such as chemicals, radiation, social stresses, etc.
To maximize reuse and facilitate data integration, we can build our exposure concepts by referencing terms from domain-specific ontologies, such as the Chemical Entities of Biological Interest Ontology (ChEBI) for chemicals. By modeling each exposure concept in the same way, we can use a reasoner to leverage the chemical classification provided by ChEBI to provide a classification for our exposure concepts. Since each exposure concept has a logical definition based on our data model for exposure, there is no need to manually manage the classification hierarchy. Let's say our model for exposure concepts holds that an \"exposure\" is an event with a particular input (the thing the subject is exposed to):
'exposure to X' EquivalentTo 'exposure event' and 'has input' some X
If we need an ontology class to represent 'exposure to sarin' (bad news!), we can simply use the term sarin from ChEBI, and create a logical definition:
'exposure to sarin' EquivalentTo 'exposure event' and 'has input' some sarin
We can go ahead and create some other concepts we need for our exposure data:
'exposure to asbestos' EquivalentTo 'exposure event' and 'has input' some asbestos
'exposure to chemical substance' EquivalentTo 'exposure event' and 'has input' some 'chemical substance'
These definitions again can reference terms provided by ChEBI: asbestos and chemical substance
"},{"location":"tutorial/dosdp-template/#classifying-our-concepts","title":"Classifying our concepts","text":"Since the three concepts we've created all follow the same logical model, their hierarchical relationship can be logically determined by the relationships of the chemicals they reference. ChEBI asserts this structure for those terms:
'chemical substance'\n |\n |\n --------------\n | |\n | |\nsarin asbestos\n
Based on this, an OWL reasoner can automatically tell us the relationships between our exposure concepts:
'exposure to chemical substance'\n |\n |\n --------------------------\n | |\n | |\n'exposure to sarin' 'exposure to asbestos'\n
To support this, we simply need to declare the ChEBI OWL file as an owl:import
in our exposure ontology, and use an OWL reasoner such as ELK.
Creating terms by hand like we just did works fine, and relying on the reasoner for the classification will save us a lot of trouble and maintain correctness as our ontology grows. But since all the terms use the same logical pattern, it would be nice to keep this in one place; this will help make sure we always follow the pattern correctly when we create new concepts. We really only need to store the list of inputs (e.g. chemicals) in order to create all our exposure concepts. As we will see later, we may also want to manage separate sets of terms that follow other, different, patterns. To do this with dosdp-tools
, we need three main files: a pattern template, a spreadsheet of pattern fillers, and a source ontology. You will also usually need a file of prefix definitions so that the tool knows how to expand your shortened identifiers into IRIs.
For our chemical exposures, getting the source ontology is easy: just download chebi.owl. Note\u2014it's about 450 MB.
For our pattern fillers spreadsheet, we just need to make a tab-delimited file containing the chemical stressors for which we need exposure concepts. The file needs a column for the term IRI to be used for the generated class (this column is always called defined_class
), and also a column for the chemical to reference (choose a label according to your data model). It should look like this:
defined_class input\nEXPOSO:1 CHEBI:75701\nEXPOSO:2 CHEBI:46661\nEXPOSO:3 CHEBI:59999\n
The columns should be tab-separated\u2014you can download a correctly formatted file to follow along. For now you will just maintain this file by hand, adding chemicals by looking up their ID in ChEBI, and manually choosing the next ID for your generated classes. In the future this may be simplified using the DOS-DP table editor, which is under development.
The trickiest part to DOS-DP is creating your pattern template (but it's not so hard). Pattern templates are written in YAML, a simple file format based on keys and values. The keys are text labels; values can be plain values, another key-value structure, or a list. The DOS-DP schema specifies the keys and values which can be used in a pattern file. We'll use most of the common entries in this example. Read the comments (lines starting with #) for explanation of the various fields:
# We can provide a name for this pattern here.\npattern_name: exposure_with_input\n\n# In 'classes', we define the terms we will use in this pattern.\n# In the OBO community the terms often have numeric IDs, so here\n# we can provide human-readable names we can use further in the pattern.\n# The key is the name to be used; the value is the ID in prefixed form (i.e. a CURIE).\nclasses:\nexposure event: ExO:0000002\nThing: owl:Thing\n\n# Use 'relations' the same way as 'classes',\n# but for the object properties used in the pattern.\nrelations:\nhas input: RO:0002233\n\n# The 'vars' section defines the various slots that can be\n# filled in for this pattern. We have only one, which we call 'input'.\n# The value is the range, meaning the class of things that are valid\n# values for this pattern. By specifying owl:Thing, we're allowing any\n# class to be provided as a variable filler. You need a column in your\n# spreadsheet for each variable defined here, in addition to the `defined class` column.\nvars:\ninput: \"Thing\"\n\n# We can provide a template for an `rdfs:label` value to generate\n# for our new term. dosdp-tools will search the source ontology\n# to find the label for the filler term, and fill it into the\n# name template in place of the %s.\nname:\ntext: \"exposure to %s\"\nvars:\n- input\n\n# This works the same as label generation, but instead creates\n# a definition annotation.\ndef:\ntext: \"A exposure event involving the interaction of an exposure receptor to %s. Exposure may be through a variety of means, including through the air or surrounding medium, or through ingestion.\"\nvars:\n- input\n\n# Here we can generate a logical axiom for our new concept. Create an\n# expression using OWL Manchester syntax. The expression can use any\n# of the terms defined at the beginning of the pattern. A reference\n# to the variable value will be inserted in place of the %s.\nequivalentTo:\ntext: \"'exposure event' and 'has input' some %s\"\nvars:\n- input\n
Download the pattern template file to follow along.
Now we only need one more file before we can run dosdp-tools
. A file of prefix definitions (also in YAML format) will specify how to expand the CURIEs we used in our spreadsheet and pattern files:
EXPOSO: http://example.org/exposure/\n
Here we are specifying how to expand our EXPOSO
prefix (used in our spreadsheet defined_class
column). To expand the others, we'll pass a convenience option to dosdp-tools
, --obo-prefixes
, which will activate some predefined prefixes such as owl:
, and handle any other prefixes using the standard expansion for OBO IDs: http://purl.obolibrary.org/obo/PREFIX_
. Here's a link to the prefixes file.
Now we're all set to run dosdp-tools
! If you've downloaded or created all the necessary files, run this command to generate your ontology of exposures (assuming you've added the dosdp-tools
to your Unix PATH):
dosdp-tools generate --obo-prefixes=true --prefixes=prefixes.yaml --infile=exposure_with_input.tsv --template=exposure_with_input.yaml --ontology=chebi.owl --outfile=exposure_with_input.owl\n
This will apply the pattern to each line in your spreadsheet, and save the result in an ontology saved at exposure_with_input.owl
(it should look something like this). If you take a look at this ontology in a text editor or in Prot\u00e9g\u00e9, you'll see that it contains three classes, each with a generated label, text definition, and equivalent class definition. You're done!
Well... you're sort of done. But wouldn't it be nice if your exposure ontology included some information about the chemicals you referenced? Without this our reasoner can't classify our exposure concepts. As we said above, we could add an owl:import
declaration and load all of ChEBI, but your exposure ontology has three classes and ChEBI has over 120,000 classes. Instead, we can use the ROBOT tool to extract a module of just the relevant axioms from ChEBI. Later, we will also see how to use ROBOT to merge the outputs from multiple DOS-DP patterns into one ontology. You can download ROBOT from its homepage.
ROBOT has a few different methods for extracting a subset from an ontology. We'll use the Syntactic Locality Module Extractor (SLME) to get a set of axioms relevant to the ChEBI terms we've referenced. ROBOT will need a file containing the list of terms. We can use a Unix command to get these out of our spreadsheet file:
sed '1d' exposure_with_input.tsv | cut -f 2 >inputs.txt\n
We'll end up with a simple list:
CHEBI:75701\nCHEBI:46661\nCHEBI:59999\n
Now we can use ROBOT to extract an SLME bottom module for those terms out of ChEBI:
robot extract --method BOT --input chebi.owl --term-file inputs.txt --output chebi_extract.owl\n
Our ChEBI extract only has 63 classes. Great! If you want, you can merge the ChEBI extract into your exposure ontology before releasing it to the public:
robot merge --input exposure_with_input.owl --input chebi_extract.owl --output exposo.owl\n
Now you can open exposo.owl
in Prot\u00e9g\u00e9, run the reasoner, and see a correct classification for your exposure concepts! You may notice that your ontology is missing labels for ExO:0000002
('exposure event') and RO:0002233
('has input'). If you want, you can use ROBOT to extract that information from ExO and RO.
You will often want to generate ontology modules using more than one DOS-DP pattern. For example, you may want to organize environmental exposures by an additional axis of classification, such as exposure to substances with various biological roles, based on information provided by ChEBI. This requires a slightly different logical expression, so we'll make a new pattern:
pattern_name: exposure_with_input_with_role\n\nclasses:\nexposure event: ExO:0000002\nThing: owl:Thing\n\nrelations:\nhas input: RO:0002233\nhas role: RO:0000087\n\nvars:\ninput: \"Thing\"\n\nname:\ntext: \"exposure to %s\"\nvars:\n- input\n\ndef:\ntext: \"A exposure event involving the interaction of an exposure receptor to a substance with %s role. Exposure may be through a variety of means, including through the air or surrounding medium, or through ingestion.\"\nvars:\n- input\n\nequivalentTo:\ntext: \"'exposure event' and 'has input' some ('has role' some %s)\"\nvars:\n- input\n
Let's create an input file for this pattern, with a single filler, neurotoxin:
defined_class input\nEXPOSO:4 CHEBI:50910\n
Now we can run dosdp-tools
for this pattern:
dosdp-tools generate --obo-prefixes --prefixes=prefixes.yaml --infile=exposure_with_input_with_role.tsv --template=exposure_with_input_with_role.yaml --ontology=chebi.owl --outfile=exposure_with_input_with_role.owl\n
We can re-run our ChEBI module extractor, first appending the terms used for this pattern to the ones we used for the first pattern:
sed '1d' exposure_with_input_with_role.tsv | cut -f 2 >>inputs.txt\n
And then run robot extract
exactly as before:
robot extract --method BOT --input chebi.owl --term-file inputs.txt --output chebi_extract.owl\n
Now we just want to merge both of our generated modules, along with our ChEBI extract:
robot merge --input exposure_with_input.owl --input exposure_with_input_with_role.owl --input chebi_extract.owl --output exposo.owl\n
If you open the new exposo.owl
in Prot\u00e9g\u00e9 and run the reasoner, you'll now see 'exposure to sarin' classified under both 'exposure to chemical substance' and also 'exposure to neurotoxin'.
By using dosdp-tools
and robot
together, you can effectively develop ontologies which compose parts of ontologies from multiple domains using standard patterns. You will probably want to orchestrate the types of commands used in this tutorial within a Makefile, so that you can automate this process for easy repeatability.
Exomiser is a Java program that ranks potential rare Mendelian disease-causing variants from whole-exome or whole-genome sequencing data. Starting from a patient's VCF file and a set of phenotypes encoded using the Human Phenotype Ontology (HPO), it will annotate, filter and prioritise likely causative variants. The program does this based on user-defined criteria such as a variant's predicted pathogenicity, frequency of occurrence in a population and also how closely the given patient's phenotype matches any known phenotype of genes from human disease and model organism data.
In this tutorial, we will learn how to install and run Exomiser with Docker, and how to interpret the results in various output formats detailing the predicted causative genes and variants. If you prefer to work locally, instructions are also provided below for Windows and Linux/Mac users.
The complete Exomiser documentation can be found here, including some relevant references here, and the Exomiser GitHub repository here.
Please note that this tutorial is up-to-date with the current latest release 13.2.0 and data version up to 2302 (Feb 2023).
"},{"location":"tutorial/exomiser-tutorial/#prerequisites","title":"PREREQUISITES","text":"You know:
You have:
Docker installed and running on your machine. Check out this simple guide to set up Docker for Windows or Docker for Mac.
We recommend to have Exomiser installed via Docker prior to the tutorial. Open a terminal and run the command below:
docker pull exomiser/exomiser-cli:13.2.0\n
Alternatively:
# download the data via\nwget https://github.com/iQuxLE/Exomiser-Tutorial/raw/main/Exomiser-Tutorial.zip\n# OR clone the repository\ngit clone https://github.com/iQuxLE/Exomiser-Tutorial.git\n\n# unzip\nunzip Exomiser-Tutorial.zip\n
Since the VCF files for parts of the example data are relatively large, you need to download the following separately and put it into the Exomiser-Tutorial
folder. # download\nwget https://github.com/iQuxLE/Exomiser-Tutorial/raw/main/pfeiffer-family-vcf.zip\n# unzip\nunzip pfeiffer-family-vcf.zip -d Exomiser-Tutorial/exomiser-config/\n
The Exomiser-Tutorial
folder contains a directory called exomiser-config
(with all the VCF and analysis files) and exomiser-overview
(with some introductory slides).
# create an empty directory for exomiser-data within the Exomiser-Tutorial folder:\ncd /path/to/Exomiser-Tutorial/\nmkdir exomiser-data\ncd exomiser-data\n# download the data\nwget https://data.monarchinitiative.org/exomiser/latest/2302_phenotype.zip # for the phenotype database\nwget https://data.monarchinitiative.org/exomiser/latest/2302_hg19.zip # for the hg19 variant database\n# unzip the data\nunzip \"2302_*.zip\"\n
Otherwise, visit the links and download the data in your own exomiser-data
directory:
2302 phenotype database
2302 hg19 variant database
Install 7-Zip for unzipping the database files. The built-in archiving software has issues extracting the zip files. Extract the database files (2302_phenotype.zip
, 2302_hg19.zip
) by right-clicking the archive and selecting 7-Zip > Extract files\u2026 into the exomiser-data
directory.
Your Exomiser-Tutorial
directory should now be structured as follows:
Exomiser-Tutorial\n \u251c\u2500\u2500 exomiser-config\n \u251c\u2500\u2500 exomiser-data\n \u2502 \u251c\u2500\u2500 2302_hg19\n \u2502 \u2514\u2500\u2500 2302_phenotype\n \u2514\u2500\u2500 exomiser-overview\n \u2514\u2500\u2500 exomiser-tutorial-slides\n
"},{"location":"tutorial/exomiser-tutorial/#outline-of-the-tutorial","title":"Outline of the tutorial","text":"For a quick overview of Exomiser take a look at the slides located in the Google Drive or GitHub repo.
"},{"location":"tutorial/exomiser-tutorial/#exomiser-installation","title":"Exomiser installation","text":""},{"location":"tutorial/exomiser-tutorial/#via-docker","title":"via Docker","text":"(recommended to be installed prior to the tutorial; if you run the command below again, you should receive the message \"Image is up to date for exomiser/exomiser-cli:13.2.0\")
docker pull exomiser/exomiser-cli:13.2.0\n
"},{"location":"tutorial/exomiser-tutorial/#via-windows","title":"via Windows","text":"exomiser-cli-13.2.0-distribution.zip
distribution from Monarch.2302_hg19.zip
and phenotype 2302_phenotype.zip
data files from Monarch.exomiser-cli-13.2.0-distribution.zip
and selecting 7-Zip > Extract Here2302_phenotype.zip
, 2302_hg19.zip
) by right-clicking the archive and selecting 7-Zip > Extract files\u2026 into the exomiser data directory. By default, Exomiser expects this to be \u2018exomiser-cli-13.2.0/data\u2019, but this can be changed in the application.properties.The following shell script should work:
# download the distribution (won't take long)\nwget https://data.monarchinitiative.org/exomiser/latest/exomiser-cli-13.2.0-distribution.zip\n# download the data (this is ~80GB and will take a while). If you only require a single assembly, only download the relevant files.\nwget https://data.monarchinitiative.org/exomiser/latest/2302_hg19.zip\nwget https://data.monarchinitiative.org/exomiser/latest/2302_phenotype.zip\n# unzip the distribution and data files - this will create a directory called 'exomiser-cli-13.2.0' in the current working directory (with examples and application.properties)\nunzip exomiser-cli-13.2.0-distribution.zip\nunzip '2302_*.zip' -d exomiser-cli-13.2.0/data\n
"},{"location":"tutorial/exomiser-tutorial/#configuring-the-applicationproperties","title":"Configuring the application.properties","text":"The application.properties file needs to be updated to point to the correct location of the Exomiser data. For the purpose of this tutorial, this is already sorted, pointing to the mounted directory inside the Docker container exomiser.data-directory=/exomiser-data
.
Also, you want to make sure to edit the file to use the correct data version (currently 2302):
exomiser.hg19.data-version=2302\nexomiser.phenotype.data-version=2302\n
"},{"location":"tutorial/exomiser-tutorial/#tutorials","title":"Tutorials","text":""},{"location":"tutorial/exomiser-tutorial/#monarch-obo-training-tutorial","title":"Monarch OBO Training Tutorial","text":""},{"location":"tutorial/exomiser-tutorial/#running-exomiser","title":"Running Exomiser","text":"For this tutorial, we will focus on running Exomiser on a single-sample (whole-exome) VCF file. Additional instructions for running Exomiser on multi-sample VCF data and large jobs are also provided below.
"},{"location":"tutorial/exomiser-tutorial/#using-phenopackets","title":"Using phenopackets","text":"It is recommended to provide Exomiser with the input sample as a Phenopacket. Exomiser will accept this in either JSON or YAML format. We will use the example pfeiffer-phenopacket.yml
below:
id: manuel\nsubject:\nid: manuel\nsex: MALE\nphenotypicFeatures:\n- type:\nid: HP:0001159\nlabel: Syndactyly\n- type:\nid: HP:0000486\nlabel: Strabismus\n- type:\nid: HP:0000327\nlabel: Hypoplasia of the maxilla\n- type:\nid: HP:0000520\nlabel: Proptosis\n- type:\nid: HP:0000316\nlabel: Hypertelorism\n- type:\nid: HP:0000244\nlabel: Brachyturricephaly\nhtsFiles:\n- uri: exomiser/Pfeiffer.vcf.gz\nhtsFormat: VCF\ngenomeAssembly: hg19\nmetaData:\ncreated: '2019-11-12T13:47:51.948Z'\ncreatedBy: julesj\nresources:\n- id: hp\nname: human phenotype ontology\nurl: http://purl.obolibrary.org/obo/hp.owl\nversion: hp/releases/2019-11-08\nnamespacePrefix: HP\niriPrefix: 'http://purl.obolibrary.org/obo/HP_'\nphenopacketSchemaVersion: 1.0\n
NOTE: This is an example of a v1.0 phenopacket, there is a more recent release of v2.0. Exomiser can run phenopackets built with either v1.0 or v2.0 schema. You can find out more about the v2.0 phenopacket schema and how to build one with Python or Java here. To convert a phenopacket v1.0 to v2.0, you can use phenopacket-tools.
"},{"location":"tutorial/exomiser-tutorial/#analysis-settings","title":"Analysis settings","text":"Below are the default analysis settings from pfeiffer-analysis.yml
that we will use in our tutorial:
---\nanalysis:\n#FULL or PASS_ONLY\nanalysisMode: PASS_ONLY\n# In cases where you do not want any cut-offs applied an empty map should be used e.g. inheritanceModes: {}\n# These are the default settings, with values representing the maximum minor allele frequency in percent (%) permitted for an\n# allele to be considered as a causative candidate under that mode of inheritance.\n# If you just want to analyse a sample under a single inheritance mode, delete/comment-out the others. For AUTOSOMAL_RECESSIVE\n# or X_RECESSIVE ensure *both* relevant HOM_ALT and COMP_HET modes are present.\ninheritanceModes: {\n AUTOSOMAL_DOMINANT: 0.1,\n AUTOSOMAL_RECESSIVE_COMP_HET: 2.0,\n AUTOSOMAL_RECESSIVE_HOM_ALT: 0.1,\n X_DOMINANT: 0.1,\n X_RECESSIVE_COMP_HET: 2.0,\n X_RECESSIVE_HOM_ALT: 0.1,\n MITOCHONDRIAL: 0.2\n}\n#Possible frequencySources:\n#Thousand Genomes project http://www.1000genomes.org/\n# THOUSAND_GENOMES,\n#ESP project http://evs.gs.washington.edu/EVS/\n# ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,\n#ExAC project http://exac.broadinstitute.org/about\n# EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,\n# EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,\n# EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,\n# EXAC_OTHER\n#Possible frequencySources:\n#Thousand Genomes project - http://www.1000genomes.org/ (THOUSAND_GENOMES)\n#TOPMed - https://www.nhlbi.nih.gov/science/precision-medicine-activities (TOPMED)\n#UK10K - http://www.uk10k.org/ (UK10K)\n#ESP project - http://evs.gs.washington.edu/EVS/ (ESP_)\n# ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,\n#ExAC project http://exac.broadinstitute.org/about (EXAC_)\n# EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,\n# EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,\n# EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,\n# EXAC_OTHER\n#gnomAD - http://gnomad.broadinstitute.org/ (GNOMAD_E, GNOMAD_G)\nfrequencySources: [\nTHOUSAND_GENOMES,\nTOPMED,\nUK10K,\n\nESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,\n\nEXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,\nEXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,\nEXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,\nEXAC_OTHER,\n\nGNOMAD_E_AFR,\nGNOMAD_E_AMR,\n# GNOMAD_E_ASJ,\nGNOMAD_E_EAS,\nGNOMAD_E_FIN,\nGNOMAD_E_NFE,\nGNOMAD_E_OTH,\nGNOMAD_E_SAS,\n\nGNOMAD_G_AFR,\nGNOMAD_G_AMR,\n# GNOMAD_G_ASJ,\nGNOMAD_G_EAS,\nGNOMAD_G_FIN,\nGNOMAD_G_NFE,\nGNOMAD_G_OTH,\nGNOMAD_G_SAS\n]\n# Possible pathogenicitySources: (POLYPHEN, MUTATION_TASTER, SIFT), (REVEL, MVP), CADD, REMM\n# REMM is trained on non-coding regulatory regions\n# *WARNING* if you enable CADD or REMM ensure that you have downloaded and installed the CADD/REMM tabix files\n# and updated their location in the application.properties. Exomiser will not run without this.\npathogenicitySources: [ REVEL, MVP ]\n#this is the standard exomiser order.\n#all steps are optional\nsteps: [\n#hiPhivePrioritiser: {},\n#priorityScoreFilter: {priorityType: HIPHIVE_PRIORITY, minPriorityScore: 0.500},\n#intervalFilter: {interval: 'chr10:123256200-123256300'},\n# or for multiple intervals:\n#intervalFilter: {intervals: ['chr10:123256200-123256300', 'chr10:123256290-123256350']},\n# or using a BED file - NOTE this should be 0-based, Exomiser otherwise uses 1-based coordinates in line with VCF\n#intervalFilter: {bed: /full/path/to/bed_file.bed},\n#genePanelFilter: {geneSymbols: ['FGFR1','FGFR2']},\nfailedVariantFilter: { },\n#qualityFilter: {minQuality: 50.0},\nvariantEffectFilter: {\n remove: [\nFIVE_PRIME_UTR_EXON_VARIANT,\nFIVE_PRIME_UTR_INTRON_VARIANT,\nTHREE_PRIME_UTR_EXON_VARIANT,\nTHREE_PRIME_UTR_INTRON_VARIANT,\nNON_CODING_TRANSCRIPT_EXON_VARIANT,\nUPSTREAM_GENE_VARIANT,\nINTERGENIC_VARIANT,\nREGULATORY_REGION_VARIANT,\nCODING_TRANSCRIPT_INTRON_VARIANT,\nNON_CODING_TRANSCRIPT_INTRON_VARIANT,\nDOWNSTREAM_GENE_VARIANT\n]\n},\n# removes variants represented in the database\n#knownVariantFilter: {},\nfrequencyFilter: {maxFrequency: 2.0},\npathogenicityFilter: {keepNonPathogenic: true},\n# inheritanceFilter and omimPrioritiser should always run AFTER all other filters have completed\ninheritanceFilter: {},\n# omimPrioritiser isn't mandatory.\nomimPrioritiser: {},\n#priorityScoreFilter: {minPriorityScore: 0.4},\n# Other prioritisers: Only combine omimPrioritiser with one of these.\n# Don't include any if you only want to filter the variants.\nhiPhivePrioritiser: {},\n# or run hiPhive in benchmarking mode:\n#hiPhivePrioritiser: {runParams: 'mouse'},\n#phivePrioritiser: {}\n#phenixPrioritiser: {}\n#exomeWalkerPrioritiser: {seedGeneIds: [11111, 22222, 33333]}\n]\noutputOptions:\noutputContributingVariantsOnly: false\n#numGenes options: 0 = all or specify a limit e.g. 500 for the first 500 results\nnumGenes: 0\n#minExomiserGeneScore: 0.7\n# Path to the desired output directory. Will default to the 'results' subdirectory of the exomiser install directory\noutputDirectory: results\n# Filename for the output files. Will default to {input-vcf-filename}-exomiser\noutputFileName: Pfeiffer-HIPHIVE-exome\n#out-format options: HTML, JSON, TSV_GENE, TSV_VARIANT, VCF (default: HTML)\noutputFormats: [HTML, JSON, TSV_GENE, TSV_VARIANT]\n
"},{"location":"tutorial/exomiser-tutorial/#running-via-docker","title":"Running via Docker","text":"docker run -it -v \"/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data\" \\\n-v \"/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser\" \\\n-v \"/path/to/Exomiser-Tutorial/exomiser-results:/results\" \\\nexomiser/exomiser-cli:13.2.0 \\\n--sample /exomiser/pfeiffer-phenopacket.yml \\\n--analysis /exomiser/pfeiffer-analysis.yml \\\n--spring.config.location=/exomiser/application.properties\n
This command will produce Pfeiffer-HIPHIVE-exome.html
, Pfeiffer-HIPHIVE-exome.json
, Pfeiffer-HIPHIVE-exome.genes.tsv
and Pfeiffer-HIPHIVE-exome.variants.tsv
in your exomiser-results
directory.
Assuming that you are within the exomiser-cli-13.2.0
distribution folder:
java -jar exomiser-cli-13.2.0.jar --sample examples/pfeiffer-phenopacket.yml \\\n--analysis examples/exome-analysis.yml --output examples/output-options.yml
"},{"location":"tutorial/exomiser-tutorial/#analysing-multi-sample-vcf-files","title":"Analysing multi-sample VCF files","text":"When analysing a multi-sample VCF file, you must detail the pedigree information in a phenopacket describing a Family object:
e.g. Exomiser-Tutorial/exomiser-config/pfeiffer-family.yml
id: ISDBM322017-family\nproband:\nsubject:\nid: ISDBM322017\nsex: FEMALE\nphenotypicFeatures:\n- type:\nid: HP:0001159\nlabel: Syndactyly\n- type:\nid: HP:0000486\nlabel: Strabismus\n- type:\nid: HP:0000327\nlabel: Hypoplasia of the maxilla\n- type:\nid: HP:0000520\nlabel: Proptosis\n- type:\nid: HP:0000316\nlabel: Hypertelorism\n- type:\nid: HP:0000244\nlabel: Brachyturricephaly\npedigree:\npersons:\n- individualId: ISDBM322017\npaternalId: ISDBM322016\nmaternalId: ISDBM322018\nsex: FEMALE\naffectedStatus: AFFECTED\n- individualId: ISDBM322015\npaternalId: ISDBM322016\nmaternalId: ISDBM322018\nsex: MALE\naffectedStatus: UNAFFECTED\n- individualId: ISDBM322016\nsex: MALE\naffectedStatus: UNAFFECTED\n- individualId: ISDBM322018\nsex: FEMALE\naffectedStatus: UNAFFECTED\nhtsFiles:\n- uri: exomiser/Pfeiffer-quartet.vcf.gz\nhtsFormat: VCF\ngenomeAssembly: GRCh37\nmetaData:\ncreated: '2019-11-12T13:47:51.948Z'\ncreatedBy: julesj\nresources:\n- id: hp\nname: human phenotype ontology\nurl: http://purl.obolibrary.org/obo/hp.owl\nversion: hp/releases/2019-11-08\nnamespacePrefix: HP\niriPrefix: 'http://purl.obolibrary.org/obo/HP_'\nphenopacketSchemaVersion: 1.0\n
Running via Docker:
docker run -it -v '/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data' \\\n-v '/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser' \\\n-v '/path/to/Exomiser-Tutorial/exomiser-results:/results' \\\nexomiser/exomiser-cli:13.2.0 \\\n--sample /exomiser/pfeiffer-family.yml \\\n--analysis /exomiser/pfeiffer-analysis.yml \\\n--spring.config.location=/exomiser/application.properties\n
Running locally:
Assuming that you are within the exomiser-cli-13.2.0
distribution folder
java -jar exomiser-cli-13.2.0.jar --sample examples/pfeiffer-family.yml --analysis examples/exome-analysis.yml --output examples/output-options.yml\n
"},{"location":"tutorial/exomiser-tutorial/#running-large-jobs-batch","title":"Running large jobs (batch)","text":"The above commands can be added to a batch file for example in the file Exomiser-Tutorial/exomiser-config/test-analysis-batch-commands.txt
. Using it with Docker we recommend creating a new directory for the batch files and mounting that to the Docker container.
Running via Docker:
docker run -it -v '/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data' \\\n-v '/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser' \\\n-v '/path/to/Exomiser-Tutorial/exomiser-results:/results' \\\n-v '/path/to/Exomiser-Tutorial/exomiser-batch-files:/batch-files' \\\nexomiser/exomiser-cli:13.2.0 \\\n--batch /batch-files/test-analysis-batch-commands.txt\n--spring.config.location=/exomiser/application.properties\n
Running locally:
Assuming that you are within the exomiser-cli-13.2.0
distribution folder
java -jar exomiser-cli-13.2.0.jar --batch examples/test-analysis-batch-commands.txt\n
The advantage of this is that a single command will be able to analyse many samples in far less time than starting a new JVM for each as there will be no start-up penalty after the initial start and the Java JIT compiler will be able to take advantage of a longer-running process to optimise the runtime code. For maximum throughput on a cluster consider splitting your batch jobs over multiple nodes.
"},{"location":"tutorial/exomiser-tutorial/#results","title":"Results","text":"Depending on the output options provided, Exomiser will write out at least an HTML and JSON output file in the results
subdirectory of the Exomiser installation (by default) or a user-defined results directory as indicated in the output options.
As a general rule, all output files contain a ranked list of genes and variants with the top-ranked gene/variant displayed first. The exception being the VCF output (if requested in the output options; not requested in this tutorial) which, since version 13.1.0, is sorted according to VCF convention and tabix indexed.
In our tutorial, we requested HTML, JSON, TSV_VARIANT and TSV_GENE output formats which are briefly outlined below.
"},{"location":"tutorial/exomiser-tutorial/#html","title":"HTML","text":"A few relevant screenshots from Pfeiffer-HIPHIVE-exome.html
:
The JSON file represents the most accurate representation of the results, as it is referenced internally by Exomiser. As such, we don\u2019t provide a schema for this, but it has been pretty stable and breaking changes will only occur with major version changes to the software. Minor additions are to be expected for minor releases, as per the SemVer specification.
We recommend using Python or JQ to extract data from this file. To give you an idea of how you can extract some data with Python, we have provided examples of how you can iterate over the results below. However, there is a lot more information content that you can pull out from the JSON results file, this only provides a snippet of what you can do.
# import json library\nimport json\n\n# to load in the exomiser json result\nwith open(\"path/to/Exomiser-Tutorial/Pfeiffer-HIPHIVE-exome.json\") as exomiser_json_result:\n exomiser_result = json.load(exomiser_json_result)\nexomiser_json_result.close()\n\n# to retrieve all predicted genes and corresponding identifier (ENSEMBL)\ngene_results = []\nfor result in exomiser_result:\n gene_results.append({result[\"geneSymbol\"]: result[\"geneIdentifier\"][\"geneId\"]})\n\n# to retrieve all predicted variants\nvariant_results = []\nfor result in exomiser_result:\n for moi in result[\"geneScores\"]: # iterating over all modes of inheritance\n if \"contributingVariants\" in moi: # checking if there is evidence of contributing variants\n for cv in moi[\"contributingVariants\"]: # iterating over all contributing variants\n variant_results.append({\"chromosome\": cv[\"contigName\"],\n \"start_pos\": cv[\"start\"],\n \"end_pos\": cv[\"end\"],\n \"ref_allele\": cv[\"ref\"],\n \"alt_allele\": cv[\"alt\"]})\n
"},{"location":"tutorial/exomiser-tutorial/#tsv-variants","title":"TSV VARIANTS","text":"In the Pfeiffer-HIPHIVE-exome.variants.tsv
file it is possible for a variant to appear multiple times, depending on the MOI it is compatible with. For example, in the excerpt of the file below, MUC6 has two variants ranked 7th under the AD model and two ranked 8th under an AR (compound heterozygous) model. In the AD case the CONTRIBUTING_VARIANT column indicates whether the variant was (1) or wasn't (0) used for calculating the EXOMISER_GENE_COMBINED_SCORE and EXOMISER_GENE_VARIANT_SCORE.
#RANK ID GENE_SYMBOL ENTREZ_GENE_ID MOI P-VALUE EXOMISER_GENE_COMBINED_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE EXOMISER_VARIANT_SCORE CONTRIBUTING_VARIANT WHITELIST_VARIANT VCF_ID RS_ID CONTIG START END REF ALT CHANGE_LENGTH QUAL FILTER GENOTYPE FUNCTIONAL_CLASS HGVS EXOMISER_ACMG_CLASSIFICATION EXOMISER_ACMG_EVIDENCE EXOMISER_ACMG_DISEASE_ID EXOMISER_ACMG_DISEASE_NAME CLINVAR_ALLELE_ID CLINVAR_PRIMARY_INTERPRETATION CLINVAR_STAR_RATING GENE_CONSTRAINT_LOEUF GENE_CONSTRAINT_LOEUF_LOWER GENE_CONSTRAINT_LOEUF_UPPER MAX_FREQ_SOURCE MAX_FREQ ALL_FREQ MAX_PATH_SOURCE MAX_PATH ALL_PATH\n1 10-123256215-T-G_AD FGFR2 2263 AD 0.0000 0.9957 0.9187 1.0000 1.0000 1 1 rs121918506 10 123256215 123256215 T G 0 900.0000 PASS 0/1 missense_variant FGFR2:ENST00000346997.2:c.1688A>C:p.(Glu563Ala) PATHOGENIC PM2,PP3_Strong,PP4,PP5_Strong ORPHA:87 Apert syndrome 28333 PATHOGENIC_OR_LIKELY_PATHOGENIC 2 0.13692 0.074 0.27 REVEL 0.965 REVEL=0.965,MVP=0.9517972\n2 5-71755984-C-G_AD ZNF366 167465 AD 0.0018 0.9237 0.8195 0.7910 0.7910 1 0 rs375204168 5 71755984 71755984 C G 0 380.8900 PASS 0/1 splice_region_variant ZNF366:ENST00000318442.5:c.1332+8G>C:p.? UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.27437 0.155 0.515 EXAC_AMERICAN 0.07975895 THOUSAND_GENOMES=0.01997,TOPMED=0.01096,ESP_EUROPEAN_AMERICAN=0.0116,ESP_ALL=0.0077,EXAC_AMERICAN=0.07975895,EXAC_NON_FINNISH_EUROPEAN=0.010914307,GNOMAD_E_AMR=0.07153929,GNOMAD_E_NFE=0.010890082,GNOMAD_E_OTH=0.018328445\n3 16-2150254-G-A_AD PKD1 5310 AD 0.0050 0.8272 0.6597 0.8707 0.8707 1 0 rs147967021 16 2150254 2150254 G A 0 406.0800 PASS 0/1 missense_variant PKD1:ENST00000262304.4:c.9625C>T:p.(Arg3209Cys) UNCERTAIN_SIGNIFICANCE 1319391 UNCERTAIN_SIGNIFICANCE 1 0.12051 0.082 0.179 EXAC_AMERICAN 0.06979585 THOUSAND_GENOMES=0.01997,TOPMED=0.007934,EXAC_AMERICAN=0.06979585,EXAC_NON_FINNISH_EUROPEAN=0.0015655332,EXAC_SOUTH_ASIAN=0.012149192,GNOMAD_E_AFR=0.006708708,GNOMAD_E_AMR=0.05070389,GNOMAD_E_NFE=0.002718672,GNOMAD_E_SAS=0.013009822,GNOMAD_G_AFR=0.011462632 MVP 0.8792868 REVEL=0.346,MVP=0.8792868\n4 3-56653839-CTG-C_AD CCDC66 285331 AD 0.0051 0.8262 0.5463 0.9984 0.9984 1 0 rs751329549 3 56653839 56653841 CTG C -2 1872.9400 PASS 0/1 frameshift_truncation CCDC66:ENST00000326595.7:c.2572_2573del:p.(Val858Glnfs*6) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.9703 0.78 1.215 GNOMAD_E_AMR 0.011914691 TOPMED=7.556E-4,EXAC_EAST_ASIAN=0.01155535,EXAC_NON_FINNISH_EUROPEAN=0.0015023135,GNOMAD_E_AMR=0.011914691,GNOMAD_E_EAS=0.0057977736,GNOMAD_E_NFE=8.988441E-4\n5 13-110855918-C-G_AD COL4A1 1282 AD 0.0075 0.7762 0.5288 0.9838 0.9838 1 0 rs150182714 13 110855918 110855918 C G 0 1363.8700 PASS 0/1 missense_variant COL4A1:ENST00000375820.4:c.994G>C:p.(Gly332Arg) UNCERTAIN_SIGNIFICANCE PP3_Moderate OMIM:175780 Brain small vessel disease with or without ocular anomalies 333515 CONFLICTING_PATHOGENICITY_INTERPRETATIONS 1 0.065014 0.035 0.128 ESP_EUROPEAN_AMERICAN 0.0233 THOUSAND_GENOMES=0.01997,TOPMED=0.0068,ESP_EUROPEAN_AMERICAN=0.0233,ESP_ALL=0.0154,EXAC_AFRICAN_INC_AFRICAN_AMERICAN=0.009609841,EXAC_NON_FINNISH_EUROPEAN=0.007491759,GNOMAD_E_AFR=0.013068479,GNOMAD_E_NFE=0.0071611437,GNOMAD_G_NFE=0.013324451 MVP 0.9869305 REVEL=0.886,MVP=0.9869305\n6 6-132203615-G-A_AD ENPP1 5167 AD 0.0079 0.7695 0.5112 0.9996 0.9996 1 0 rs770775549 6 132203615 132203615 G A 0 922.9800 PASS 0/1 splice_donor_variant ENPP1:ENST00000360971.2:c.2230+1G>A:p.? UNCERTAIN_SIGNIFICANCE PVS1_Strong NOT_PROVIDED 0 0.41042 0.292 0.586 GNOMAD_E_SAS 0.0032486517 TOPMED=7.556E-4,EXAC_NON_FINNISH_EUROPEAN=0.0014985314,GNOMAD_E_NFE=0.0017907989,GNOMAD_E_SAS=0.0032486517\n7 11-1018088-TG-T_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.9990 1 0 rs765231061 11 1018088 1018089 TG T -1 441.8100 PASS 0/1 frameshift_variant MUC6:ENST00000421673.2:c.4712del:p.(Pro1571Hisfs*21) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.0070363074 GNOMAD_E_AMR=0.0030803352,GNOMAD_G_NFE=0.0070363074\n7 11-1018093-G-GT_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.9989 0 0 rs376177791 11 1018093 1018093 G GT 1 592.4500 PASS 0/1 frameshift_elongation MUC6:ENST00000421673.2:c.4707dup:p.(Pro1570Thrfs*136) NOT_AVAILABLE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.007835763 GNOMAD_G_NFE=0.007835763\n8 11-1018088-TG-T_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.9990 1 0 rs765231061 11 1018088 1018089 TG T -1 441.8100 PASS 0/1 frameshift_variant MUC6:ENST00000421673.2:c.4712del:p.(Pro1571Hisfs*21) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.0070363074 GNOMAD_E_AMR=0.0030803352,GNOMAD_G_NFE=0.0070363074\n8 11-1018093-G-GT_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.9989 1 0 rs376177791 11 1018093 1018093 G GT 1 592.4500 PASS 0/1 frameshift_elongation MUC6:ENST00000421673.2:c.4707dup:p.(Pro1570Thrfs*136) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.007835763 GNOMAD_G_NFE=0.007835763\n9 7-44610376-G-A_AD DDX56 54606 AD 0.0091 0.7545 0.5036 0.9992 0.9992 1 0 rs774566321 7 44610376 44610376 G A 0 586.6600 PASS 0/1 stop_gained DDX56:ENST00000258772.5:c.991C>T:p.(Arg331*) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.56071 0.379 0.852 EXAC_SOUTH_ASIAN 0.006114712 EXAC_SOUTH_ASIAN=0.006114712,GNOMAD_E_SAS=0.0032509754\n10 14-96730313-G-A_AD BDKRB1 623 AD 0.0093 0.7525 0.5018 1.0000 1.0000 1 0 14 96730313 96730313 G A 0 378.2200 PASS 0/1 stop_gained BDKRB1:ENST00000216629.6:c.294G>A:p.(Trp98*) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.52212 0.272 1.097 \n
"},{"location":"tutorial/exomiser-tutorial/#tsv-genes","title":"TSV GENES","text":"In the Pfeiffer-HIPHIVE-exome.genes.tsv
file, all the various phenotypic scores and HPO matches from the HUMAN, MOUSE, FISH and PPI comparisons are reported per each gene. It is possible for a gene to appear multiple times, depending on the MOI it is compatible with, given the filtered variants. For example in the example below MUC6 is ranked 7th under the AD model and 8th under an AR model.
#RANK ID GENE_SYMBOL ENTREZ_GENE_ID MOI P-VALUE EXOMISER_GENE_COMBINED_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE HUMAN_PHENO_SCORE MOUSE_PHENO_SCORE FISH_PHENO_SCORE WALKER_SCORE PHIVE_ALL_SPECIES_SCORE OMIM_SCORE MATCHES_CANDIDATE_GENE HUMAN_PHENO_EVIDENCE MOUSE_PHENO_EVIDENCE FISH_PHENO_EVIDENCE HUMAN_PPI_EVIDENCE MOUSE_PPI_EVIDENCE FISH_PPI_EVIDENCE\n1 FGFR2_AD FGFR2 2263 AD 0.0000 0.9957 0.9187 1.0000 0.8671 0.9187 0.0000 0.5057 0.9187 1.0000 0 Apert syndrome (ORPHA:87): Syndactyly (HP:0001159)-Toe syndactyly (HP:0001770), Strabismus (HP:0000486)-Strabismus (HP:0000486), Hypoplasia of the maxilla (HP:0000327)-Hypoplasia of the maxilla (HP:0000327), Proptosis (HP:0000520)-Proptosis (HP:0000520), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachyturricephaly (HP:0000244), Strabismus (HP:0000486)-ocular hypertelorism (MP:0001300), Hypoplasia of the maxilla (HP:0000327)-short maxilla (MP:0000097), Proptosis (HP:0000520)-exophthalmos (MP:0002750), Hypertelorism (HP:0000316)-ocular hypertelorism (MP:0001300), Brachyturricephaly (HP:0000244)-abnormal frontal bone morphology (MP:0000107), Proximity to FGF18 Syndactyly (HP:0001159)-abnormal metatarsal bone morphology (MP:0003072), Strabismus (HP:0000486)-abnormal neurocranium morphology (MP:0000074), Hypoplasia of the maxilla (HP:0000327)-maxilla hypoplasia (MP:0000457), Proptosis (HP:0000520)-abnormal neurocranium morphology (MP:0000074), Hypertelorism (HP:0000316)-abnormal neurocranium morphology (MP:0000074), Brachyturricephaly (HP:0000244)-abnormal neurocranium morphology (MP:0000074),\n2 ZNF366_AD ZNF366 167465 AD 0.0018 0.9237 0.8195 0.7910 0.0000 0.8195 0.0000 0.5015 0.8195 1.0000 0 Syndactyly (HP:0001159)-syndactyly (MP:0000564), Strabismus (HP:0000486)-microphthalmia (MP:0001297), Hypoplasia of the maxilla (HP:0000327)-micrognathia (MP:0002639), Proptosis (HP:0000520)-microphthalmia (MP:0001297), Hypertelorism (HP:0000316)-microphthalmia (MP:0001297), Brachyturricephaly (HP:0000244)-microphthalmia (MP:0001297), Proximity to CTBP1 associated with Wolf-Hirschhorn syndrome (ORPHA:280): Syndactyly (HP:0001159)-Arachnodactyly (HP:0001166), Strabismus (HP:0000486)-Strabismus (HP:0000486), Hypoplasia of the maxilla (HP:0000327)-Micrognathia (HP:0000347), Proptosis (HP:0000520)-Proptosis (HP:0000520), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Calvarial skull defect (HP:0001362),\n3 PKD1_AD PKD1 5310 AD 0.0050 0.8272 0.6597 0.8707 0.0000 0.6597 0.2697 0.5069 0.6597 1.0000 0 Strabismus (HP:0000486)-micrognathia (MP:0002639), Hypoplasia of the maxilla (HP:0000327)-micrognathia (MP:0002639), Proptosis (HP:0000520)-micrognathia (MP:0002639), Hypertelorism (HP:0000316)-micrognathia (MP:0002639), Brachyturricephaly (HP:0000244)-micrognathia (MP:0002639), Hypoplasia of the maxilla (HP:0000327)-mandibular arch skeleton malformed, abnormal (ZP:0001708), Proximity to IFT88 associated with Retinitis pigmentosa (ORPHA:791): Strabismus (HP:0000486)-Ophthalmoplegia (HP:0000602), Hypoplasia of the maxilla (HP:0000327)-Wide nasal bridge (HP:0000431), Proximity to IFT88 Syndactyly (HP:0001159)-polydactyly (MP:0000562), Strabismus (HP:0000486)-supernumerary molars (MP:0010773), Hypoplasia of the maxilla (HP:0000327)-supernumerary molars (MP:0010773), Proptosis (HP:0000520)-supernumerary molars (MP:0010773), Hypertelorism (HP:0000316)-supernumerary molars (MP:0010773), Brachyturricephaly (HP:0000244)-abnormal coronal suture morphology (MP:0003840),\n4 CCDC66_AD CCDC66 285331 AD 0.0051 0.8262 0.5463 0.9984 0.0000 0.5463 0.0000 0.0000 0.5463 1.0000 0 Strabismus (HP:0000486)-abnormal cone electrophysiology (MP:0004022), Hypoplasia of the maxilla (HP:0000327)-abnormal rod electrophysiology (MP:0004021), Proptosis (HP:0000520)-abnormal rod electrophysiology (MP:0004021), Hypertelorism (HP:0000316)-abnormal rod electrophysiology (MP:0004021), Brachyturricephaly (HP:0000244)-abnormal retina photoreceptor layer morphology (MP:0003728),\n5 COL4A1_AD COL4A1 1282 AD 0.0075 0.7762 0.5288 0.9838 0.3882 0.5288 0.0000 0.5047 0.5288 1.0000 0 Brain small vessel disease with or without ocular anomalies (OMIM:175780): Strabismus (HP:0000486)-Exotropia (HP:0000577), Strabismus (HP:0000486)-buphthalmos (MP:0009274), Hypoplasia of the maxilla (HP:0000327)-abnormal cornea morphology (MP:0001312), Proptosis (HP:0000520)-abnormal cornea morphology (MP:0001312), Hypertelorism (HP:0000316)-abnormal cornea morphology (MP:0001312), Brachyturricephaly (HP:0000244)-abnormal retina morphology (MP:0001325), Proximity to COL7A1 associated with Localized dystrophic epidermolysis bullosa, pretibial form (ORPHA:79410): Syndactyly (HP:0001159)-Nail dystrophy (HP:0008404), Hypoplasia of the maxilla (HP:0000327)-Carious teeth (HP:0000670), Proximity to COL7A1 Syndactyly (HP:0001159)-abnormal digit morphology (MP:0002110), Strabismus (HP:0000486)-abnormal tongue morphology (MP:0000762), Hypoplasia of the maxilla (HP:0000327)-abnormal tongue morphology (MP:0000762), Proptosis (HP:0000520)-abnormal tongue morphology (MP:0000762), Hypertelorism (HP:0000316)-abnormal tongue morphology (MP:0000762),\n6 ENPP1_AD ENPP1 5167 AD 0.0079 0.7695 0.5112 0.9996 0.3738 0.5112 0.0000 0.5044 0.5112 1.0000 0 Autosomal recessive hypophosphatemic rickets (ORPHA:289176): Hypoplasia of the maxilla (HP:0000327)-Tooth abscess (HP:0030757), Brachyturricephaly (HP:0000244)-Craniosynostosis (HP:0001363), Syndactyly (HP:0001159)-abnormal elbow joint morphology (MP:0013945), Strabismus (HP:0000486)-abnormal retina morphology (MP:0001325), Hypoplasia of the maxilla (HP:0000327)-abnormal snout skin morphology (MP:0030533), Proptosis (HP:0000520)-abnormal retina morphology (MP:0001325), Hypertelorism (HP:0000316)-abnormal retina morphology (MP:0001325), Brachyturricephaly (HP:0000244)-abnormal retina morphology (MP:0001325), Proximity to DMP1 associated with Autosomal recessive hypophosphatemic rickets (ORPHA:289176): Hypoplasia of the maxilla (HP:0000327)-Tooth abscess (HP:0030757), Brachyturricephaly (HP:0000244)-Craniosynostosis (HP:0001363), Proximity to DMP1 Syndactyly (HP:0001159)-abnormal long bone hypertrophic chondrocyte zone (MP:0000165), Strabismus (HP:0000486)-abnormal dental pulp cavity morphology (MP:0002819), Hypoplasia of the maxilla (HP:0000327)-abnormal dental pulp cavity morphology (MP:0002819), Proptosis (HP:0000520)-abnormal dental pulp cavity morphology (MP:0002819), Hypertelorism (HP:0000316)-abnormal dental pulp cavity morphology (MP:0002819), Brachyturricephaly (HP:0000244)-abnormal dental pulp cavity morphology (MP:0002819),\n7 MUC6_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.0000 0.0000 0.0000 0.5046 0.5046 1.0000 0 Proximity to GALNT2 associated with Congenital disorder of glycosylation, type IIt (OMIM:618885): Syndactyly (HP:0001159)-Sandal gap (HP:0001852), Strabismus (HP:0000486)-Alternating exotropia (HP:0031717), Hypoplasia of the maxilla (HP:0000327)-Tented upper lip vermilion (HP:0010804), Proptosis (HP:0000520)-Hypertelorism (HP:0000316), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachycephaly (HP:0000248),\n8 MUC6_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.0000 0.0000 0.0000 0.5046 0.5046 1.0000 0 Proximity to GALNT2 associated with Congenital disorder of glycosylation, type IIt (OMIM:618885): Syndactyly (HP:0001159)-Sandal gap (HP:0001852), Strabismus (HP:0000486)-Alternating exotropia (HP:0031717), Hypoplasia of the maxilla (HP:0000327)-Tented upper lip vermilion (HP:0010804), Proptosis (HP:0000520)-Hypertelorism (HP:0000316), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachycephaly (HP:0000248),\n9 DDX56_AD DDX56 54606 AD 0.0091 0.7545 0.5036 0.9992 0.0000 0.0000 0.3788 0.5036 0.5036 1.0000 0 Brachyturricephaly (HP:0000244)-head decreased width, abnormal (ZP:0000407), Proximity to PAK1IP1 Strabismus (HP:0000486)-abnormal maxilla morphology (MP:0000455), Hypoplasia of the maxilla (HP:0000327)-abnormal maxilla morphology (MP:0000455), Proptosis (HP:0000520)-abnormal maxilla morphology (MP:0000455), Hypertelorism (HP:0000316)-abnormal maxilla morphology (MP:0000455), Brachyturricephaly (HP:0000244)-decreased forebrain size (MP:0012138),\n10 BDKRB1_AD BDKRB1 623 AD 0.0093 0.7525 0.5018 1.0000 0.0000 0.0000 0.0000 0.5018 0.5018 1.0000 0 Proximity to OPN4 Strabismus (HP:0000486)-abnormal visual pursuit (MP:0006156), Hypoplasia of the maxilla (HP:0000327)-abnormal visual pursuit (MP:0006156), Proptosis (HP:0000520)-abnormal visual pursuit (MP:0006156), Hypertelorism (HP:0000316)-abnormal visual pursuit (MP:0006156), Brachyturricephaly (HP:0000244)-abnormal retina ganglion cell morphology (MP:0008056),\n
"},{"location":"tutorial/exomiser-tutorial/#docker-for-mac","title":"Docker for Mac","text":"Follow this link and download the Docker.dmg for your operating system.
The Docker.dmg will be found in your /Downloads
directory.
After double-clicking on the Docker.dmg a new window will come up:
Drag and drop the Docker app into your /Applications
folder. Double-click on the Docker symbol. Docker Desktop will start in the background, after you allow it to be opened.
Additionally, this window will come up to agree the Docker subscription service agreement.
After running the installation restart your terminal and check the Docker installation again from inside your terminal with:
docker --version\n
If the output gives you a version and no error you are ready to go. If you have not already restarted your terminal do this now, and the error should be fixed.
In case you get an error message like this, please ensure you have downloaded the correct docker.dmg
.
Now, whenever you want to pull images make sure that Docker is running in the background. Otherwise, you may get an error stating it's not able to connect to the Docker deamon.
"},{"location":"tutorial/exomiser-tutorial/#docker-for-windows","title":"Docker for Windows","text":"Follow this link and download the Docker installer for Windows.
Inside your /Downloads
directory, search for the Installer and double-click.
To run on Windows Docker requires a virtual machine. Docker recommends using WSL2. More information on this can be found here.
Click \u201cOk\u201d and wait a bit.
Now you will have to restart your computer.
After restarting, Docker should start automatically and the Service Agreement will come up, which you will have to agree to use Docker:
If the Docker desktop app is showing this warning upon start, do not click \u201cRestart\u201d, yet. Instead, follow the link and install the kernel update.
The link should point you to an address with a separate download link.
Start and finish the installation for WSL.
If you still have the Docker Desktop dialog window open in the background, click on Restart. Otherwise, just restart your computer as you normally do.
If Docker Desktop did not start on its own, simply open it from the shortcut on your Desktop. You can do the initial orientation by clicking \"Start\".
After this, your Docker Desktop screen should look like this:
Now, whenever you want to pull images make sure that Docker is running in the background.
"},{"location":"tutorial/fhkb/","title":"Manchester Family History Advanced OWL","text":"This is a fork of the infamous Manchester Family History Advanced OWL Tutorial version 1.1, located at
http://owl.cs.manchester.ac.uk/publications/talks-and-tutorials/fhkbtutorial/
The translation to markdown is not without issue, but we are making a start to making the tutorial a bit more accessible. This reproduction is done with kind permission by Robert Stevens.
"},{"location":"tutorial/fhkb/#original-credits-version-11-see-pdf","title":"Original credits (Version 1.1, see pdf):","text":"Authors:
Bio-Health Informatics Group\nSchool of Computer Science\nUniversity of Manchester\nOxford Road\nManchester\nUnited Kingdom\nM13 9PL\nrobert.stevens@manchester.ac.uk\n
"},{"location":"tutorial/fhkb/#contributors","title":"Contributors","text":"The University of Manchester\nCopyright\u00a9 The University of Manchester\nNovember 25, 2015\n
"},{"location":"tutorial/fhkb/#acknowledgements","title":"Acknowledgements","text":"This tutorial was realised as part of the Semantic Web Authoring Tool (SWAT) project (see http://www.swatproject.org), which is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/G032459/1, to the University of Manchester, the University of Sussex and the Open University.
"},{"location":"tutorial/fhkb/#dedication","title":"Dedication","text":"The Stevens family\u2014all my ancestors were necessary for this to happen. Also, for my Mum who gathered all the information.
"},{"location":"tutorial/fhkb/#contents","title":"Contents","text":"Preamble
1. Introduction
2. Adding some Individuals to the FHKB
3. Ancestors and Descendants
4. Modelling the Person Class
5. Siblings in the FHKB
6. Individuals in Class Expressions
7. Data Properties in the FHKB
8. Cousins in the FHKB
9. Marriage in the FHKB
10. Extending the TBox
11. Final remarks
A FHKB Family Data
"},{"location":"tutorial/fhkb/#preamble","title":"Preamble","text":""},{"location":"tutorial/fhkb/#01-licencing","title":"0.1 Licencing","text":"The \u2018Manchester Family History Advanced OWL Tutorial\u2019 by Robert Stevens, Margaret Stevens, Nicolas Matentzoglu, Simon Jupp is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
"},{"location":"tutorial/fhkb/#02-reporting-errors","title":"0.2 Reporting Errors","text":"This manual will almost certainly contain errors, defects and infelicities. Do report them to robert.stevens@manchester.ac.uk supplying chapter, section and some actual context in the form of words will help in fixing any of these issues.
"},{"location":"tutorial/fhkb/#03-acknowledgements","title":"0.3 Acknowledgements","text":"As well as the author list, many people have contributed to this work. Any contribution, such as reporting bugs etc., is rewarded by an acknowledgement of contribution (in alphabetical order) when the authors get around to adding them:
This tutorial introduces the tutee to many of the more advanced features of the Web Ontology Language (OWL). The topic of family history is used to take the tutee through various modelling issues and, in doing so, using many features of OWL 2 to build a Family History Knowledge Base (FHKB). The exercises are designed to maximise inference about family history through the use of an automated reasoner on an OWL knowledge base (KB) containing many members of the Stevens family.
The aim, therefore, is to enable people to learn advanced features of OWL 2 in a setting that involves both classes and individuals, while attempting to maximise the use of inference within the FHKB.
"},{"location":"tutorial/fhkb/#11-learning-outcomes","title":"1.1 Learning Outcomes","text":"By doing this tutorial, a tutee should be able to:
Building an FHKB enables us to meet our learning outcomes through a topic that is accessible to virtually everyone. Family history or genealogy is a good topic for a general tutorial on OWL 2 as it enables us to touch many features of the language and, importantly, it is a field that everyone knows. All people have a family and therefore a family history \u2013 even if they do not know their particular family history. A small caveat was put on the topic being accessible to everyone as some cultures differ, for instance, in the description of cousins and labels given to different siblings. Nevertheless, family history remains a topic that everyone can talk about.
Family history is a good topic for an OWL ontology as it obviously involves both individuals \u2013 the people involved \u2013 and classes of individuals \u2013 people, men and women, cousins, etc. Also, it is an area rich in inference; from only knowing parentage and sex of an individual, it is possible to work out all family relationships \u2013 for example, sharing parents implies a sibling relationship; one\u2019s parent\u2019s brothers are one\u2019s uncles; one\u2019s parent\u2019s parents are one\u2019s grandparents. So, we should be able to construct an ontology that allows us to both express family history, but also to infer family relationships between people from knowing relatively little about them.
As we will learn through the tutorial, OWL 2 cannot actually do all that is needed to create a FHKB. This is unfortunate, but we use it to our advantage to illustrate some of the limitations of OWL 2. We know that rule based systems can do family history with ease, but that is not the point here; we are not advocating OWL DL as an appropriate mechanism for doing family history, but we do use it as a good educational example.
We make the following assumptions about what people know:
We make some simplifying assumptions in this tutorial:
At the end of the tutorial, you should be able to produce a property hierarchy and a TBox or class hierarchy such as shown in Figure 1.1; all supported by use of the automated reasoner and a lot of OWL 2\u2019s features.
Figure 1.1: A part of the class and property hierarchy of the final FHKB.
"},{"location":"tutorial/fhkb/#13-how-to-use-this-tutorial","title":"1.3 How to use this Tutorial","text":"Here are some tips on using this manual to the best advantage:
The following resources are available at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial:
1 The image comes fromhttp://ancienthomeofdragon.homestead.com/May 2012.
"},{"location":"tutorial/fhkb/#chapter-2","title":"Chapter 2","text":""},{"location":"tutorial/fhkb/#adding-some-individuals-to-the-fhkb","title":"Adding some Individuals to the FHKB","text":"In this chapter we will start by creating a fresh OWL ontology and adding some individuals that will be surrogates for people in the FHKB. In particular you will:
The \u2018world\u20192 or field of interest we model in an ontology is made up of objects or individuals. Such objects include, but are not limited to:
2 we use \u2018world\u2019 as a synonym of \u2018field of interest\u2019 or \u2018domain\u2019. \u2018World\u2019 does not restrict us to modelling the physical world outside our consciousness.
We observe these objects, either outside lying around in the world or in our heads. OWL is all about modelling such individuals. Whenever we make a statement in OWL, when we write down an axiom, we are making statements about individuals. When thinking about the axioms in an ontology it is best to think about the individuals involved, even if OWL individuals do not actually appear in the ontology. All through this tutorial we will always be returning to the individuals being described in order to help us understand what we are doing and to help us make decisions about how to do it.
"},{"location":"tutorial/fhkb/#22-asserting-parentage-facts","title":"2.2 Asserting Parentage Facts","text":"Biologically, everyone has parents; a mother and a father3. The starting point for family history is parentage; we need to relate the family member objects by object properties. An object property relates two objects, in this case a child object with his or her mother or father object. To do this we need to create three object properties:
Task 1: Creating object properties for parentagehasMother
; isMotherOf
and give hasMother
the InverseOf: isMotherOf
; hasFather
; hasParent
; give it the obvious inverse; hasMother
and hasFather
sub-properties of hasParent
. Note how the reasoner has automatically completed the sub-hierarchy for isParentOf:
isMotherOf
and isFatherOf
are inferred to be sub-properties of isParentOf
.
The OWL snippet below shows some parentage fact assertions on an individual. Note that rather than being assertions to an anonymous individual via some class, we are giving an assertion to a named individual.
Individual: grant_plinth\nFacts: hasFather mr_plinth, hasMother mrs_plinth\n
3 Don\u2019t quibble; it\u2019s true enough here.
Task 2: Create the ABoxhasMother
and hasFather
properties in our fact assertions. You do not need to assert names and birth years yet. This exercise will require you to create an individual for every person we want to talk about, using the Firstname_Secondname_Familyname_Birthyear
pattern, as for example in Robert_David_Bright_1965
.While asserting facts about all individuals in the FHKB will be a bit tedious at\ntimes, it might be useful to at least do the task for a subset of the family members.\nFor the impatient reader, there is a convenience snapshot of the ontology including\nthe raw individuals available at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
If you are working with Prot\u00e9g\u00e9, you may want to look at the Matrix plugin for\nProt\u00e9g\u00e9 at this point. The plugin allows you to add individuals quickly in the\nform of a regular table, and can significantly reduce the effort of adding any type\nof entity to the ontology. In order to install the matrix plugin, open Prot\u00e9g\u00e9 and\ngo to File \u00bb Check for plugins. Select the \u2018Matrix Views\u2019 plugin. Click install,\nwait until the the installation is confirmed, close and re-open Prot\u00e9g\u00e9; go to the\n\u2018Window\u2019 menu item, select \u2018Tabs\u2019 and add the \u2018Individuals matrix\u2019.\n
Now do the following:
Task 3: DL querieshasFather
value David_Bright_1934
and look at the answers (remember to check the respective checkbox in Prot\u00e9g\u00e9 to include individuals in your query results). isFatherOf
value Robert_David_Bright_1965
. Look at the answers. 4. Look at the entailed facts on Robert_David_Bright_1965
.You should find the following:
Since we have said that isFatherOf
has an inverse of hasFather
, and we have asserted that Robert_David_Bright_1965 hasFather David_Bright_1934
, we have a simple entailment that David_Bright_1934 isFatherOf Robert_David_Bright_1965
. So, without asserting the isFatherOf
facts, we have been able to ask and get answers for that DL query.
As we asserted that Robert_David_Bright_1965 hasFather David_Bright_1934
, we also infer that he hasParent
David_Bright_1934
; this is because hasParent
is the super-property of hasFather
and the sub-property implies the super-property. This works all the way up the property tree until topObjectProperty
, so all individuals are related by topObjectProperty
\u2014this is always true. This implication \u2018upwards\u2019 is the way to interpret how the property hierarchies work.
We have now covered the basics of dealing with individuals in OWL ontologies. We have set up some properties, but without domains, ranges, appropriate characteristics and then arranged them in a hierarchy. From only a few assertions in our FHKB, we can already infer many facts about an individual: Simple exploitation of inverses of properties and super-properties of the asserted properties.
We have also encountered some important principles:
hasFather
implies the hasParent
fact between individuals. This entailment of the super-property is very important and will drive much of the inference we do with the FHKB.The FHKB ontology at this stage of the tutorial has an expressivity of ALHI.\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.026 sec (0.00001 % of final), by Pellet\n2.2.0 0.144 sec (0.00116 % of final) and by FaCT++ 1.6.4 is approximately 0.\nsec (0.000 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-3","title":"Chapter 3","text":""},{"location":"tutorial/fhkb/#ancestors-and-descendants","title":"Ancestors and Descendants","text":"In this Chapter you will:
Find a snapshot of the ontology at this stage at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial.\n
"},{"location":"tutorial/fhkb/#31-ancestors-and-descendants","title":"3.1 Ancestors and Descendants","text":"The FHKB has parents established between individuals and we know that all people have two parents. A parent is an ancestor of its children; a person\u2019s parent\u2019s parents are its ancestors; and so on. So, in our FHKB, Robert\u2019s ancestors are David, Margaret, William, Iris, Charles, Violet, James, another Violet, another William, Sarah and so on. If my parent\u2019s parents are my ancestors, then what we need is a transitive version of the hasParent
property. Obviously we do not want hasParent
to be transitive, as Robert\u2019s grandparents (and so on) would become his parents (and that would be wrong).
We can easily achieve what is necessary. We need a hasAncestor
property that has a transitive characteristic. The trick is to make this a super-property of the hasParent
property. As explained before, a sub-property implies its super-property. So, if individual x holds a hasParent
property with an individual y , then it also holds an instance of its super-property hasAncestor
with the individual y. If individual y then holds a hasParent
property with another individual z , then there is also, by implication, a hasAncestor
property between y and z. As hasAncestor
is transitive, x and z also hold a hasAncestor
relationship between them.
The inverse of hasAncestor
can either be isAncestorOf
or hasDescendant
. We choose the isAncestorOf
option.
hasRelation
, make it symmetric. hasAncestor
. hasRelation
and a super-property of hasParent
. hasAncestor
transitive. isAncestorOf
. Do not \u2018stitch\u2019 it into the property hierarchy; the reasoner will sort it all out for you. hasAncestor value William_George_Bright_1901
. isAncestorOf value Robert_David_Bright_1965
.The hasAncestor
object property will look like this:
ObjectProperty: hasAncestor\nSubPropertyOf: hasRelation\nSuperPropertyOf: hasParent,\nCharacteristics: Transitive\nInverseOf: isAncestorOf\n
As usual, it is best to think of the objects or individuals involved in the relationships. Consider the three individuals \u2013 Robert, David and William. Each has a hasFather
property, linking Robert to David and then David to William. As hasFather
implies its super-property hasParent
, Robert also has a hasParent
property with David, and David has a hasParent
relation to William. Similarly, as hasParent
implies hasAncestor
, the Robert object has a hasAncestor
relation to the David object and the David object has one to the William object. As hasAncestor
is transitive, Robert not only holds this property to the David object, but also to the William object (and so on back through Robert\u2019s ancestors).
We also want to use a sort of restricted transitivity in order to infer grandparents, great grandparents and so on. My grandparents are my parent\u2019s parents; my grandfathers are my parent\u2019s fathers. My great grandparents are my parent\u2019s parent\u2019s parents. My great grandmothers are my parent\u2019s parent\u2019s mothers. This is sort of like transitivity, but we want to make the paths only a certain length and, in the case of grandfathers, we want to move along two relationships \u2013 hasParent
and then hasFather
.
We can do this with OWL 2\u2019s sub-property chains. The way to think about sub-property chains is: If we see property x followed by property y linking three objects, then it implies that property z is held between
Figure 3.1: Three blobs representing objects of the classPerson. The three objects are linked by a hasParent
property and this implies a hasGrandparent
property.
the first and third objects. Figure 3.1 shows this diagrammatically for the hasGrandfather
property.
For various grandparent object properties we need the following sets of implications:
Notice that we can trace the paths in several ways, some have more steps than others, though the shorter paths themselves employ paths. Tracing these paths is what OWL 2\u2019s sub-property chains achieve. For the new object property hasGrandparent
we write:
ObjectProperty: hasGrandparent SubPropertyChain: hasParent o hasParent\n
We read this as \u2018hasParent
followed by hasParent
implies hasGrandparent
\u2019. We also need to think where the hasGrandparent
property fits in our growing hierarchy of object properties. Think about the implications: Does holding a hasParent
property between two objects imply that they also hold a hasGrandparent
property? Of course the answer is \u2018no\u2019. So, this new property is not a super-property of hasParent
. Does the holding of a hasGrandparent
property between two objects imply that they also hold an hasAncestor
property? The answer is \u2018yes\u2019; so that should be a super-property of hasGrandparent
. We need to ask such questions of our existing properties to work out where we put it in the object property hierarchy. At the moment, our hasGrandparent
property will look like this:
ObjectProperty: hasGrandParent\nSubPropertyOf: hasAncestor\nSubPropertyChain: hasParent o hasParent\nSuperPropertyOf: hasGrandmother, hasGrandfather\nInverseOf: isGrandParentOf\n
Do the following task:
Task 5: Grandparents object propertieshasGrandparent
, hasGrandmother
and hasGrandfather
object properties and the obvious inverses (see OWL code above); Robert_David_Bright_1965
and his parents.Again, think of the objects involved. We can take the same three objects as before: Robert, David and William. Think about the properties that exist, both by assertion and implication, between these objects. We have asserted only hasFather
between these objects. The inverse can be inferred between the actual individuals (remember that this is not the case for class level restrictions \u2013 that all instances of a class hold a property does not mean that the filler objects at the other end hold the inverse; the quantification on the restriction tells us this). Remember that:
hasFather
property with David;hasFather
property with William;hasParent
super-property of hasFather
, Robert holds a hasParent
property with David, and the latter holds one with William;hasGrandfather
then implies that Robert holds a hasGrandfather
property to William. Use the diagram in figure 3.1 to trace the path; there is a hasParent
path from Robert to William via David and this implies the hasGrandfather
property between Robert and William.It is also useful to point out that the inverse of hasGrandfather
also has the implication of the sub-property chain of the inverses of hasParent
. That is, three objects linked by a path of two isParentOf
properties implies that an isGrandfatherOf
property is established between the first and third object, in this case William and Robert. As the inverses of hasFather
are established by the reasoner, all the inverse implications also hold.
It is important when dealing with property hierarchies to think in terms of properties between objects and of the implications \u2018up the hierarchy\u2019. A sub-property implies its super-property. So, in our FHKB, two person objects holding a hasParent
property between them, by implication also hold an hasAncestor
property between them. In turn, hasAncestor
has a super-property hasRelation
and the two objects in question also hold, by implication, this property between them as well.
We made hasAncestor
transitive. This means that my ancestor\u2019s ancestors are also my ancestors. That a sub-property is transitive does not imply that its super-property is transitive. We have seen that by manipulating the property hierarchy we can generate a lot of inferences without adding any more facts to the individuals in the FHKB. This will be a feature of the whole process \u2013 keep the work to the minimum (well, almost).
In OWL 2, we can also trace \u2018paths\u2019 around objects. Again, think of the objects involved in the path of properties that link objects together. We have done simple paths so far \u2013 Robert linked to David via hasParent
and David linked to William via hasFather
implies the link between Robert and William of hasGrandfather
. If this is true for all cases (for which you have to use your domain knowledge), one can capture this implication in the property hierarchy. Again, we are making our work easier by adding no new explicit facts, but making use of the implication that the reasoner works out for us.
The FHKB ontology at this stage of the tutorial has an expressivity ofALRI+.\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.262 sec (0.00014 % of final), by Pellet\n2.2.0 0.030 sec (0.00024 % of final) and by FaCT++ 1.6.4 is approximately 0.004\nsec (0.000 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-4","title":"Chapter 4","text":""},{"location":"tutorial/fhkb/#modelling-the-person-class","title":"Modelling the Person Class","text":"In this Chapter you will:
Person
class;Sex
classes;Man
and Woman
;These simple classes will form the structure for the whole FHKB.
"},{"location":"tutorial/fhkb/#41-the-class-of-person","title":"4.1 The Class of Person","text":"For the FHKB, we start by thinking about the objects involved
There is a class of Person
that we will use to represent all these people objects.
Person
class DomainEntity
; DomainEntity
called Person
.We use DomainEntity
as a house-keeping measure. All of our ontology goes underneath this class. We can put other classes \u2018outside\u2019 the ontology, as siblings of DomainEntity
, such as \u2018probe\u2019 classes we wish to use to test our ontology.
The main thing to remember about the Person
class is that we are using it to represent all \u2018people\u2019 individuals. When we make statements about the Person
class, we are making statements about all \u2018people\u2019 individuals.
What do we know about people? All members of the Person
class have:
There\u2019s a lot more we know about people, but we will not mention it here.
"},{"location":"tutorial/fhkb/#42-describing-sex-in-the-fhkb","title":"4.2 Describing Sex in the FHKB","text":"Each and every person object has a sex. In the FHKB we will take a simple view on sex \u2013 a person is either male or female, with no intersex or administrative sex and so on. Each person only has one sex.
We have two straight-forward options for modelling sex:
We will take the approach of having a class of Maleness objects and a class of Femaleness objects. These are qualities or attributes of self-standing objects such as a person. These two classes are disjoint, and each is a subclass of a class called Sex
. The disjointness means that any one instance of Sex
cannot be both an instance of Maleness
and an instance of Femaleness
at once. We also want to put in a covering axiom on the class Sex
, which means that any instance of Sex
must be either Maleness
or Femaleness
; there is no other kind of Sex
.
Again, notice that we have been thinking at the level of objects. We do the same when thinking about Person
and their Sex
. Each and every person is related to an instance of Sex
. Each Person
holds one relationship to a Sex
object. To do this we create an object property called hasSex
. We make this property functional, which means that any object can hold that property to only one distinct filler object.
We make the domain of hasSex
to be Person
and the range to be Sex
. The domain of Person
means that any object holding that property will be inferred to be a member of the class Person
. Putting the range of Sex
on the hasSex
property means that any object at the right-hand end of the hasSex
property will be inferred to be of the class Sex
. Again, think at the level of individuals or objects.
We now put a restriction on the Person
class to state that each and every instance of the class Person
holds a hasSex
property with an instance of the Sex
class. It has an existential operator \u2018some\u2019 in the axiom, but the functional characteristic means that each Person
object will hold only one hasSex
property to a distinct instance of a Sex
object4.
4 An individual could hold two hasSex
properties, as long as the sex objects at the right-hand end of the property are not different.
Sex
; DomainEntity
; Person
and Sex
disjoint; Sex
, Maleness
and Femaleness
; Maleness
and Femaleness
disjoint; Sex
such that it is equivalent to Maleness
or Femaleness
. hasSex
, with the domain Person
, the range Sex
and give it the characteristic of \u2018Functional\u2019; hasSex some Sex
to the class Person
.The hasSex
property looks like:
ObjectProperty: hasSex\nCharacteristics: Functional\nDomain: Person\nRange: Sex\n
The Person
class looks like:
Class: Person\nSubClassOf: DomainEntity,(hasSex some Sex)\nDisjointWith: Sex\n
"},{"location":"tutorial/fhkb/#43-defining-man-and-woman","title":"4.3 Defining Man and Woman","text":"We now have some of the foundations for the FHKB. We have the concept of Person
, but we also need to have the concepts of Man
and Woman
. Now we have Person
, together with Maleness
and Femaleness
, we have the necessary components to define Man
and Woman
. These two classes can be defined as: Any Person
object that has a male sex can be recognised to be a man; any Person
object that has a female sex can be recognised as a member of the class woman. Again, think about what conditions are sufficient for an object to be recognised to be a member of a class; this is how we create defined classes through the use of OWL equivalence axioms.
To make the Man
and Woman
classes do the following:
Man
; Person that hasSex some Maleness
; Femaleness
, to create the Woman
class; Person
class to indicate that man and woman are the only kinds of person that can exist. (This is not strictly true due to the way Sex
has been described.) Having run the reasoner, the Man
and Woman
classes should appear underneath Person
5.
5Actually in Prot\u00e9g\u00e9, this might happen without the need to run the reasoner.
The Man
and Woman
classes will be important for use as domain and range constraints on many of the properties used in the FHKB. To achieve our aim of maximising inference, we should be able to infer that individuals are members of Man
, Woman
or Person
by the properties held by an object. We should not have to state the type of an individual in the FHKB.
The classes for Man
and Woman
should look like:
Class: Man\nEquivalentTo: Person and (hasSex some Maleness)\n
Class: Woman\nEquivalentTo: Person and (hasSex some Femaleness)\n
"},{"location":"tutorial/fhkb/#44-describing-parentage-in-the-fhkb","title":"4.4 Describing Parentage in the FHKB","text":"To finish off the foundations of the FHKB we need to describe a person object\u2019s parentage. We know that each and every person has one mother and each and every person has one father. Here we are talking about biological mothers and fathers. The complexities of adoption and step parents are outside the scope of this FHKB tutorial.
Task 9: Describing ParentagePerson
and the range Woman
to the property hasMother
.hasFather
, but give it the range Man
;hasParent
domain and range of Person
;The (inferred) property hierarchy in the FHKB should look like that shown in Figure 4.1. Notice that we have asserted the sub-property axioms on one side of the property hierarchy. Having done so, the reasoner uses those axioms, together with the inverses, to work out the property hierarchy for the \u2018other side\u2019.
We make hasMother
functional, as any one person object can hold only one hasMother
property to a distinct Woman
object. The range of hasMother
is Woman
, as a mother has to be a woman. The Person
object holding the hasMother
property can be either a man or a woman, so we have the domain constraint as Person
; this means any object holding a hasMother
property will be inferred to be a Person
. Similarly, any object at the right-hand end of a hasMother
property will be inferred to be a Woman
, which is the result we need. The same reasoning goes for hasFather
and hasParent
, with the sex constraints on the latter being only Person
. The inverses of the two functional sub-properties of hasParent
are not themselves functional. After all, a Woman
can be the mother of many Person
objects, but each Person
object can have only one mother.
Figure 4.1: The property hierarchy with the hasSex
and the parentage properties
Figure 4.2: the core TBox for the FHKB with the Person
and Sex
classes.
Person
class as shown below.Class: Person\nSubClassOf: DomainEntity, (hasFather some Man), (hasMother some Woman),\n(hasSex some Sex)\nDisjointWith: Sex\n
Task 11: DL queries for people and sex Person
, Man
and Woman
; look at the answers and count the numbers in each class; which individuals have no sex and why? Man
or Woman
, but some are, as we will see below, only inferred to be Person
.The domain and range constraints on our properties have also driven some entailments. We have not asserted that David_Bright_1934
is a member of Man
, but the range constraint on hasFather
(or the inferred domain constraint on the isFatherOf
relation) has enabled this inference to be made. This goes for any individual that is the right-hand-side (either inferred or asserted) of either hasFather
or hasMother
(where the range is that of Woman
). For Robert David Bright, however, he is only the left-hand-side of an hasFather
or an hasMother
property, so we\u2019ve only entailed that this individual is a member of Person
.
In our description of the Person
class we have said that each and every instance of the class Person
has a father (the same goes for mothers). So, when we ask the query \u2018which individuals have a father\u2019, we get all the instances of Person
back, even though we have said nothing about the specific parentage of each Person
. We do not know who their mothers and fathers are, but we know that they have one of each. We know all the individuals so far entered are members of the Person
class; when asserting the type to be either Man
or Woman
(each of which is a subclass of Person
), we infer that each is a person. When asserting the type of each individual via the hasSex
property, we know each is a Person
, as the domain of hasSex
is the Person
class. As we have also given the right-hand side of hasSex
as either Maleness
or Femaleness
, we have given sufficient information to recognise each of these Person
instances to be members of either Man
or Woman
.
So far we have not systematically added domains and ranges to the properties in the FHKB. As a reminder, when a property has a domain of X
any object holding that property will be inferred to be a member of class X
. A domain doesn\u2019t add a constraint that only members of class X
hold that property; it is a strong implication of class membership. Similarly, a property holding a range implies that an object acting as right-hand-side to a property will be inferred to be of that class. We have already seen above that we can use domains and ranges to imply the sex of people within the FHKB.
Do the following:
Task 12: Domains and RangesPerson
, Man
and Woman
are domains and ranges for hasFather
, hasMother
and hasParent
. hasAncestor
, hasGrandparent
, hasUncle
and so on; look to see what domains and ranges are found. Add any domains and ranges explicitly as necessary.
Prot\u00e9g\u00e9 for example in its current version (November 2015) does not visualise\ninherited domains and ranges in the same way as it shows inferred inverse relations.\n
We typically assert more domains and ranges than strictly necessary. For example, if we say that hasParent
has the domain Person
, this means that every object x
that is connected to another object y
via the hasParent
relation must be a Person
. Let us assume the only thing we said about x
and y
is that they are connected by a hasMother
relation. Since this implies that x
and y
are also connected by a hasParent
relation (hasMother
is a sub-property of hasParent
) we do not have to assert that hasFather
has the domain of Person
; it is implied by what we know about the domain and range of hasParent
.
In order to remove as many assertions as possible, we may therefore choose to assert as much as we know starting from the top of the hierarchy, and only ever adding a domain if we want to constrain the already inferred domain even further (or range respectively). For example, in our case, we could have chosen to assert Person
to be the domain of hasRelation
. Since hasRelation
is symmetric, it will also infer Person
to be the range. We do not need to say anything for hasAncestor
or hasParent
, and only if we want to constrain the domain or range further (like in the case of hasFather
by making the range Man
) do we need to actually assert something. It is worth noting that because we have built the object property hierarchy from the bottom (hasMother
etc.) we have ended up asserting more than necessary.
From the Pizza Tutorial and other work with OWL you should have seen some unsatisfiabilities. In Prot\u00e9g\u00e9 this is highlighted by classes going \u2018red\u2019 and being subclasses ofNothing; that is, they can have no instances in that model.
Task 13: InconsistenciesRobert_David_Bright_1965 hasMother David_Bright_1934
. Robert_David_Bright_1965 hasMother Iris_Ellen_Archer_1907
After asserting the first fact it should be reported by the reasoner that the ontology is inconsistent. This means, in lay terms, that the model you\u2019ve provided in the ontology cannot accommodate the facts you\u2019ve provided in the fact assertions in your ABox\u2014that is, there is an inconsistency between the facts and the ontology... The ontology is inconsistent because David_Bright_1934
is being inferred to be a Man
and a Woman
at the same time which is inconsistent with what we have said in the FHKB.
When we, however, say that Robert David Bright
has two different mothers, nothing bad happens! Our domain knowledge says that the two women are different, but the reasoner does not know this as yet... ; Iris Ellen Archer and Margaret Grace Rever may be the same person; we have to tell the reasoner that they are different. For the same reason the functional characteristic also has no effect until the reasoner \u2018knows\u2019 that the individuals are different. We will do this in Section 7.1.1 and live with this \u2018fault\u2019 for the moment.
Ancestor
, MaleAncestor
, FemaleAncestor
; Descendant
, MaleDescendant
and FemaleDescendant
; The code for the classes looks like:
Class: Ancestor EquivalentTo: Person and isAncestorOf some Person\nClass: FemaleAncestor EquivalentTo: Woman and isAncestorOf some Person\nClass: Descendant EquivalentTo: Person and hasAncestor some Person\nClass: MaleDescendant EquivalentTo: Man and hasAncestor some Person\n
The TBox after reasoning can be seen in Figure 4.3. Notice that the reasoner has inferred that several of the classes are equivalent or \u2018the same\u2019. These are: Descendant
and Person
; MaleDescendant
and Man
, FemaleDescendant
and Woman
.
The reasoner has used the axioms within the ontology to infer that all the instances of Person
are also instances of the class Descendant
and that all the instances of Woman
are also the same instances as the class Female Descendant
. This is intuitively true; all people are descendants \u2013 they all have parents that have parents etc. and thus everyone is a descendant. All women are female people that have parents etc. As usual we should think about the objects within the classes and what we know about them. This time it is useful to think about the statements we have made about Person
in this Chapter \u2013 that all instances of Person
have a father and a mother; add to this the information from the property hierarchy and we know that all instances of Person
have parents and ancestors. We have repeated all of this in our new defined classes for Ancestor
and Descendant
and the reasoner has highlighted this information.
Figure 4.3: The defined classes from Section 4.8 in the FHKB\u2019s growing class hierarchy
Task 15: More AncestorsMaleDescendant
. You should get Man
back - they are equivalent (and this makes sense). hasAncestor
, but adding in, for instance, hasFather
as the sub-property of the transitive super-property of hasForefather
and setting the domains and ranges appropriately (or working out if they\u2019ll be inferred appropriately). Here we interpret a forefather as one\u2019s father\u2019s father etc. This isn\u2019t quite right, as a forefather is any male ancestor, but we\u2019ll do it that way anyway. You might want to play around with DL queries. Because of the blowup in inferred relationships, we decided to not include this pattern in the tutorial version of the FHKB.Most of what we have done in this chapter is straight-forward OWL, all of which would have been met in the pizza tutorial. It is, however, a useful revision and it sets the stage for refining the FHKB. Figure 4.2 shows the basic set-up we have in the FHKB in terms of classes; we have a class to represent person, man and woman, all set-up with a description of sex, maleness and femaleness. It is important to note, however, the approach we have taken: We have always thought in terms of the objects we are modelling.
Here are some things that should now be understood upon completing this chapter:
Person
have a mother, so any individual asserted to be a Person
must have a mother. We do not necessarily know who they are, but we know they have one.Person
, not that he is a Man
. This is because, so far, he only has the domain constraint of hasMother
and hasFather
to help out.Finally, we looked at some defined classes. We inferred equivalence between some classes where the extents of the classes were inferred to be the same \u2013 in this case the extents of Person
and Descendant
are the same. That is, all the objects that can appear in Person
will also be members of Descendant
. We can check this implication intuitively \u2013 all people are descendants of someone. Perhaps not the most profound inference of all time, but we did no real work to place this observation in the FHKB.
This last point is a good general observation. We can make the reasoner do work\nfor us. The less maintenance we have to do in the FHKB the better. This will be\na principle that works throughout the tutorial.\n
The FHKB ontology at this stage of the tutorial has an expressivity of SRIF.\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.884 sec (0.00047 % of final), by Pellet\n2.2.0 0.256 sec (0.00207 % of final) and by FaCT++ 1.6.4 is approximately 0.013\nsec (0.000 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-5","title":"Chapter 5","text":""},{"location":"tutorial/fhkb/#siblings-in-the-fhkb","title":"Siblings in the FHKB","text":"In this chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
"},{"location":"tutorial/fhkb/#51-blood-relations","title":"5.1 Blood relations","text":"Do the following first:
Task 16: The bloodrelation object propertyhasBloodrelation
object property, making it a sub-property of hasRelation
. hasAncestor
property a sub-property of hasBloodrelation
.Does a blood relation of Robert have the same relationship to Robert (symmetry)? Is a blood relation of Robert\u2019s blood relation a blood relation of Robert (transitivity)? Think of an aunt by marriage; her children are my cousins and blood relations via my uncle, but my aunt is not my blood relation. My siblings share parents; male siblings are brothers and female siblings are sisters. So far we have asserted parentage facts for the Person
in our ABox. Remember that our parentage properties have inverses, so if we have added an hasFather
property between a Person
and a Man
, we infer the isFatherOf
property between that Man
and that Person
.
We should have enough information within the FHKB to infer siblings. We could use a sub-property chain such as:
ObjectProperty: hasSibling\nSubPropertyOf: hasBloodrelation\nCharacteristics: Symmetric, transitive\nSubPropertyChain: hasParent o isParentOf\n
We make a property of hasSibling
and make it a sub-property of hasBloodrelation
. Remember, think of the objects involved and the implications we want to follow; being a sibling implies being a blood relation, it does not imply any of the other relationships we have in the FHKB.
Note that we have made hasSibling
symmetric; if Robert is sibling of Richard, then Richard is sibling of Robert. We should also think about transitivity; if David is sibling of Peter and Peter is sibling of John, then David is sibling of John. So, we make hasSibling
symmetric and transitive (see Figure 5.1). However, we must take care of half-siblings: child 1 and child 2 share a mother, but not a father; child 2 and child 3 share the father, but not the mother \u2013 child 1 and child 3 are not even half-siblings. However, at least for the moment, we will simply ignore this inconvenience, largely so that we can explore what happens with different modelling options.
Figure 5.1: Showing the symmetry and transitivity of the hasSibling
(siblingof) property by looking at the brothers David, John and Peter
We also have the implication using three objects (see Figure 5.2):
hasParent
property with David;isFatherOf
property with Richard;hasSibling
property with Richard;hasSibling
is symmetric, Richard holds an hasSibling
property with Robert.Figure 5.2: Tracing out the sub-property chain for hasSibling
; note that Robert is a sibling of himself by this path
Do the following tasks:
Task 17: SiblingshasSibling
property as above; hasSibling
value Robert_David_Bright_1965
.From this last DL query you should get the answer that both Robert and Richard are siblings of Robert. Think about the objects involved in the sub-property chain: we go from Robert to David via the hasParent
and from David to Richard via the isParentOf
property; so this is OK. However, we also go from Robert to David and then we can go from David back to Robert again \u2013 so Robert is a sibling of Robert. We do not want this to be true.
We can add another characteristic to the hasSibling
property, the one of being irreflexive
. This means that an object cannot hold the property with itself.
hasSibling
property; Note that the reasoner claims you have an inconsistent ontology (or in some cases, you might get a message box saying \"Reasoner died\"). Looking at the hasSibling
property again, the reason might not be immediately obvious. The reason for the inconsistency lies in the fact that we create a logical contradiction: through the property chain, we say that every Person
is a sibling of him or herself, and again disallowing just that by adding the irreflexive characteristic. A different explanation lies within the OWL specification itself: In order to maintain decidability irreflexive properties must be simple - for example, they may not be property chains6.
6 http://www.w3.org/TR/owl2-syntax/#The_Restrictions_on_the_Axiom_Closure
"},{"location":"tutorial/fhkb/#521-brothers-and-sisters","title":"5.2.1 Brothers and Sisters","text":"We have only done siblings, but we obviously need to account for brothers and sisters. In an analogous way to motherhood, fatherhood and parenthood, we can talk about sex specific sibling relationships implying the sex neutral hasSibling
; holding either a hasBrother
or an isSisterOf
between two objects would imply that a hasSibling
property is also held between those two objects. This means that we can place these two sex specific sibling properties below hasSibling
with ease. Note, however, that unlike the hasSibling
property, the brother and sister properties are not symmetric. Robert hasBrother
Richard and vice versa , but if Daisy hasBrother
William, we do not want William to hold an hasBrother
property with Daisy. Instead, we create an inverse of hasBrother
, isBrotherOf
, and the do the same for isSisterOf
.
We use similar, object based, thought processes to choose whether to have transitivity as a characteristic of hasBrother
. Think of some sibling objects or individuals and place hasBrother
properties between them. Make it transitive and see if you get the right answers. Put in a sister to and see if it stil works. If David hasBrother
Peter and Peter hasBrother
John, then David hasBrother
John; so, transitivity works in this case. Think of another example. Daisy hasBrother
Frederick, and Frederick hasBrother
William, thus Daisy hasBrother
William. The inverses work in the same way; William isBrotherOf
Frederick and Frederick isBrotherOf
Daisy; thus William isBrotherOf
Daisy. All this seems reasonable.
hasBrother
object property as shown below; hasSister
in a similar manner; 3. Add appropriate inverses, domains and ranges.ObjectProperty: hasBrother\nSubPropertyOf: hasSibling\nCharacteristics: Transitive\nInverseOf: isBrotherOf\nRange: Man\n
We have some hasSibling
properties (even if they are wrong). We also know the sex of many of the people in the FHKB through the domains and ranges of properties such as hasFather
, hasMother
and their inverses..
Can we use sub-property chains in the same way as we have used them in the hasSibling
property? The issue is that of sex; the property isFatherOf
is sex neutral at the child end, as is the inverse hasFather
(the same obviously goes for the mother properties). We could use a sub-property chain of the form:
ObjectProperty: hasBrother\nSubPropertyChain: hasParent o hasSon\n
A son is a male child and thus that object is a brother of his siblings. At the moment we do not have son or daughter properties. We can construct a property hierarchy as shown in Figure 5.3. This is made up from the following properties:
hasChild
and isChildOf
hasSon
(range Man
and domain Person
) and isSonOf
;hasDaughter
(range Woman
domain Person
) and isDaughterOf
Note that hasChild
is the equivalent of the existing property isParentOf
; if I have a child, then I am its parent. OWL 2 can accommodate this fact. We can add an equivalent property axiom in the following way:
ObjectProperty: isChildOf\nEquivalentTo: hasParent\n
We have no way of inferring the isSonOf
and isDaughterOf
from what already exists. What we want to happen is the implication of \u2018Man
and hasParent
Person
implies isSonOf
\u2019. OWL 2 and its reasoners cannot do this implication. It has been called the \u2018man man problem\u20197. Solutions for this have been developed [3], but are not part of OWL 2 and its reasoners.
Figure 5.3: The property hierarchy for isChildOf
and associated son/daughter properties
7 http://lists.w3.org/Archives/Public/public-owl-dev/2007JulSep/0177.html
Child property Parent Robert David Bright 1965 isSonOf David Bright 1934, Margaret Grace Rever 1934 Richard John Bright 1962 isSonOf David Bright 1934, Margaret Grace Rever 1934 Mark Bright 1956 isSonOf John Bright 1930, Joyce Gosport Ian Bright 1959 isSonOf John Bright 1930, Joyce Gosport Janet Bright 1964 isDaughterOf John Bright 1930, Joyce Gosport William Bright 1970 isSonOf John Bright 1930, Joyce GosportTable 5.1: Child property assertions for the FHKB
Thus we must resort to hand assertions of properties to test out our new path:
Task 20: Sons and daughtersisChildOf
value David_Bright_1934
and you should have the answer of Richard and Robert; Of course, it works, but we see the same problem as above. As usual, think of the objects involved. Robert isSonOf
David and David isParentOf
Robert, so Robert is his own brother. Irreflexivity again causes problems as it does above (Task 18).
Our option one has lots of problems. So, we have an option of asserting the various levels of sibling. We can take the same basic structure of sibling properties as before, but just fiddle around a bit and rely on more assertion while still trying to infer as much as possible. We will take the following approach:
Table 5.2: The sibling relationships to add to the FHKB.
Do the following:
Task 21: Add sibling assertionsisChildOf
assertions as explained above. isBrotherOf
value Robert_David_Bright_1965
; isBrotherOf
value Richard_John_Bright_1962
; hasBrother
value Robert_David_Bright_1965
; hasBrother
value Richard_John_Bright_1962
;isSisterOf
value William_Bright_1970
; Man and hasSibling value Robert_David_Bright_1965
.We can see some problems with this option as well:
hasBrother
property to Robert. We would really like an isBrotherOf
to Robert to hold.Man
and hasSibling value Robert
only retrieves Robert himself. Because we only asserted that Robert is a brother of Richard, and the domain of isBrotherOf
is Man
we know that Robert is a Man
, but we do not know anything about the Sex
of Richard.Which of the two options gives the worse answers and which is the least effort? Option one is obviously the least effort; we only have to assert the same parentage facts as we already have; then the sub-property chains do the rest. It works OK for hasSibling
, but we cannot do brothers and sisters adequately; we need Man
and hasSibling
\u2290 isBrotherOf
and we cannot do that implication. This means we cannot ask the questions we need to ask.
So, we do option two, even though it is hard work and is still not perfect for query answering, even though we have gone for a sparse assertion mode. Doing full sibling assertion would work, but is a lot of effort.
We could start again and use the isSonOfandisDaughterOf
option, with the sub-property chains described above. This still has the problem of everyone being their own sibling. It can get the sex specific sibling relationships, but requires a wholesale re-assertion of parentage facts. We will continue with option two, largely because it highlights some nice problems later on.
In Section 5.2 we briefly talked about half-siblings. So far, we have assumed full-siblings (or, rather, just talked about siblings and made no distinction). Ideally, we would like to accommodate distinctions between full- and half-siblings; here we use half-siblings, where only one parent is in common between two individuals, as the example. The short-answer is, unfortunately, that OWL 2 cannot deal with half-siblings in the way that we want - that is, such that we can infer properties between named individuals indicating full- or half-sibling relationships.
It is possible to find sets of half-brothers in the FHKB by writing a defined class or DL-query for a particular individual.} The following fragment of OWL defines a class that looks for the half-brothers of an individual called \u2018Percival\u2019:
Class: HalfBrotherOfPercival\nEquivalentTo: Man and (((hasFather some (not (isFatherOf value Percival))) and\n(hasMother some (isMotherOf value Percival))) or ((hasFather some (isFatherOf\nvalue Percival)) and (hasMother some (not (isMotherOf value Percival)))))\n
Here we are asking for any man that either has Percival\u2019s father but not his mother, or his mother, but not his father. This works fine, but is obviously not a general solution. The OWL description is quite complex and the writing will not scale as the number of options (hypothetically, as the number of parents increases... ) increases; it is fine for man/woman, but go any higher and it will become very tedious to write all the combinations.
Another way of doing this half-brother class to find the set of half-brothers of a individual is to use cardinality constraints:
Class: HalfBrotherOfPercival\nEquivalentTo: Man and (hasParent exactly 1 (isParentOf value Percival))\n
This is more succinct. We are asking for a man that has exactly one parent from the class of individuals that are the class of Percival\u2019s parents. This works, but one more constraint has to be present in the FHKB. We need to make sure that there can be only two parents (or indeed, just a specified number of parents for a person). If we leave it open as to the number of parents a person has, the reasoner cannot work out that there is a man that shares exactly one parent, as there may be other parents. We added this constraint to the FHKB in Section 6.2; try out the classes to check that they work.
These two solutions have been about finding sets of half-brothers for an individual. What we really want in the FHKB is to find half-brothers between any given pair of individuals.
Unfortunately we cannot, without rules, ask OWL 2 to distinguish full- and half-siblings \u2013 we cannot count the number of routes taken between siblings via different distinct intermediate parent objects.
"},{"location":"tutorial/fhkb/#55-aunts-and-uncles","title":"5.5 Aunts and Uncles","text":"An uncle is a brother of either my mother or father. An aunt is a sister of either my mother or father. In common practice, wives and husbands of aunts and uncles are usually uncles and aunts respectively. Formally, these aunts and uncles are aunts-in-law and uncles-in-law. Whatever approach we take, we cannot fully account for aunts and uncles until we have information about marriages, which will not have until Chapter 9. We will, however, do the first part now.
Look at the objects and properties between them for the following facts:
As we are tracing paths or \u2018chains\u2019 of objects and properties we should use sub-property chains as a solution for the aunts and uncles. We can make an hasUncle
property as follows (see Figure 5.4):
ObjectProperty: hasUncle\nSubPropertyOf: hasBloodrelation\nDomain: Man\nRange: Person\nSubPropertyChain: hasParent o hasBrother\nInverseOf: isUncleOf\n
Figure 5.4: Tracing out the path between objects to get the hasUncle
sub-property chain.
Notice we have the domain of Man
and range of Person
. We also have an inverse. As usual, we can read this as \u2018an object that holds an hasParent
property, followed by an object holding a hasBrother
property, implies that the first object holds an hasUncle
property with the last object\u2019.
Note also where the properties (include the ones for aunt) go in the object property hierarchy. Aunts and uncles are not ancestors that are in the direct blood line of a person, but they are blood relations (in the narrower definition that we are using). Thus the aunt and uncle properties go under the hasBloodrelation
property (see Figure 5.5). Again, think of the implications between objects holding a property between them; that two objects linked by a property implies that those two objects also hold all the property\u2019s super-properties as well. As long as all the super-properties are true, the place in the object property hierarchy is correct (think about the implications going up, rather than down).
Figure 5.5: The object property hierarchy with the aunt and uncle properties included. On the right side, we can see the hasUncle property as shown by Prot\u00e9g\u00e9.
Do the following tasks:
Task 22: Uncles and AuntshasUncle
property as above; hasAunt
property as well; Julie_Bright_1966
and for Mark_Bright_1956
; hasGreatUncle
and hasGreatAunt
and place them in the property hierarchy.We can see this works \u2013 unless we have any gaps in the sibling relationships (you may have to fix these). Great aunts and uncles are simply a matter of adding another \u2018parent\u2019 leg into the sub-property chain. We are not really learning anything new with aunts and uncles, except that we keep gaining a lot for
free through sub-property chains. We just add a new property with its sub-property chain and we get a whole lot more inferences on individuals. To see what we now know about Robert David Bright, do the following:
Task 23: What do we know?You can now see lots of facts about Robert David Bright, with only a very few actual assertions directly on Robert David Bright.
"},{"location":"tutorial/fhkb/#56-summary","title":"5.6 Summary","text":"Siblings have revealed several things for us:
Man
and hasSibling
\u2283 isBrotherOf
, but OWL 2 doesn\u2019t do this implication;The FHKB ontology at this stage of the tutorial has an expressivity ofSRIF.\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 1355.614 sec (0.71682 % of final), by\nPellet 2.2.0 0.206 sec (0.00167 % of final) and by FaCT++ 1.6.4 is approximately\n0.039 sec (0.001 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-6","title":"Chapter 6","text":""},{"location":"tutorial/fhkb/#individuals-in-class-expressions","title":"Individuals in Class Expressions","text":"In this chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
"},{"location":"tutorial/fhkb/#61-richard-and-roberts-parents-and-ancestors","title":"6.1 Richard and Robert\u2019s Parents and Ancestors","text":"So far we have only used object properties between unspecified objects. We can, however, specify a specific individual to act at the right-hand-side of a class restriction or type assertion on an individual. The basic syntax for so-called nominals is:
Class: ParentOfRobert\nEquivalentTo: Person and isParentOf valueRobert_David_Bright_1965\n
This is an equivalence axiom that recognises any individual that is a Person
and a parent of Robert David Bright.
ParentOfRobert
as described above; Richard_John_Bright_1962
and classify; ParentOfRichardAndRobert
, defining it as Person and isParentOf some {Robert_David_Bright_1965 ,Richard_John_Bright_1962 }
; again see what happens on classification. Note that the expressions isMotherOf value Robert_David_Bright_1965
and isMotherOf some {Robert_David_Bright_1965 }
are practically identical. The only difference is that using value
, you can only specify one individual, while some
relates to a class (a set of individuals).We see that these queries work and that we can create more complex nominal based class expressions. The disjunction above is
isParentOf some {Robert_David_Bright_1965, Richard_John_Bright_1965}\n
The \u2018{\u2019 and \u2018}\u2019 are a bit of syntax that says \u2018here\u2019s a class of individual\u2019.
We also see that the classes for the parents of Robert David Bright and Richard John Bright have the same members according to the FHKB, but that the two classes are not inferred to be equivalent. Our domain knowledge indicates the two classes have the same extents (members) and thus the classes are equivalent, but the automated reasoner does not make this inference. As usual, this is because the FHKB has not given the automated reasoner enough information to make such an inference.
"},{"location":"tutorial/fhkb/#62-closing-down-what-we-know-about-parents-and-siblings","title":"6.2 Closing Down What we Know About Parents and Siblings","text":"The classes describing the parents of Richard and Robert are not equivalent, even though, as humans, we know their classes of parent are the same. We need more constraints so that it is known that the four parents are the only ones that exist. We can try this by closing down what we know about the immediate family of Robert David Bright.
In Chapter 4 we described that a Person
has exactly one Woman
and exactly one Man
as mother and father (by saying that the hasMother
and hasFather
properties are functional and thus only one of each may be held by any one individual to distinct individuals). The parent properties are defined in terms of hasParent
, hasMother
and hasFather
. The latter two imply hasParent
. The two sub-properties are functional, but there are no constraints on hasParent
, so an individual can hold many instances of this property. So, there is no information in the FHKB to say a Person
has only two parents (we say there is one mother and one father, but not that there are only two parents). Thus Robert and Richard could have other parents and other grandparents than those in the FHKB; we have to close down our descriptions so that only two parents are possible. There are two ways of doing this:
hasParent
in the same way as we did for Sex
in Chapter 4.hasParent
exactly 2 Person
to the classPerson
; ParentOfRobert
and ParentOfRichard
are placed and whether or not they are found to be equivalent; hasParent max 2 Person
to the class Person
; We find that these two classes are equivalent; we have supplied enough information to infer that these two classes are equivalent. So, we know that option one above works, but what about option two? This takes a bit of care to think through, but the basic thing is to think about how many ways there are to have a hasParent
relationship between two individuals. We know that we can have either a hasFather
or a hasMother
property between two individuals; we also know that we can have only one of each of these properties between an individual and a distinct individual. However, the open world assumption tells us that there may be other ways of having a hasParent
property between two individuals; we\u2019ve not closed the possibilities. By putting on the hasParent exactly 2 Person
restriction on the Person
class, we are effectively closing down the options for ways that a person can have parents; we know because of the functional characteristic on hasMother
and hasFather
that we can have only one of each of these and the two restrictions say that one of each must exist. So, we know we have two ways of having a parent on each Person
individual. So, when we say that there are exactly two parents (no more and no less) we have closed down the world of having parents\u2014thus these two classes can be inferred to be equivalent. It is also worth noting that this extra axiom on the Person
class will make the reasoner run much more slowly.
Finally, for option 2, we have no way of placing a covering axiom on a property. What we\u2019d like to be able to state is something like:
ObjectProperty: hasParent\nEquivalentTo: hasFather or hasMother\n
but we can\u2019t.
"},{"location":"tutorial/fhkb/#63-summary","title":"6.3 Summary","text":"For practice, do the following:
Task 26: Additional PracticeGrandparentOfRobert
and GrandparentOfRichard
and make them inferred to be equivalent.In this chapter we have seen the use of individuals within class expressions. It allows us to make useful queries and class definitions. The main things to note is that it can be done and that there is some syntax involved. More importantly, some inferences may not be as expected due to the open world assumption in OWL.
By now you might have noticed a significant increase in the time the reasoner needs\nto classify. Closing down what we know about family relationships takes its toll on\nthe reasoner performance, especially the usage of 'hasParent exactly 2 Person'. At\nthis point we recommend rewriting this axiom to 'hasParent max 2 Person'. It gives\nus most of what we need, but has a little less negative impact on the reasoning\ntime.\n
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ.\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 2067.273 sec (1.09313 % of final), by\nPellet 2.2.0 0.529 sec (0.00428 % of final) and by FaCT++ 1.6.4 is approximately\n0.147 sec (0.004 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-7","title":"Chapter 7","text":""},{"location":"tutorial/fhkb/#data-properties-in-the-fhkb","title":"Data Properties in the FHKB","text":"We now have some individuals with some basic object properties between individuals. OWL 2, however, also has data properties that can relate an object or individual to some item of data. There are data about a Person
, such as years of events and names etc. So, in this Chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial.\n
"},{"location":"tutorial/fhkb/#71-adding-some-data-properties-for-event-years","title":"7.1 Adding Some Data Properties for Event Years","text":"Everyone has a birth year; death year; and some have a marriage year and so on. We can model these simply with data properties and an integer as a filler. OWL 2 has a DateTime datatype, where it is possible to specify a precise time and date down to a second. 7 This proves cumbersome (see http://robertdavidstevens.wordpress.com/2011/05/05/using-the-datetime-data-type-to-describe-birthdays/ for details); all we need is a simple indication of the year in which a person was born. Of course, the integer type has a zero, which the Gregorian calendar for which we use integer as a proxy does not, but integer is sufficient to our needs. Also, there are various ontological treatments of time and information about people (this extends to names etc. as well), but we gloss over that here\u2014that\u2019s another tutorial.
7 http://www.w3.org/TR/2008/WD-owl2-quick-reference-20081202/#Built-in_Datatypes_and_Facets
We can have dates for birth, death and (eventually) marriage (see Chapter 9) and we can just think of these as event years. We can make a little hierarchy of event years as shown in Figure 7.1).
Task 27: Create a data property hierarchyhasEventYear
with range integer and domain Person
; hasBirthYear
and make it a sub-property of hasEventYear
(that way, the domain and range of hasEventYear
are inherited); hasDeathYear
and make it a sub-property of hasEventYear
; Again, asserting birth years for all individuals can be a bit tedious. The reader\ncan find a convenience snapshot of the ontology at this stage at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
We now have an ABox with individuals with fact assertions to data indicating a birth year. We can, if we wish, also add a class restriction to the Person
class saying that each and every instance of the class Person
holds a data property to an integer and that this property is called \u2018hasBirthYear\u2019. As usual when deciding whether to place such a restriction upon a class, ask whether it is true that each and every instance of the class holds that property; this is exactly the same as we did for the object properties in Chapter 4. Everyone does have a birth year, even if it is not known.
Once birth years have been added to our individuals, we can start asking some questions.
Task 28: DL queries 1. Use a DL query to ask:Person
born after 1960;Person
born in the 1960s;Person
born in the 1800s;Person
that has fewer than three children;Person
that has more than three children.The DL query for people born in the 1960s is:
Person and hasBirthYear some int[>= 1960, < 1970]\n
This kind of interval is known as a facet.
"},{"location":"tutorial/fhkb/#711-counting-numbers-of-children","title":"7.1.1 Counting Numbers of Children","text":"The last two queries in the list do not work as expected. We have asked, for instance, for Person
that have more than three children, but we get no members of Person
in the answer, though we know that there are some in the FHKB (e.g., John_Bright_1930
). This is because there is not enough information in the FHKB to tell that this person has more than three different people as children. As humans we can look at the four children of John Bright and know that they are different \u2013 for instance, they all have different birth years. The automated reasoner, however, does not know that a Person
can only have one birth year.
hasBirthYear
functional. Person
that has more than three children again.This time the query should work. All the other event year properties should be made functional, expect hasEventYear
, as one individual can have many event years. As the children have different birth year and an individual can only hold one hasBirthYear
property, then these people must be distinct entities.
Of course, making birth year functional is not a reliable way of ensuring that the automated reasoner knows that the individual are different. It is possible for two Person
to have the same birth year within the same family \u2013 twins and so on. Peter_William_Bright_1941
has three children, two of which are twins, so will not be a member of the class of people with at least three children. So, we use the different individuals axiom. Most tools, including Prot\u00e9g\u00e9, have a feature that allows all individuals to be made different.
From now on, every time you add individuals, make sure the different individuals axiom is updated.
"},{"location":"tutorial/fhkb/#72-the-open-world-assumption","title":"7.2 The Open World Assumption","text":"We have met again the open world assumption and its importance in the FHKB. In the use of the functional characteristic on the hasBirthYear
property, we saw one way of constraining the interpretation of numbers of children. We also introduced the \u2018different individuals\u2019 axiom as a way of making all individuals in a knowledge base distinct. There are more questions, however, for which we need more ways of closing down the openness of OWL 2.
Take the questions:
We can only answer these questions if we locally close the world.We have said that David and Margaret have two children, Richard and Robert, but we have not said that there are not any others. As usual, try not to apply your domain knowledge too much; ask yourself what the automated reasoner actually knows. As we have the open world assumption, the reasoner will assume, unless otherwise said, that there could be more children; it simply doesn\u2019t know.
Think of a railway journey enquiry system. If I ask a standard closed world system about the possible routes by rail, between Manchester and Buenos Aires, the answer will be \u2019none\u2019, as there are none described in the system. With the open world assumption, if there is no information in the system then the answer to the same question will simply be \u2018I don\u2019t know\u2019. We have to explicitly say that there is no railway route from Manchester to Buenos Aires for the right answer to come back.
We have to do the same thing in OWL. We have to say that David and Margaret have only two children. We do this with a type assertion on individuals. So far we have only used fact assertions. A type assertion to close down David Bright\u2019 parentage looks like this:
isParentOf only {Robert_David_Bright_1965,Richard_John_Bright_1962 }\n
This has the same meaning as the closure axioms that you should be familiar with on classes. We are saying that the only fillers that can appear on the right-hand-side of the isParentOf
property on this individual are the two individuals for Richard and Robert. We use the braces to represent the set of these two individuals.
isParentOf exactly 2 Person
.The last query should return the answer of David Bright. Closing down the whole FHKB ABox is a chore and would really have to be done programmatically. OWL scripting languages such as the Ontology Preprocessing Language8 (OPPL) [2] can help here. Also going directly to the OWL API [1]9, if you know what you are doing, is another route.
Adding all these closure type assertions can slow down the reasoner; so think about\nthe needs of your system \u2013 just adding it \u2018because it is right\u2019 is not necessarily the\nright route.\n
8 http://oppl2.sourceforge.net
9 http://owlapi.sourceforge.net/
"},{"location":"tutorial/fhkb/#73-adding-given-and-family-names","title":"7.3 Adding Given and Family Names","text":"We also want to add some other useful data facts to people \u2013 their names. We have been putting names as part of labels on individuals, but data fact assertions make sense to separate out family and given names so that we can ask questions such as \u2018give me all people with the family name Bright and the first given name of either James or William\u2019. A person\u2019s name is a fact about that person and is more, in this case, than just a label of the representation of that person. So, we want family names and given names. A person may have more than one given name \u2013 \u2018Robert David\u2019, for instance \u2013 and an arbitrary number of given names can be held. For the FHKB, we have simply created two data properties of hasFirstGivenName
and hasSecondGivenName
). Ideally, it would be good to have some index on the property to given name position, but OWL has no n-ary relationships. Otherwise, we could reify the hasGivenName
property into a class of objects, such as the following:
Class: GivenName\nSubClassOf:hasValue some String,\nhasPosition some Integer\n
but it is really rather too much trouble for the resulting query potential.
As already shown, we will use data properties relating instances of Person
to strings. We want to distinguish family and given names, and then different positions of given names through simple conflating of position into the property name. Figure 7.1 shows the intended data property hierarchy.
Figure 7.1: The event year and name data property hierarchies in the FHKB.
Do the following:
Task 32: Data propertieshasName
property the domain of Person
and the range of String
; The name data property hierarchy and the queries using those properties displays what now should be familiar. Sub-properties that imply the super-property. So, when we ask hasFirstGivenName
value \"William\"
and then the query hasGivenName value value \"William\"
we can expect different answers. There are people with \u2018William\u2019 as either first or second given name and asking the question with the super-property for given names will collect both first and second given names.
We have used data properties that link objects to data such as string, integer, floats and Booleans etc. OWL uses the XML data types. We have seen a simple use of data properties to simulate birth years. The full FHKB also uses them to place names (given and family) on individuals as strings. This means one can ask for the Person
with the given name \"James\", of which there are many in the FHKB.
Most importantly we have re-visited the open world assumption and its implications for querying an OWL ABox. We have looked at ways in which the ABox can be closed down \u2013 unreliably via the functional characteristic (in this particular case) and more generally via type assertions.
All the DL queries used in this chapter can also serve as defined classes in the TBox. It is a useful exercise to progressively add more defined classes to the FHKB TBox. Make more complex queries, make them into defined classes and inspect where they appear in the class hierarchy.
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 1891.157 sec (1.00000 % of final), by\nPellet 2.2.0 1.134 sec (0.00917 % of final) and by FaCT++ 1.6.4 is approximately\n0.201 sec (0.006 % of final). 0 sec indicates failure or timeout.\n
Note that we now cover the whole range of expressivity of OWL 2. HermiT at\nleast is impossibly slow by now. This may be because HermiT does more work\nthan the others. For now, we recommend to use either Pellet or FaCT++.\n
"},{"location":"tutorial/fhkb/#chapter-8","title":"Chapter 8","text":""},{"location":"tutorial/fhkb/#cousins-in-the-fhkb","title":"Cousins in the FHKB","text":"In this Chapter you will
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
Be warned; from here on the reasoner can start running slowly! Please see warning\nat the beginning of the last chapter for more information.\n
"},{"location":"tutorial/fhkb/#81-introducing-cousins","title":"8.1 Introducing Cousins","text":"Cousins can be confusing, but here is a brief summary:
Simply, my first cousins are my parent\u2019s sibling\u2019s children. As usual, we can think about the objects and put in place some sub-property chains.
"},{"location":"tutorial/fhkb/#82-first-cousins","title":"8.2 First Cousins","text":"Figure 8.1: Tracing out the sub-property chain for cousins going from a child to a parent, to its sibling, and down to its child, a cousin
Figure 8.1 shows the sub-property chain for first cousins. As usual, think at the object level; to get to the first cousins of Robert David Bright, we go to the parents of Robert David Bright, to their siblings and then to their children. We go up, along and down. The OWL for this could be:
ObjectProperty: hasFirstCousin\nSubPropertyOf: hasCousin\nSubPropertyChain: hasParent o hasSibling o hasChild\nCharacteristics: Symmetric\n
Note that we follow the definitions in Section 8.1 of first cousins sharing a grandparent, but not a parent. The sub-property chain goes up to children of a grandparent (a given person\u2019s parents), along to siblings and down to their children. We do not want this property to be transitive. One\u2019s cousins are not necessarily my cousins. The blood uncles of Robert David Bright have children that are his cousins. These first cousins, however, also have a mother that is not a blood relation of Robert David Bright and the mother\u2019s sibling\u2019s children are not cousins of Robert David Bright.
We do, however, want the property to be symmetric. One\u2019s cousins have one\u2019s-self as a cousin.
We need to place the cousin properties in the growing object property hierarchy. Cousins are obviously blood relations, but not ancestors, so they go off to one side, underneath hasBloodrelation
. We should group the different removes and degree of cousin underneath one hasCousin
property and this we will do.
Do the following:
Task 33: First cousinshasCousin
to the hierarchy underneath hasBloodrelation
;hasFirstCousin
underneath this property;You should see the following people as first cousins of Robert David Bright: Mark Anthony Heath, Nicholas Charles Heath, Mark Bright, Ian Bright, Janet Bright, William Bright, James Bright, Julie Bright, Clare Bright, Richard John Bright and Robert David Bright. The last two, as should be expected, are first cousins of Robert David Bright and this is not correct. As David Bright will be his own brother, his children are his own nieces and nephews and thus the cousins of his own children. Our inability to infer siblings correctly in the FHKB haunts us still and will continue to do so.
Although the last query for the cousins of Robert David Bright should return the\nsame results for every reasoner, we have had experiences where the results differ.\n
"},{"location":"tutorial/fhkb/#83-other-degrees-and-removes-of-cousin","title":"8.3 Other Degrees and Removes of Cousin","text":"Other degrees of cousins follow the same pattern as for first cousins; we go up, along and down. For second cousins we go up from a given individual to children of a great grandparent, along to their siblings and down to their grandchildren. The following object property declaration is for second cousins (note it uses the isGrandparentOf
and its inverse properties, though the parent properties could be used) :
ObjectProperty: hasSecondCousin\nSubPropertyOf: hasCousin\nSubPropertyChain: hasGrandParent o hasSibling o isGrandParentOf\nCharacteristics: Symmetric\n
\u2018 Removes \u2019 simply add in another \u2018leg\u2019 of either \u2018up\u2019 or \u2018down\u2019 either side of the \u2018along\u2019\u2014that is, think of the actual individuals involved and draw a little picture of blobs and lines\u2014then trace your finger up, along and down to work out the sub-property chain. The following object property declaration does it for first cousins once removed (note that this has been done by putting this extra \u2018leg\u2019 on to the hasFirstCousin
property; the symmetry of the property makes it work either way around so that a given person is the first cousin once removed of his/her first cousins once removed):
ObjectProperty: hasFirstCousinOnceRemoved\nSubPropertyOf: hasCousin\nSubPropertyChain: hasFirstCousin o hasChild\nCharacteristics: Symmetric\n
To exercise the cousin properties do the following:
Task 34: Cousin propertiesYou should see that we see some peculiar inferences about Robert David Bright\u2019 cousins \u2013 not only are his brother and himself his own cousins, but so are his father, mother, uncles and so on. This makes sense if we look at the general sibling problem, but also it helps to just trace the paths around. If we go up from one of Robert David Bright\u2019 true first cousins to a grandparent and down one parent relationship, we follow the first cousin once removed path and get to one of Robert David Bright\u2019 parents or uncles. This is not to be expected and we need a tighter definition that goes beyond sub-property chains so that we can exclude some implications from the FHKB.
"},{"location":"tutorial/fhkb/#84-doing-first-cousins-properly","title":"8.4 Doing First Cousins Properly","text":"As far as inferring first cousin facts for Robert David Bright, we have failed. More precisely, we have recalled all Robert David Bright\u2019s cousins, but the precision is not what we would desire. What we can do is ask for Robert David Bright\u2019 cousins, but then remove the children of Robert David Bright\u2019 parents. The following DL query achieves this:
Person that hasFirstCousin valueRobert_David_Bright_1965\nand (not (hasFather valueDavid_Bright_1934) or not (hasMother valueMar-\ngaret_Grace_Rever_1934)\n
This works, but only for a named individual. We could make a defined class for this query; we could also make a defined class FirstCousin
, but it is not of much utility. We would have to make sure that people whose parents are not known to have siblings with children are excluded. That is, people are not \u2018first cousins\u2019 whose only first cousins are themselves and their siblings. The following class does this:
Class: FirstCousin\nEquivalentTo: Person that hasFirstCousin some Person\n
Task 35: Roberts first cousins FirstCousin
as shown above;FirstCousinOfRobert
;Robert_David_Bright_1965
first cousins and takes away the children of Robert_David_Bright_1965
\u2019 parents as shown above.This gives some practice with negation. One is making a class and then \u2018taking\u2019 some of it away \u2013 \u2018these, but not those\u2019.
"},{"location":"tutorial/fhkb/#85-summary","title":"8.5 Summary","text":"We have now expanded the FHKB to include most blood relationships. We have also found that cousins are hard to capture just using object properties and sub-property chains. Our broken sibling inferences mean that we have too many cousins inferred at the instance level. We can get cousins right at the class level by using our inference based cousins, then excluding some using negation. Perhaps not neat, but it works.
We have reinforced that we can just add more and more relationships to individuals by just adding more properties to our FHKB object property hierarchy and adding more sub-property chains that use the object properties we have built up upon parentage and sibling properties; this is as it should be.
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet\n2.2.0 111.395 sec (0.90085 % of final) and by FaCT++ 1.6.4 is approximately 0.868\nsec (0.024 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-9","title":"Chapter 9","text":""},{"location":"tutorial/fhkb/#marriage-in-the-fhkb","title":"Marriage in the FHKB","text":"In this chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
Much of what is in this chapter is really revision; it is more of the same - making\nlots of properties and using lots of sub-property chains. However, it is worth it as\nit will test your growing skills and it also makes the reasoners and yourself work\nhard. There are also some good questions to ask of the FHKB as a result of adding\nmarriages.\n
"},{"location":"tutorial/fhkb/#91-marriage","title":"9.1 Marriage","text":"Marriage is a culturally complex situation to model. The FHKB started with a conservative model of a marriage involving only one man and one woman.10 Later versions are more permissive; a marriage simply has a minimum of two partners. This leaves it open to numbers and sex of the people involved. In fact, \u2018marriage\u2019 is probably not the right name for it. Using BreedingRelationship
as a label (the one favoured by the main author\u2019s mother) may be a little too stark and might be a little exclusive.... In any case, some more generic name is probably better and various subclasses of the FHKB\u2019s Marriage
class are probably necessary.
10 There being no funny stuff in the Stevens family.
To model marriage do the following:
Task 36: MarriageMarriage
, subclass of DomainEntity
;hasPartner
(domain Marriage
and range Person
) and isPartnerIn
hasFemalePartner
(domain Marriage
and range Woman
, sub-property of hasPartner
) and its inverse isFemalePartnerIn
;hasPartner
has MalePartner
(domain Marriage
and range Man
)and its inverse isMalePartnerIn
;hasMarriageYear
, making us a sub-property of hasEventYear
,make it functional;m001
with the label Marriage of David and Margaret
and add the facts: hasMalePartner David_Bright_1934
;hasFemalePartner Margaret_Grace_Rever_1934
hasMarriageYear 1958
;m002
with the label Marriage of John and Joyce
and add the facts:hasMalePartner
John_Bright_1930
;hasFemalePartner
Joyce_Gosport
(you may have to add Joyce if you did not already did that);hasMarriageYear
1955
;m003
with the label Marriage of Peter and Diana
and add the facts: hasMalePartner
Peter_William_Bright_1941
;hasFemalePartner
Diana_Pool
(you may have to add Diana if you did not already did that);hasMarriageYear
1964
;We have the basic infrastructure for marriages. We can ask the usual kinds of questions; try the following:
Task 37: DL queriesDL query: Marriage and hasMarriageYear some int[<= 1960]\n
"},{"location":"tutorial/fhkb/#911-spouses","title":"9.1.1 Spouses","text":"This marriage infrastructure can be used to infer some slightly more interesting things for actual people. While we want marriage objects so that we can talk about marriage years and even locations, should we want to, we also want to be able to have the straight-forward spouse relationships one would expect. We can use sub-property chains in the usual manner; do the following:
Task 38: Wifes and HusbandshasSpouse
with two sub-properties hasHusband
and hasWife
. isSpouseOf
, isWifeOf
and isHusbandOf
. hasWife
property, add the sub-property chain isMalePartnerIn o hasFemalePartner
. hasHusband
property. Figure 9.1 shows what is happening with the sub-property chains. Note that the domains and ranges of the spouse properties come from the elements of the sub-property chains. Note also that the hasSpouse
relationship will be implied from its sub-property chains.
The following questions can now be asked:
Figure 9.1: The sub-property chain path used to infer the spouse relationships via the marriage partnerships.
and many more. This is really a chance to explore your querying abilities and make some complex nested queries that involve going up and down the hierarchy and tracing routes through the graph of relationships between the individuals you\u2019ve inferred.
"},{"location":"tutorial/fhkb/#92-in-laws","title":"9.2 In-Laws","text":"Now we have spouses, we can also have in-laws. The path is simple: isSpouseOf o hasMother
implies hasMotherInLaw
. The path involved in mother-in-laws can be seen in Figure 9.2. The following OWL code establishes the sub-property chains for hasMotherInLaw
:
ObjectProperty: hasMotherInLaw\nSubPropertyOf: hasParentInLaw\nSubPropertyChain: isSpouseOf o hasMother\nDomain: Person\nRange: Woman\nInverseOf: isMotherInLawOf\n
Figure 9.2: Tracing out the path between objects to make the sub-property chain for mother-in-laws
Do the following to make the parent in-law properties:
Task 39: Parents in-lawhasParentInLaw
with two sub-properties of hasMotherInLaw
and hasFatherInLaw
; hasMotherInLaw
above; Brothers and sisters in law have the interesting addition of having more than one path between objects to establish a sister or brother in law relationship. The OWL code below establishes the relationships for \u2018is sister in law of\u2019:
ObjectProperty: hasSisterInLaw\nSubPropertyOf: hasSiblingInLaw\nSubPropertyChain: hasSpouse o hasSister\nSubPropertyChain: hasSibling o isWifeOf\n
A wife\u2019s husband\u2019s sister is a sister in law of the wife. Figure 9.3 shows the two routes to being a sister-in-law. In addition, the wife is a sister in law of the husband\u2019s siblings. One can add as many sub-property chains to a property as one needs. You should add the properties for hasSiblingInLawOf
and its obvious sub-properties following the inverse of the pattern above.
By now, chances are high that the realisation takes a long time. We recommend to\nremove the very computationally expensive restriction `hasParent` exactly 2 Person\non the `Person` class, if you have not done it so far.\n
Figure 9.3: The two routes to being a sister-in-law.
"},{"location":"tutorial/fhkb/#94-aunts-and-uncles-in-law","title":"9.4 Aunts and Uncles in-Law","text":"The uncle of Robert David Bright has a wife, but she is not the aunt of Robert David Bright, she is the aunt-in-law. This is another kith relationship, not a kin relationship. The pattern has a familiar feel:
ObjectProperty: isAuntInLawOf\nSubPropertyOf: isInLawOf\nSubPropertyChain: isWifeOf o isBrotherOf o isParentOf\n
Task 41: Uncles and aunts in-law hasAuntInLaw
and hasUncleInLaw
in the usual way; hasRelation
and two sub-properties of isBloodRelationOf
and isInLawOf
to establish the kith and kin relationships respectively; isInLawOf
.Figure 9.4: The object property hierarchy after adding the various in-law properties.
"},{"location":"tutorial/fhkb/#95-summary","title":"9.5 Summary","text":"This has really been a revision chapter; nothing new has really been introduced. We have added a lot of new object properties and one new data property. The latest object property hierarchy with the \u2018in-law\u2019 branch can be seen in Figure 9.4. Highlights have been:
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet\n2.2.0 123.655 sec (1.00000 % of final) and by FaCT++ 1.6.4 is approximately 1.618\nsec (0.046 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-10","title":"Chapter 10","text":""},{"location":"tutorial/fhkb/#extending-the-tbox","title":"Extending the TBox","text":"In this chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available\nat http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial\n
"},{"location":"tutorial/fhkb/#101-adding-defined-classes","title":"10.1 Adding Defined Classes","text":"Add the following defined classes:
Task 42: Adding defined classesThe three classes of Child
, Son
and Daughter
are of note. They are coded in the following way:
Class: Child EquivalentTo: Person that hasParent Some Person\nClass: Son EquivalentTo: Man that hasParent Some Person\nClass: Daughter EquivalentTo: Woman that hasParent Some Person\n
After running the reasoner, you will find that Person
is found to be equivalent to Child
; Daughter
is equivalent to Woman
and that Son
is equivalent to Man
. This does, of course, make sense \u2013 each and every person is someone\u2019s child, each and every woman is someone\u2019s daughter. We will forget evolutionary time-scales where this might be thought to break down at some point \u2013 all Person
individuals are also Descendant
individuals, but do we expect some molecule in some prebiotic soup to be a member of this class?
Nevertheless, within the scope of the FHKB, such inferred equivalences are not unreasonable. They are also instructive; it is possible to have different intentional descriptions of a class and for them to have the same logical extents. You can see another example of this happening in the amino acids ontology, but for different reasons.
Taking Grandparent
as an example class, there are two ways of writing the defined class:
Class: Grandparent EquivalentTo: Person and isGrandparentOf some Person\nClass: Grandparent EquivalentTo: Person and (isParentOf some (Person and (is-\nParentOf some Person))\n
Each comes out at a different place in the class hierarchy. They both capture the right individuals as members (that is, those individuals in the ABox that are holding a isGrandparentOf
property), but the class hierarchy is not correct. By definition, all grandparents are also parents, but the way the object property hierarchy works means that the first way of writing the defined class (with the isGrandparentOf
property) is not subsumed by the class Parent
. We want this to happen in any sensible class hierarchy, so we have to use the second pattern for all the classes, spelling out the sub-property path that implies the property such as isGrandparentOf
within the equivalence axiom.
The reason for this need for the \u2018long-form\u2019 is that the isGrandparentOf
does not imply the isParentOf
property. As described in Chapter 3 if this implication were the case, being a grandparent of Robert David Bright, for instance, would also imply that the same Person
were a parent of Robert David Bright; an implication we do not want. As these two properties (isParentOf
and isGrandparentOf
) do not subsume each other means that the defined classes written according to pattern one above will not subsume each other in the class hierarchy. Thus we use the second pattern. If we look at the class for grandparents of Robert:
Class: GrandparentOfRobert\nEquivalentTo: Person that isParentOf some (Person that isParentOf value Robert\nDavid Bright)\n
If we make the equivalent class for Richard John Bright, apply the reasoner and look at the hierarchy, we see that the two classes are not logically equivalent, even though they have the same extents of William George Bright, Iris Ellen Archer, Charles Herbert Rever and Violet Sylvia Steward. We looked at this example in Section 6.2, where there is an explanation and solutions.
"},{"location":"tutorial/fhkb/#102-summary","title":"10.2 Summary","text":"We can add defined classes based on each property we have put into the object property hierarchy. We see the expected hierarchy; as can be seen from Figure 10.1 it has an obvious symmetry based on sex. We also see a lot of equivalences inferred \u2013 all women are daughters, as well as women descendants. Perhaps not the greatest insight ever gained, but it at least makes sense; all women must be daughters. It is instructive to use the explanation feature in Prot\u00e9g\u00e9 to look at why the reasoner has made these inferences. For example, take a look at the class hasGrandmother some Woman
\u2013 it is instructive to see how many there are.
Like the Chapter on marriage and in-law (Chapter 9), this chapter has largely been revision. One thing of note is, however, that we must not use the object properties that are inferred through sub-property chains as definitions in the TBox; we must spell out the sub-property chain in the definition, otherwise the implications do not work properly.
One thing is almost certain; the resulting TBox is rather complex and would be almost impossible to maintain by hand.
Figure 10.1: The full TBox hierarchy of the FHKB
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).\n
The time to reason with the FHKB at this point (in Prot\u00e9g\u00e9) on a typical desktop\nmachine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet\n2.2.0 0.000 sec (0.00000 % of final) and by FaCT++ 1.6.4 is approximately 35.438\nsec (1.000 % of final). 0 sec indicates failure or timeout.\n
"},{"location":"tutorial/fhkb/#chapter-11","title":"Chapter 11","text":""},{"location":"tutorial/fhkb/#final-remarks","title":"Final remarks","text":"If you have done all the tasks within this tutorial, then you will have touched most parts of OWL 2. Unusually for most uses of OWL we have concentrated on individuals, rather than just on the TBox. One note of warning \u2013 the full FHKB has some 450 members of the Bright family and takes a reasonably long time to classify, even on a sensible machine. The FHKB is not scalable in its current form.
One reason for this is that we have deliberately maximised inference. We have attempted not to explicitly type the individuals, but drive that through domain and range constraints. We are making the property hierarchy do lots of work. For the individual Robert David Bright, we only have a couple of assertions, but we infer some 1 500 facts between Robert David Bright and other named individuals in the FHKB\u2013displaying this in Prot\u00e9g\u00e9 causes problems. We have various complex classes in the TBox and so on.
We probably do not wish to drive a genealogical application using an FHKB in this form. Its purpose is educational. It touches most of OWL 2 and shows a lot of what it can do, but also a considerable amount of what it cannot do. As inference is maximised, the FHKB breaks most of the OWL 2 reasoners at the time of writing.However, it serves its role to teach about OWL 2.
OWL 2 on its own and using it in this style, really does not work for family history. We have seen that siblings and cousins cause problems. rules in various forms can do this kind of thing easily\u2014it is one of the primary examples for learning about Prolog. Nevertheless, the FHKB does show how much inference between named individuals can be driven from a few fact assertions and a property hierarchy. Assuming a powerful enough reasoner and the ability to deal with many individuals, it would be possible to make a family history application using the FHKB; as long as one hid the long and sometimes complex queries and manipulations that would be necessary to \u2018prune\u2019 some of the \u2018extra\u2019 facts found about individuals. However, the FHKB does usefully show the power of OWL 2, touch a great deal of the language and demonstrate some of its limitations.
"},{"location":"tutorial/fhkb/#appendix-a","title":"Appendix A","text":""},{"location":"tutorial/fhkb/#fhkb-family-data","title":"FHKB Family Data","text":"Table A.1: The list of individuals in the FHKB
Person First given name Second given name Family name Birth year Mother Father Alec John Archer 1927 Alec John Archer 1927 Violet Heath 1887 James Alexander Archer 1882 Charles Herbert Rever 1895 Charles Herbert Rever 1895 Elizabeth Frances Jessop 1869 William Rever 1870 Charlotte Caroline Jane Bright 1894 Charlotte Caroline Jane Bright 1894 Charlotte Hewett 1863 Henry Edmund Bright 1862 Charlotte Hewett 1863 Charlotte none Hewett 1863 not specified not specified Clare Bright 1966 Clare none Bright 1966 Diana Pool Peter William Bright 1941 Diana Pool Diana none Pool none not specified not specified David Bright 1934 David none Bright 1934 Iris Ellen Archer 1906 William George Bright 1901 Dereck Heath Dereck none Heath 1927 not specified not specified Eileen Mary Rever 1929 Eileen Mary Rever 1929 Violet Sylvia Steward 1894 Charles Herbert Rever 1895 Elizabeth Frances Jessop 1869 Elizabeth Frances Jessop 1869 not specified not specified Ethel Archer 1912 Ethel none Archer 1912 Violet Heath 1887 James Alexander Archer 1882 Frederick Herbert Bright 1889 Frederick Herbert Bright 1889 Charlotte Hewett 1863 Henry Edmund Bright 1862 Henry Edmund Bright 1862 Henry Edmund Bright 1862 not specified not specified Henry Edmund Bright 1887 Henry Edmund Bright 1887 Charlotte Hewett 1863 Henry Edmund Bright 1862 Ian Bright 1959 Ian none Bright 1959 Joyce Gosport John Bright 1930 Iris Ellen Archer 1906 Iris Ellen Archer 1906 Violet Heath 1887 James Alexander Archer 1882 James Alexander Archer 1882 James Alexander Archer 1882 not specified not specified James Bright 1964 James none Bright 1964 Diana Pool Peter William Bright 1941 James Frank Hayden Bright 1891 James Frank Bright 1891 Charlotte Hewett 1863 Henry Edmund Bright 1862 Janet Bright 1964 Janet none Bright 1964 Joyce Gosport John Bright 1930 John Bright 1930 John none Bright 1930 Iris Ellen Archer 1906 William George Bright 1901 John Tacey Steward 1873 John Tacey Steward 1873 not specified not specified Joyce Archer 1921 Joyce none Archer 1921 Violet Heath 1887 James Alexander Archer 1882 Joyce Gosport Joyce none Gosport not specified not specified not specified Julie Bright 1966 Julie none Bright 1966 Diana Pool Peter William Bright 1941 Kathleen Minnie Bright 1904 Kathleen Minnie Bright 1904 Charlotte Hewett 1863 Henry Edmund Bright 1862 Leonard John Bright 1890 Leonard John Bright 1890 Charlotte Hewett 1863 Henry Edmund Bright 1862 Lois Green 1871 Lois none Green 1871 not specified not specified Margaret Grace Rever 1934 Margaret Grace Rever 1934 Violet Sylvia Steward 1894 Charles Herbert Rever 1895 Mark Anthony Heath 1960 Mark Anthony Heath 1960 Eileen Mary Rever 1929 Dereck Heath Mark Bright 1956 Mark none Bright 1956 Joyce Gosport John Bright 1930 Nicholas Charles Heath 1964 Nicholas Charles Heath 1964 Eileen Mary Rever 1929 Dereck Heath Nora Ada Bright 1899 Nora Ada Bright 1899 Charlotte Hewett 1863 Henry Edmund Bright 1862 Norman James Archer 1909 Norman James Archer 1909 Violet Heath 1887 James Alexander Archer 1882 Peter William Bright 1941 Peter William Bright 1941 Iris Ellen Archer 1906 William George Bright 1901 Richard John Bright 1962 Richard John Bright 1962 Margaret Grace Rever 1934 David Bright 1934 Robert David Bright 1965 Robert David Bright 1965 Margaret Grace Rever 1934 David Bright 1934 Violet Heath 1887 Violet none Heath 1887 not specified not specified Violet Sylvia Steward 1894 Violet Sylvia Steward 1894 Lois Green 1871 John Tacey Steward 1873 William Bright 1970 William none Bright 1970 Joyce Gosport John Bright 1930 William George Bright 1901 William George Bright 1901 Charlotte Hewett 1863 Henry Edmund Bright 1862 William Rever 1870 William none Rever 1870 not specified not specified"},{"location":"tutorial/fhkb/#bibliography","title":"Bibliography","text":"[1] M. Horridge and S. Bechhofer. The owl api: a java api for working with owl 2 ontologies. Proc. of OWL Experiences and Directions , 2009, 2009.
[2] Luigi Iannone, Alan Rector, and Robert Stevens. Embedding knowledge patterns into owl. In European Semantic Web Conference (ESWC09) , pages 218\u2013232, 2009.
[3] Dmitry Tsarkov, Uli Sattler, Margaret Stevens, and Robert Stevens. A Solution for the Man-Man Problem in the Family History Knowledge Base. In Sixth International Workshop on OWL: Experiences and Directions 2009 , 2009.
"},{"location":"tutorial/github-fundamentals/","title":"GitHub Fundamentals for OBO Engineers","text":""},{"location":"tutorial/github-fundamentals/#introduction-to-github","title":"Introduction to GitHub","text":""},{"location":"tutorial/github-fundamentals/#back-to-getting-started","title":"Back to Getting Started","text":""},{"location":"tutorial/github-fundamentals/#back-to-main-repo","title":"Back to Main Repo","text":""},{"location":"tutorial/github-fundamentals/#overview","title":"Overview:","text":"GitHub is increasingly used by software developers, programmers and project managers for uploading and sharing content, as well as basic project management. You build a profile, upload projects to share and connect with other users by \"following\" their accounts. Many users store programs and code projects, but you can also upload text documents or other file types in your project folders to share publicly (or privately). It is capable of storing any file type from text, to structured data, to software. And more features are being added by the day. The real power of Git, however, is less about individuals publishing content (many places can do that, including google docs etc). It is more about that content being easily shared, built upon, and credited in a way that is robust to the realities of distributed collaboration. You don't have to know how to code or use the command line. It is a powerful way to organize projects with multiple participants.
"},{"location":"tutorial/github-fundamentals/#organization","title":"Organization","text":"Git supports the following types of primary entities:
The relationships between any combination of these entities is many-to-many, with the nuanced exception of repositories. For our purposes today we will oversimplify by saying that a repositoy belongs either to a single organization or to a single individual.
"},{"location":"tutorial/github-fundamentals/#markdown","title":"Markdown","text":"Content in GitHub is written using Markdown, a text-to-HTML conversion tool for web writers (ref).
For more help with Markdown, see this GitHub guide.
Raw markup syntax As renderedHeader - use # for H1, ## for H2, etc.
# Header, ## Header (note, the header is not displaying properly in this table) Emphasis, aka italics, with *asterisks* or _underscores_.
Emphasis, aka italics, with asterisks or underscores. Strong emphasis, aka bold, with **asterisks** or __underscores__.
Strong emphasis, aka bold, with asterisks or underscores. Combined emphasis with **asterisks and _underscores_**.
Combined emphasis with asterisks and underscores. Strikethrough uses two tildes. ~~Scratch this.~~
Strikethrough uses two tildes. ~~Scratch this.~~ Lists: To introduce line breaks in markdown, add two spaces For a bulleted list, use * or - (followed by a space)
Here is an example of a list: One Two Three
Here is an example of a bulleted list:
GitHub can store any kind of content, provided it isn't too big. (And now even this is possible). However, it is more capable for some filetypes than it is for others. Certain filetypes can be viewed 'natively' within the GitHub interface. These are:
Adopted from CD2H MTIP tutorial
"},{"location":"tutorial/github-issues/","title":"GitHub Issue for OBO Engineers","text":""},{"location":"tutorial/github-issues/#intro-to-managing-and-tracking-issues-in-github","title":"Intro to managing and tracking issues in GitHub","text":""},{"location":"tutorial/github-issues/#overview","title":"Overview","text":"Back to top
Why: \"Issues are a great way to keep track of tasks, enhancements, and bugs for your projects or for anyone else's. As long as you are a registered GitHub user you can log an issue, or comment on an issue for any open repo on GitHub. Issues are a bit like email\u2014except they can be shared, intelligently organized, and discussed with the rest of your team. GitHub\u2019s tracker is called Issues, and has its own section in every repository.\" (From: https://guides.github.com/features/issues/)
How:
How to create an issue in GitHub:
- [ ]
markdown syntax before each bullet. Note, you can also add sub-tasks by clicking the 'add a task list' button in the tool bar. The status of the tasks in an issue (eg. https://github.com/nicolevasilevsky/c-path-practice/issues/1 will then be reflected in any summary view. Eg. https://github.com/nicolevasilevsky/c-path-practice/issues.Your turn:
Follow the instructions above to create a ticket about a hypothetical issue (such as an improvement to this tutorial) that includes a sub-task list.
"},{"location":"tutorial/github-issues/#assign-issues","title":"Assign issues","text":"Back to top
Assign issues to people
Add labels
New Labels
Your turn:
On the ticket you previously created:
Back to top
Comment on issues
Close issues
Use direct @ mentions
Link documents
You can link documents and files by:
Cross reference to another ticket
Before saving your changes, you can preview the comment to ensure the correct formatting.
Your turn:
Back to top
Milestones
Your turn
Create a new milestone, and add the milestone to an existing ticket.
Projects
To create project:
Your turn
Create a new project and add columns and add cards to the columns.
"},{"location":"tutorial/github-issues/#query-issues","title":"Query issues","text":"Back to top
Once you start using GitHub for lots of things it is easy to get overwhelmed by the number of issues. The query dashboard https://github.com/issues allows you to filter on tickets.
More complex queries are also possible.
Note, you must be signed in to GitHub to view the above links.
Further reading on Issue querys
"},{"location":"tutorial/github-issues/#nofifications","title":"Nofifications","text":"Back to top
Adopted from CD2H MTIP tutorial
"},{"location":"tutorial/intro-cli-1/","title":"Tutorial: Very (!) short introduction to the command line for ontology curators and semantic engineers: Part 1","text":"As a modern ontology curator, you are an engineer - you are curating computable knowledge, testing the integrity of your curation using quality control testing, and are responsible for critical components of modern knowledge systems that directly affect user experience - the ontologies.
Scientific computing is a big, scary world comprising many different tools, methodologies, training resources and philosophies, but nearly all modern workflows share one key aspect: the ability to execute commands that help you find and manipulate data with the command line. Some examples of that include:
sh run.sh make prepare_release
git
and committing changescurl
or wget
Here we are doing a basic hands on tutorial which will walk you through the must-know commands. For a more comprehensives introduction into thinking about automation please see our lesson on Automating Ontology Development Workflows: Make, Shell and Automation Thinking
The tutorial uses example tailored for users of UNIX systems, like Mac and Linux. Users of Windows generally have analogous steps - wherever we talk about an sh
file in the following there exists a corresponding bat
file that can be run in the windows powershell, or CMD.
You have:
Intro to Command Lind Interface Part 1
"},{"location":"tutorial/intro-cli-1/#tutorial","title":"Tutorial","text":"We are not going to discuss here in any detail what the command line is. We will focus on what you can do with it: for more information skip to the further reading section.
The basic idea behind the command line is that you run a command to achieve a goal. Among the most common goals relevant to you as a semantic engineer will be:
Most commands result in some kind of printed statement. Lets try one. Open your terminal (a terminal is the program you use to enter commands. For a nice overview of how shell, terminal, command line and console relate, see here). On Mac, you can type CMD+Space to search for programs and then type \"terminal\". For this tutorial we use the default Terminal.app, but there are many others, including iterm2. For this introduction, it does not matter which terminal you use. When first opening the terminal you will see something like this:
or
Note that your terminal window may look slightly different, depending on your configuration. More on that later.
Let's type our first command and hit enter:
whoami\n
On my machine I get
(base) matentzn@mbp.local:~ $ whoami\nmatentzn\n
This does not seem like a useful command, but sometimes, we forget who we are, and it is good to be reminded. So, what happened here? We ran a command, named whoami
and our command line executed that command which is implemented somewhere on our machine as a program. That program simply determined who I am in some way, and then printed the result again.
Ok so, lets lets look a bit closer at the command prompt itself:
matentzn@mbp.local:~ $\n
Two interesting things to not here for today:
~
. This universally (on all Unix systems) refers to your user directory on your computer. In this case here, it tells you that in your terminal, you are \"in your user directory\".$
sign. It simply denotes where your command line starts (everything before the $ is information provided to you, everything will be about your commands). Make sure that you do not accidentally copy based the $
sign from the examples on the web into your command prompt:(base) matentzn@mbp.local:~ $ $ whoami\n-bash: $: command not found\n(base) matentzn@mbp.local:~ $\n
whoami
did not do anything.
Ok, based on the ~
we know that we are \"in the user home directory\". Let as become a bit more confident about that and ask the command prompt where we are:
matentzn@mbp.local:~ $ pwd\n/Users/matentzn\n
The pwd
command prints out the full path of our current location in the terminal. As you can see, the default location when opening the command prompt is, indeed, the home director, located in /Users/matentzn
. We will use it later again.
A word about paths. /Users/matentzn
is what we call a path. On UNIX systems, /
separates one directory from another. So matentzn
is a directory inside of the Users
directory.
Let us now take a look what our current directory contains (type ls
and hit enter):
matentzn@mbp.local:~ $ ls\nApplications Library ...\n
This command will simply list all of the files in your directory as a big list. We can do this a bit nicer:
matentzn@mbp.local:~ $ ls -l\ntotal 80000\ndrwx------@ 4 matentzn staff 128 31 Jul 2020 Applications\ndrwx------@ 26 matentzn staff 832 12 Sep 2021 Desktop\n
-l
is a short command line option which allows you specify that you would like print the results in a different format (a long list). We will not go into any detail here what this means but a few things to not in the output: You can see some pieces of information that are interesting, like when the file or directory was last modified (i.e. 31. July 2020), who modified it (me) and, of course, the name e.g. Applications
.
Before we move on to the next section, let us clear
the current terminal from all the command outputs we ran:
clear\n
Your command prompt should now be empty again.
"},{"location":"tutorial/intro-cli-1/#working-with-files-and-directories","title":"Working with files and directories","text":"In the previous section we learned how to figure out who we are (whoami
), where we are (pwd
) and how to see what is inside the current directory (ls -l
) and how to clear all the output (clear
).
Let us know look at how we can programmatically create a new directory and change the location in our terminal.
First let us create a new directory:
mkdir tutorial-my\n
Now if we list the contents of our current directory again (ls -l
), we will see our newly created directory listed! Unfortunately, we just realised that we chose the wrong name for our directory! It should have been my-tutorial
instead of tutorial-my
! So, let us rename it. In the command prompt, rather than \"renaming\" files and directories, we \"move\" them (mv
).
mv tutorial-my my-tutorial\n
Now, lets enter our newly created directory using the _c_hange _d_irectory command (cd
), and create another sub-directory in my-tutorial
, called \"data\" (mkdir data
):
cd my-tutorial\nmkdir data\n
You can check again with ls -l
. If you see the data directory listed, we are all set! Feel free to run clear
again to get rid of all the current output on the command prompt.
Let us also enter this directory now: cd data
.
If we want to leave the directory again, feel free to do that like this:
cd ..\n
The two dots (..
) mean: \"parent directory.\" This is very important to remember during your command line adventures: ..
stands for \"parent directory\", and .
stands for \"current/this directory\" (see more on that below).
Now, let's get into something more advanced: downloading files.
"},{"location":"tutorial/intro-cli-1/#downloading-and-searching-files","title":"Downloading and searching files","text":"Our first task is to download the famous Human Phenotype Ontology Gene to Phenotype Annotations (aka HPOA). As you should already now, whenever we download ontologies, or ontology related files, we should always use a persistent URL, if available! This is the one for HPOA: http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt
.
There are two very popular commands for downloading content: curl
and wget
. I think most of my colleagues prefer curl
, but I like wget
because it simpler for beginners. So I will use it here. Lets us try downloading the file!
wget http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt -O genes_to_phenotype.txt\n
The -O
parameter is optional and specifies a filename. If you do not add the parameter, wget
will try to guess the filename from the URL. This does not always go so well with complex URLs, so I personally recommend basically always specifying the -O
parameter.
You can also use the curl equivalent of the wget command;
curl -L http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt --output genes_to_phenotype.txt\n
Try before reading on: Exercises!
genes_to_phenotype.txt
to the data directory you previously created.data
directory.Do not move on to the next step unless your data directory looks similar to this:
matentzn@mbp.local:~/my-tutorial/data $ pwd\n/Users/matentzn/my-tutorial/data\nmatentzn@mbp.local:~/my-tutorial/data $ ls -l\ntotal 53968\n-rw-r--r-- 1 matentzn staff 19788987 11 Jun 19:09 genes_to_phenotype.txt\n-rw-r--r-- 1 matentzn staff 7836327 27 Jun 22:50 hp.obo\n
Ok, let us look at the first 10 lines of genes_to_phenotype.txt using the head
command:
head genes_to_phenotype.txt\n
head
is a great command to familiarise yourself with a file. You can use a parameter to print more or less lines:
head -3 genes_to_phenotype.txt\n
This will print the first 3 lines of the genes_to_phenotype.txt file. There is another analogous command that allows us to look at the last lines off a file:
tail genes_to_phenotype.txt\n
head
, tail
. Easy to remember.
Next, we will learn the most important of all standard commands on the command line: grep
. grep
stands for \"Global regular expression print\" and allows us to search files, and print the search results to the command line. Let us try some simple commands first.
grep diabetes genes_to_phenotype.txt\n
You will see a list of hundreds of lines out output. Each line corresponds to a line in the genes_to_phenotype.txt
file which contains the word \"diabetes\".
grep is case sensitive. It wont find matches like Diabetes, with capital D!\n\nUse the `-i` parameter in the grep command to instruct grep to\nperform case insensitive matches.\n
There is a lot more to grep than we can cover here today, but one super cool thing is searching across an entire directory.
grep -r \"Elevated circulating follicle\" .\n
Assuming you are in the data
directory, you should see something like this:
./genes_to_phenotype.txt:190 NR0B1 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510\n./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - - mim2gene OMIM:273250\n...... # Removed other results\n./hp.obo:name: Elevated circulating follicle stimulating hormone level\n
There are two new aspects to the command here:
-r
option (\"recursive\") allows is to search a directory and all directories within in..
in the beginning. Remember, in the previous use of the grep
command we had the name of a file in the place where now the .
is. The .
means \"this directory\" - i.e. the directory you are in right now (if lost, remember pwd
).As you can see, grep
does not only list the line of the file in which the match was found, it also tells us which filename it was found in! We can make this somewhat more easy to read as well by only showing filenames using the -l
parameter:
matentzn@mbp.local:~/my-tutorial/data $ grep -r -l \"Elevated circulating follicle\" .\n./genes_to_phenotype.txt\n./hp.obo\n
"},{"location":"tutorial/intro-cli-1/#the-dark-art-of-piping-and-redirects","title":"The Dark Art of Piping and Redirects","text":"The final lesson for today is about one of the most powerful features of the command line: the ability to chain commands together. Let us start with a simple example (make sure you are inside the data directory):
grep -r \"Elevated circulating follicle\" . | head -3\n
This results in:
./genes_to_phenotype.txt:190 NR0B1 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510\n./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - - mim2gene OMIM:273250\n./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510\n
So, what is happening here? First, we use the grep
command to find \"Elevated circulating follicle\" in our data directory. As you may remember, there are more than 10 results for this command. So the grep command now wants to print these 10 results for you, but the |
pipe symbol intercepts the result from grep
and passes it on to the next command, which is head
. Remember head
and tail
from above? Its exactly the same thing, only that, rather than printing the first lines of a file, we print the first lines of the output of the previous command. You can do incredible things with pipes. Here a taster which is beyond this first tutorial, but should give you a sense:
grep \"Elevated circulating follicle\" genes_to_phenotype.txt | cut -f2 | sort | uniq | head -3\n
Output:
AR\nBNC1\nC14ORF39\n
What is happening here?
grep
is looking for \"Elevated circulating follicle\" in all files in the directory, then \"|\" is passing the output on tocut
, which extracts the second column of the table (how cool?), then \"|\" is passing the output on tosort
, which sorts the output, then \"|\" is passing the output on touniq
, which removes all duplicate values from the output, then \"|\" is passing the output on tohead
, which is printing only the first 3 rows of the result.Another super cool use of piping is searching your command history. Try running:
history\n
This will show you all the commands you have recently run. Now if you want to simply look for some very specific commands that you have run in the past you can combine history
with grep
:
history | grep follicle\n
This will print every command you ran in the last hour that contains the word \"follicle\". Super useful if you, like me, keep forgetting your commands!
The last critical feature of the command line we cover today is the \"file redirect\". Instead of printing the output to file, we may chose to redirect the results to a file instead:
matentzn@mbp.local:~/my-tutorial/data $ grep \"Elevated circulating follicle\" genes_to_phenotype.txt | cut -f2 | sort | uniq | head -3 > gene.txt\nmatentzn@mbp.local:~/my-tutorial/data $ head gene.txt\nAR\nBNC1\nC14ORF39\n
> gene.txt
basically tells the command line: instead of printing the results to the command line, \"print\" them into a file which is called gene.txt
.
Sam also did here PhD in and around ontologies but has moved entirely to data engineering since. I really liked her 1 hour introduction into the terminal, this should fill some of the yawning gaps in this introduction here.
"},{"location":"tutorial/intro-cli-1/#further-reading","title":"Further reading","text":"Today we will pick up where we left off after the first CLI tutorial, and discuss some more usages of the command line. In particular, we will:
You have:
~/.zshrc
file in case you have had any previous customisations you wish to preserve.Introduction to Command Line Interface Part 2
"},{"location":"tutorial/intro-cli-2/#preparation","title":"Preparation","text":"odk.bat
as instructed above in some directory on your machine (the path to the odk.bat file should have no spaces!).bash_profile
in the same directory as your odk.bat file.-v %cd%\\.bash_profile:/root/.bash_profile
to the odk.bat file (this is mounting the .bash_profile
file inside your ODK container). There is already a similar -v statement in this file, just copy it right afterodk.bat bash
on your CMD (first, cd
to the directory containing the odk.bat file).If you have not done so, install https://ohmyz.sh/. It is not strictly speaking necessary to use ohmyzsh to follow the rest of this tutorial, but it is a nice way to managing your Zsh (z-shell) configuration. Note that the ODK is using the much older bash
, but it should be fine for you to work with anyways.
As Semantic Engineers or Ontology Curators we frequently have to install custom tools like ROBOT, owltools, and more on our computer. These are frequently downloaded from the internet as \"binaries\", for example as Java \"jar\" files. In order for our shell to \"know\" about these downloaded programs, we have to \"add them to the path\".
Let us first look at what we currently have loaded in our path:
echo $PATH\n
What you see here is a list of paths. To read this list a bit more easily, let us remember our lesson on piping commands:
echo $PATH | tr ':' '\\n' | sort\n
What we do here:
echo
command to print the contents of the $PATH variable. In Unix systems, the $
signifies the beginning of a variable name (if you are curious about what other \"environment variables\" are currently active on your system, use theprintenv
command). The output of theecho
command is piped to the next command (tr
).tr \u2013 translate characters
command copies the input of the previous command to the next with substitution or deletion of selected characters. Here, we substitute the :
character, which is used to separate the different directory paths in the $PATH
variable, with \"\\n\", which is the all important character that denotes a \"new line\".So, how do we change the \"$PATH\"? Let's try and install ROBOT and see! Before we download ROBOT, let us think how we will organise our custom tools moving forward. Everyone has their own preferences, but I like to create a tools
directory right in my Users directory, and use this for all my tools moving forward. In this spirit, lets us first go to our user directory in the terminal, and then create a \"tools\" directory:
cd ~\nmkdir -p tools\n
The -p
parameter simply means: create the tools directory only if it does not exist. Now, let us go inside the tools directory (cd ~/tools
) and continue following the instructions provided here.
First, let us download the latest ROBOT release using the curl
command:
curl -L https://github.com/ontodev/robot/releases/latest/download/robot.jar > robot.jar\n
ROBOT is written in the Java programming language, and packaged up as an executable JAR file. It is still quite cumbersome to directly run a command with that JAR file, but for the hell of it, let us just do it (for fun):
java -jar robot.jar --version\n
If you have worked with ROBOT before, this looks quite a bit more ugly then simply writing:
robot --version\n
If you get this (or a similar) error:
zsh: permission denied: robot\n
You will have to run the following command as well, which makes the robot
wrapper script executable:
chmod +x ~/tools/robot\n
So, how can we achieve this? The answer is, we download a \"wrapper script\" and place it in the same folder as the Jar. Many tools provide such wrapper scripts, and they can sometimes do many more things than just \"running the jar file\". Let us know download the latest wrapper script:
curl https://raw.githubusercontent.com/ontodev/robot/master/bin/robot > robot\n
If everything went well, you should be able to print the contents of that file to the terminal using cat
:
cat robot\n
You should see something like:
#!/bin/sh\n\n## Check for Cygwin, use grep for a case-insensitive search\nIS_CYGWIN=\"FALSE\"\nif uname | grep -iq cygwin; then\n IS_CYGWIN=\"TRUE\"\nfi\n\n# Variable to hold path to this script\n# Start by assuming it was the path invoked.\nROBOT_SCRIPT=\"$0\"\n\n# Handle resolving symlinks to this script.\n# Using ls instead of readlink, because bsd and gnu flavors\n# have different behavior.\nwhile [ -h \"$ROBOT_SCRIPT\" ] ; do\n ls=`ls -ld \"$ROBOT_SCRIPT\"`\n # Drop everything prior to ->\n link=`expr \"$ls\" : '.*-> \\(.*\\)$'`\n if expr \"$link\" : '/.*' > /dev/null; then\n ROBOT_SCRIPT=\"$link\"\n else\n ROBOT_SCRIPT=`dirname \"$ROBOT_SCRIPT\"`/\"$link\"\n fi\ndone\n\n# Directory that contains the this script\nDIR=$(dirname \"$ROBOT_SCRIPT\")\n\nif [ $IS_CYGWIN = \"TRUE\" ]\nthen\n exec java $ROBOT_JAVA_ARGS -jar \"$(cygpath -w $DIR/robot.jar)\" \"$@\"\nelse\n exec java $ROBOT_JAVA_ARGS -jar \"$DIR/robot.jar\" \"$@\"\nfi\n
We are not getting into the details of what this wrapper script does, but note that, you can fine the actually call the the ROBOT jar file towards the end: java $ROBOT_JAVA_ARGS -jar \"$DIR/robot.jar\" \"$@\"
. The cool thing is, we do not need to ever worry about this script, but it is good for use to know, as Semantic Engineers, that it exists.
Now, we have downloaded the ROBOT jar file and the wrapper script into the ~/tools
directory. The last step remaining is to add the ~/tools
directory to your path. It makes sense to try to at least understand the basic idea behind environment variables: variables that are \"loaded\" or \"active\" in your environment (your shell). The first thing you could try to do is change the variable right here in your terminal. To do that, we can use the export
command:
export PATH=$PATH:~/tools\n
What you are doing here is using the export
command to set the PATH
variable to $PATH:~/tools
, which is the old path ($PATH
), a colon (:
) and the new directory we want to add (~/tools
). And, indeed, if we now look at our path again:
echo $PATH | tr ':' '\\n' | sort\n
We will see the path added. We can now move around to any directory on our machine and invoke the robot
command. Try it before moving on!
Unfortunately, the change we have now applied to the $PATH
variable is not persistent: if you open a new tab in your Terminal, your $PATH
variable is back to what it was. What we have to do in order to make this persistent is to add the export
command to a special script which is run every time the you open a new terminal: your shell profile.
There is a lot to say about your shell profiles, and we are taking a very simplistic view here that covers 95% of what we need: If you are using zsh
your profile is managed using the ~/.zshrc
file, and if you are using bash
, your profile is managed using the ~/.bash_profile
file. In this tutorial I will assume you are using zsh
, and, in particular, after installing \"oh-my-zsh\". Let us look at the first 5 lines of the ~/.zshrc
file:
head ~/.zshrc\n
If you have installed oh-my-zsh, the output will look something like:
# If you come from bash you might have to change your $PATH.\n# export PATH=$HOME/bin:/usr/local/bin:$PATH\n\n# Path to your oh-my-zsh installation.\nexport ZSH=\"$HOME/.oh-my-zsh\"\n\n# Set name of the theme to load --- if set to \"random\", it will\n# load a random theme each time oh-my-zsh is loaded, in which case,\n# to know which specific one was loaded, run: echo $RANDOM_THEME\n# See https://github.com/ohmyzsh/ohmyzsh/wiki/Themes\n
This ~/.zshrc
profile script is loaded every time you open up a new shell. What we want to do is add our export
command above to this script, so that it is running every time. That is the basic concept of a shell profile: providing a series of commands that is run every time a new shell (terminal window, tab) is opened.
For this tutorial, we use nano
to edit the file, but feel free to use your text editor of choice. For example, you can open the profile file using TextEdit
on Mac like this:
open -a TextEdit ~/.zshrc\n
We will proceed using nano
, but feel free to use any editor.
nano ~/.zshrc\n
Using terminal-based editors like nano or, even worse, vim, involves a bit of a learning curve. nano
is by far the least powerful and simple to use. If you typed the above command, you should see its contents on the terminal. The next step is to copy the following (remember, we already used it earlier)
export PATH=$PATH:~/tools\n
and paste it somewhere into the file. Usually, there is a specific section of the file that is concerned with setting up your path. Eventually, as you become more of an expert, you will start organising your profile according to your own preferences! Today we will just copy the command anywhere, for example:
# If you come from bash you might have to change your $PATH.\n# export PATH=$HOME/bin:/usr/local/bin:$PATH\nexport PATH=~/tutorial:$PATH\n# ..... other lines in the file\n
Note that the #
symbol denotes the beginning of a \"comment\" which is ignored by the shell/CLI. After you have pasted the above, you use the following keyboard key-combinations to safe and close the file:
control + O\n
This saves the file. Confirm with Enter.
control + x\n
This closes the file. Now, we need to tell the shell we are currently in that it should reload our profile we have just edited. We do that using the source
command.
source ~/.zshrc\n
Great! You should be able open a new tab in your terminal (with command+t on a Mac, for example) and run the following command:
robot --version\n
"},{"location":"tutorial/intro-cli-2/#managing-aliases-and-custom-commands-in-your-shell-profile","title":"Managing aliases and custom commands in your shell profile","text":"This section will only give a sense of the kinds of things you can do with your shell profile - in the end you will have to jump into the cold water and build your skills up yourself. Let us start with a very powerful concept: aliases. Aliases are short names for your commands you can use if you use them repeatedly but are annoyed typing them out every time. For example, tired of typing out long paths all the time to jump between your Cell Ontology and Human Phenotype Ontology directories? Instead of:
cd /Users/matentzn/ws/human-phenotype-ontology/src/ontology\n
wouldn't it be nice to be able to use, instead,
cdhp\n
or, if you are continuously checking git status
, why not implement a alias gits
? Or activating your python environment (source ~/.pyenv/versions/oak/bin/activate
) with a nice env-oak
? To achieve this we do the following:
(1) Open your profile in a text editor of your choice, e.g.
nano ~/.zshrc\n
add the following lines:
alias cdt='cd ~/tools'\nalias hg='history | grep'\n
Save (control+o) and close (control+x) the profile. Reload the profile:
source ~/.zshrc\n
(Alternatively, just open a new tab in your Terminal.) Now, lets try our new aliases:
cdt\n
Will bring you straight to your tools
directory you created in the previous lesson above.
hg robot\n
Will search your terminal command history for every command you have executed involving robot
.
In the following, we provide a list of aliases we find super useful:
alias cdt='cd ~/tools'
- add shortcuts to all directories you frequently visit!alias orcid='echo '\\''https://orcid.org/0000-0002-7356-1779'\\'' | tr -d '\\''\\n'\\'' | pbcopy'
- if you keep having to look up your ORCID, your favourite ontologies PURL or the your own zoom room, why not add a shortcut that copies it straight into your clipboard?alias opent='open ~/tools'
- why not open your favourite directory in finder without faving to search the User Interface? You can use the same idea to open your favourite ontology from wherever you are, i.e. alias ohp='open ~/ws/human-phenotype-ontology/src/ontology/hp-edit.owl'
.alias env-linkml='source ~/.pyenv/versions/linkml/bin/activate'
- use simple shortcuts to active your python environments. This will become more important if you learn to master special python tools like OAK.alias update_repo='sh run.sh make update_repo'
- for users of ODK - alias all your long ODK commands!The most advanced thought we want to cover today is \"functions\". You can not only manage simple aliases, but you can actually add proper functions into your shell profile. Here is an example of one that I use:
ols() {\n open https://www.ebi.ac.uk/ols/search?q=\"$1\"\n}\n
This is a simple function in my bash profile that I can use to search on OLS:
ols \"lung disorder\"\n
It will open this search straight in my browser.
rreport() {\n robot report -i \"$1\" --fail-on none -o /Users/matentzn/tmp_data/report_\"$(basename -- $1)\".tsv\n}\n
This allows me to quickly run a robot report on an ontology.
rreport cl.owl\n
Why not expand the function and have it open in my atom text editor right afterwards?
rreport() {\n robot report -i \"$1\" --fail-on none -o /Users/matentzn/tmp_data/report_\"$(basename -- $1)\".tsv && atom /Users/matentzn/tmp_data/report_\"$(basename -- $1)\".tsv\n}\n
The possibilities are endless. Some power-users have hundreds of such functions in their shell profiles, and they can do amazing things with them. Let us know about your own ideas for functions on the OBOOK issue tracker. Or, why not add a function to create a new, titled issue on OBOOK?
obook-issue() {\n open https://github.com/OBOAcademy/obook/issues/new?title=\"$1\"\n}\n
and from now on run:
obook-issue \"Add my awesome function\"\n
"},{"location":"tutorial/intro-cli-2/#further-reading","title":"Further reading","text":"In this tutorial, we will learn to use a very basic lexical matching tool (OAK Lexmatch). The goal is not only to enable the learner to design their own matching pipelines, but also to to think about how they fit into their mapping efforts. Note that this tutorial is not about how to do proper matching: the goal here is simply to introduce you to the general workflow. Proper ontology matching is a major discipline with many tools, preprocessing and tuning approaches and often intricate interplay between matching tools and human curators. Today, you will just get a sense of the general method.
"},{"location":"tutorial/lexmatch-tutorial/#pre-requisites","title":"Pre-requisites","text":"In this tutorial, you will learn how to match fruit juices in Wikidata with FOODON using a simple lexical matching tool (OAK). The idea is simple: We obtain the ontologies we like to match, ask OAK to generate the matches and then curate the results.
Makefile
to prepare your input ontology with ROBOT.Setting up oak
is described in its documentation. Note that, aside from oak
itself, you also need relation-graph
, rdftab
and riot
installed, see https://incatools.github.io/ontology-access-kit/intro/tutorial07.html#without-docker. This tutorial requires OAK version 0.1.59 or higher.
Note that if you are using the ODK docker image, oaklib
is already installed. In the following, we will use the ODK wrapper to ensure that everyone has a consistent experience. If you want to use the local (non-docker) setup, you have to follow the instructions above before continuing and ignore the sh odk.sh
part of the commands.
ODK 1.3.1, the version still active on the 8th December 2022, does not have the latest dependencies of OAK installed. To follow the tutorial you have to use the ODK development snapshot.
Install the ODK Development snapshot:
docker pull obolibrary/odkfull:dev\n
After downloading https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/resources/odk.sh into your local working directory, open it with a text editor and change:
docker ... obolibrary/odkfull ...\n
to
docker ... obolibrary/odkfull:dev ...\n
"},{"location":"tutorial/lexmatch-tutorial/#download-ontologies-and-extract-subsets","title":"Download Ontologies and extract subsets","text":"First, we download FOODON
ontology. You can do this in whatever way you want, for example with wget
:
sh odk.sh wget http://purl.obolibrary.org/obo/foodon.owl -O foodon.owl\n
Next, we extract the subset of FOODON that is relevant to our task at hand: relevant terms about fruit juices. The right method of subset extraction will differ from task to task. For this tutorial, we are using ROBOT extract to obtain a MIREOT
module containing all the fruit juices. We do this by selecting everything between fruit juice food product
as the upper-term
and fruit juices (apple juice
, orange juice
and grapefruit juice
) as the lower-term
of the FOODON
subset.
sh odk.sh robot extract --method MIREOT --input foodon.owl --upper-term \"FOODON:00001140\" --lower-term \"FOODON:00001277\" --lower-term \"FOODON:00001059\" --lower-term \"FOODON:03306174 \" --output fruit_juice_food_foodon.owl\n
If you open fruit_juice_food_foodon.owl
in Protege, you will see something similar to:
Next, we use OAK to extract juices and their labels from wikidata by selecting the descendants of juice
from wikidata
, store the result as a ttl
file and then convert it to OWL
using ROBOT
.
sh odk.sh runoak -i wikidata: descendants wikidata:Q8492 -p i,p -o juice_wd.ttl -O rdf\nsh odk.sh robot convert -i juice_wd.ttl -o juice_wd.owl\n
Note that you wont be able to see anything when opening juice_wd.owl
in wikidata, because it does not have any OWL types (class, individual assertions) attached to it. However, you can convince yourself all is well by opening juice_wd.owl
in a text editor, and see expressions such as:
<rdf:Description rdf:about=\"http://www.wikidata.org/entity/Q10374646\">\n <rdfs:label>cashew apple juice</rdfs:label>\n</rdf:Description>\n
The last preparation step is merging the two subsets (from FOODON and wikidata) into a single file using ROBOT
:
sh odk.sh robot merge -i fruit_juice_food_foodon.owl -i juice_wd.owl -o foodon_wd.owl\n
"},{"location":"tutorial/lexmatch-tutorial/#generate-the-matches-with-oak","title":"Generate the matches with OAK","text":"Now we are ready to create our first set of matches. First, let's run oak
's lexmatch
command to generate lexical matches between the contents of the merged file:
sh odk.sh runoak -i sqlite:foodon_wd.owl lexmatch -o foodon_wd_lexmatch.tsv\n
This will generate an SSSOM tsv file with the mapped contents as shown below:
# curie_map:\n# FOODON: http://purl.obolibrary.org/obo/FOODON_\n# owl: http://www.w3.org/2002/07/owl#\n# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n# rdfs: http://www.w3.org/2000/01/rdf-schema#\n# semapv: https://w3id.org/semapv/\n# skos: http://www.w3.org/2004/02/skos/core#\n# sssom: https://w3id.org/sssom/\n# wikidata: http://www.wikidata.org/entity/\n# license: https://w3id.org/sssom/license/unspecified\n# mapping_set_id: https://w3id.org/sssom/mappings/091390a2-6f64-436d-b2d1-309045ff150c\n
subject_id subject_label predicate_id object_id object_label mapping_justification mapping_tool confidence subject_match_field object_match_field match_string FOODON:00001059 apple juice skos:closeMatch wikidata:Q618355 apple juice semapv:LexicalMatching oaklib 0.5 rdfs:label rdfs:label apple juice FOODON:00001059 apple juice skos:closeMatch wikidata:Q618355 apple juice semapv:LexicalMatching oaklib 0.5 oio:hasExactSynonym rdfs:label apple juice FOODON:03301103 orange juice skos:closeMatch wikidata:Q219059 orange juice semapv:LexicalMatching oaklib 0.5 rdfs:label rdfs:label orange juice FOODON:03306174 grapefruit juice skos:closeMatch wikidata:Q1138468 grapefruit juice semapv:LexicalMatching oaklib 0.5 rdfs:label rdfs:label grapefruit juice wikidata:Q15823640 cherry juice skos:closeMatch wikidata:Q62030277 cherry juice semapv:LexicalMatching oaklib 0.5 rdfs:label rdfs:label cherry juice wikidata:Q18201657 must skos:closeMatch wikidata:Q278818 must semapv:LexicalMatching oaklib 0.5 rdfs:label rdfs:label must This is great - we get a few mappings without much work. If you need some help interpreting this table, please refer to the SSSOM tutorials for details.
Just eyeballing the labels in our ontology with OAK:
sh odk.sh runoak -i sqlite:foodon_wd.owl terms | grep juice\n
We notice rows like:
...\nFOODON:00001001 ! orange juice (liquid)\n...\n
It may be beneficial for us to pre-process the labels a bit before performing the matches, for example, by excluding comments in the labels provided in brackets (essentially removing (liquid)
).
To do this, we will define a few simple mapping rules in a file called matcher_rules.yaml
. OAK provides a standard for representing the matching rules. You can see an example here.
Here is an example file:
rules:\n- description: default\npostconditions:\npredicate_id: skos:closeMatch\nweight: 0.0\n\n- description: exact to exact\npreconditions:\nsubject_match_field_one_of:\n- oio:hasExactSynonym\n- rdfs:label\n- skos:prefLabel\nobject_match_field_one_of:\n- oio:hasExactSynonym\n- rdfs:label\n- skos:prefLabel\npostconditions:\npredicate_id: skos:exactMatch\nweight: 2.0\n\n- preconditions:\nsubject_match_field_one_of:\n- oio:hasExactSynonym\n- rdfs:label\nobject_match_field_one_of:\n- oio:hasBroadSynonym\npostconditions:\npredicate_id: skos:broadMatch\nweight: 2.0\n\n- synonymizer:\nthe_rule: Remove parentheses bound info from the label.\nmatch: r'\\([^)]*\\)'\nmatch_scope: \"*\"\nreplacement: \"\"\n\n- synonymizer:\nthe_rule: Replace \"'s\" by \"s\" in the label.\nmatch: r'\\'s'\nmatch_scope: \"*\"\nreplacement: \"s\"\n
As you can see, there are basically two kinds of rules: normal ones, and synonimizer
ones. The normal rules provide preconditions and postconditions. For example, the second rule says: if an exact synonym, preferred label or label of the subject matches an exact synonym, preferred label or label of the object, then assert a skos:exactMatch
. The synonimizer
rules are preprocessing rules which are applied to the labels and synonyms prior to matching. Let's now run the matcher again:
sh odk.sh runoak -i sqlite:foodon_wd.owl lexmatch -R matcher_rules.yaml -o foodon_wd_lexmatch_with_rules.tsv \n
This will generate an SSSOM tsv file with a few more matches than the previous output (the exact matches may differ from version to version):
# curie_map:\n# FOODON: http://purl.obolibrary.org/obo/FOODON_\n# IAO: http://purl.obolibrary.org/obo/IAO_\n# owl: http://www.w3.org/2002/07/owl#\n# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n# rdfs: http://www.w3.org/2000/01/rdf-schema#\n# semapv: https://w3id.org/semapv/\n# skos: http://www.w3.org/2004/02/skos/core#\n# sssom: https://w3id.org/sssom/\n# wikidata: http://www.wikidata.org/entity/\n# license: https://w3id.org/sssom/license/unspecified\n# mapping_set_id: https://w3id.org/sssom/mappings/6b9c727f-9fdc-4a78-bbda-a107b403e3a9\n
subject_id subject_label predicate_id object_id object_label mapping_justification mapping_tool confidence subject_match_field object_match_field match_string subject_preprocessing object_preprocessing FOODON:00001001 orange juice (liquid) skos:exactMatch FOODON:00001277 orange juice (unpasteurized) semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice semapv:RegularExpressionReplacement semapv:RegularExpressionReplacement FOODON:00001001 orange juice (liquid) skos:exactMatch FOODON:03301103 orange juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice semapv:RegularExpressionReplacement FOODON:00001001 orange juice (liquid) skos:exactMatch wikidata:Q219059 orange juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice semapv:RegularExpressionReplacement FOODON:00001059 apple juice skos:exactMatch wikidata:Q618355 apple juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label apple juice FOODON:00001059 apple juice skos:exactMatch wikidata:Q618355 apple juice semapv:LexicalMatching oaklib 0.8 oio:hasExactSynonym rdfs:label apple juice FOODON:00001277 orange juice (unpasteurized) skos:exactMatch FOODON:03301103 orange juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice semapv:RegularExpressionReplacement FOODON:00001277 orange juice (unpasteurized) skos:exactMatch wikidata:Q219059 orange juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice semapv:RegularExpressionReplacement FOODON:00002403 food material skos:exactMatch FOODON:03430109 food (liquid, low viscosity) semapv:LexicalMatching oaklib 0.8 oio:hasExactSynonym rdfs:label food semapv:RegularExpressionReplacement FOODON:00002403 food material skos:exactMatch FOODON:03430130 food (liquid) semapv:LexicalMatching oaklib 0.8 oio:hasExactSynonym rdfs:label food semapv:RegularExpressionReplacement FOODON:03301103 orange juice skos:exactMatch wikidata:Q219059 orange juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label orange juice FOODON:03306174 grapefruit juice skos:exactMatch wikidata:Q1138468 grapefruit juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label grapefruit juice FOODON:03430109 food (liquid, low viscosity) skos:exactMatch FOODON:03430130 food (liquid) semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label food semapv:RegularExpressionReplacement semapv:RegularExpressionReplacement wikidata:Q15823640 cherry juice skos:exactMatch wikidata:Q62030277 cherry juice semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label cherry juice wikidata:Q18201657 must skos:exactMatch wikidata:Q278818 must semapv:LexicalMatching oaklib 0.8497788951776651 rdfs:label rdfs:label must "},{"location":"tutorial/lexmatch-tutorial/#curate","title":"Curate","text":"As we have described in detail in our introduction to Semantic Matching, it is important to remember that matching in its raw form should not be understood to result in semantic mappings: they are better understood as mapping candidates. Therefore, it is always to plan for a review of false positives and false negatives:
orange juice [wikidata:Q219059]
and orange juice (unpasteurized) [FOODON:00001277]
may not be considered as the same thing in the sense of skos:exactMatch
. For a more detailed introduction into manual mapping curation with SSSOM we recommend following this tutorial: https://mapping-commons.github.io/sssom/tutorial/.
"},{"location":"tutorial/linking-data/","title":"Tutorial: From Tables to Linked Data","text":"These are the kinds of things that I do when I need to work with a new dataset. My goal is to have data that makes good sense and that I can integrate with other data using standard technologies: Linked Data.
"},{"location":"tutorial/linking-data/#0-before","title":"0. Before","text":"The boss just sent me this new table to figure out:
datetime investigator subject species strain sex group protocol organ disease qualifier comment 1/1/14 10:21 AM JAO 12 RAT F 344/N FEMALE 1 HISTOPATHOLOGY LUNG ADENOCARCINOMA SEVERE 1/1/14 10:30 AM JO 31 MOUSE B6C3F1 MALE 2 HISTOPATHOLOGY NOSE INFLAMMATION MILD 1/1/14 10:45 AM JAO 45 RAT F 344/N MALE 1 HISTOPATHOLOGY ADRENAL CORTEX NECROSIS MODERATEIt doesn't seem too bad, but there's lots of stuff that I don't quite understand. Where to start?
"},{"location":"tutorial/linking-data/#1-getting-organized","title":"1. Getting Organized","text":"Before I do anything else, I'm going to set up a new project for working with this data. Maybe I'll change my mind later and want to merge the new project with an existing project, but it never hurts to start from a nice clean state.
I'll make a new directory in a sensible place with a sensible name. In my case I have a ~/Repositories/
directory, with subdirectories for GitHub and various GitLab servers, a local
directory for projects I don't plan to share, and a temp
directory for projects that I don't need to keep. I'm not sure if I'm going to share this work, so it can go in a new subdirectory of local
. I'll call it \"linking-data-tutorial\" for now.
Then I'll run git init
to turn that directory into a git repository. For now I'm just going to work locally, but later I can make a repository on GitHub and push my local repository there.
Next I'll create a README.md
file where I'll keep notes for myself to read later. My preferred editor is Kakoune.
So I'll open a terminal and run these commands:
$ cd ~/Repositories/local/\n$ mkdir linking-data-tutorial\n$ cd linking-data-tutorial\n$ git init\n$ kak README.md\n
In the README I'll start writing something like this:
# Linking Data Tutorial\n\nAn example of how to convert a dataset to Linked Data.\n\nThe source data is available from\n<https://github.com/jamesaoverton/obook/tree/master/03-RDF/data.csv>\n
Maybe this information should go somewhere else eventually, but the README is a good place to start.
\"Commit early, commit often\" they say, so:
$ git add README.md\n$ git commit -m \"Initial commit\"\n
"},{"location":"tutorial/linking-data/#2-getting-copies","title":"2. Getting Copies","text":"Data has an annoying tendency to get changed. You don't want it changing out from under you while you're in the middle of something. So the next thing to do is get a copy of the data and store it locally. If it's big, you can store a compressed copy. If it's too big to fit on your local machine, well keep the best notes you can of how to get to the data, and what operations you're doing on it.
I'm going to make a cache
directory and store all my \"upstream\" data there. I'm going to fetch the data and that's it -- I'm not going to edit these files. When I want to change the data I'll make copies in another directory. I don't want git to track the cached data, so I'll add /cache/
to .gitignore
and tell git to track that. Then I'll use curl
to download the file.
$ mkdir cache\n$ echo \"/cache/\" >> .gitignore\n$ git add .gitignore\n$ git commit -m \"Ignore /cache/ directory\"\n$ cd cache\n$ curl -LO \"https://github.com/jamesaoverton/obook/raw/master/03-RDF/data.csv\"\n$ ls\ndata.csv\n$ cd ..\n$ ls -a\n.gitignore data README.md\n
"},{"location":"tutorial/linking-data/#3-getting-my-bearings","title":"3. Getting My Bearings","text":"The first thing to do is look at the data. In this case I have just one table in CSV format, so I can use any number of tools to open the file and look around. I bet the majority of people would reach for Excel. My (idiosyncratic) preference is VisiData.
What am I looking for? A bunch of different things:
In my README file I'll make a list of the columns like this:
- datetime\n- investigator\n- subject\n- species\n- strain\n- sex\n- group\n- protocol\n- organ\n- disease\n- qualifier\n- comment\n
Then I'll make some notes for myself:
- datetime: American-style dates, D/M/Y or M/D/Y?\n- investigator: initials, ORCID?\n- subject: integer ID\n- species: common name for species, NCBITaxon?\n- strain: some sort of code with letters, numbers, spaces, some punctuation\n- sex: string female/male\n- group: integer ID\n- protocol: string, OBI?\n- organ: string, UBERON?\n- disease: string, DO/MONDO?\n- qualifier: string, PATO?\n- comment: ???\n
You can see that I'm trying to figure out what's in each column. I'm also thinking ahead to OBO ontologies that I know of that may have terms that I can use for each column.
"},{"location":"tutorial/linking-data/#4-getting-structured","title":"4. Getting Structured","text":"In the end, I want to have nice, clean Linked Data. But I don't have to get there in one giant leap. Instead I'll take a bunch of small, incremental steps.
There's lots of tools I can use, but this time I'll use SQLite.
First I'll set up some more directories. I'll create a build
directory where I'll store temporary files. I don't want git to track this directory, so I'll add it to .gitignore
.
$ mkdir build/\n$ echo \"/build/\" >> .gitignore\n$ git add .gitignore\n$ git commit -m \"Ignore /build/ directory\"\n
I'll also add a src
directory to store code. I do want to track src
with git.
$ mkdir src\n$ kak src/data.sql\n
In src/data.sql
I'll add just enough to import build/data.csv
:
-- import build/data.csv\n.mode csv\n.import build/data.csv data_csv\n
This will create a build/data.db
file and import build/data.csv
into a data_csv
table. Does it work?
$ sqlite3 build/data.db < src/data.sql\n$ sqlite3 build/data.db <<< \"SELECT * FROM data_csv LIMIT 1;\"\n2014-01-01 10:21:00-0500|JAO|12|RAT|F 344/N|FEMALE|1|HISTOPATHOLOGY|LUNG|ADENOCARCINOMA|SEVERE|\n
Nice!
Note that I didn't even specify a schema for data_csv
. It uses the first row as the column names, and the type of every column is TEXT
. Here's the schema I end up with:
$ sqlite3 build/data.db <<< \".schema data_csv\"\nCREATE TABLE data_csv(\n\"datetime\" TEXT,\n\"investigator\" TEXT,\n\"subject\" TEXT,\n\"species\" TEXT,\n\"strain\" TEXT,\n\"sex\" TEXT,\n\"group\" TEXT,\n\"protocol\" TEXT,\n\"organ\" TEXT,\n\"disease\" TEXT,\n\"qualifier\" TEXT,\n\"comment\" TEXT\n);\n
I'm going to want to update src/data.sql
then rebuild the database over and over. It's small, so this will only take a second. If it was big, then I would copy a subset into build/data.csv
for now so that I the script still runs in a second or two and I can iterate quickly. I'll write a src/build.sh
script to make life a little easier:
#!/bin/sh\n\nrm -f build/*\ncp cache/data.csv build/data.csv\nsqlite3 build/data.db < src/data.sql\n
Does it work?
$ sh src/build.sh\n
Nice! Time to update the README:
## Requirements\n\n- [SQLite3](https://sqlite.org/index.html)\n\n## Usage\n\nRun `sh src/build.sh`\n
I'll commit my work in progress:
$ git add src/data.sql src/build.sh\n$ git add --update\n$ git commit -m \"Load data.csv into SQLite\"\n
Now I have a script that executes a SQL file that loads the source data into a new database. I'll modify the src/data.sql
file in a series of small steps until it has the structure that I want.
In the real world, data is always a mess. It takes real work to clean it up. And really, it's almost never perfectly clean.
It's important to recognize that cleaning data has diminishing returns. There's low hanging fruit: easy to clean, often with code, and bringing big benefits. Then there's tough stuff that requires an expert to work through the details, row by row.
The first thing to do is figure out the schema you want. I'll create a new data
table and start with the default schema from data_csv
. Notice that in the default schema all the column names are quoted. That's kind of annoying. But when I remove the quotation marks I realize that one of the column names is \"datetime\", but datetime
is a keyword in SQLite! You can't use it as a column name without quoting. I'll rename it to \"assay_datetime\". I have the same problem with \"group\". I'll rename \"group\" to \"group_id\" and \"subject\" to \"subject_id\". The rest of the column names seem fine.
I want \"assay_datetime\" to be in standard ISO datetime format, but SQLite stores these as TEXT. The \"subject\" and \"group\" columns are currently integers, but I plan to make them into URIs to CURIEs. So everything will still be TEXT.
CREATE TABLE data(\nassay_datetime TEXT,\ninvestigator TEXT,\nsubject_id TEXT,\nspecies TEXT,\nstrain TEXT,\nsex TEXT,\ngroup_id TEXT,\nprotocol TEXT,\norgan TEXT,\ndisease TEXT,\nqualifier TEXT,\ncomment TEXT\n);\n
The dates currently look like \"1/1/14 10:21 AM\". Say I know that they were done on Eastern Standard Time. How do I convert to ISO dates like \"2014-01-01 10:21:00-0500\"? Well SQLite isn't the right tool for this. The Unix date
command does a nice job, though:
$ date -d \"1/1/14 10:21 AM EST\" +\"%Y-%m-%d %H:%M:%S%z\"\n2014-01-01 10:21:00-0500\n
I can run that over each line of the file using awk
. So I update the src/build.sh
to rework the build/data.csv
before I import:
#!/bin/sh\n\nrm -f build/*\n\nhead -n1 cache/data.csv > build/data.csv\ntail -n+2 cache/data.csv \\\n| awk 'BEGIN{FS=\",\"; OFS=\",\"} {\n \"date -d \\\"\"$1\" EST\\\" +\\\"%Y-%m-%d %H:%M:%S%z\\\"\" | getline $1;\n print $0\n}' \\\n>> build/data.csv\n\nsqlite3 build/data.db < src/data.sql\n
One more problem I could clean up is that \"JO\" should really be \"JAO\" -- that's just a typo, and they should both refer to James A. Overton. I could make that change in src/build.sh
, but I'll do it in src/data.sql
instead. I'll write a query to copy all the rows of data_csv
into data
and then I'll update data
with some fixes.
-- copy from data_csv to data\nINSERT INTO data SELECT * FROM data_csv;\n\n-- clean data\nUPDATE data SET investigator=\"JAO\" WHERE investigator=\"JO\";\n
Honestly, it took me quite a while to write that awk
command. It's a very powerful tool, but I don't use it enough to remember how it works. You might prefer to write yourself a Python script, or some R code. You could use that instead of this SQL UPDATE as well. I just wanted to show you two of the thousands of ways to do this. If there's a lot of replacements like \"JO\", then you might also consider listing them in another table that you can read into your script.
The important part is to automate your cleaning!
Why didn't I just edit cache/data.csv
in Excel? In step 2 I saved a copy of the data because I didn't want it to change while I was working on it, but I do expect it to change! By automating the cleaning process, I should be able to just update cache/data.csv
run everything again, and the fixes will be applied again. I don't want to do all this work manually every time the upstream data is updated.
I'll commit my work in progress:
$ git add --update\n$ git commit -m \"Start cleaning data\"\n
Cleaning can take a lot of work. This is example table is pretty clean already. The next hard part is sorting out your terminology.
"},{"location":"tutorial/linking-data/#6-getting-connected","title":"6. Getting Connected","text":"It's pretty easy to convert a table structure to triples. The hard part is converting the table contents. There are some identifiers in the table that would be better as URLs, and there's a bunch of terminology that would be better if it was linked to an ontology or other system.
I'll start with the identifiers that are local to this data: subject_id and group_id. I can convert them to URLs by defining a prefix and then just using that prefix. I'll use string concatenation to update the table:
-- update subject and groupd IDs\nUPDATE data SET subject_id='ex:subject-' || subject_id;\nUPDATE data SET group_id='ex:group-' || group_id;\n
Now I'll check my work:
$ sqlite3 build/data.db <<< \"SELECT * FROM data_csv LIMIT 1;\"\n2014-01-01 10:21:00-0500|JAO|ex:subject-12|RAT|F 344/N|FEMALE|ex:group-1|HISTOPATHOLOGY|LUNG|ADENOCARCINOMA|SEVERE|\n
I should take a moment to tell you, that while I was writing the Turtle conversion code later in this essay, I had to come back here and change these identifiers. The thing is that Turtle is often more strict than I expect about identifier syntax. Turtle identifiers look like CURIEs, but they're actually QNames. CURIEs are pretty much just just URLs shortened with a prefix, so almost anything goes. QNames come from XML, and Turtle identifiers have to be valid XML element names.
I always remember that I need to stick to alphanumeric characters, and that I have to replace whitespace and punctuation with a -
or _
. I didn't remember that the local part (aka \"suffix\", aka \"NCName\") can't start with a digit. So I tried to use \"subject:12\" and \"group:1\" as my identifiers. That worked fine until I generated Turtle. The Turtle looked fine, so it took me quite a while to figure out why it looked very wrong when I converted it into RDXML format.
This kind of thing happens to me all the time. I'm almost always using a mixture of technologies based on different sets of assumptions, and there are always things that don't line up. That's why I like to work in small iterations, checking my work as I go (preferrably with automated tests), and keeping everything in version control. When I need to make a change like this one, I just circle back and iterate again.
The next thing is to tackle the terminology. First I'll just make a list of the terms I'm using from the relevant columns in build/term.tsv
:
```sh #collect $ sqlite3 build/data.db << EOF > build/term.tsv SELECT investigator FROM data UNION SELECT species FROM data UNION SELECT strain FROM data UNION SELECT strain FROM data UNION SELECT sex FROM data UNION SELECT protocol FROM data UNION SELECT organ FROM data UNION SELECT disease FROM data UNION SELECT qualifier FROM data; EOF
It's a lot of work to go through all those terms\nand find good ontology terms.\nI'm going to do that hard work for you\n(just this once!)\nso we can keep moving.\nI'll add this table to `src/term.tsv`\n\n| id | code | label |\n| ------------------------- | -------------- | ------------------ |\n| obo:NCBITaxon_10116 | RAT | Rattus norvegicus |\n| obo:NCBITaxon_10090 | MOUSE | Mus musculus |\n| ex:F344N | F 344/N | F 344/N |\n| ex:B6C3F1 | B6C3F1 | B6C3F1 |\n| obo:PATO_0000383 | FEMALE | female |\n| obo:PATO_0000384 | MALE | male |\n| obo:OBI_0600020 | HISTOPATHOLOGY | histology |\n| obo:UBERON_0002048 | LUNG | lung |\n| obo:UBERON_0007827 | NOSE | external nose |\n| obo:UBERON_0001235 | ADRENAL CORTEX | adrenal cortex |\n| obo:MPATH_268 | ADENOCARCINOMA | adenocarcinoma |\n| obo:MPATH_212 | INFLAMMATION | inflammation |\n| obo:MPATH_4 | NECROSIS | necrosis |\n| obo:PATO_0000396 | SEVERE | severe intensity |\n| obo:PATO_0000394 | MILD | mild intensity |\n| obo:PATO_0000395 | MODERATE | moderate intensity |\n| orcid:0000-0001-5139-5557 | JAO | James A. Overton |\n\nAnd I'll add these prefixes to `src/prefix.tsv`:\n\n| prefix | base |\n| ------- | ------------------------------------------- |\n| rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |\n| rdfs | http://www.w3.org/2000/01/rdf-schema# |\n| xsd | http://www.w3.org/2001/XMLSchema# |\n| owl | http://www.w3.org/2002/07/owl# |\n| obo | http://purl.obolibrary.org/obo/ |\n| orcid | http://orcid.org/ |\n| ex | https://example.com/ |\n| subject | https://example.com/subject/ |\n| group | https://example.com/group/ |\n\nNow I can import these tables into SQL\nand use the term table as a FOREIGN KEY constraint\non data:\n\n```sql\n.mode tabs\n\nCREATE TABLE prefix (\n prefix TEXT PRIMARY KEY,\n base TEXT UNIQUE\n);\n.import --skip 1 src/prefix.tsv prefix\n\nCREATE TABLE term (\n id TEXT PRIMARY KEY,\n code TEXT UNIQUE,\n label TEXT UNIQUE\n);\n.import --skip 1 src/term.tsv term\n\nCREATE TABLE data(\n assay_datetime TEXT,\n investigator TEXT,\n subject_id TEXT,\n species TEXT,\n strain TEXT,\n sex TEXT,\n group_id TEXT,\n protocol TEXT,\n organ TEXT,\n disease TEXT,\n qualifier TEXT,\n comment TEXT,\n FOREIGN KEY(investigator) REFERENCES term(investigator),\n FOREIGN KEY(species) REFERENCES term(species),\n FOREIGN KEY(strain) REFERENCES term(strain),\n FOREIGN KEY(sex) REFERENCES term(sex),\n FOREIGN KEY(protocol) REFERENCES term(protocol),\n FOREIGN KEY(organ) REFERENCES term(organ),\n FOREIGN KEY(disease) REFERENCES term(disease),\n FOREIGN KEY(qualifier) REFERENCES term(qualifier)\n);\n\n-- copy from data_csv to data\nINSERT INTO data SELECT * FROM data_csv;\n\n-- clean data\nUPDATE data SET investigator='JAO' WHERE investigator='JO';\n\n-- update subject and groupd IDs\nUPDATE data SET subject_id='ex:subject-' || subject_id;\nUPDATE data SET group_id='ex:group-' || group_id;\n
I'll update the README:
See `src/` for:\n\n- `prefix.tsv`: shared prefixes\n- `term.tsv`: terminology\n
I'll commit my work in progress:
$ git add src/prefix.tsv src/term.tsv\n$ git add --update\n$ git commit -m \"Add and apply prefix and term tables\"\n
Now all the terms are linked to controlled vocabularies of one sort or another. If I want to see the IDs for those links instead of the \"codes\" I can define a VIEW:
CREATE VIEW linked_data_id AS\nSELECT assay_datetime,\ninvestigator_term.id AS investigator,\nsubject_id,\nspecies_term.id AS species,\nstrain_term.id AS strain,\nsex_term.id AS sex,\ngroup_id,\nprotocol_term.id AS protocol,\norgan_term.id AS organ,\ndisease_term.id AS disease,\nqualifier_term.id AS qualifier\nFROM data\nJOIN term as investigator_term ON data.investigator = investigator_term.code\nJOIN term as species_term ON data.species = species_term.code\nJOIN term as strain_term ON data.strain = strain_term.code\nJOIN term as sex_term ON data.sex = sex_term.code\nJOIN term as protocol_term ON data.protocol = protocol_term.code\nJOIN term as organ_term ON data.organ = organ_term.code\nJOIN term as disease_term ON data.disease = disease_term.code\nJOIN term as qualifier_term ON data.qualifier = qualifier_term.code;\n
I'll check:
$ sqlite3 build/data.db <<< \"SELECT * FROM linked_ids LIMIT 1;\"\n2014-01-01 10:21:00-0500|orcid:0000-0001-5139-5557|ex:subject-12|obo:NCBITaxon_10116|ex:F344N|obo:PATO_0000383|ex:group-1|obo:OBI_0600020|obo:UBERON_0002048|obo:MPATH_268|obo:PATO_0000396\n
I can also define a similar view for their \"official\" labels:
CREATE VIEW linked_data_label AS\nSELECT assay_datetime,\ninvestigator_term.label AS investigator,\nsubject_id,\nspecies_term.label AS species,\nstrain_term.label AS strain,\nsex_term.label AS sex,\ngroup_id,\nprotocol_term.label AS protocol,\norgan_term.label AS organ,\ndisease_term.label AS disease,\nqualifier_term.label AS qualifier\nFROM data\nJOIN term as investigator_term ON data.investigator = investigator_term.code\nJOIN term as species_term ON data.species = species_term.code\nJOIN term as strain_term ON data.strain = strain_term.code\nJOIN term as sex_term ON data.sex = sex_term.code\nJOIN term as protocol_term ON data.protocol = protocol_term.code\nJOIN term as organ_term ON data.organ = organ_term.code\nJOIN term as disease_term ON data.disease = disease_term.code\nJOIN term as qualifier_term ON data.qualifier = qualifier_term.code;\n
I'll check:
$ sqlite3 build/data.db <<< \"SELECT * FROM linked_data_label LIMIT 1;\"\n2014-01-01 10:21:00-0500|James A. Overton|ex:subject-12|Rattus norvegicus|F 344/N|female|ex:group-1|histology|lung|adenocarcinoma|severe intensity\n
I'll commit my work in progress:
$ git add --update\n$ git commit -m \"Add linked_data tables\"\n
Now the tables use URLs and is connected to ontologies and stuff. But are we Linked yet?
"},{"location":"tutorial/linking-data/#7-getting-triples","title":"7. Getting Triples","text":"SQL tables aren't an official Linked Data format. Of all the RDF formats, I prefer Turtle. It's tedious but not difficult to get Turtle out of SQL. These query do what I need them to do, but note that if the literal data contained quotation marks (for instance) then I'd have to do more work to escape those. First I create a triple table:
CREATE TABLE triple (\nsubject TEXT,\npredicate TEXT,\nobject TEXT,\nliteral INTEGER -- 0 for object IRI, 1 for object literal\n);\n\n-- create triples from term table\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT id, 'rdfs:label', label, 1\nFROM term;\n\n-- create triples from data table\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-assay_datetime', assay_datetime, 1\nFROM data;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-investigator', term.id, 0\nFROM data\nJOIN term AS term ON data.investigator = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-subject_id', subject_id, 0\nFROM data;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-species', term.id, 0\nFROM data\nJOIN term AS term ON data.species = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-strain', term.id, 0\nFROM data\nJOIN term AS term ON data.strain = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-sex', term.id, 0\nFROM data\nJOIN term AS term ON data.sex = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-group_id', group_id, 0\nFROM data;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-protocol', term.id, 0\nFROM data\nJOIN term AS term ON data.protocol = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-organ',term.id, 0\nFROM data\nJOIN term AS term ON data.organ= term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-disease', term.id, 0\nFROM data\nJOIN term AS term ON data.disease = term.code;\n\nINSERT INTO triple(subject, predicate, object, literal)\nSELECT 'ex:assay-' || data.rowid, 'ex:column-qualifier', term.id, 0\nFROM data\nJOIN term AS term ON data.qualifier = term.code;\n
Then I can turn triples into Turtle using string concatenation:
SELECT '@prefix ' || prefix || ': <' || base || '> .'\nFROM prefix\nUNION ALL\nSELECT ''\nUNION ALL\nSELECT subject || ' ' ||\npredicate || ' ' ||\nCASE literal\nWHEN 1 THEN '\"' || object || '\"'\nELSE object\nEND\n|| ' . '\nFROM triple;\n
I can add this to the src/build.sh
:
sqlite3 build/data.db < src/turtle.sql > build/data.ttl\n
Here's just a bit of that build/data.ttl
file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n\norcid:0000-0001-5139-5557 rdfs:label \"James A. Overton\" .\nassay:1 column:assay_datetime \"2014-01-01 10:21:00-0500\"^^xsd:datetime .\nassay:1 column:investigator orcid:0000-0001-5139-5557 .\n
SQL is not a particularly expressive language. Building the triple table is straightforward but verbose. I could have done the same thing with much less Python code. (Or I could have been clever and generated some SQL to execute!)
I'll commit my work in progress:
$ git add src/turtle.sql\n$ git add --update\n$ git commit -m \"Convert to Turtle\"\n
So technically I have a Turtle file. Linked Data! Right? Well, it's kind of \"flat\". It still looks more like a table than a graph.
"},{"location":"tutorial/linking-data/#8-getting-linked","title":"8. Getting Linked","text":"The table I started with is very much focused on the data: there was some sort of assay done, and this is the information that someone recorded about it. The Turtle I just ended up with is basically the same.
Other people may have assay data. They may have tables that they converted into Turtle. So can I just merge them? Technically yes: I can put all these triples in one graph together. But I'll still just have \"flat\" chunks of data representing rows sitting next to other rows, without really linking together.
The next thing I would do with this data is reorganized it based on the thing it's talking about. I know that:
Most of these are things that I could point to in the world, or could have pointed to if I was in the right place at the right time.
By thinking about these things, I'm stepping beyond what it was convenient for someone to record, and thinking about what happened in the world. If somebody else has some assay data, then they might have recorded it differently for whatever reason, and so it wouldn't line up with my rows. I'm trying my best to use the same terms for the same things. I also want to use the same \"shapes\" for the same things. When trying to come to an agreement about what is connected to what, life is easier if I can point to the things I want to talk about: \"See, here is the person, and the mouse came from here, and he did this and this.\"
I could model the data in SQL by breaking the big table into smaller tables. I could have tables for:
Then I would convert each table to triples more carefully. That's a good idea. Actually it's a better idea than what I'm about to do...
Since we're getting near the end, I'm going to show you how you can do that modelling in SPARQL. SPARQL has a CONSTRUCT operation that you use to build triples. There's lots of tools that I could use to run SPARQL but I'll use ROBOT. I'll start with the \"flat\" triples in build/data.ttl
, select them with my WHERE clause, then CONSTRUCT better triples, and save them in build/model.ttl
.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\nPREFIX obo: <http://purl.obolibrary.org/obo/>\nPREFIX ex: <https://example.com/>\n\nCONSTRUCT {\n ?group\n rdfs:label ?group_label .\n ?subject\n rdf:type ?species ;\n rdfs:label ?subject_label ;\n ex:strain ?strain ;\n obo:RO_0000086 ?sex ; # has quality\n ex:group ?group .\n ?sex\n rdf:type ?sex_type ;\n rdfs:label ?sex_label .\n ?organ\n rdf:type ?organ_type ;\n rdfs:label ?organ_label ;\n obo:BFO_0000050 ?subject . # part of\n ?assay\n rdf:type ?assay_type ;\n rdfs:label ?assay_label ;\n obo:OBI_0000293 ?subject ; # has specified input\n obo:IAO_0000136 ?organ . # is about\n}\nWHERE {\n ?subject_row\n ex:column-assay_datetime ?datetime ;\n ex:column-investigator ?investigator ;\n ex:column-subject_id ?subject ;\n ex:column-species ?species ;\n ex:column-sex ?sex_type ;\n ex:column-group_id ?group ;\n ex:column-protocol ?assay_type ;\n ex:column-organ ?organ_type ;\n ex:column-disease ?disease ;\n ex:column-qualifier ?qualifier .\n\n ?assay_type\n rdfs:label ?assay_type_label .\n ?sex_type\n rdfs:label ?sex_type_label .\n ?organ_type\n rdfs:label ?organ_type_label .\n\n BIND (URI(CONCAT(STR(?subject), \"-assay\")) AS ?assay)\n BIND (URI(CONCAT(STR(?subject), \"-sex\")) AS ?sex)\n BIND (URI(CONCAT(STR(?subject), \"-organ\")) AS ?organ)\n BIND (CONCAT(\"subject \", REPLACE(STR(?subject), \"^.*-\", \"\")) AS ?subject_label)\n BIND (CONCAT(\"group \", REPLACE(STR(?group), \"^.*-\", \"\")) AS ?group_label)\n BIND (CONCAT(?subject_label, \" \", ?assay_type_label) AS ?assay_label)\n BIND (CONCAT(?subject_label, \" sex: \", ?sex_type_label) AS ?sex_label)\n BIND (CONCAT(?subject_label, \" \", ?organ_type_label) AS ?organ_label)\n}\n
I can add this to the src/build.sh
:
java -jar robot.jar query \\\n--input build/data.ttl \\\n--query src/model.rq build/model.ttl\n
Then I get build/model.ttl
that looks (in part) like this:
ex:subject-31 a obo:NCBITaxon_10090 ;\nrdfs:label \"subject 31\" ;\nobo:RO_0000086 ex:subject-31-sex ;\nex:group ex:group-2 .\n\nex:group-2 rdfs:label \"group 2\" .\n
Now that's what I call Linked Data!
I'll update the README:
## Modelling\n\nThe data refers to:\n\n- investigator\n- subject\n- group\n- assay\n- measurement data\n - subject organ\n - disease\n\nTODO: A pretty diagram.\n
I'll commit my work in progress:
$ git add src/model.rq\n$ git add --update\n$ git commit -m \"Build model.ttl\"\n
"},{"location":"tutorial/linking-data/#9-getting-it-done","title":"9. Getting It Done","text":"That was a lot of work for a small table. And I did all the hard work of mapping the terminology to ontology terms for you!
There's lots more I can do. The SPARQL is just one big chunk, but it would be better in smaller pieces. The modelling isn't all that great yet. Before changing that I want to run it past the boss and see what she thinks.
It's getting close to the end of the day. Before I quit I should update the README, clean up anything that's no longer relevant or correct, and make any necessary notes to my future self:
$ git add --update\n$ git commit -m \"Update README\"\n$ quit\n
"},{"location":"tutorial/managing-dynamic-imports-odk/","title":"Managing Dynamic Imports with the Ontology Development Kit","text":"In this tutorial, we discuss the general workflow of managing dynamic imports, i.e. importing terms from other ontologies which can be kept up to date.
"},{"location":"tutorial/managing-dynamic-imports-odk/#tutorial","title":"Tutorial","text":"Follow instructions for the PATO dynamic import process here.
"},{"location":"tutorial/managing-ontology-project/","title":"Tutorial on Managing OBO Ontology Projects","text":"This tutorial is not about editing ontologies and managing the evolution of its content (aka ontology curation), but the general process of managing an ontology project overall. In this lesson, we will cover the following:
It is important to understand that the following is just one good way of doing project management for OBO ontologies, and most projects will do it slightly differently. We do however believe that thinking about your project management process and the roles involved will benefit your work in the long term, and hope that the following will help you as a starting point.
"},{"location":"tutorial/managing-ontology-project/#roles-in-obo-ontology-project-management-activities","title":"Roles in OBO Ontology project management activities","text":"For an effective management of an ontology, the following criteria are recommended:
Without the above minimum criteria, the following recommendations will be very hard to implement.
"},{"location":"tutorial/managing-ontology-project/#the-project-management-toolbox","title":"The Project Management Toolbox","text":"We make use of three tools in the following recommendation:
Project boards: Project boards, sometimes referred to as Kanban boards, GitHub boards or agile boards, are a great way to organise outstanding tickets and help maintain a clear overview of what work needs to be done. They are usually realised with either GitHub projects or ZenHub. If you have not worked with project boards before, we highly recommend watching a quick tutorial on Youtube, such as:
GitHub teams. GitHub teams, alongside with organisations, are a powerfull too to organise collaborative workflows on GitHub. They allow you to communicate and organise permissions for editing your ontology in a transparent way. You can get a sense of GitHub teams by watching one of the the numerous tutorials on GitHub, such as:
Markdown-based documentation system. Writing great documentation is imperative for a sustainable project. Across many of our recent projects, were are using mkdocs, which we have also integrated with the Ontology Development Kit, but there are others to consider. We deeply recommend to complete a very short introduction to Markdown, this tutorial on YouTube.
"},{"location":"tutorial/managing-ontology-project/#what-do-you-need-for-your-project","title":"What do you need for your project?","text":"Every ontology or group of related ontologies (sometimes it is easier to manage multiple ontologies at once, because their scope or technical workflows are quite uniform or they are heavily interrelated) should have:
To Do
(issues that are important but not urgent), Priority
(issues that are important and urgent), In Progress
(issues that are being worked on) and Under review
(issues that need review). From years of experience with project boards, we recommend against the common practice of keeping a Backlog
column (issues that are neither important nor urgent nor likely to be addressed in the next 6 months), nor a Done
column (to keep track of closed issues) - they just clutter the view.mkdocs
in OBO projects) with a page listing the members of the team (example). This page should provide links to all related team pages from Github and their project boards, as well as a table listing all current team members with the following information:To Do
and Priority
columns of the Technical Team. The later is important: it is the job of the curation team to prioritise the technical issues. The Technical Team can add tickets to the To Do
and Priority
columns, but this usually happens only in response to a request from the Curation Team.Priority
tickets. The Technical Team is responsible toPriority
to the In Progress
and later to the Done
section.Priority
issues.To Do
issues should first be moved to the Priority
section before being addressed. This prevents focusing on easy to solve tickets in favour of important ones.Backlog
items are not added at all to the board - if they ever become important, they tend to resurface all by themselves.main
(formerly master
) branch should be write protected with suitable rules. For example, requiring QC to pass and 1 approving review as a minimum.In this tutorial, we discuss the general workflow of ontology releases.
"},{"location":"tutorial/managing-ontology-releases-odk/#tutorial","title":"Tutorial","text":"Follow instructions for the PATO release process here.
"},{"location":"tutorial/migrating-ontology-to-odk/","title":"Migrating your old Ontology Release System to the Ontology Development Kit","text":"Content TBP, recording exists on request.
"},{"location":"tutorial/monarch-kg-neo4j-basics/","title":"Neo4j tutorial","text":""},{"location":"tutorial/monarch-kg-neo4j-basics/#running-locally-your-very-own-monarch-graph","title":"Running locally (your very own Monarch Graph)","text":"The new Monarch Knowledge Graph has a more streamlined focus on the core Monarch data model, centering on Diseases, Phenotypes and Genes and the associations between them. This has the benefit of being a graph that can be build in 2 hours instead of 2 days, and that you can run locally on your laptop.
Note: As of the writing of this tutorial, (Feb 2023), the graph is just starting to move from its initial construction phrase into real use, and so there are still bugs to find. Some of which show up in this tutorial.
"},{"location":"tutorial/monarch-kg-neo4j-basics/#check-out-the-repository","title":"Check out the repository","text":"https://github.com/monarch-initiative/monarch-neo4j
"},{"location":"tutorial/monarch-kg-neo4j-basics/#download-data","title":"Download Data","text":"dumps
directorycopy dot_env_template to .env and edit the values to look like:
# This Environment Variable file is referenced by the docker-compose.yaml build\n\n# Set this variable to '1' to trigger an initial loading of a Neo4j dump\nDO_LOAD=1\n\n# Name of Neo4j dump file to load, assumed to be accessed from within\n# the 'dumps' internal Volume path within the Docker container\nNEO4J_DUMP_FILENAME=monarch-kg.neo4j.dump\n
That should mean uncommenting DO_LOAD and NEO4j_DUMP_FILENAME
"},{"location":"tutorial/monarch-kg-neo4j-basics/#optional-plugin-setup","title":"Optional Plugin Setup","text":"You may wish to install additional plugins#### Download plugins * Download the [APOC plugin jar file](https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.4.0.13/apoc-4.4.0.13-all.jar) and put in the `plugins` directory * Download, the [GDS plugin](https://graphdatascience.ninja/neo4j-graph-data-science-2.3.0.zip), unzip the download and copy jar file to the `plugins` directory #### Environment setup In addition to the changes above to .env, you will need to uncomment the following lines in the .env file:
NEO4J_apoc_export_file_enabled=true\nNEO4J_apoc_import_file_enabled=true\nNEO4J_apoc_import_file_use__neo4j__config=true\nNEO4JLABS_PLUGINS=\\[\\\"apoc\\\", \\\"graph-data-science\\\"\\]\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#tutorials","title":"Tutorials","text":""},{"location":"tutorial/monarch-kg-neo4j-basics/#monarch-obo-training-tutorials","title":"Monarch OBO training Tutorials","text":""},{"location":"tutorial/monarch-kg-neo4j-basics/#querying-the-monarch-kg-using-neo4j","title":"Querying the Monarch KG using Neo4J","text":""},{"location":"tutorial/monarch-kg-neo4j-basics/#start-neo4j","title":"Start Neo4j","text":"On the command line, from the root of the monarch-neo4j repository you can launch the neo4j with:
docker-compose up\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#querying","title":"Querying","text":""},{"location":"tutorial/monarch-kg-neo4j-basics/#return-details-for-a-single-disease","title":"Return details for a single disease","text":"Nodes in a cypher query are expressed with ()
and the basic form of a query is MATCH (n) RETURN n
. To limit the results to just our disease of interest, we can restrict by a property, in this case the id
property.
MATCH (d {id: 'MONDO:0007038'}) RETURN d\n
This returns a single bubble, but by exploring the controls just to the left of the returned query, you can see a json or table representation of the returned node.
{\n\"identity\": 480388,\n\"labels\": [\n\"biolink:Disease\",\n\"biolink:NamedThing\"\n],\n\"properties\": {\n\"name\": \"Achoo syndrome\",\n\"provided_by\": [\n\"phenio_nodes\"\n],\n\"id\": \"MONDO:0007038\",\n\"category\": [\n\"biolink:Disease\"\n]\n},\n\"elementId\": \"480388\"\n}\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#connections-out-from-our-disease","title":"Connections out from our disease","text":"Clicking back to the graph view, you can expand to see direct connections out from the node by clicking on the node and then clicking on the graph icon. This will return all nodes connected to the disease by a single edge.
Tip: the node images may not be labeled the way you expect. Clicking on the node reveals a panel on the right, clicking on that node label at the top of the panel will reveal a pop-up that lets you pick which property is used as the caption in the graph view.
"},{"location":"tutorial/monarch-kg-neo4j-basics/#querying-for-connections-out-from-our-disease","title":"Querying for connections out from our disease","text":"In cypher, nodes are represented by ()
and edges are represented by []
in the form of ()-[]-()
, and your query is a little chance to express yourself with ascii art. To get the same results as the expanded graph view, you can query for any edge connecting to any node. Note that the query also asks for the connected node to be returned.
MATCH (d {id: 'MONDO:0007038'})-[]-(n) RETURN d, n\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#expanding-out-further-and-restricting-the-relationship-direction","title":"Expanding out further and restricting the relationship direction","text":"It's possible to add another edge to the query to expand out further. In this case, we're adding a second edge to the query, and restricting the direction of the second edge to be outgoing. This will return all nodes connected to the disease by a single edge, and then all nodes connected to those nodes by a single outgoing edge. It's important to note that without limiting the direction of the association, this query will traverse up, and then back down the subclass tree.
MATCH (d {id: 'MONDO:0007038'})-[]->(n)-[]->(m) RETURN d,n,m\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#exploring-the-graph-schema","title":"Exploring the graph schema","text":"Sometimes, we don't know what kind of questions to ask without seeing the shape of the data. Neo4j provides a graph representation of the schema by calling a procedure
CALL db.schema.visualization\n
If you tug on nodes and zoom, you may find useful information, but it's not a practical way to explore the schema.
"},{"location":"tutorial/monarch-kg-neo4j-basics/#whats-connected-to-a-gene","title":"What's connected to a gene?","text":"We can explore the kinds of connections available for a given category of node. Using property restriction again, but this time instead of restricting by the ID, we'll restrict by the category. Also, instead of returning nodes themselves, we'll return the categories of those nodes.
MATCH (g:`biolink:Gene`)-[]->(n) RETURN DISTINCT labels(n)\n
Tip: the DISTINCT
keyword is used to remove duplicate results. In this case, we're only interested in the unique categories of nodes connected to genes.
Expanding on the query above, we can also return the type of relationship connecting the gene to the node.
MATCH (g:`biolink:Gene`)-[rel]->(n) RETURN DISTINCT type(rel), labels(n)\n
Which returns tabular data like:
\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502\"type(rel)\" \u2502\"labels(n)\" \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502\"biolink:located_in\" \u2502[\"biolink:NamedThing\",\"biolink:CellularComponent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:part_of\" \u2502[\"biolink:NamedThing\",\"biolink:MacromolecularComplexMixin\"]\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:enables\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:actively_involved_in\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:colocalizes_with\" \u2502[\"biolink:NamedThing\",\"biolink:CellularComponent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:active_in\" \u2502[\"biolink:NamedThing\",\"biolink:CellularComponent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:actively_involved_in\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:contributes_to\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:orthologous_to\" \u2502[\"biolink:NamedThing\",\"biolink:Gene\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:participates_in\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:interacts_with\" \u2502[\"biolink:NamedThing\",\"biolink:Gene\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_phenotype\" \u2502[\"biolink:NamedThing\",\"biolink:GeneticInheritance\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_phenotype\" \u2502[\"biolink:NamedThing\",\"biolink:PhenotypicQuality\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:risk_affected_by\" \u2502[\"biolink:NamedThing\",\"biolink:Disease\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:gene_associated_with_condition\" \u2502[\"biolink:NamedThing\",\"biolink:Disease\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_phenotype\" \u2502[\"biolink:NamedThing\",\"biolink:ClinicalModifier\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_positive_effect\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:risk_affected_by\" \u2502[\"biolink:NamedThing\",\"biolink:Gene\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:gene_associated_with_condition\" \u2502[\"biolink:NamedThing\",\"biolink:Gene\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within_positive_effect\"\u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_mode_of_inheritance\" \u2502[\"biolink:NamedThing\",\"biolink:GeneticInheritance\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_negative_effect\" \u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_positive_effect\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within_negative_effect\"\u2502[\"biolink:NamedThing\",\"biolink:Occurrent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_phenotype\" \u2502[\"biolink:NamedThing\",\"biolink:PhenotypicFeature\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within_negative_effect\"\u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_or_within_positive_effect\"\u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\",\"biolink:GrossAnatomicalStructure\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\",\"biolink:AnatomicalEntity\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:acts_upstream_of_negative_effect\" \u2502[\"biolink:NamedThing\",\"biolink:Pathway\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\",\"biolink:Cell\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:located_in\" \u2502[\"biolink:NamedThing\",\"biolink:MacromolecularComplexMixin\"]\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\",\"biolink:CellularComponent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\",\"biolink:MacromolecularComplexMixin\"]\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:part_of\" \u2502[\"biolink:NamedThing\",\"biolink:CellularComponent\"] \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:expressed_in\" \u2502[\"biolink:NamedThing\"] \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Note: the DISTINCT keyword will only remove duplicate results if the entire result is the same. In this case, we're interested in the unique combinations of relationship type and node category.
"},{"location":"tutorial/monarch-kg-neo4j-basics/#kinds-of-associations-between-two-entity-types","title":"Kinds of associations between two entity types","text":"Further constraining on the type of the connecting node, we can ask what kinds of associations exist between two entity types. For example, what kinds of associations exist between genes and diseases?
MATCH (g:`biolink:Gene`)-[rel]->(n:`biolink:Disease`) RETURN DISTINCT type(rel)\n
\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502\"type(rel)\" \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502\"biolink:gene_associated_with_condition\"\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:risk_affected_by\" \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#diseases-associated-with-a-gene","title":"Diseases associated with a gene","text":"MATCH (g:`biolink:Gene`{id:\"HGNC:1100\"})-[]-(d:`biolink:Disease`) RETURN g,d\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#phenotypes-associated-with-diseases-associated-with-a-gene","title":"Phenotypes associated with diseases associated with a gene","text":"MATCH (g:`biolink:Gene`{id:\"HGNC:1100\"})-[]->(d:`biolink:Disease`)-[]->(p:`biolink:PhenotypicFeature`) RETURN g,d,p\n
Why doesn't this return results? This is a great opportunity to track down an unexpected problem.
First, try a less constrained query, so that the 3rd node can be anything:
MATCH (g:`biolink:Gene`{id:\"HGNC:1100\"})-[]->(d:`biolink:Disease`)-[]->(p) RETURN g,d,p\n
With a little tugging and stretching, a good picture emerges, and by clicking our phenotype bubbles, they look like they're showing as PhenotypicQuality rather than PhenotypicFeature. This is likely a bug, but a sensible alternative for this same intent might be:
MATCH (g:`biolink:Gene`{id:\"HGNC:1100\"})-[]->(d:`biolink:Disease`)-[:`biolink:has_phenotype`]->(p) RETURN g,d,p\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#recursive-traversal","title":"Recursive traversal","text":"Sometimes, we don't know the specific number of hops. What if we want to answer the question \"What genes affect the risk for an inherited auditory system disease?\"
First, lets find out how are diseases connected to one another. Name the relationship to query for just the predicates.
MATCH (d:`biolink:Disease`)-[rel]-(d2:`biolink:Disease`) RETURN DISTINCT type(rel)\n
\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502\"type(rel)\" \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502\"biolink:subclass_of\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:related_to\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:associated_with\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:has_phenotype\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:gene_associated_with_condition\"\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:risk_affected_by\" \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
(* Please ignore biolink:gene_associated_with_condition
and biolink:risk_affected_by
showing up here, those are due to a bug in our OMIM ingest)
We'll construct a query that fixes the super class disease, then connects at any distance to any subclass of that disease, and then brings genes that affect risk for those diseases. To avoid a big hairball graph being returned, we can return the results as a table showing the diseases and genes.
MATCH (d:`biolink:Disease`{id:\"MONDO:0002409\"})<-[:`biolink:subclass_of`*]-(d2:`biolink:Disease`)<-[`biolink:risk_affected_by`]-(g:`biolink:Gene`) RETURN d.id, d.name, d2.id, d2.name,g.symbol,g.id\n
once you trust the query, you can also use the DISTINCT keyword again focus in on just the gene list
MATCH (d:`biolink:Disease`{id:\"MONDO:0002409\"})<-[:`biolink:subclass_of`*]-(d2:`biolink:Disease`)<-[`biolink:risk_affected_by`]-(g:`biolink:Gene`) RETURN DISTINCT g.id\n
"},{"location":"tutorial/monarch-kg-neo4j-basics/#gene-to-gene-associations","title":"Gene to Gene Associations","text":"First, we can ask what kind of associations we have between genes.
MATCH (g:`biolink:Gene`)-[rel]->(g2:`biolink:Gene`) RETURN DISTINCT type(rel)\n
\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502\"type(rel)\" \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502\"biolink:orthologous_to\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:interacts_with\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:risk_affected_by\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"biolink:gene_associated_with_condition\"\u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Again, please ignore biolink:gene_associated_with_condition
and biolink:risk_affected_by
.
Let's say that from the list above, we're super interested in the DIABLO gene, because, obviously, it has a cool name. We can find it's orthologues by querying through the biolink:orthologous_to
relationship.
MATCH (g {id:\"HGNC:21528\"})-[:`biolink:orthologous_to`]-(o:`biolink:Gene`) RETURN g,o
We can then make the question more interesting, by finding phenotypes associated with these orthologues.
MATCH (g {id:\"HGNC:21528\"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:has_phenotype`]->(p) RETURN g,og,p\n
That was a dead end. What about gene expression?
MATCH (g {id:\"HGNC:21528\"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:expressed_in`]->(a) RETURN g,og,a\n
We can add this one step further by connecting our gene expression list in UBERON terms
MATCH (g {id:\"HGNC:21528\"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:expressed_in`]->(a)-[`biolink:subclass_of`]-(u) WHERE u.id STARTS WITH 'UBERON:'\nRETURN distinct u.id, u.name\n
In particular, it's a nice confirmation to see that we started at the high level MONDO term \"inherited auditory system disease\", passed through subclass relationships to more specific diseases, connected to genes that affect risk for those diseases, focused on a single gene, and were able to find that it is expressed in the cochlea.
\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502\"u.id\" \u2502\"u.name\" \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502\"UBERON:0000044\"\u2502\"dorsal root ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0000151\"\u2502\"pectoral fin\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0000948\"\u2502\"heart\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0000961\"\u2502\"thoracic ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001017\"\u2502\"central nervous system\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001555\"\u2502\"digestive tract\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001675\"\u2502\"trigeminal ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001700\"\u2502\"geniculate ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001701\"\u2502\"glossopharyngeal ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001844\"\u2502\"cochlea\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001991\"\u2502\"cervical ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0002107\"\u2502\"liver\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0002441\"\u2502\"cervicothoracic ganglion\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0003060\"\u2502\"pronephric duct\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0003922\"\u2502\"pancreatic epithelial bud\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0004141\"\u2502\"heart tube\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0004291\"\u2502\"heart rudiment\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0005426\"\u2502\"lens vesicle\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0007269\"\u2502\"pectoral appendage musculature\"\u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0019249\"\u2502\"2-cell stage embryo\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0000965\"\u2502\"lens of camera-type eye\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0001645\"\u2502\"trigeminal nerve\" \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502\"UBERON:0003082\"\u2502\"myotome\" \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
"},{"location":"tutorial/odk-toolbox/","title":"Using the ODK Toolbox","text":"This tutorial will show you how to use the tools that are made available by the ODK Docker images, independently of an ODK-generated repository and of ODK-managed workflows.
"},{"location":"tutorial/odk-toolbox/#prerequisites","title":"Prerequisites","text":"You have:
You know:
Let\u2019s check which Docker images, if any, are available in your Docker installation:
$ docker images\nREPOSITORY TAG IMAGE ID CREATED SIZE\n
Here, the listing comes up empty, meaning there are no images at all. This is what you would expect if you have just installed Docker and have yet to do anything with it.
Let\u2019s download the main ODK image:
$ docker pull obolibrary/odkfull\nUsing default tag: latest\nlatest: Pulling from obolibrary/odkfull\n[\u2026 Output truncated for brevity \u2026]\nDigest: sha256:272d3f788c18bc98647627f9e6ac7311ade22f35f0d4cd48280587c15843beee\nStatus: Downloaded newer image for obolibrary/odkfull:latest\ndocker.io/obolibrary/odkfull:latest\n
Let\u2019s see the images list again:
$ docker images\nREPOSITORY TAG IMAGE ID CREATED SIZE\nobolibrary/odkfull latest 0947360954dc 6 months ago 2.81GB\n
Docker images can exist in several versions, which are called tags in Docker parlance. In our pull
command, since we have not specified any tag, Docker had automatically defaulted to the latest
tag, which by convention is the latest ODK release.
To download a specific version, append the tag after the image name (you can check which tags are available on DockerHub). For example, let\u2019s download the 1.3.1 release from June 2022:
$ docker pull obolibrary/odkfull:v1.3.1\nv1.3.1: Pulling from obolibrary/odkfull\nDigest: sha256:272d3f788c18bc98647627f9e6ac7311ade22f35f0d4cd48280587c15843beee\nStatus: Downloaded newer image for obolibrary/odkfull:v1.3.1\ndocker.io/obolibrary/odkfull:v1.3.1\n
Again, let\u2019s see the output of docker images
:
$ docker images\nREPOSITORY TAG IMAGE ID CREATED SIZE\nobolibrary/odkfull latest 0947360954dc 6 months ago 2.81GB\nobolibrary/odkfull v1.3.1 0947360954dc 6 months ago 2.81GB\n
Note how both the latest
and the v1.3.1
images have the same ID. This is because, at the time of this writing, the 1.3.1 release is the latest ODK release, so the latest
tag actually points to the same image as the v1.3.1
tag. This will change when the ODK v1.3.2 is released: then, using latest
(explicitly or by not specifying any tag at all) will point to the new release, while v1.3.1
will forever continue to point to the June 2022 release.
In the rest of this tutorial, we will always use the latest
image, and so we will dispense with the explicit tag. But remember that anywhere you see obolibrary/odkfull
in one of the commands below, you can always use obolibrary/odkfull:TAG
to force Docker to use a specific ODK version.
Now that we have the ODK image available, let\u2019s try to start it. The command for that is docker run
, which has the following syntax:
docker run [OPTIONS] <IMAGE> [COMMAND [ARGUMENTS...]]\n
where IMAGE
is the name of the image to use (in our case, always obolibrary/odkfull
).
With the ODK, you will always need the --rm
option. It instructs the Docker engine to automatically remove the container it creates to run a command, once that command terminates. (Not using the --rm
option would lead to those \u201cspent\u201d containers to accumulate on your system, ultimately forcing you to manually remove them with the docker container rm
command.)
If we don\u2019t specify an explicit command, the simplest command line we can have is thus:
$ docker run --rm obolibrary/odkfull\nUsage: odk.py [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n create-dynfile For testing purposes\n create-makefile For testing purposes\n dump-schema Dumps the python schema as json schema.\n export-project For testing purposes\n seed Seeds an ontology project\n$\n
In the absence of an explicit command, the default command odk.py
is automatically invoked by Docker. Since it has been invoked without any argument, odk.py
does nothing but printing its \u201cusage\u201d message before terminating. When it terminates, the Docker container terminates as well, and we are back at the terminal prompt.
To invoke one of the tools available in the toolbox (we\u2019ll see what those tools are later in this document), just complete the command line as needed. For example, to test that ROBOT is there (and to see which version we have):
$ docker run --rm obolibrary/odkfull robot --version\nROBOT version 1.9.0\n
"},{"location":"tutorial/odk-toolbox/#accessing-your-files-from-within-the-container","title":"Accessing your files from within the container","text":"Since we have ROBOT, let\u2019s use it. Move to a directory containing some ontology files (here, I\u2019ll use a file from the Drosophila Anatomy Ontology, because if you have to pick an ontology, why not picking an ontology that describes the One True Model Organism?).
$ ls\nfbbt.obo\n
We want to convert that OBO file to a file in, say, the OWL Functional Syntax. So we call ROBOT with the appropriate command and options:
$ docker run ---rm obolibrary/odkfull robot convert -i fbbt.obo -f ofn -o fbbt.ofn\norg.semanticweb.owlapi.io.OWLOntologyInputSourceException: java.io.FileNotFoundException: fbbt.obo (No such file or directory)\nUse the -vvv option to show the stack trace.\nUse the --help option to see usage information.\n
Huh? Why the \u201cNo such file or directory\u201d error? We just checked that fbbt.obo
is present in the current directory, why can\u2019t ROBOT find it?
Because Docker containers run isolated from the rest of the system \u2013 that\u2019s kind of the entire point of such containers in general! From within a container, programs can, by default, only access files from the image from which the container has been started.
For the ODK Toolbox to be at all useful, we need to explicitly allow the container to access some parts of our machine. This is done with the -v
option, as in the following example:
$ docker run --rm -v /home/alice/fbbt:/work [\u2026rest of the command omitted for now\u2026]\n
This -v /home/alice/fbbt:/work
has the effect of binding the directory /home/alice/fbbt
from our machine to the directory /work
inside the container. This means that if a program that runs within the container tries to have a look at the /work
directory, what this program will actually see is the contents of the /home/alice/fbbt
directory. Figuratively, the -v
option opens a window in the container\u2019s wall, allowing to see parts of what\u2019s outside from within the container.
With that window, and assuming our fbbt.obo
file is within the /home/alice/fbbt
directory, we can try again invoking the conversion command:
$ docker run --rm -v /home/alice/fbbt:/work obolibrary/odkfull robot convert -i /work/fbbt.obo -f ofn -o /work/fbbt.ofn\n$ ls\nfbbt.obo\nfbbt.ofn\n
This time, ROBOT was able to find out fbbt.obo
file, and to convert it as we asked.
We can slightly simplify the last command line in two ways.
First, instead of explicitly specifying the full pathname to the current directory (/home/alice/fbbt
), we can use the shell variable $PWD
, which is automatically expanded to that pathname: -v $PWD:/work
.
Second, to avoid having to explicitly refer to the /work
directory in the command, we can ask the Docker engine to run our command as if the current directory, within the container, was already /work
. This is done with the -w /work
option.
The command above now becomes:
$ docker run --rm -v $PWD:/work -w /work obolibrary/odkfull robot convert -i fbbt.obo -f ofn -o fbbt.ofn\n
This is the typical method of invoking a tool from the ODK Toolbox to work on files from the current directory.
In fact, this is exactly how the src/ontology/run.sh
wrapper script, that is automatically created in an ODK-generated repository, works. If you work with an ODK-managed ontology, you can invoke an arbitrary ODK tool by using the run.sh
instead of calling docker run
yourself. Assuming for example that you already are in the src/ontology
directory of an ODK-managed ontology, you could use:
./run.sh robot convert -i fbbt.obo -f ofn -o fbbt.ofn\n
If you want to use the ODK toolbox with ontologies that are not managed by the ODK (so, where a run.sh
script is not readily available), you can set up an independent wrapper script, as explained in the Setting up the ODK tutorial.
If you have several commands to invoke in a row involving files from the same directory, you do not have to repeatedly invoke docker run
once for each command. Instead, you can invoke a shell, from which you will be able to run successively as many commands as you need:
$ docker run --rm -ti -v $PWD:/work -w /work obolibrary/odkfull bash\nroot@c1c2c80c491b:/work#
The -ti
options allow to use your current terminal to control the shell that is started within the container. This is confirmed by the modified prompt that you can see above, which indicates that you are now \u201cin\u201d the container. You can now directly use all the tools that you need:
root@c1c2c80c491b:/work# robot convert -i fbbt.obo -f owx -o fbbt.owl\nroot@c1c2c80c491b:/work# Konclude consistency -i fbbt.owl\n{info} 18:21:14.543 >> Starting Konclude \u2026\n[\u2026]\n{info} 18:21:16.949 >> Ontology \u2018out.owl\u2019 is consistent.\nroot@c1c2c80c491b:/work#\n
When you are done, exit the shell by hitting Ctrl-D
or with the exit
command. The shell will terminate, and with it, the container will terminate as well, sending you back to your original terminal.
Now that you know how to invoke any tool from the ODK Toolbox, here\u2019s a quick overview of which tools are available.
For a definitive list, the authoritative source is the ODK repository, especially the Dockerfile
and requirements.txt.full
files. And if you miss a tool that you think should be present in the toolbox, don\u2019t hesitate to open a ticket to suggest that the tool be added in a future ODK release!
The goal of this tutorial is to quickly showcase key ODK workflows. It is not geared at explaining individual steps in detail. For a much more detailed tutorial for creating a fresh ODK repo, see here for a tutorial for setting up your first workflow. We recommend to complete this tutorial before attempting this one.
"},{"location":"tutorial/odk-tutorial-2/#tutorial","title":"Tutorial","text":"This is some useful background from the ICBO 2022 OBO Tutorial:
"},{"location":"tutorial/odk-tutorial-2/#seeding","title":"Seeding","text":"cato-odk.yaml
change github_org
to your GitHub username. If you dont do this, some ODK features wont work perfectly, like documentation. github_org: matentzn\nrepo: cat-ontology\n
curl https://raw.githubusercontent.com/INCATools/ontology-development-kit/v1.3.1/seed-via-docker.sh | bash -s -- --clean -C cato-odk.yaml\n
Let us now import planned process:
src/ontology/imports/cob_terms.txt
in your favourite text editorCOB:0000082
to the term file (this is the planned process
class in COB).src/ontology
directory, run sh run.sh make refresh-cob
.src/ontology/cato-odk.yaml
, locate the entry for importing cob
and switch it to a different module type: filter
. import_group:\n products: \n - id: ro\n - id: cob\n module_type: filter\n
sh run.sh make update_repo
to apply the changes. Check out the git diff to the Makefile
to convince yourself that the new extraction method has been applied.src/ontology
directory, run sh run.sh make refresh-cob
. Convince yourself that now only the planned process
term is imported.Makefile
, cato-odk.yaml
, imports/cob_terms.txt
and imports/cob_import.owl
.Great, we have done our change, now we are ready to make a release!
main
branch in git
.git pull
).src/ontology
execute the release workflow: sh run.sh make prepare_release_fast
(we are using fast
release here which skips refreshing imports again - we just did that).planned process
class has been added to all ontology release artefacts.v2022-09-01
. Note the leading v
. Select the correct date (the date you made the release, YYYY-MM-dd
). Fill in all the other form elements as you see fit. Click Publish release
.With our ODK setup, we also have a completely customisable documentation system installed. We just need to do a tiny change to the GitHub pages settings:
Build and deployment
select Deploy from branch
.gg-pages
as the branch (this is where ODK deploys to), and /(root)
as the directory. Save
.Actions
in the main menu to follow the build process).Pages
section in Settings
. You should see a button Visit site
. Click on it. If everything went correctly, you should see your new page: github_org
, see seeding). If you have not configured your repo, go to the GitHub front page of your repo, go into the docs
directory, click on index.md
and edit it from here. Make a small random edit.main
or do it properly, create a branch, PR, ask for reviews, merge.That's it! In about 20 minutes, we
A project ontology, sometimes and controversially referred to as an application ontology, is an ontology which is composed of other ontologies for a particular use case, such as Natural Language Processing applications, Semantic Search and Knowledge Graph integration. A defining feature of a project ontology is that it is not intended to be used as a domain ontology. Concretely, this means that content from project ontologies (such as terms or axioms) is not to be re-used by domain ontologies (under no circumstances). Project ontology developers have the freedom to slice & dice, delete and add relationships, change labels etc as their use case demands it. Usually, such processing is minimal, and in a well developed environment such as OBO, new project ontology-specific terms are usually kept at a minimum.
In this tutorial, we discuss the fundamental building blocks of application ontologies and show you how to build one using the Ontology Development Kit as one of several options.
"},{"location":"tutorial/project-ontology-development/#prerequisites","title":"Prerequisites","text":"There are a few reasons for developing project ontologies. Here are two that are popular in our domain:
Any application ontology will be concerned with at least 3 ingredients:
MONDO:123, MONDO:231
MONDO:123, incl. all children
MONDO:123, incl. all terms that are in some way logically related to MONDO:123
There are five phases on project ontology development which we will discuss in detail in this section:
There are other concerns, like continuous integration (basically making sure that changes to the seed or project ontology pipelines do not break anything) and release workflows which are not different from any other ontology.
"},{"location":"tutorial/project-ontology-development/#managing-the-seed","title":"Managing the seed","text":"As described above, the seed is the set of terms that should be extracted from the source ontologies into the project ontology. The seed comprises any of the following:
MONDO:0000001
all, children, descendants, ancestors, annotations
Users of ODK will be mostly familiar with term files located in the imports directory, such as src/ontology/imports/go_terms.txt
. Selectors are usually hidden from the user by the ODK build system, but they are much more important now when building project ontologies.
Regardless of which system you use to build your project ontology, it makes sense to carefully plan your seed management. In the following, we will discuss some examples:
It makes sense to document your seed management plan. You should usually account for the possibility of changes (terms being added or removed) during the design phase.
"},{"location":"tutorial/project-ontology-development/#extracting-modules","title":"Extracting modules","text":"Module extraction is the process for selecting an appropriate subset from an ontology. There are many ways to extracting subsets from an ontology:
You can consult the ROBOT documentation for some details on module extraction.
Let's be honest - none of these module extraction techniques are really ideal for project ontologies. SLME modules are typically used for domain ontology development to ensure logical consistency with imported ontologies, but otherwise contain too much information (for most project ontology use cases). ROBOT filter has a hard time with dealing with closures of existential restrictions: for example you cant be sure that, if you import \"endocardial endothelium\" and \"heart\" using filter, that the one former is still part of the latter (it is only indirectly a part) - a lot of research and work has being going on to make this easier. The next version of ROBOT (1.8.5) is going to contain a new module extraction command which will ensure that such links are not broken.
One of the design confusions in this part of the process is that most use cases of application ontologies really do not care at all about OWL. Remember, OWL really only matters for the design of domain ontologies, to ensure a consistent representation of the domain and enable reasoning-based classification. So it is, at least slightly, unsatisfactory that we have to use OWL tools to do something that may as well be done by something simpler, more akin to \"graph-walking\".
"},{"location":"tutorial/project-ontology-development/#managing-metadata-and-customisations","title":"Managing metadata and customisations","text":"Just like any other ontology, a project ontology should be well annotated according to the standards of FAIR Semantics, for example using the OBO Foundry conventions. In particular, project ontologies should be
Furthermore, it is often necessary to add additional terms to the ontology which are not covered by other upstream ontologies. Here we need to distinguish two cases:
With our OBO hat on, if you start adding terms \"quickly\", you should develop a procedure to get these terms into suitable upstream ontologies at a later stage. This is not so much a necessity as a matter of \"open data ethics\": if you use other people's work to make your life easier, its good to give back!
Lastly, our use cases sometimes require us to add additional links between the terms in our ontologies. For example, we may have to add subClassOf links between classes of different ontologies that cover the same domain. Or we want to add additional information. As with \"quickly adding terms\", if the information is generally useful, you should consider to add them to the respective upstream source ontology (synonyms of disease terms from Mondo, for example). We often manage such axioms as ROBOT templates and curate them as simple to read tables.
"},{"location":"tutorial/project-ontology-development/#merging-and-post-processing","title":"Merging and post-processing","text":"Just like with most ontologies, the last part of the process is merging the various pieces (modules from external sources, customisations, metadata) together into a single whole. During this phase a few things can happen, but these are the most common ones:
One thing to remember is that you are not building a domain ontology. You are usually not concerned with typical issues in ontology engineering, such as logical consistency (or coherence, i.e. the absence of unsatisfiable classes). The key for validating an application ontology comes from its intended use case: Can the ontology deliver the use case it promised? There are many approaches to ensure that, chief among them competency questions. What we usually do is try to express competency questions as SPARQL queries, and ensure that there is at least one result. For example, for one of the project ontologies the author is involved with (CPONT), we have developed a synthetic data generator, which we combine with the ontology to ask questions such as: \"Give me all patients which has a recorded diagnosis of scoliosis\" (SPARQL). So the ontology does a \"good job\" if it is able to return, say, at least 100 patients in our synthetic data for which we know that they are diagnoses with scoliosis or one of its subtypes.
"},{"location":"tutorial/project-ontology-development/#frameworks-for-building-project-ontologies","title":"Frameworks for building project ontologies","text":"The perfect framework for building project ontologies does not exist yet. The Ontology Development Kit (ODK) has all the tools you need set up a basic application ontology, but the absence of a \"perfect\" module extraction algorithm for this use case is still unsatisfactory. However, for many use cases, filter
modules like the ones described above are actually good enough. Here we will go through a simple example.
An alternative framework for application ontology development based on a Web User Interface and tables for managing the seed is developed by James Overton at (ontodev).
Another potential alternative is to go all the way to graph-land and build the application ontology with KGX and LinkML. See here for an example. Creating a project ontology this way feels more like a Knowledge Graph ETL task than building an ontology!
"},{"location":"tutorial/project-ontology-development/#example-application-ontology-with-odk","title":"Example application ontology with ODK","text":"Set up a basic ODK ontology. We are not covering this again in this tutorial, please refer to the tutorial on setting up your ODK repo.
"},{"location":"tutorial/project-ontology-development/#dealing-with-large-imports","title":"Dealing with large imports","text":"Many of the larger imports in application ontologies do not fit into the normal GitHub file size limit. In this cases it is better to attach them to a GitHub release rather than to check them into version control.
TBD
"},{"location":"tutorial/project-ontology-development/#additional-materials-and-resources","title":"Additional materials and resources","text":"Participants will need to have access to the following resources and tools prior to the training:
Description: How to create and manage pull requests to ontology files in GitHub.
"},{"location":"tutorial/pull-requests/#learning-objectives","title":"Learning objectives","text":"A pull request (PR) is an event in Git where a contributor (you!) asks a maintainer of a Git repository to review changes (e.g. edits to an ontology file) they want to merge into a project (e.g. the owl file) (see reference). A contributor creates a pull request to propose and collaborate on changes to a repository. These changes are proposed in a branch, which ensures that the default branch only contains finished and approved work. See more details here.
"},{"location":"tutorial/pull-requests/#how-to-write-a-great-descriptive-title","title":"How to write a great descriptive title","text":"When committing a pull request, you must include a title and a description (more details in the workflow below.) Tips below (adapted from Hugo Dias):
The title of the PR should be self-explanatory
Do: Describe what was changed in the pull request
Example: Add new term: MONDO:0100503 DPH5-related diphthamide-deficiency syndrome`
Don't: write a vague title that has very little meaning.
Example: Add new term
Don't: use the branch name in the pull request (sometimes GitHub will offer this as a default name)
Example:
"},{"location":"tutorial/pull-requests/#general-tips","title":"General tips","text":"A video is below.
Example diffs:
Example 1 (Cell Ontology):
Example 2 (Mondo):
"},{"location":"tutorial/pull-requests/#write-a-good-commit-messages","title":"Write a good commit messages","text":"Commit message: Before Committing, you must add a commit message. In GitHub Desktop in the Commit field in the lower left, there is a subject line and a description.
Give a very descriptive title: Add a descriptive title in the subject line. For example: add new class ONTOLOGY:ID [term name] (e.g. add new class MONDO:0000006 heart disease)
Write a detailed summary of what the change is in the Description box, referring to the issue. The sentence should clearly state how the issue is addressed.
NOTE: You can use the word \u2018fixes\u2019 or \u2018closes\u2019 in the commit message - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
\u2018Fixes\u2019 and \u201cCloses\u2019 is case-insensitive and can be plural or singular (fixes, closes, fix, close).
If you don\u2019t want to close the ticket, just refer to the ticket # without the word \u2018fixes\u2019 or use \u2018adresses\u2019 or 'addresses'. The commit will be associated with the correct ticket but the ticket will remain open.
Push: To incorporate the changes into the remote repository, click Commit to [branch name], then click Push.
Tips for finding reviewers:
An ontology repository should have an owner assigned. This may be described in the ReadMe file or on the OBO Foundry website. For example, the contact person for Mondo is Nicole Vasilevsky.
If you are assigned to review a pull request, you should receive an email notification. You can also check for PRs assigned to you by going to https://github.com/pulls/assigned.
"},{"location":"tutorial/pull-requests/#what-kind-of-person-do-we-need-for-what-kind-of-pull-request","title":"What kind of person do we need for what kind of pull request?","text":"It depends on what the pull request is addressing. Remember the QC checks will check for things like unsatisfiable classes and many other checks (that vary between ontologies). Your job as a reviewer is to check for things that the QC checks won't pick up and need human judgement.
If you don't know who to assign, we recommend assigning the ontology contact person and they can triage the request.
To review a PR, you should view the 'Files changed' and view the diff(s). You can review changes in a pull request one file at a time.
Example:
"},{"location":"tutorial/pull-requests/#things-to-look-out-for-when-reviewing-a-pr","title":"Things to look out for when reviewing a PR:","text":"Make sure the changes made address the ticket. In the example above, Sabrina addressed a ticket that requested adding a new term to Mondo, which is what she did on the PR (see https://github.com/monarch-initiative/mondo/pull/5078).
Examples of things to look for in content changes (like adding new terms or revising existing terms):
appropriate annotations
Make sure there are not any unintended or unwanted changes on the PR. See example below. Protege reordered the location of a term in the file.
After reviewing the file(s), you can approve the pull request or request additional changes by submitting your review with a summary comment.
Comment (Submit general feedback without explicit approval)
Request changes (Submit feedback that must be addressed before merging)
In addition or instead of adding inline comments, you can leave comments on the Conversation page. The conversation page is a good place to discuss the PR, and for the original creator to respond to the reviewer comments.
GitHub added a 'suggested Changes' feature that allows a PR reviewer to suggest an exact change in a comment in a PR. You can add inline comments and commit your comment using 'inline commits'. Read more about it here.
If you review the PR and the changes properly address what was described in the description, then it should be sufficient. Not every PR needs comments, it can be approved without any comments or requests for changes. Feel free to ask for help with your review, and/or assign additional reviewers.
Some of the content above was adapted from GitHub Docs.
"},{"location":"tutorial/pull-requests/#how-to-change-a-pull-request-in-response-to-review","title":"How to change a pull request in response to review","text":"Conflicts arise when edits are made on two separate branches to the same line in a file. (reference). When editing an ontology file (owl file or obo file), conflicts often arise when adding new terms to an ontology file on separate branches, or when there are a lot of open pull requests.
Conflicts in ontology files can be fixed either on the command line or using GitHub Desktop. In this lesson, we describe how to fix conflicts using GitHub Desktop.
"},{"location":"tutorial/pull-requests/#fix-conflicts-in-github-desktop","title":"Fix conflicts in GitHub desktop","text":"open [ontology file name]
(e.g.open mondo-edit.obo
) or open in Protege manually.Watch a video below with an example fixing a conflict in the Mondo ontology file.
Some examples of conflicts that Nicole fixed in Mondo are below:
"},{"location":"tutorial/pull-requests/#further-regarding","title":"Further regarding","text":""},{"location":"tutorial/pull-requests/#gene-ontology-daily-workflow","title":"Gene Ontology Daily Workflow","text":"
Gene Ontology Editing Guide
"},{"location":"tutorial/pull-requests/#github-merge-conflicts","title":"GitHub Merge Conflicts","text":"Blog post by Hugo Dias
"},{"location":"tutorial/pull-requests/#suggesting-changes-on-github-includes-description-of-how-to-make-inline-commits","title":"Suggesting Changes on GitHub - includes description of how to make inline commits","text":""},{"location":"tutorial/robot-tutorial-1/","title":"ROBOT Mini-Tutorial 1: Convert, Extract and Template","text":"This tutorial covers three ROBOT commands:
Before starting this tutorial, either:
We will be using the files from the Ontologies 101 Tutorial. In your terminal, navigate to the repository that you cloned and then into the BDK14_exercises
folder.
So far, we have been saving our ontologies in Protege using the default RDF/XML syntax, but there are many flavors of OWL. We will discuss each of these serializations in more detail during the class session, but ROBOT supports the following:
Navigate to the basic-subclass/
folder. Open chromosome-parts.owl
in your text editor and you will see it's in RDF/XML format. We're going to take this file and convert it to Turtle (ttl
) serialization. Return to your terminal and enter the following command:
robot convert --input chromosome-parts.owl --format ttl --output chromosome-parts.ttl\n
ROBOT convert is smart about detecting formats, so since the output file ends with .ttl
, the --format ttl
parameter isn't really required. If you wanted to use a different file ending, say .owl
, you will need to include the format flag to force ROBOT to write Turtle.
Now open chromosome-parts.ttl
in your text editor and see what's changed! RDF/XML and Turtle are very different serializations, but the actual data that is contained in these two files is exactly the same.
chromosome-parts.owl
into the following formats: obo
(OBO Format), ofn
(OWL Functional), and omn
(OWL Manchester).Sometimes we only want to browse or share a subset of an ontology, especially with some of the larger OBO Foundry ontologies. There are two main methods for creating subsets:
Right now, we will use use MIREOT and talk more about SLME in our class session. MIREOT makes sure that you have the minimal amount of information you need to reuse an existing ontology term. It allows us to extract a small portion of the class hierarchy by specifying upper and lower boundaries, which you will see in the example below. We need to know the identifiers (as CURIEs) of the terms that we want to set as our boundaries.
"},{"location":"tutorial/robot-tutorial-1/#lets-try-it_1","title":"Let's Try It!","text":"Open chromosome-parts.owl
in Protege and open the Class hierarchy. We are going to create a subset relevant to the term \"chromosome\". First, we will find the CURIE of our desired term. Search for \"chromosome\" and find the \"id\" annotation property. This will be our lower term. Right now, we won't set an upper boundary. That means this subset will go all the way up to the top-level ancestor of \"chromosome\".
Return to your terminal and enter the following command (where the --lower-term
is the CURIE that we just found):
robot extract --method MIREOT --input chromosome-parts.owl --lower-term GO:0005694 --output chromosome-full.owl\n
Now open chromosome-full.owl
in Protege and open the Class hierarchy. When you open the \"cellular_component\" node, you'll notice that most of the terms are gone! Both \"organelle\" and \"intracellular part\" remain because they are in the path between \"chromosome\" and the top-level \"cellular_component\". Keep clicking down and you'll find \"chromosome\" at the very bottom. Since \"chromosome\" has two named parents, both of those parents are included, which is why we ended up with \"organelle\" and \"intracellular part\".
Now let's try it with an upper term. This time, we want \"organelle\" to be the upper boundary. Find the CURIE for \"organelle\".
Return to your terminal and enter the following command (where the --upper-term
is the new CURIE we just found):
robot extract --method MIREOT \\\n --input chromosome-parts.owl \\\n --lower-term GO:0005694 \\\n --upper-term GO:0043226 \\\n --output chromosome.owl\n
Open chromosome.owl
and again return to the Class hierarchy. This time, we see \"organelle\" directly below owl:Thing
. \"intracellular part\" is also now missing because it does not fall under \"organelle\".
chromosome-parts.owl
file.Most of the knowledge encapsulated in ontologies comes from domain experts. Often, these domain experts are not computer scientists and are not familiar with the command line. Luckily, most domain experts are familiar with spreadsheets!
ROBOT provides a way to convert spreadsheets into OWL ontologies using template strings. We'll get more into these during the class session, but if you want to get a head start, they are all documented here. Essentially, the first row of a ROBOT template is a human-readable header. The second row is the ROBOT template string. Each row below that represents an entity to be created in the output ontology. We can create new entities by giving them new IDs, but we can also reference existing entities just by label. For now, we're going to create a new, small ontology with new terms using a template.
"},{"location":"tutorial/robot-tutorial-1/#lets-try-it_2","title":"Let's Try It!","text":"Download (or copy/paste) the animals.tsv file and move it to the basic-subclass/
folder (or whatever folder you would like to work in; we will not be using any of the Ontology 101 files anymore). This contains the following data:
In the first column, we use the special ID
keyword to say that this is our term's unique identifier. The second column contains the LABEL
keyword which is a shortcut for the rdfs:label
annotation property. The third column uses the SC
keyword to state that this column will be a subclass statement. The %
sign is replaced by the value in the cell. We'll talk more about this keyword and the %
symbol during the class session. Finally, the last column begins with A
to denote that this will be an annotation, and then is followed by the annotation property we're using.
Just looking at the template, you can begin to predict what a class hierarchy using these terms would look like in an ontology. We can turn this into reality!
In your terminal, enter the following command:
robot template --template animals.tsv --output animals.owl\n
Note that in this command, we don't use the --input
parameter. That parameter is reserved for input ontologies, and we are not using one right now. More on this later.
Open animals.owl
in Protege, and you'll be able to see the class hierarchy we defined in the template as an actual structure.
Now let's make another small ontology that reuses some terms from our animals.owl
file. Download (or copy/paste) animals2.tsv into the same folder. This contains the following:
You'll notice that we are referencing two terms from our other spreadsheet in this one.
In your terminal, enter the following command:
robot template --input animals.owl --template animals2.tsv --output animals2.owl\n
This time, we did use the --input
parameter and provided the animals ontology we just created. This allows us to use any term in the animals.owl
file in our animals2.tsv
template and ROBOT will know what we're talking about.
Go ahead and open animals2.owl
in Protege. What's missing? The parent classes for \"dog\" and \"cat\" don't have labels, and the \"animal\" term is missing entirely. This is because, even though ROBOT knew about these classes, we didn't ask for the original ontology to be included in the output, so no axioms from that ontology can be found in this newly-created one. Next week, we'll learn about combining ontologies with the Merge command.
For now, let's add the original animals.owl
file as an import:
animals.owl
, click Continue, and then click FinishProt\u00e9g\u00e9 will now load animals.owl
as an import. When you return to the Entities tab, you'll see all those upper-level terms. Note the difference in how the terms are displayed in the class hierarchy.
animals.tsv
template and regenerating animals.owl
.In week 6, we got some hands-on experience with ROBOT using convert
, extract
, and template
. This week, we will learn four new ROBOT commands:
The goal of these and previous commands is to build up to creating an ontology release workflow.
Before starting this tutorial, either:
To start, we will be working in the same folder as the first ROBOT Mini-Tutorial. Navigate to this folder in your terminal and list the contents of the current directory by running ls
. You should see catalog-v001.xml
listed as one of these files. We want to delete this so that we can fix the ontology IRI problem we ran into last week! Before going any further with this tutorial, do this by running either del catalog-v001.xml
for Windows or rm catalog-v001.xml
if you're using Docker, MacOS, or other Linux system.
The annotate
command allows you to attach metadata to your ontology in the form of IRIs and ontology annotations. Like the annotations on a term, ontology annotations help users to understand how they can use the ontology.
As we discussed during previous parts of the course, ontology IRIs are very important! We saw how importing an ontology without an IRI into another ontology without an IRI can cause some problems in the catalog-v001.xml
file. We're going to fix that problem by giving IRIs to both our animals.owl
and animals2.owl
files.
Let's start with animals.owl
:
robot annotate --input animals.owl \\\n --ontology-iri http://example.com/animals.owl \\\n --output animals.owl\n
You'll notice we gave the same file name as the input file; we're just updating our previous file so we don't need to do this in a separate OWL file.
On your own, give animals2.owl
the ontology IRI http://example.com/animals2.owl
. Remember that, in reality, we always want our ontology IRIs to be resolvable, so these would be pretty bad IRIs for an actual ontology.
Let's fix our import statement now. Open animals2.owl
in Prot\u00e9g\u00e9 and go to the Entities tab. You'll see that even though we still have the import statement in the Active ontology tab, the top-level terms are no longer labeled. Since we changed the ontology IRI, Prot\u00e9g\u00e9 can no longer resolve our local file (because the catalog-v001.xml
file was not updated). Go back to the Active ontology tab and click the X to the right of our original import. Then, re-add animals.owl
as an import using the same steps as last time. When you return to the Entities tab, you'll once again see the labels of the top-level terms.
When we release our ontologies, we want to make sure to include a version IRI. Like the ontology IRI, this should always resolve to the version of the ontology at the time of the release. For clarity, we usually use dates in our version IRIs in the OBO Foundry. That way, you know when you navigate to a specific version IRI, that's what the ontology looked like on that date. (Note: edit files don't usually have version IRIs as they are always changing, and we don't expect to be able to point to a stable version)
While you can add a version IRI in Prot\u00e9g\u00e9, if you're trying to create an automated release workflow, this is a manual step you don't want to have to include. Keeping it in your release workflow also makes sure that the verion IRIs are consistent (we'll see how to do this with make
later). For now, let's add a version IRI to animals.owl
(feel free to replace the 2021-05-20
with today's date):
robot annotate --input animals.owl \\\n --version-iri http://example.com/animals/2021-05-20/animals.owl \\\n --output animals.owl\n
Let's break down this version IRI. We have the host (http://example.com/
) followed by our ontology's namespace (animals
). Next, we provided the date in the format of YYYY-MM-DD
. Finally, we have the name of the file. This is standard for OBO Foundry, except with a different host. For example, you can find a release of OBI from April 6, 2021 at http://purl.obolibrary.org/obo/obi/2021-04-06/obi.owl
. In this case, the host is http://purl.obolibrary.org/obo/
. Of course, you may see different patterns in non-OBO-Foundry ontologies, but they should always resolve (hopefully!).
Go ahead and open or reload animals.owl
in Protege. You'll see in the Active Ontology tab that now both the ontology IRI and version IRI fields are filled out.
In addition to ontology and version IRIs, you may also want to add some other metadata to your ontology. For example, when we were introduced to report
, we added a description to the ontology to fix one of the report problems. The three ontology annotations that are required by the OBO Foundry are:
dc11:title
)dc:license
)dc11:description
)These three annotation properties all come from the Dublin Core, but they have slightly different namespaces. This is because DC is split into two parts: the /terms/
and /elements/1.1/
namespaces. Just remember to double check that you're using the correct namespace. If you click on the DC link, you can find the complete list of DC terms in their respective namespaces.
ROBOT contains some built-in prefixes, which can be found here. The prefix dc:
corresponds to the /terms/
namespace and dc11:
to /elements/1.1/
. You may see different prefixes used (for example, /terms/
is sometimes dcterms:
or just terms:
), but the full namespace is what really matters as long as the prefix is defined somewhere.
Let's go ahead and add a title and description to our animals.owl
file. We'll do this using the --annotation
option, which expects two arguments: (1) the CURIE of the annotation property, (2) the value of the annotation. The value of the annotation must be enclosed in double quotes if there are spaces. You can use any annotation property you want here, and include as many as you want! For now, we'll start with two:
robot annotate --input animals.owl \\\n --annotation dc11:title \"Animal Ontology\" \\\n --annotation dc11:description \"An ontology about animals\" \\\n --output animals.owl\n
--annotation
adds these as strings, but remember that an annotation can also point to an link or IRI. We want our license to be a link, so we'll use --link-annotation
instead to add that:
robot annotate --input animals.owl \\\n --link-annotation dc:license https://creativecommons.org/licenses/by/4.0/ \\\n --output animals.owl\n
OBO Foundry recommends using Creative Commons for all licenses. We just gave our ontology the most permissive of these, CC-BY.
When you open animals.owl
in Prot\u00e9g\u00e9 again, you'll see these annotations added to the Active ontology tab. You can also click on the CC-BY link!
We've already learned how to include external ontologies as imports. Usually, for the released version of an ontology, the imports are merged in so that all contents are in one file.
Another reason you may want to merge two ontologies is if you're adding new terms to an ontology using template
, like how we created new animal terms in animals2.tsv
last time. We're going to demonstrate two methods of merging now. The first involves merging two (or more!) separate files and the second involves merging all imports into the current input ontology.
First, copy animals2.owl
to animals-new.owl
. In Windows, this command is copy animals2.owl animals-new.owl
. For Docker and other Linux operating systems, this is cp animals2.owl animals-new.owl
. Open animals-new.owl
in Prot\u00e9g\u00e9 and remove the import we added last time. This is done in the Imported ontologies section of the Active ontology tab. Just click the X on the right side of the imported animals ontology. Don't forget to save!
Continuing with the animals.owl
file we created last week, now run the following command:
robot merge --input animals.owl --input animals-new.owl --output animals-full.owl\n
When you just import an external ontology into your ontology, you'll notice in the Prot\u00e9g\u00e9 class hierarchy that all terms from the external ontology are a less-bold text than internal terms. This can be seen when you open animals2.owl
, where we imported animals.owl
. This is simply Prot\u00e9g\u00e9's way of telling us that these terms are not part of your current ontology. Now that we've merged these two ontologies together, when you open animals-full.owl
in Prot\u00e9g\u00e9, you'll see that all the terms are bold.
By default, the output ontology will get the ontology IRI of the first input ontology. We picked animals.owl
as our first ontology here because this is the ontology that we're adding terms to, so we want our new output ontology to replace the original while keeping the same IRI. merge
will also copy over all the ontology annotations from animals.owl
(the first input) into the new file. The annotations from animals2.owl
are ignored, but we'll talk more about this in our class session.
If we were editing an ontology in the wild, we'd probably now replace the original with this new file using cp
or copy
. For now, don't replace animals.owl
because we'll need it for this next part.
IMPORTANT: Be very careful to check that the format is the same if you're replacing a file! Remember, you can always output OWL Functional syntax or another syntax by ending your output with .ofn
, for example: --output animals-full.ofn
.
When we want to merge all our imports into our working ontology, we call this collapsing the import closure. Luckily (since we're lazy), you don't need to type out each of your imports as an input to do this.
We already have animals.owl
imported into animals2.owl
. Let's collapse the import closure:
robot merge --input animals2.owl --collapse-import-closure true --output animals-full-2.owl\n
Even though we gave this a different file name, if you open animals-full-2.owl
in Prot\u00e9g\u00e9, you'll notice that it's exactly the same as animals-full.owl
! This is because we merged the same files together, just in a slightly different way. This time, though, the ontology IRI is the one for animals2.owl
, not animals.owl
. That is because that was our first input file.
As we saw in the prepwork for Week 5, running a reasoner in Prot\u00e9g\u00e9 creates an inferred class hierarchy. In the OBO Foundry, releases versions of ontologies usually have this inferred hierarchy asserted, so you see the full inferred hierarchy when you open the ontology without running the reasoner. ROBOT reason
allows us to output a version of the ontology with these inferences asserted.
As we discussed, ELK and HermiT are the two main reasoners you'll be using. Instead of using our example ontologies (the asserted and inferred hierarchies for these will look exactly the same), we're going to use another ontology from the Ontologies 101 tutorial from week 5. Navigate back to that directory and then navigate to BDK14_exercises/basic-classification
.
Like running the reasoner in Prot\u00e9g\u00e9, running reason
does three things:
Remember, when we run the reasoner in Prot\u00e9g\u00e9, if the ontology is inconsistent, reason
will fail. If there are unsatisfiable classes, these will be asserted as owl:Nothing
. ROBOT will always fail in both cases, but has some tools to help us figure out why. Let's introduce an unsatifiable class into our test and see what happens.
First, let's make a copy of ubiq-ligase-complex.owl
and call this new file unreasoned.owl
(copy
or cp
).
Open unreasoned.owl
in Prot\u00e9g\u00e9 and follow the steps below. These are things we've covered in past exercises, but if you get stuck, please don't hesitate to reach out.
Like we did in the Disjointness part of the Ontologies 101 tutorial, we've made 'intracellular organelle part' a subclass of two classes that should have no overlap based on the disjointness axiom. Save the ontology and return to your terminal. Now, we'll run reason
. The default reasoner is ELK, but you can specify the reasoner you want to use with the --reasoner
option. For now, we'll just use ELK.
robot reason --input unreasoned.owl --output unsatisfiable.owl\n
You'll notice that ROBOT printed an error message telling us that the term with the IRI http://purl.obolibrary.org/obo/GO_0044446
is unsatisfiable and ROBOT didn't create unsatisfiable.owl
. This is ideal for automated pipelines where we don't want to be releasing unsatisfiable classes.
We can still use ROBOT to investigate the issue, though. It already gave us the IRI, but we can get more details using the --dump-unsatisfiable
option. We won't provide an output this time because we know it won't succeed.
robot reason --input unreasoned.owl --dump-unsatisfiable unsatisfiable.owl\n
You can open unsatisfiable.owl
in Prot\u00e9g\u00e9 and see that 'intracellular organelle part' is not the only term included, even though it was the only unsatisfiable class. Like with the SLME method of extraction, all the terms used in unsatisfiable class or classes logic are included in this unsatisfiable module. We can then use Prot\u00e9g\u00e9 to dig a little deeper in this small module. This is especially useful when working with large ontologies and/or the HermiT reasoner, which both can take quite some time. By extracting a smaller module, we can run the reasoner again in Prot\u00e9g\u00e9 to get detailed explanations. In this case, we already know the problem, so we don't need to investigate any more.
Now let's reason over the original ubiq-ligase-complex.owl
and see what happens:
robot reason --input ubiq-ligase-complex.owl --output reasoned.owl\n
If you just open reasoned.owl
in Prot\u00e9g\u00e9, you won't really notice a different between this and the input file unless you do some digging. This takes us to our next command...
The diff
command can be used to compare the axioms in two ontologies to see what has been added and what has been removed. While the diffs on GitHub are useful for seeing what changed, it can be really tough for a human to read the raw OWL formats. Using ROBOT, we can output these diffs in a few different formats (using the --format
option):
plain
: plain text with just the added and removed axioms listed in OWL functional syntax (still tough for a human to read, but could be good for passing to other scripts)pretty
: similar to plain
, but the IRIs are replaced with CURIEs and labels where available (still hard to read)html
: a nice, sharable HTML file with the diffs sorted by termmarkdown
: like the HTML diff, but in markdown for easy sharing on platforms like GitHub (perfect for pull requests!)We're going to generate an HTML diff of ubiq-ligase-complex.owl
compared to the new reasoned.owl
file to see what inferences have been asserted. diff
takes a left (\"original\") and a right (\"new\") input to compare.
robot diff --left ubiq-ligase-complex.owl \\\n --right reasoned.owl \\\n --format html \\\n --output diff.html\n
Open diff.html
in your browser side-by-side with reasoned.owl
and you can see how the changes look in both.
Homework question: Running reason
should assert inferences, yet there are some removed axioms in our diff. Why do you think these axioms were removed?
In this tutorial you will learn how to set up your QC pipeline with ROBOT report
, verify
, validate-profile
and reason
.
Quality control is a very large concern in ontologies. For example, we want to make sure that our editors use the right annotation properties to attach metadata to terms (such as a date, or a label), or to make sure that our last edit did not accidentally introduce a logical error. In ROBOT, we have four commands that help us in particular to ensure the quality of our ontologies:
verify
to ensure they do not appear in your ontology.reason
to ensure that your ontology is consistent and coherent and test the \"unique name assumption\".In the following, we will learn about all of these and how they fit in the wider concerns of ontology quality control.
"},{"location":"tutorial/robot-tutorial-qc/#download-test-ontology","title":"Download test ontology","text":"Download example.owl
, or get it via the command line:
curl https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/robot_tutorial_qc/example.owl > example.owl\n
Let us ensure we are using the same ROBOT version:
robot --version\n
We see:
ROBOT version 1.8.3\n
"},{"location":"tutorial/robot-tutorial-qc/#robot-validate-profile","title":"ROBOT validate-profile","text":"ROBOT validate-profile: Ensures that your ontology is a syntactically valid OWL ontology. This is the absolute minimum check - some \"violations\" to OWL 2 DL validity cause the reasoner to behave in unexpected and wrong ways!
robot validate-profile --profile DL -i example.owl\n
Thankfully, our test ontology is in valid OWL DL:
OWL 2 DL Profile Report: [Ontology and imports closure in profile]\n
This check is overlooked by a lot of OWL Ontology developers despite its importance to ensure both a predictable behaviour of the reasoner and of parsing tools. See here for an example where an ontology was not in OWL DL profile, causing various problems for parsing and computation: https://github.com/Orphanet/ORDO/issues/32.
"},{"location":"tutorial/robot-tutorial-qc/#robot-report","title":"ROBOT report","text":"Let us generate a simple report:
robot report -i example.owl -o report.html\n
ROBOT report will do two things:
Violations: 11\n-----------------\nERROR: 5\nWARN: 4\nINFO: 2\nERROR Report failed!\n
Let us look at the file in a browser (simply double-click on the html file the way you would open a PDF). Your report should look similar to this:
While there are other formats you can export your report to, HTML is a great format which not only offers useful colour coding, but also allows us to click on the related classes and properties and, more importantly, the checks to find our what they mean (for an overview of all ROBOT report checks see here).
"},{"location":"tutorial/robot-tutorial-qc/#exercise","title":"Exercise","text":"We will leave it to the reader as an exercise to try and fix all the errors indicated by the report!
"},{"location":"tutorial/robot-tutorial-qc/#advanced-usage-of-robot-report","title":"Advanced usage of ROBOT report","text":""},{"location":"tutorial/robot-tutorial-qc/#customisation","title":"Customisation","text":"While by far the most widely spread usage of ROBOT report is to check for OBO best practices, it is possible to customise the report by removing certain OBO ontology checks and adding custom ones.
Lets first create a simple profile.txt
in our directory and add the following lines:
WARN annotation_whitespace\nERROR missing_ontology_description\nERROR missing_definition\nERROR missing_ontology_license\nERROR missing_ontology_title\nERROR misused_obsolete_label\nERROR multiple_labels\n
Now we tell ROBOT to run the command using our custom profile rather than the default ROBOT profile:
robot report -i example.owl --profile profile.txt -o report.html\n
The resulting report looks different:
In particular, some checks like missing_superclass
which we did not care about for our use case are not shown at all anymore, and others, such as missing_definition
are now considered ERROR
(red) rather than WARN
(warning, yellow) because for our use case, we have decided that definitions on terms are mandatory.
ROBOT verify allows us to define QC checks for undesirable situation (we sometimes call this \"anti-pattern\") using the SPARQL query language. The idea is simple: we write a SPARQL query for the thing we do not want. For example, we can use SPARQL to look for classes with more than one label. Then, we feed this query to ROBOT verify. ROBOT verify than ensures that the query has no answers, i.e the thing we do not want actually does not happen:
PREFIX owl: <http://www.w3.org/2002/07/owl#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT DISTINCT ?entity ?property ?value WHERE {\n VALUES ?property { rdfs:label }\n ?entity ?property ?value .\n ?entity ?property ?value2 .\n FILTER (?value != ?value2) .\n FILTER NOT EXISTS { ?entity owl:deprecated true }\n FILTER (!isBlank(?entity))\n}\nORDER BY ?entity\n
Let us safe this query now in our working directory as bad_labels.sparql
and run the following:
robot verify -i example.owl --queries bad_labels.sparql\n
ROBOT will output this to tell us which terms have violations:
FAIL Rule bad_labels.sparql: 2 violation(s)\nentity,property,value\nhttp://purl.obolibrary.org/obo/OBI_0002986,http://www.w3.org/2000/01/rdf-schema#label,CT scan\nhttp://purl.obolibrary.org/obo/OBI_0002986,http://www.w3.org/2000/01/rdf-schema#label,computed tomography imaging assay\n
Now the cool thing with verify
is that we can basically feed SPARQL SELECT queries in whatever shape or form we want. To make error messages more readable for curators, you can even encode a proper error message:
PREFIX owl: <http://www.w3.org/2002/07/owl#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT DISTINCT ?error WHERE {\n VALUES ?property { rdfs:label }\n ?entity ?property ?value .\n ?entity ?property ?value2 .\n FILTER (?value != ?value2) .\n FILTER NOT EXISTS { ?entity owl:deprecated true }\n FILTER (!isBlank(?entity))\n BIND(CONCAT(\"Entity \",STR(?entity),\" uses two different labels: (1) \",STR(?value),\" and (2) \",STR(?value2)) as ?error)\n}\nORDER BY ?entity\n
This time, when running the query, we get:
FAIL Rule bad_labels.sparql: 2 violation(s)\nerror\nEntity http://purl.obolibrary.org/obo/OBI_0002986 uses two different labels: (1) CT scan and (2) computed tomography imaging assay\nEntity http://purl.obolibrary.org/obo/OBI_0002986 uses two different labels: (1) computed tomography imaging assay and (2) CT scan\n
Which appears much more readable! You can tweak the output in whatever way you think is best. Two things about this:
ROBOT report
: despite the ability to include custom queries, all queries must start with:SELECT DISTINCT ?entity ?property ?value WHERE\n
This is one of the reasons we still like using ROBOT verify, despite the fact that ROBOT report can also be extended with custom checks.
verify
is that you can add the --output-dir results/
parameter to your query to get ROBOT to export the query results as TSV files. This can be useful if you have many QC queries and need to work with them independently of the checks.This is not an exhaustive tutorial for ROBOT reason (for more, see here). We only want to get across two checks that we feel absolutely every ontology developer should know about.
"},{"location":"tutorial/robot-tutorial-qc/#the-distinct-scope-assumption","title":"The \"distinct scope\" assumption","text":"In most cases, we do not want to define the exact same concept twice. There are some exceptions, for example when we align ontologies such as CHEBI and GO which may have overlapping scope, but in 99.9% of the cases, having the reasoner infer that two classes are the same points to a mistake in the axiomatisation. Let us check that we do not have any such unintended equivalencies in our ontology:
robot reason -i example.owl --equivalent-classes-allowed none\n
ROBOT will note that:
ERROR No equivalent class axioms are allowed\nERROR Equivalence: <http://purl.obolibrary.org/obo/TEST_0600047> == <http://purl.obolibrary.org/obo/OBI_0600047>\n
Further investigation in Protege will reveal that TEST_0600047 and OBI_0600047 are subclasses of each other, which causes the reasoner to infer that they are equivalent.
"},{"location":"tutorial/setting-up-project-odk/","title":"Tutorial: How to get started with your own ODK-style repository","text":"The tutorial uses example tailored for users of UNIX systems, like Mac and Linux. Users of Windows generally have analogous steps - wherever we talk about an sh
file in the following there exists a corresponding bat
file that can be run in the windows powershell, or CMD.
You have:
A recording of a demo of creating a ODK-repo is available here
"},{"location":"tutorial/setting-up-project-odk/#your-first-repository","title":"Your first repository","text":"On your machine, create a new folder somewhere:
cd ~\nmkdir odk_tutorial\ncd odk_tutorial\n
Now download the seed-my-repo wrapper script from the ODK GitHub repository. A detailed explanation of how to do that can be found here. For simplicity, we just use wget here to download the seed-my-repo file, but you can do it manually:
wget https://raw.githubusercontent.com/INCATools/ontology-development-kit/master/seed-via-docker.sh\n
The last ingredient we need is an ODK config file. While you can, in theory, create an empty repo entirely without a config file (one will be generated for you), we recommend to just start right with one. You can find many examples of configs here. For the sake of this tutorial, we will start with a simple config:
id: cato\ntitle: \"Cat Anatomy Ontology\"\ngithub_org: obophenotype\ngit_main_branch: main\nrepo: cat_anatomy_ontology\nrelease_artefacts:\n- base\n- full\n- simple\nprimary_release: full\nexport_formats:\n- owl\n- obo\n- json\nimport_group:\nproducts:\n- id: ro\n- id: pato\n- id: omo\nrobot_java_args: \"-Xmx8G\"\nrobot_report:\nuse_labels: TRUE\nfail_on: ERROR\ncustom_profile: TRUE\nreport_on:\n- edit\n
Safe this config file as in your temporary directory, e.g. ~/odk_tutorial/cato-odk.yaml
.
Most of your work managing your ODK in the future will involve editing this file. There are dozens of cool options that do magical things in there. For now, lets focus on the most essential:
"},{"location":"tutorial/setting-up-project-odk/#general-config","title":"General config:","text":"id: cato\ntitle: \"Cat Anatomy Ontology\"\n
The id is essential, as it will determine how files will be named, which default term IDs to assume, and many more. It should be a lowercase string which is, by convention at least 4 characters long - 5 is not unheard of. The title
field is used to generate various default values in the repository, like the README and others. There are other fields, like description
, but let's start minimal for now. A full list of elements can be found in this schema:
https://github.com/INCATools/ontology-development-kit/blob/master/schema/project-schema.json
"},{"location":"tutorial/setting-up-project-odk/#git-config","title":"Git config:","text":"github_org: obophenotype\ngit_main_branch: main\nrepo: cat_anatomy_ontology\n
The github_org
(the GitHub or GitLab organisation) and the repo
(repository name) will be used for some basic config of the git repo. Enter your own github_org
here rather than obophenotype
. Your default github_org
is your GitHub username. If you are not creating a new repo, but working on a repo that predates renaming the GitHub main branch from master
to main
, you may want to set the git_main_branch
as well.
release_artefacts:\n - base\n - full\n - simple\nprimary_release: full\nexport_formats:\n - owl\n - obo\n - json\n
With this configuration, we tell the ODK that we wish to automatically generate the base, full and simple release files for our ontology. We also say that we want the primary_release
to be the full
release (which is also the default). The primary release will be materialised as cato.owl
, and is what most users of your ontology will interact with. More information and what these are can be found here. We always want to create a base
, i.e. the release variant that contains all the axioms that belong to the ontology, and none of the imported ones, but we do not want to make it the primary_release
, because it will be unclassified and missing a lot of the important inferences.
We also configure export products: we always want to export to OWL
(owl
), but we can also chose to export to OBO
(obo
) format and OBOGraphs JSON
(json
).
import_group:\n products:\n - id: ro\n - id: pato\n - id: omo\n
This is a central part of the ODK, and the section of the config file you will interact with the most. Please see here for details. What we are asking the ODK here, in essence, to set us up for dynamically importing from the Relation Ontology (RO), the Phenotype And Trait Ontology (PATO) and the OBO Metadata Ontology (OMO).
"},{"location":"tutorial/setting-up-project-odk/#memory-management","title":"Memory management:","text":"robot_java_args: '-Xmx8G'\n
Here we say that we allow ROBOT to consume up to 8GB of memory. Make sure that your docker is set up to permit at least ~20% more memory than that, i.e. 9GB or 10GB, otherwise, some cryptic Docker errors may come up.
"},{"location":"tutorial/setting-up-project-odk/#robot-report","title":"ROBOT Report:","text":"robot_report:\n use_labels: TRUE\n fail_on: ERROR\n report_on:\n - edit\n
use_labels
: allows switching labels on and off in the ROBOT reportfail_on
: the report will fail if there is an ERROR-level violationreport_on
: specify which files to run the report over.With this configuration, we tell ODK we want to run a report to check the quality of the ontology. Check here the complete list of report queries.
"},{"location":"tutorial/setting-up-project-odk/#generate-the-repo","title":"Generate the repo","text":"Run the following:
cd ~/odk_tutorial\nsh seed-via-docker.sh -c -C cato-odk.yaml\n
This will create a basic layout of your repo under target/cato/*
Note: after this run, you wont need cato-odk.yaml
anymore as it will have been added to your ontology repo, which we will see later.
You can now move the target/cato
directory to a more suitable location. For the sake of this tutorial we will move it to the Home directory.
mv target/cato ~/\n
"},{"location":"tutorial/setting-up-project-odk/#using-github-desktop","title":"Using GitHub Desktop","text":"If you use GitHub Desktop, you can now simply add this repo by selecting File -> Add local repository
and select the directory you moved the repo to (as an aside, you should really have a nice workspace directory like ~/git
or ~/ws
or some such to organise your projects).
Then click Publish the repository
on
Follow the instructions you see on the Terminal (they are printed after your seed-my-repo run).
"},{"location":"tutorial/setting-up-project-odk/#finish","title":"Finish!","text":"Congratulations, you have successfully jump-started your very own ODK repository and can start developing.
"},{"location":"tutorial/setting-up-project-odk/#next-steps","title":"Next steps:","text":"~/cato/src/ontology/cato-edit.owl
using Protege.This tutorial will teach you how to create report tables using SPARQL and the ODK. Report tables are TSV files that can be viewed by programs such as Excel or Google Sheets.
For a tutorial on how to generate reports independent of ODK please see here.
"},{"location":"tutorial/sparql-report-odk/#preparation","title":"Preparation","text":"robot_report:\n custom_sparql_exports:\n - basic-report\n - my-cat-report\n
This will tell the ODK that you no longer wish to generate the ODK default reports (synonyms, xrefs, etc), but instead:
basic-report
)my-cat-report
.Now, we can apply these changes as usual:
sh run.sh make update_repo\n
"},{"location":"tutorial/sparql-report-odk/#adding-the-actual-table-report","title":"Adding the actual table report","text":"Similar to our ROBOT tutorial on queries, let us now add a simple table report for the terms and labels in our ontology. To do that, let us safe the following file in our src/sparql
directory (standard ODK setup), i.e. src/sparql/my-cat-report.sparql
(you must use the same name as the one you speciefied in your ODK yaml file above):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\n\nSELECT ?term ?property ?value\nWHERE {\n ?term a owl:Class ;\n rdfs:label ?value .\n}\n
Now, let's generate our report (you have to be, as always, in src/ontology/
):
sh run.sh make custom_reports\n
This will generate all custom reports you have configured in one go and save them in the src/ontology/reports
directory. reports/my-cat-report.tsv
looks probably something like this for you:
?term ?property ?value\n<http://purl.obolibrary.org/obo/CATO_0000000> \"root node\"@en\n...\n
That is all there is. You can configure as many reports as you want, and they will all be generated with the custom_reports
command above, or as part of your ontology releases.
Creating table outputs from your ontology helps with many issues, for example during ontology curation (it is often easier to look at tables of related ontology terms rather than a hierarchy), for data aggregation (you want to know how many synonyms there are, and which) and simply to share \"a list of all terms with labels\". There are two major tools to help here:
Download example.owl
, or get it via the command line:
curl https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/robot_tutorial_qc/example.owl > example.owl\n
Let us ensure we are using the same ROBOT version:
robot --version\n
We see:
ROBOT version 1.8.3\n
"},{"location":"tutorial/sparql-report-robot/#generating-a-simple-report","title":"Generating a simple report","text":"Very frequently, we wish need to create summary tables (for a more detailed motivation see here).
Here, lets generate a simple report table by specifying a query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT ?term ?property ?value\nWHERE {\n ?term rdfs:label ?value .\n}\n
Let us safe the query as labels.sparql
in our working directory.
Let's now generate the report:
robot query -i example.owl --query labels.sparql labels.tsv\n
When looking at labels.tsv (in a text editor, or Excel, or whatever table editor you prefer), we notice that some properties are included in our list and decide to change that by restricting the results to classes:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\n\nSELECT ?term ?property ?value\nWHERE {\n ?term a owl:Class ;\n rdfs:label ?value .\n}\n
Now, when running the robot query
command again, we see only the terms we want.
Note that you could have achieved all this with a simple ROBOT export command. However, there are many cool ways you can tweak your reports when you learn how to build them manually during SPARQL. Your only limit is essentially SPARQL itself, which gives you access too most things in your ontology, aside from perhaps complex logical axioms.
"},{"location":"tutorial/sparql/","title":"Basic SPARQL for OBO Engineers","text":"In this tutorial we introduce SPARQL, with a particular spin on how we use it across OBO ontologies. Following this tutorial should give you a sense of how we use SPARQL across OBO, without going too much into technical details. You can find concrete tutorials on how to generate reports or QC checks with ROBOT and ODK towards the end of this page.
"},{"location":"tutorial/sparql/#preparation","title":"Preparation","text":"--update
). ROBOT uses Jena internally to execute SPARQL queries.SPARQL has many uses in the OBO-sphere, but the following in particular:
We will discuss each of these in the following and give examples. An informal discussion of SPARQL in OBO can be followed in video below.
"},{"location":"tutorial/sparql/#quality-control-checking","title":"Quality control checking","text":"For us, ROBOT + SPARQL were a game changer for our quality control (QC) pipelines. This is how it works. First, we encode the error in the form of a SPARQL query (we sometimes call this \"anti-pattern\", i.e. an undesirable (anti-) representation). For example, the following check simply looks for entities that have more than one definition:
PREFIX obo: <http://purl.obolibrary.org/obo/>\nPREFIX owl: <http://www.w3.org/2002/07/owl#>\n\nSELECT DISTINCT ?entity ?property ?value WHERE {\n VALUES ?property { obo:IAO_0000115\n obo:IAO_0000600 }\n ?entity ?property ?value .\n ?entity ?property ?value2 .\n FILTER (?value != ?value2)\n FILTER NOT EXISTS { ?entity owl:deprecated true }\n FILTER (!isBlank(?entity))\n}\nORDER BY ?entity\n
This is a typical workflow. Think of an ontology editor working on an ontology. Often, that curator notices that the same problem happens repeatedly and tell us, the Ontology Pipeline Developer, that they would like a check to prevent the error. We then capture the erroneous situation as a SPARQL query. Then, we add it to our ontology repository, and execute it with ROBOT report or ROBOT verify (see above) in our CI pipelines, usually based on GitHub actions or Travis. Note that the Ontology Development Kit provides a built-in framework for for such queries build on ROBOT verify and report.
"},{"location":"tutorial/sparql/#creating-summary-tables-for-ontologies","title":"Creating summary tables for ontologies","text":"Many times, we need to create tabular reports of our ontologies to share with stakeholders or to help with internal reviews, e.g.:
Sometimes using Yasgui, for example in conjunction with the RENCI Ubergraph Endpoint, is enough, but often, using ROBOT query is the better choice, especially if you want to make sure the right version of the ontology is used (Ubergraph occasionally is out of date).
Using ROBOT in conjunction with a Workflows Automation system like Github actions helps with generating up-to-date reports. Here is an example of a GitHub action that generates a few reports with ROBOT and pushes them back to the repository.
"},{"location":"tutorial/sparql/#a-note-for-data-scientists","title":"A note for Data Scientists","text":"In many cases we are asked how to best \"load an ontology\" into a python notebook or similar. Very often the answer is that it is best to first extract the content of the ontology into a table form, and then load it using a CSV reader like pandas
. In this scenario, the workflow for interacting with ontologies is:
If combined with for example a Makefile, you can always ensure that the report generation process is fully reproducible as well.
"},{"location":"tutorial/sparql/#sophisticated-data-transformations-in-ontology-pipelines","title":"Sophisticated data transformations in ontology pipelines","text":"Lastly, we use ROBOT query to implement complex ontology transformation processes. For example the following complex query transforms related synonyms to exact synonyms if some complex condition is met:
prefix owl: <http://www.w3.org/2002/07/owl#>\nprefix oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>\nprefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nDELETE {\n ?term oboInOwl:hasRelatedSynonym ?related .\n ?relax a owl:Axiom ;\n owl:annotatedSource ?term ;\n owl:annotatedProperty oboInOwl:hasRelatedSynonym ;\n owl:annotatedTarget ?related ;\n oboInOwl:hasDbXref ?xref2 .\n}\n\nINSERT {\n ?relax a owl:Axiom ;\n owl:annotatedSource ?term ;\n owl:annotatedProperty oboInOwl:hasExactSynonym ;\n owl:annotatedTarget ?related ;\n oboInOwl:hasDbXref ?xref2 .\n}\nWHERE\n{\n {\n ?term oboInOwl:hasRelatedSynonym ?related ;\n oboInOwl:hasExactSynonym ?exact ;\n a owl:Class .\n ?exax a owl:Axiom ;\n owl:annotatedSource ?term ;\n owl:annotatedProperty oboInOwl:hasExactSynonym ;\n owl:annotatedTarget ?exact ;\n oboInOwl:hasDbXref ?xref1 .\n ?relax a owl:Axiom ;\n owl:annotatedSource ?term ;\n owl:annotatedProperty oboInOwl:hasRelatedSynonym ;\n owl:annotatedTarget ?related ;\n oboInOwl:hasDbXref ?xref2 .\n\n FILTER (str(?related)=str(?exact))\n FILTER (isIRI(?term) && regex(str(?term), \"^http://purl.obolibrary.org/obo/MONDO_\"))\n }\n}\n
This can be a very useful tool for bulk editing the ontology, in particular where it is difficult or impossible to achieve the same using regular expressions or other forms of \"replacement\"-techniques. Here are some example queries we collected to do such mass operations in Mondo.
"},{"location":"tutorial/sparql/#related-tutorials","title":"Related tutorials","text":"This tutorial is based off https://ontology101tutorial.readthedocs.io/en/latest/DL_QueryTab.html +Created by: Melissa Haendel, Chris Mungall, David Osumi-Sutherland, Matt Yoder, Carlo Torniai, and Simon Jupp
+The DL query tab shown below provides an interface for querying and searching an ontology. The ontology must be classified by a reasoner before it can be queried in the DL query tab.
+For this tutorial, we will be using cc.owl which can be found here.
+Open cc.owl in Protege (use Open from URL and enter the https://raw.githubusercontent.com/OHSUBD2K/BDK14-Ontologies-101/master/BDK14_exercises/basic-dl-query/cc.owl
). Run the reasoner. Navigate to the DL Query tab.
Type organelle
into the box, and make sure subclasses
and direct subclasses
are ticked.
You can type any valid OWL class expression into the DL query tab. For example, to find all classes whose members are part_of a membrane, type part_of some membrane
and click execute
. Note the linking underscore for this relation in this ontology. Some ontologies do not use underscores for relations, whereby you'd need single quotes (i.e. part of
).
The OWL keyword and
can be used to make a class expression that is the intersection of two class expressions. For example, to find the classes in the red area below, we want to find subclasses of the intersection of the class organelle
and the class endoplasmic reticulum part
Note that we do not need to use the part
grouping classes in the gene ontology (GO). The same results can be obtained by querying for the intersection of the class organelle
and the restriction part_of some ER
– try this and see.
We can also ask for superclasses by ticking the boxes as below:
+ +The or
keyword is to used to create a class expression that is the union of two class expressions. For example:
+(WARNING: or
is not supported by ELK reasoner)
This is illustrated by the red area in the following Venn diagram:
+ +For further exercises, please see https://ontology101tutorial.readthedocs.io/en/latest/EXERCISE_BasicDL_Queries.html
+ + + + + + +This tutorial explains adding quality checks not included in the ROBOT Report.
+You have completed the tutorials:
+ +oboInOwl:creation_date
to the root_node
in the CAT Ontology.oboInOwl:creation_date
. It will return the class with the annotation if it's not of type xsd:dateTime
.PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+
+SELECT ?cls WHERE
+{
+ ?cls oboInOwl:creation_date ?date .
+ FILTER(DATATYPE(?date) != xsd:dateTime)
+}
+
Save the SPARQL query in the src/sparql
folder and name it [violation name]-violation.sparql
. In the case of the tutorial, date-as-string-violation.sparql
Add the check to the ODK config file. In the previous tutorial, this is located at ~/cato/src/ontology/cato-odk.yaml
. Inside robot_report
, add custom_sparql_checks
robot_report:
+ use_labels: TRUE
+ fail_on: ERROR
+ report_on:
+ - edit
+ custom_sparql_checks:
+ - date-as-string
+
sh run.sh make update_repo
+
sh run.sh make sparql_test
+FAIL Rule ../sparql/date-as-string-violation.sparql: 1 violation(s)
+cls
+http://purl.obolibrary.org/obo/CATO_0000000
+
xsd:dateTime
, and run the test again to certify everything is good this time.
+
+sh run.sh make sparql_test
+PASS Rule ../sparql/date-as-string-violation.sparql: 0 violation(s)
+
Push the changes to your repository, and the custom checks will run whenever creating a new Pull Request, as detailed here.
+There are several checks already available in the ODK. If you'd like to add them, add the validation name in your ODK config file.
+owldef-self-reference
: verify if the term uses its term as equivalentredundant-subClassOf
: verify if there are redundant subclasses between three classestaxon-range
: verify if the annotations present_in_taxon
or never_in_taxon
always use classes from NCBITaxoniri-range
: verify if the value for the annotations never_in_taxon
, present_in_taxon
, foaf:depicted_by
, oboInOwl:inSubset
and dcterms:contributor
are not an IRIiri-range-advanced
: same as iri-range
plus check for rdfs:seeAlso
annotationlabel-with-iri
: verify if there is IRI in the labelmultiple-replaced_by
: verify if an obsolete term has multiple replaced_by
termsterm-tracker-uri
: verify if the value for the annotation term_tracker_item is not URIillegal-date
: verify if the value for the annotations dcterms:date
, dcterms:issued
and dcterms:created
are of type xds:date
and use the pattern YYYY-MM-DD
ROBOT report can also have custom quality checks.
+custom_profile: TRUE
, in the ODK config file. robot_report:
+ use_labels: TRUE
+ fail_on: ERROR
+ custom_profile: TRUE
+ report_on:
+ - edit
+ custom_sparql_checks:
+ - date-as-string
+
src/sparql
. There isn't a restriction on the file name. However, it should return the variables ?entity ?property ?value
.
+SELECT DISTINCT ?entity ?property ?value
+WHERE {
+ ...
+}
+
src/ontology/profile.txt
file.ERROR file:../sparql/<file name>.sparql
+
src/ontology/reports/cato-edit.owl-obo-report.tsv
. The Rule Name will be the SPARQL file name.sh run.sh make test
+
entity
, property
and value
-> ROBOT reportKeep in mind that after changing the profile.txt
, you won't get any upcoming updates, and you need to update manually.
This tutorial is based off https://ontology101tutorial.readthedocs.io/en/latest/Disjointness.html +Created by: Melissa Haendel, Chris Mungall, David Osumi-Sutherland, Matt Yoder, Carlo Torniai, and Simon Jupp
+For this excercise, we will be using chromosome-parts-interim.owl file that can be found here
+In the chromosome-parts-interim.owl file, at the top of our class hierarchy we have cell, cell part, chromosomal part, intracellular part, organelle and organelle part. By default, OWL assumes that these classes can overlap, i.e. there are individuals who can be instances of more than one of these classes. We want to create a restriction on our ontology that states these classes are different and that no individual can be a member of more than one of these classes. We can say this in OWL by creating a disjoint classes axiom.
+If you do not already have it open, load your previous ontology that was derived from the 'interim file'. Note: you can open a recent file by going to File-> Open Recent
+We want to assert that organelle
and organelle part
are disjoint. To do this first select the organelle
class. In the class 'Description' view, scroll down and select the (+) button next to Disjoint With. You are presented with the now familiar window allowing you to select, or type, to choose a class. In the hierarchy panel, you can use CTRL to select multiple classes. Select 'organelle part' as disjoint with organelle.
Note that the directionality is irrelevant. Prove this to yourself by deleting the disjoint axiom, and adding it back from organelle part
.
We have introduced a deliberate mistake into the ontology. We previously asserted that intracellular organelle part
is a subclass of both organelle part
and organelle
. We have now added an axiom stating that organelle
and organelle part
are disjoint. We can use the reasoner to check the consistency of our ontology. The reasoner should detect our contradiction.
Protégé comes with several reasoners, and more can be installed via the plugins mechanism (see plugins chapter). Select a reasoner from the Reasoner menu (Elk, HermiT, Pellet, or Fact++ will work - we mostly use ELK). Once a reasoner is highlighted, select 'Start reasoner' from the menu. Note: you may get several pop-boxes/warnings, ignore those.
+The intracellular organelle part
class will have changed to red indicating that the class is now unsatisfiable.
You can also see unsatisfiable classes by switching to the inferred view.
+ +Here you will a special class called Nothing
. When we previously said that all OWL classes are subclasses of OWL Thing. OWL Nothing
is a leaf class or bottom class of your ontology. Any classes that are deemed unsatisfiable by the reasoner are shown as subclasses or equivalent to OWL Nothing. The inferred view will show you all subclasses of Nothing.
Once the ontology is classified, inferred statements or axioms are shown in the various panels with a light-yellow shading. The class description for intracellular organelle part
should look something like the screen shot below. You will see that the class has been asserted equivalent to the Nothing
class. Inside this statement, a small question mark icon appears, clicking this will get an explanation from the reasoner for this inconsistency.
Select the (?) icon to get an explanation for this inconsistency. The explanation shows the axioms involved. We see the disjoint class axiom alongside the two subclass axioms are causing the inconsistency. We can simply repair this ontology by removing the intracellular organelle part
subClassOf organelle
axiom.
Remove the Disjoint with axiom (click the (x) beside organelle
in the Description pane for intracellular organelle part
), and resynchronise the reasoner from the reasoner menu.
This is a very unprofessional video below recorded as part of one of our trainings. It walks you through this tutorial here, with some additional examples being given and a bit of Q&A.
+ + +SC 'part of' some %
which can be instantiated by ROBOT
to be transformed into an OWL axiom: SubClassOf(CATO:001 ObjectSomeValuesFrom(BFO:0000051 UBERON:123))
. Similarly, DOSDP YAML files are often referred to as "templates" (which is appropriate). Unfortunately, we often refer to them as "patterns" which is not strictly the right way to name them: they are templates that encode patterns (and that only to a limited extend). We recommend to refer to the DOSDP YAML files as "templates".equivalentTo
or subClassOf
field: It tells DOSDP tools how to generate an OWL axiom, with which variable slots (vars
).This tutorial assumes you have set up an ODK repo with this config:
+id: cato
+title: "Cat Anatomy Ontology"
+github_org: obophenotype
+git_main_branch: main
+repo: cat_anatomy_ontology
+release_artefacts:
+ - base
+ - full
+ - simple
+primary_release: full
+export_formats:
+ - owl
+ - obo
+ - json
+import_group:
+ products:
+ - id: ro
+ - id: pato
+ - id: omo
+robot_java_args: '-Xmx8G'
+
In your src/ontology/{yourontology}-odk.yaml
file, simply add the following:
use_dosdps: true
+
This flag activates DOSDP in ODK - without it, none of the DOSDP workflows in ODK can be used. Technically, this flag tells ODK the following things:
+src/ontology/Makefile
is extended as follows:pipelines
, or workflows, for processing patterns, e.g. pattern_schema_checks
for validating all DOSDP templates,patterns
to regenerate all patterns.src/patterns
, is created with the following files:src/patterns/pattern.owl
: This is an ontology of your own patterns. This can be used to browse the your pattern in the form of a class hierarchy, which can help greatly to understand how they relate logically. There are some flaws in this system, like occasional unintended equivalencies between patterns, but for most uses, it is doing ok.src/patterns/definitions.owl
: This is the merged ontology of all your DOSDP generated classes. Basically, if you manage your classes across multiple DOSDP patterns and tables, their generated OWL axioms will all be added to this file.src/patterns/external.txt
: This file can be used to import external patterns. Just add the (p)URL to a pattern to the file, and the DOSDP pipeline will import it when you run it. We use this a lot when sharing DOSDP templates across ontologies.src/patterns/data/default/
) and one in the src/patterns
directory. The former points you to the place where you should put, by default, any DOSDP data tables. More about that in the next sections.To fully activate DOSDP in your ontology, please run:
+sh run.sh make update_repo
+
This will:
+v1.3
, for example)Makefile
in certain ways(1) Create a new file src/patterns/dosdp-patterns/haircoat_colour_pattern.yaml
and paste the following content:
pattern_name: haircoat_colour_pattern
+pattern_iri: http://purl.obolibrary.org/obo/obo-academy/patterns/haircoat_colour_pattern.yaml
+
+description: "
+ Captures the multicoloured characteristic of the fur, i.e. spotted, dotted, motley etc."
+
+classes:
+ colour_pattern: PATO:0001533
+ coat_of_hair: UBERON:0010166
+
+relations:
+ has_characteristic: RO:0000053
+
+vars:
+ colour_pattern: "'colour_pattern'"
+
+name:
+ text: "%s coat of hair"
+ vars:
+ - colour_pattern
+
+def:
+ text: "A coat of hair with a %s colour pattern."
+ vars:
+ - colour_pattern
+
+equivalentTo:
+ text: "'coat_of_hair' and 'has_characteristic' some %s"
+ vars:
+ - colour_pattern
+
(2) Let's also create a simple template table to capture traits for our ontology.
+Note: the filename of the DOSDP template file (haircoat_colour_pattern.yaml
) excluding the extension must be identical
+to the filename of the template table (haircoat_colour_pattern.tsv
) excluding the extension.
Let's create the new file at src/patterns/data/default/haircoat_colour_pattern.tsv
.
defined_class colour_pattern
+CATO:0000001 PATO:0000333
+
We are creating a minimal table here with just two columns:
+defined_class
refers to the ID for the term that is being modelled by the template (mandatory for all DOSDP templates)colour_pattern
refers to the variable of the same name specified in the vars:
section of the DOSDP template YAML file.Next, we will get a bit used to various commands that help us with DOSDP-based ontology development.
+Lets first try to transform the simple table above to OWL using the ODK pipeline (we always use IMP=false
to skip refreshing imports, which can be a lengthy process):
sh run.sh make ../patterns/definitions.owl -B IMP=false
+
This process will will create the ../patterns/definitions.owl
file, which is the file that contains all axioms generated by all templates you have configured. In our simple scenario, this means a simple single pattern. Let us look at definitions.owl in your favourite text editor first.
Tip: Remember, the `-B` tells `make` to run the make command no matter what - one of the advantages of `make` is that it only runs a command again if something changed, for example, you have added something to a DOSDP template table.
+
Tip: Looking at ontologies in text editors can be very useful, both to reviewing files and making changes! Do not be afraid, the ODK will ensure you wont break anything.
+
Let us look in particular at the following section of the definitions.owl file:
+# Class: <http://purl.obolibrary.org/obo/CATO_0000001> (http://purl.obolibrary.org/obo/PATO_0000333 coat of hair)
+
+AnnotationAssertion(<http://purl.obolibrary.org/obo/IAO_0000115> <http://purl.obolibrary.org/obo/CATO_0000001> "A coat of hair with a http://purl.obolibrary.org/obo/PATO_0000333 colour pattern."^^xsd:string)
+AnnotationAssertion(rdfs:label <http://purl.obolibrary.org/obo/CATO_0000001> "http://purl.obolibrary.org/obo/PATO_0000333 coat of hair"^^xsd:string)
+EquivalentClasses(<http://purl.obolibrary.org/obo/CATO_0000001> ObjectIntersectionOf(<http://purl.obolibrary.org/obo/UBERON_0010166> ObjectSomeValuesFrom(<http://purl.obolibrary.org/obo/RO_0000053> <http://purl.obolibrary.org/obo/PATO_0000333>)))
+
These are the three axioms / annotation assertions that were created by the DOSDP pipeline. The first annotation is a simple automatically generated definition. What is odd at first glance, is that the definition reads "A coat of hair with a http://purl.obolibrary.org/obo/PATO_0000333 colour pattern."
- what does the PATO:0000333
IRI do in the middle of our definition? Understanding this is fundamental to the DODSP pattern workflow, because it is likely that you will have to fix cases like this from time to time.
The DOSDP workflow is about generating axioms automatically from existing terms. For example, in this tutorial we are trying to generate terms for different kinds of hair coats for our cats, using the colour pattern
(PATO:0001533) hierarchy in the PATO ontology as a basis. The only one term we have added so far is spotted
(PATO:0000333). The problem is though, that dosdp-tools
, the tool which is part of the ODK and responsible for the DOSDP workflows, does not know anything about PATO:0000333 unless it is already imported into the ontology. In order to remedy this situation, lets import the term:
sh run.sh make refresh-pato
+
ODK will automatically see that you have used PATO:0000333 in your ontology, and import it for you. Next, let us make sure that the our edit file has the correct import configured. Open your ontology in a text editor, and make sure you can find the following import statement:
+Import(<http://purl.obolibrary.org/obo/cato/patterns/definitions.owl>)
+
Replace cato
in the PURL with whatever is the ID of your own ontology. Also, do not forget to update src/ontology/catalog-v001.xml
, by adding this line:
<group id="Folder Repository, directory=, recursive=false, Auto-Update=false, version=2" prefer="public" xml:base="">
+...
+<uri name="http://purl.obolibrary.org/obo/cato/patterns/definitions.owl" uri="../patterns/definitions.owl"/>
+...
+</group>
+
Important: Remember that we have not yet told dosdp-tools about the freshly imported PATO:0000333 term. To do that, lets run the DOSDP pipeline again:
+sh run.sh make ../patterns/definitions.owl -B IMP=false
+
A quick look at src/patterns/definitions.owl
would now reveal your correctly formatted definitions:
AnnotationAssertion(<http://purl.obolibrary.org/obo/IAO_0000115> <http://purl.obolibrary.org/obo/CATO_0000001> "A coat of hair with a spotted colour pattern."^^xsd:string)
+
Now, we are ready to view our ontology (the edit file, i.e. src/ontology/cato-edit.owl
) in Protege:
Still a few things to iron out - there is an UBERON term that we still need to import, and our class is not a subclass of the CATO root node
, but we had a good start.
Re-using terms is at the heart of the OBO philosophy, but when it comes to re-using axiom patterns, such as the ones we can define as part of a ROBOT template, we are (as of 2022) still in the early stages. One thing we can do to facilitate re-use is to share DOSDP templates between different projects. We do that by simply adding the URL at which the pattern is located to src/patterns/dosdp-patterns/external.txt
. Note: if you are copying a URL from GitHub, make sure it is the raw
url, i.e.:
https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/patterns/dosdp-patterns/entity_attribute.yaml
+
Here, we randomly decided to import a pattern defined by the Ontology of Biological Attributes (an ontology of traits such as tail length
or head size
), for example to represent cat traits in our Cat Ontology. After adding the above URL to our the external.txt
file, we can add it to our pipeline:
sh run.sh make update_patterns
+
You will now see the entity_attribute.yaml
template in src/patterns/dosdp-patterns
. We will not do anything with this template as part of this tutorial, so you can remove it again if you wish (by removing the URL from the external.txt
file and physically deleting the src/patterns/dosdp-patterns/entity_attribute.yaml
file).
Sometimes, we want to manage more than one DOSDP pipeline at once. For example, in more than one of our projects, we have some patterns that are automatically generated by software tools, and others that are manually curated by ontology developers. In other use cases, we sometimes want to restrict the pattern pipelines to generating only logical axioms. In either case, we can add new pipelines by adding the following to the src/ontology/youront-odk.yaml
file:
pattern_pipelines_group:
+ products:
+ - id: manual
+ dosdp_tools_options: "--obo-prefixes=true --restrict-axioms-to=logical"
+ - id: auto
+ dosdp_tools_options: "--obo-prefixes=true"
+
This does the following: It tells the ODK that you want
+id: cato
+title: "Cat Anatomy Ontology"
+github_org: obophenotype
+git_main_branch: main
+use_dosdps: TRUE
+repo: cat_anatomy_ontology
+release_artefacts:
+ - base
+ - full
+ - simple
+primary_release: full
+export_formats:
+ - owl
+ - obo
+ - json
+import_group:
+ products:
+ - id: ro
+ - id: pato
+ - id: omo
+robot_java_args: '-Xmx8G'
+pattern_pipelines_group:
+ products:
+ - id: manual
+ dosdp_tools_options: "--obo-prefixes=true --restrict-axioms-to=logical"
+ - id: auto
+ dosdp_tools_options: "--obo-prefixes=true"
+
Flag | +Explanation | +
---|---|
use_dosdps: TRUE | +Activates DOSDP in your ODK repository setup | +
pattern_pipelines_group: products: - id: manual dosdp_tools_options: "--obo-prefixes=true --restrict-axioms-to=logical" |
+Adding a manual pipeline to your DOSDP setup in which only logical axioms are generated. |
+
Dead Simple OWL Design patterns (DOSDP) is a templating system for documenting and generating new OWL classes. The templates themselves are designed to be human readable and easy to author. Separate tables (TSV files) are used to specify individual classes.
+The complete DOSDP documentation can be found here http://incatools.github.io/dead_simple_owl_design_patterns/.
+For another DOSDP tutorial see here.
+A DOSDP tempaltes are written in YAML) file, an easily editable format for encoding nested data structures. At the top level of nesting is a set of 'keys', which must match those specified in the DOSDP standard. The various types of key and their function are outlined below. Each key is followed by a colon and then a value, which may be a text string, a list or another set of keys. Lists items are indicated using a '-'. Nesting is achieved via indenting using some standard number of spaces (typically 3 or 4). Here's a little illustration:
+key1: some text
+key2:
+ - first list item (text; note the indent)
+ - second list item
+key3:
+ key_under_key3: some text
+ another_key_under_key3:
+ - first list item (text; note the indent)
+ - second list item
+ yet_another_key_under_key3:
+ key_under_yet_another_key_under_key3: some more text
+
In the following text, keys and values together are sometimes referred to as 'fields'.
+A set of fields that specify general information about a pattern: name, description, IRI, contributors, examples etc
+e.g.
+pattern_name: abnormalAnatomicalEntity
+pattern_iri: http://purl.obolibrary.org/obo/upheno/patterns/abnormalAnatomicalEntity.yaml
+description: "Any unspecified abnormality of an anatomical entity."
+
+contributors:
+ - https://orcid.org/0000-0002-9900-7880
+
A major aim of the DOSDP system is to produce self-contained, human-readable templates. Templates need IDs in order to be reliably used programatically, but templates that only use IDs are not human readable. DOSDPs therefore include a set of dictionaries that map labels to IDs. Strictly any readable name can be used, but by convention we use class labels. IDs must be OBO curie style e.g. CL:0000001).
+Separate dictionaries are required for classes, relations (object properties) & annotationProperties +e.g.
+classes:
+ quality: PATO:0000001
+ abnormal: PATO:0000460
+ anatomical entity: UBERON:0001062
+
+relations:
+ inheres_in_part_of: RO:0002314
+ has_modifier: RO:0002573
+ has_part: BFO:0000051
+
These fields specify the names of pattern variables (TSV column names) and map these to a range. e.g. This specifies a variable called 'anatomy' with the range 'anatomical entity':
+vars:
+ anatomy: "'anatomical entity'"
+
The var name (anatomy) corresponds to a column name in the table (TSV file) used in combination with this template, to generate new terms based on the template. The range specifies what type of term is allowed in this column - in this case 'anatomical entity' (UBERON:0001062; as specified in the dictionary) or one of its subclasses, e.g.-
+anatomy | +
---|
UBERON:0001154 | +
There are various types of variables:
+vars
are used to specify OWL classes (see example above). data_vars and data_list_vars are used to specify single pieces or data lists respectively. The range of data_vars is specified using XSD types. e.g.
data_vars:
+ number: xsd:int
+
+data_list_vars:
+ xrefs: xsd:string
+
A table used to specify classes following this pattern could have the following content. Note that in lists, multiple elements are separated by a '|'.
+number | +xrefs | +
---|---|
1 | +pubmed:123456|DOI:10.1016/j.cell.2016.07.054 | +
Template fields are where the content of classes produced by the template is specified. These mostly follow printf format: A text
field has variable slots specified using %s (for strings), %d for integers and %f for floats (decimals). Variables slots are filled, in order of appearance in the text, with values coming from a list of variables in an associated vars
field e.g.
name:
+ text: "%s of %s"
+ vars:
+ - neuron
+ - brain_region
+
If the value associated with the neuron var is (the class) 'glutamatergic neuron' and the value associated with the = 'brain region' var is 'primary motor cortext', this will generate a classes with the name (label) "glutamatergic neuron of primary motor cortex".
+DOSDPs include a set of convenience fields for annotation of classes that follow OBO conventions for field names and their mappings to OWL annotation properties. These include name
, def
, comment
, namespace
. When the value of a var is an OWL class, the name (label) of the var is used in the substitution. (see example above).
The annotation axioms generated by these template fields can be annotated. One OBO field exists for this purpose: xrefs
allows annotation with a list of references using the obo standard xref annotation property (curies)
e.g.
+data_list_vars:
+ xrefs: xsd:string
+
+def:
+ text: "Any %s that has a soma located in the %s"
+ vars:
+ - neuron
+ - brain_region
+ xrefs: xrefs
+
Where a single equivalent Class, subclassOf or GCI axiom is specified, you may use the keys 'EquivalentTo', 'subClassOf' or 'GCI' respectively. If multiple axioms of any type are needed, use the core field logical_axioms
.
annotations:
+ - annotationProperty:
+ text:
+ vars:
+ annotations: ...
+ - annotationProperty:
+ text:
+ vars:
+
+logical_axioms:
+ - axiom_type: subClassOf
+ text:
+ vars:
+ -
+ -
+ - axiom_type: subClassOf
+ text:
+ vars:
+ -
+ -
+ annotations:
+ - ...
+
TBA
+The Ontology Development Kit (ODK) comes with a few pre-configured workflows involving DOSDP templates. For a detailed tutorial see here.
+ + + + + + +Note: This is an updated Version of Jim Balhoff's DOSDP tutorial here.
+The main use case for dosdp-tools
(and the DOS-DP framework) is managing a set of ontology terms, which all follow a common logical pattern, by simply collecting the unique aspect of each term as a line in a spreadsheet. For example, we may be developing an ontology of environmental exposures. We would like to have terms in our ontology which represent exposure to a variety of stressors, such as chemicals, radiation, social stresses, etc.
To maximize reuse and facilitate data integration, we can build our exposure concepts by referencing terms from domain-specific ontologies, such as the Chemical Entities of Biological Interest Ontology (ChEBI) for chemicals. By modeling each exposure concept in the same way, we can use a reasoner to leverage the chemical classification provided by ChEBI to provide a classification for our exposure concepts. Since each exposure concept has a logical definition based on our data model for exposure, there is no need to manually manage the classification hierarchy. Let's say our model for exposure concepts holds that an "exposure" is an event with a particular input (the thing the subject is exposed to):
+'exposure to X' EquivalentTo 'exposure event' and 'has input' some X
If we need an ontology class to represent 'exposure to sarin' (bad news!), we can simply use the term sarin from ChEBI, and create a logical definition:
+'exposure to sarin' EquivalentTo 'exposure event' and 'has input' some sarin
We can go ahead and create some other concepts we need for our exposure data:
+'exposure to asbestos' EquivalentTo 'exposure event' and 'has input' some asbestos
'exposure to chemical substance' EquivalentTo 'exposure event' and 'has input' some 'chemical substance'
These definitions again can reference terms provided by ChEBI: asbestos and chemical substance
+Since the three concepts we've created all follow the same logical model, their hierarchical relationship can be logically determined by the relationships of the chemicals they reference. ChEBI asserts this structure for those terms:
+'chemical substance'
+ |
+ |
+ --------------
+ | |
+ | |
+sarin asbestos
+
Based on this, an OWL reasoner can automatically tell us the relationships between our exposure concepts:
+ 'exposure to chemical substance'
+ |
+ |
+ --------------------------
+ | |
+ | |
+'exposure to sarin' 'exposure to asbestos'
+
To support this, we simply need to declare the ChEBI OWL file as an owl:import
in our exposure ontology, and use an OWL reasoner such as ELK.
Creating terms by hand like we just did works fine, and relying on the reasoner for the classification will save us a lot of trouble and maintain correctness as our ontology grows. But since all the terms use the same logical pattern, it would be nice to keep this in one place; this will help make sure we always follow the pattern correctly when we create new concepts. We really only need to store the list of inputs (e.g. chemicals) in order to create all our exposure concepts. As we will see later, we may also want to manage separate sets of terms that follow other, different, patterns. To do this with dosdp-tools
, we need three main files: a pattern template, a spreadsheet of pattern fillers, and a source ontology. You will also usually need a file of prefix definitions so that the tool knows how to expand your shortened identifiers into IRIs.
For our chemical exposures, getting the source ontology is easy: just download chebi.owl. Note—it's about 450 MB.
+For our pattern fillers spreadsheet, we just need to make a tab-delimited file containing the chemical stressors for which we need exposure concepts. The file needs a column for the term IRI to be used for the generated class (this column is always called defined_class
), and also a column for the chemical to reference (choose a label according to your data model). It should look like this:
defined_class input
+EXPOSO:1 CHEBI:75701
+EXPOSO:2 CHEBI:46661
+EXPOSO:3 CHEBI:59999
+
The columns should be tab-separated—you can download a correctly formatted file to follow along. For now you will just maintain this file by hand, adding chemicals by looking up their ID in ChEBI, and manually choosing the next ID for your generated classes. In the future this may be simplified using the DOS-DP table editor, which is under development.
+The trickiest part to DOS-DP is creating your pattern template (but it's not so hard). Pattern templates are written in YAML, a simple file format based on keys and values. The keys are text labels; values can be plain values, another key-value structure, or a list. The DOS-DP schema specifies the keys and values which can be used in a pattern file. We'll use most of the common entries in this example. Read the comments (lines starting with #) for explanation of the various fields:
+# We can provide a name for this pattern here.
+pattern_name: exposure_with_input
+
+# In 'classes', we define the terms we will use in this pattern.
+# In the OBO community the terms often have numeric IDs, so here
+# we can provide human-readable names we can use further in the pattern.
+# The key is the name to be used; the value is the ID in prefixed form (i.e. a CURIE).
+classes:
+ exposure event: ExO:0000002
+ Thing: owl:Thing
+
+# Use 'relations' the same way as 'classes',
+# but for the object properties used in the pattern.
+relations:
+ has input: RO:0002233
+
+# The 'vars' section defines the various slots that can be
+# filled in for this pattern. We have only one, which we call 'input'.
+# The value is the range, meaning the class of things that are valid
+# values for this pattern. By specifying owl:Thing, we're allowing any
+# class to be provided as a variable filler. You need a column in your
+# spreadsheet for each variable defined here, in addition to the `defined class` column.
+vars:
+ input: "Thing"
+
+# We can provide a template for an `rdfs:label` value to generate
+# for our new term. dosdp-tools will search the source ontology
+# to find the label for the filler term, and fill it into the
+# name template in place of the %s.
+name:
+ text: "exposure to %s"
+ vars:
+ - input
+
+# This works the same as label generation, but instead creates
+# a definition annotation.
+def:
+ text: "A exposure event involving the interaction of an exposure receptor to %s. Exposure may be through a variety of means, including through the air or surrounding medium, or through ingestion."
+ vars:
+ - input
+
+# Here we can generate a logical axiom for our new concept. Create an
+# expression using OWL Manchester syntax. The expression can use any
+# of the terms defined at the beginning of the pattern. A reference
+# to the variable value will be inserted in place of the %s.
+equivalentTo:
+ text: "'exposure event' and 'has input' some %s"
+ vars:
+ - input
+
Download the pattern template file to follow along.
+Now we only need one more file before we can run dosdp-tools
. A file of prefix definitions (also in YAML format) will specify how to expand the CURIEs we used in our spreadsheet and pattern files:
EXPOSO: http://example.org/exposure/
+
Here we are specifying how to expand our EXPOSO
prefix (used in our spreadsheet defined_class
column). To expand the others, we'll pass a convenience option to dosdp-tools
, --obo-prefixes
, which will activate some predefined prefixes such as owl:
, and handle any other prefixes using the standard expansion for OBO IDs: http://purl.obolibrary.org/obo/PREFIX_
. Here's a link to the prefixes file.
Now we're all set to run dosdp-tools
! If you've downloaded or created all the necessary files, run this command to generate your ontology of exposures (assuming you've added the dosdp-tools
to your Unix PATH):
dosdp-tools generate --obo-prefixes=true --prefixes=prefixes.yaml --infile=exposure_with_input.tsv --template=exposure_with_input.yaml --ontology=chebi.owl --outfile=exposure_with_input.owl
+
This will apply the pattern to each line in your spreadsheet, and save the result in an ontology saved at exposure_with_input.owl
(it should look something like this). If you take a look at this ontology in a text editor or in Protégé, you'll see that it contains three classes, each with a generated label, text definition, and equivalent class definition. You're done!
Well... you're sort of done. But wouldn't it be nice if your exposure ontology included some information about the chemicals you referenced? Without this our reasoner can't classify our exposure concepts. As we said above, we could add an owl:import
declaration and load all of ChEBI, but your exposure ontology has three classes and ChEBI has over 120,000 classes. Instead, we can use the ROBOT tool to extract a module of just the relevant axioms from ChEBI. Later, we will also see how to use ROBOT to merge the outputs from multiple DOS-DP patterns into one ontology. You can download ROBOT from its homepage.
ROBOT has a few different methods for extracting a subset from an ontology. We'll use the Syntactic Locality Module Extractor (SLME) to get a set of axioms relevant to the ChEBI terms we've referenced. ROBOT will need a file containing the list of terms. We can use a Unix command to get these out of our spreadsheet file:
+sed '1d' exposure_with_input.tsv | cut -f 2 >inputs.txt
+
We'll end up with a simple list:
+CHEBI:75701
+CHEBI:46661
+CHEBI:59999
+
Now we can use ROBOT to extract an SLME bottom module for those terms out of ChEBI:
+robot extract --method BOT --input chebi.owl --term-file inputs.txt --output chebi_extract.owl
+
Our ChEBI extract only has 63 classes. Great! If you want, you can merge the ChEBI extract into your exposure ontology before releasing it to the public:
+robot merge --input exposure_with_input.owl --input chebi_extract.owl --output exposo.owl
+
Now you can open exposo.owl
in Protégé, run the reasoner, and see a correct classification for your exposure concepts! You may notice that your ontology is missing labels for ExO:0000002
('exposure event') and RO:0002233
('has input'). If you want, you can use ROBOT to extract that information from ExO and RO.
You will often want to generate ontology modules using more than one DOS-DP pattern. For example, you may want to organize environmental exposures by an additional axis of classification, such as exposure to substances with various biological roles, based on information provided by ChEBI. This requires a slightly different logical expression, so we'll make a new pattern:
+pattern_name: exposure_with_input_with_role
+
+classes:
+ exposure event: ExO:0000002
+ Thing: owl:Thing
+
+relations:
+ has input: RO:0002233
+ has role: RO:0000087
+
+vars:
+ input: "Thing"
+
+name:
+ text: "exposure to %s"
+ vars:
+ - input
+
+def:
+ text: "A exposure event involving the interaction of an exposure receptor to a substance with %s role. Exposure may be through a variety of means, including through the air or surrounding medium, or through ingestion."
+ vars:
+ - input
+
+equivalentTo:
+ text: "'exposure event' and 'has input' some ('has role' some %s)"
+ vars:
+ - input
+
Let's create an input file for this pattern, with a single filler, neurotoxin:
+defined_class input
+EXPOSO:4 CHEBI:50910
+
Now we can run dosdp-tools
for this pattern:
dosdp-tools generate --obo-prefixes --prefixes=prefixes.yaml --infile=exposure_with_input_with_role.tsv --template=exposure_with_input_with_role.yaml --ontology=chebi.owl --outfile=exposure_with_input_with_role.owl
+
We can re-run our ChEBI module extractor, first appending the terms used for this pattern to the ones we used for the first pattern:
+sed '1d' exposure_with_input_with_role.tsv | cut -f 2 >>inputs.txt
+
And then run robot extract
exactly as before:
robot extract --method BOT --input chebi.owl --term-file inputs.txt --output chebi_extract.owl
+
Now we just want to merge both of our generated modules, along with our ChEBI extract:
+robot merge --input exposure_with_input.owl --input exposure_with_input_with_role.owl --input chebi_extract.owl --output exposo.owl
+
If you open the new exposo.owl
in Protégé and run the reasoner, you'll now see 'exposure to sarin' classified under both 'exposure to chemical substance' and also 'exposure to neurotoxin'.
By using dosdp-tools
and robot
together, you can effectively develop ontologies which compose parts of ontologies from multiple domains using standard patterns. You will probably want to orchestrate the types of commands used in this tutorial within a Makefile, so that you can automate this process for easy repeatability.
Exomiser is a Java program that ranks potential rare Mendelian disease-causing variants from whole-exome or whole-genome sequencing data. Starting from a patient's VCF file and a set of phenotypes encoded using the Human Phenotype Ontology (HPO), it will annotate, filter and prioritise likely causative variants. The program does this based on user-defined criteria such as a variant's predicted pathogenicity, frequency of occurrence in a population and also how closely the given patient's phenotype matches any known phenotype of genes from human disease and model organism data.
+In this tutorial, we will learn how to install and run Exomiser with Docker, and how to interpret the results in various output formats detailing the predicted causative genes and variants. If you prefer to work locally, instructions are also provided below for Windows and Linux/Mac users.
+The complete Exomiser documentation can be found here, including some relevant references here, and the Exomiser GitHub repository here.
+Please note that this tutorial is up-to-date with the current latest release 13.2.0 and data version up to 2302 (Feb 2023).
+You know:
+You have:
+Docker installed and running on your machine. Check out this simple guide to set up Docker for Windows or Docker for Mac.
+We recommend to have Exomiser installed via Docker prior to the tutorial. Open a terminal and run the command below:
+docker pull exomiser/exomiser-cli:13.2.0
+
Alternatively:
+# download the data via
+wget https://github.com/iQuxLE/Exomiser-Tutorial/raw/main/Exomiser-Tutorial.zip
+# OR clone the repository
+git clone https://github.com/iQuxLE/Exomiser-Tutorial.git
+
+# unzip
+unzip Exomiser-Tutorial.zip
+
Exomiser-Tutorial
folder.
+# download
+wget https://github.com/iQuxLE/Exomiser-Tutorial/raw/main/pfeiffer-family-vcf.zip
+# unzip
+unzip pfeiffer-family-vcf.zip -d Exomiser-Tutorial/exomiser-config/
+
The Exomiser-Tutorial
folder contains a directory called exomiser-config
(with all the VCF and analysis files)
+and exomiser-overview
(with some introductory slides).
# create an empty directory for exomiser-data within the Exomiser-Tutorial folder:
+cd /path/to/Exomiser-Tutorial/
+mkdir exomiser-data
+cd exomiser-data
+# download the data
+wget https://data.monarchinitiative.org/exomiser/latest/2302_phenotype.zip # for the phenotype database
+wget https://data.monarchinitiative.org/exomiser/latest/2302_hg19.zip # for the hg19 variant database
+# unzip the data
+unzip "2302_*.zip"
+
Otherwise, visit the links and download the data in your own exomiser-data
directory:
Install 7-Zip for unzipping the database files. The built-in archiving software has issues extracting the zip files. Extract the database files (2302_phenotype.zip
, 2302_hg19.zip
) by right-clicking the archive and selecting 7-Zip > Extract files… into the exomiser-data
directory.
Your Exomiser-Tutorial
directory should now be structured as follows:
Exomiser-Tutorial
+ ├── exomiser-config
+ ├── exomiser-data
+ │ ├── 2302_hg19
+ │ └── 2302_phenotype
+ └── exomiser-overview
+ └── exomiser-tutorial-slides
+
For a quick overview of Exomiser take a look at the slides located in +the Google Drive +or GitHub repo.
+(recommended to be installed prior to the tutorial; if you run the command below again, you should receive the message "Image is up to date for exomiser/exomiser-cli:13.2.0")
+docker pull exomiser/exomiser-cli:13.2.0
+
exomiser-cli-13.2.0-distribution.zip
distribution from Monarch.2302_hg19.zip
and phenotype 2302_phenotype.zip
data files from Monarch.exomiser-cli-13.2.0-distribution.zip
and selecting 7-Zip > Extract Here2302_phenotype.zip
, 2302_hg19.zip
) by right-clicking the archive and selecting 7-Zip > Extract files… into the exomiser data directory. By default, Exomiser expects this to be ‘exomiser-cli-13.2.0/data’, but this can be changed in the application.properties.The following shell script should work:
+# download the distribution (won't take long)
+wget https://data.monarchinitiative.org/exomiser/latest/exomiser-cli-13.2.0-distribution.zip
+# download the data (this is ~80GB and will take a while). If you only require a single assembly, only download the relevant files.
+wget https://data.monarchinitiative.org/exomiser/latest/2302_hg19.zip
+wget https://data.monarchinitiative.org/exomiser/latest/2302_phenotype.zip
+# unzip the distribution and data files - this will create a directory called 'exomiser-cli-13.2.0' in the current working directory (with examples and application.properties)
+unzip exomiser-cli-13.2.0-distribution.zip
+unzip '2302_*.zip' -d exomiser-cli-13.2.0/data
+
The application.properties file needs to be updated to point to the correct location of the Exomiser data. For the purpose of this tutorial, this is already sorted, pointing to the mounted directory inside the Docker container exomiser.data-directory=/exomiser-data
.
Also, you want to make sure to edit the file to use the correct data version (currently 2302):
+ exomiser.hg19.data-version=2302
+ exomiser.phenotype.data-version=2302
+
For this tutorial, we will focus on running Exomiser on a single-sample (whole-exome) VCF file. Additional instructions for running Exomiser on multi-sample VCF data and large jobs are also provided below.
+It is recommended to provide Exomiser with the input sample as a Phenopacket. Exomiser will accept this in either JSON
+or YAML format. We will use the example pfeiffer-phenopacket.yml
below:
id: manuel
+subject:
+ id: manuel
+ sex: MALE
+phenotypicFeatures:
+ - type:
+ id: HP:0001159
+ label: Syndactyly
+ - type:
+ id: HP:0000486
+ label: Strabismus
+ - type:
+ id: HP:0000327
+ label: Hypoplasia of the maxilla
+ - type:
+ id: HP:0000520
+ label: Proptosis
+ - type:
+ id: HP:0000316
+ label: Hypertelorism
+ - type:
+ id: HP:0000244
+ label: Brachyturricephaly
+htsFiles:
+ - uri: exomiser/Pfeiffer.vcf.gz
+ htsFormat: VCF
+ genomeAssembly: hg19
+metaData:
+ created: '2019-11-12T13:47:51.948Z'
+ createdBy: julesj
+ resources:
+ - id: hp
+ name: human phenotype ontology
+ url: http://purl.obolibrary.org/obo/hp.owl
+ version: hp/releases/2019-11-08
+ namespacePrefix: HP
+ iriPrefix: 'http://purl.obolibrary.org/obo/HP_'
+ phenopacketSchemaVersion: 1.0
+
++NOTE: This is an example of a v1.0 phenopacket, there is a more recent release of v2.0. Exomiser can run +phenopackets built with either v1.0 or v2.0 schema. You can find out more about the v2.0 phenopacket schema and how to +build one with Python or Java here. To convert a phenopacket +v1.0 to v2.0, you can use phenopacket-tools.
+
Below are the default analysis settings from pfeiffer-analysis.yml
that we will use in our tutorial:
---
+analysis:
+ #FULL or PASS_ONLY
+ analysisMode: PASS_ONLY
+ # In cases where you do not want any cut-offs applied an empty map should be used e.g. inheritanceModes: {}
+ # These are the default settings, with values representing the maximum minor allele frequency in percent (%) permitted for an
+ # allele to be considered as a causative candidate under that mode of inheritance.
+ # If you just want to analyse a sample under a single inheritance mode, delete/comment-out the others. For AUTOSOMAL_RECESSIVE
+ # or X_RECESSIVE ensure *both* relevant HOM_ALT and COMP_HET modes are present.
+ inheritanceModes: {
+ AUTOSOMAL_DOMINANT: 0.1,
+ AUTOSOMAL_RECESSIVE_COMP_HET: 2.0,
+ AUTOSOMAL_RECESSIVE_HOM_ALT: 0.1,
+ X_DOMINANT: 0.1,
+ X_RECESSIVE_COMP_HET: 2.0,
+ X_RECESSIVE_HOM_ALT: 0.1,
+ MITOCHONDRIAL: 0.2
+ }
+ #Possible frequencySources:
+ #Thousand Genomes project http://www.1000genomes.org/
+ # THOUSAND_GENOMES,
+ #ESP project http://evs.gs.washington.edu/EVS/
+ # ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
+ #ExAC project http://exac.broadinstitute.org/about
+ # EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
+ # EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
+ # EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
+ # EXAC_OTHER
+ #Possible frequencySources:
+ #Thousand Genomes project - http://www.1000genomes.org/ (THOUSAND_GENOMES)
+ #TOPMed - https://www.nhlbi.nih.gov/science/precision-medicine-activities (TOPMED)
+ #UK10K - http://www.uk10k.org/ (UK10K)
+ #ESP project - http://evs.gs.washington.edu/EVS/ (ESP_)
+ # ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
+ #ExAC project http://exac.broadinstitute.org/about (EXAC_)
+ # EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
+ # EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
+ # EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
+ # EXAC_OTHER
+ #gnomAD - http://gnomad.broadinstitute.org/ (GNOMAD_E, GNOMAD_G)
+ frequencySources: [
+ THOUSAND_GENOMES,
+ TOPMED,
+ UK10K,
+
+ ESP_AFRICAN_AMERICAN, ESP_EUROPEAN_AMERICAN, ESP_ALL,
+
+ EXAC_AFRICAN_INC_AFRICAN_AMERICAN, EXAC_AMERICAN,
+ EXAC_SOUTH_ASIAN, EXAC_EAST_ASIAN,
+ EXAC_FINNISH, EXAC_NON_FINNISH_EUROPEAN,
+ EXAC_OTHER,
+
+ GNOMAD_E_AFR,
+ GNOMAD_E_AMR,
+ # GNOMAD_E_ASJ,
+ GNOMAD_E_EAS,
+ GNOMAD_E_FIN,
+ GNOMAD_E_NFE,
+ GNOMAD_E_OTH,
+ GNOMAD_E_SAS,
+
+ GNOMAD_G_AFR,
+ GNOMAD_G_AMR,
+ # GNOMAD_G_ASJ,
+ GNOMAD_G_EAS,
+ GNOMAD_G_FIN,
+ GNOMAD_G_NFE,
+ GNOMAD_G_OTH,
+ GNOMAD_G_SAS
+ ]
+ # Possible pathogenicitySources: (POLYPHEN, MUTATION_TASTER, SIFT), (REVEL, MVP), CADD, REMM
+ # REMM is trained on non-coding regulatory regions
+ # *WARNING* if you enable CADD or REMM ensure that you have downloaded and installed the CADD/REMM tabix files
+ # and updated their location in the application.properties. Exomiser will not run without this.
+ pathogenicitySources: [ REVEL, MVP ]
+ #this is the standard exomiser order.
+ #all steps are optional
+ steps: [
+ #hiPhivePrioritiser: {},
+ #priorityScoreFilter: {priorityType: HIPHIVE_PRIORITY, minPriorityScore: 0.500},
+ #intervalFilter: {interval: 'chr10:123256200-123256300'},
+ # or for multiple intervals:
+ #intervalFilter: {intervals: ['chr10:123256200-123256300', 'chr10:123256290-123256350']},
+ # or using a BED file - NOTE this should be 0-based, Exomiser otherwise uses 1-based coordinates in line with VCF
+ #intervalFilter: {bed: /full/path/to/bed_file.bed},
+ #genePanelFilter: {geneSymbols: ['FGFR1','FGFR2']},
+ failedVariantFilter: { },
+ #qualityFilter: {minQuality: 50.0},
+ variantEffectFilter: {
+ remove: [
+ FIVE_PRIME_UTR_EXON_VARIANT,
+ FIVE_PRIME_UTR_INTRON_VARIANT,
+ THREE_PRIME_UTR_EXON_VARIANT,
+ THREE_PRIME_UTR_INTRON_VARIANT,
+ NON_CODING_TRANSCRIPT_EXON_VARIANT,
+ UPSTREAM_GENE_VARIANT,
+ INTERGENIC_VARIANT,
+ REGULATORY_REGION_VARIANT,
+ CODING_TRANSCRIPT_INTRON_VARIANT,
+ NON_CODING_TRANSCRIPT_INTRON_VARIANT,
+ DOWNSTREAM_GENE_VARIANT
+ ]
+ },
+ # removes variants represented in the database
+ #knownVariantFilter: {},
+ frequencyFilter: {maxFrequency: 2.0},
+ pathogenicityFilter: {keepNonPathogenic: true},
+ # inheritanceFilter and omimPrioritiser should always run AFTER all other filters have completed
+ inheritanceFilter: {},
+ # omimPrioritiser isn't mandatory.
+ omimPrioritiser: {},
+ #priorityScoreFilter: {minPriorityScore: 0.4},
+ # Other prioritisers: Only combine omimPrioritiser with one of these.
+ # Don't include any if you only want to filter the variants.
+ hiPhivePrioritiser: {},
+ # or run hiPhive in benchmarking mode:
+ #hiPhivePrioritiser: {runParams: 'mouse'},
+ #phivePrioritiser: {}
+ #phenixPrioritiser: {}
+ #exomeWalkerPrioritiser: {seedGeneIds: [11111, 22222, 33333]}
+ ]
+outputOptions:
+ outputContributingVariantsOnly: false
+ #numGenes options: 0 = all or specify a limit e.g. 500 for the first 500 results
+ numGenes: 0
+ #minExomiserGeneScore: 0.7
+ # Path to the desired output directory. Will default to the 'results' subdirectory of the exomiser install directory
+ outputDirectory: results
+ # Filename for the output files. Will default to {input-vcf-filename}-exomiser
+ outputFileName: Pfeiffer-HIPHIVE-exome
+ #out-format options: HTML, JSON, TSV_GENE, TSV_VARIANT, VCF (default: HTML)
+ outputFormats: [HTML, JSON, TSV_GENE, TSV_VARIANT]
+
docker run -it -v "/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data" \
+-v "/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser" \
+-v "/path/to/Exomiser-Tutorial/exomiser-results:/results" \
+exomiser/exomiser-cli:13.2.0 \
+--sample /exomiser/pfeiffer-phenopacket.yml \
+--analysis /exomiser/pfeiffer-analysis.yml \
+--spring.config.location=/exomiser/application.properties
+
This command will produce Pfeiffer-HIPHIVE-exome.html
, Pfeiffer-HIPHIVE-exome.json
, Pfeiffer-HIPHIVE-exome.genes.tsv
and Pfeiffer-HIPHIVE-exome.variants.tsv
in your exomiser-results
directory.
Assuming that you are within the exomiser-cli-13.2.0
distribution folder:
java -jar exomiser-cli-13.2.0.jar --sample examples/pfeiffer-phenopacket.yml \
+--analysis examples/exome-analysis.yml --output examples/output-options.yml
+
When analysing a multi-sample VCF file, you must detail the pedigree information in a phenopacket describing a Family +object:
+e.g. Exomiser-Tutorial/exomiser-config/pfeiffer-family.yml
id: ISDBM322017-family
+proband:
+ subject:
+ id: ISDBM322017
+ sex: FEMALE
+ phenotypicFeatures:
+ - type:
+ id: HP:0001159
+ label: Syndactyly
+ - type:
+ id: HP:0000486
+ label: Strabismus
+ - type:
+ id: HP:0000327
+ label: Hypoplasia of the maxilla
+ - type:
+ id: HP:0000520
+ label: Proptosis
+ - type:
+ id: HP:0000316
+ label: Hypertelorism
+ - type:
+ id: HP:0000244
+ label: Brachyturricephaly
+pedigree:
+ persons:
+ - individualId: ISDBM322017
+ paternalId: ISDBM322016
+ maternalId: ISDBM322018
+ sex: FEMALE
+ affectedStatus: AFFECTED
+ - individualId: ISDBM322015
+ paternalId: ISDBM322016
+ maternalId: ISDBM322018
+ sex: MALE
+ affectedStatus: UNAFFECTED
+ - individualId: ISDBM322016
+ sex: MALE
+ affectedStatus: UNAFFECTED
+ - individualId: ISDBM322018
+ sex: FEMALE
+ affectedStatus: UNAFFECTED
+htsFiles:
+ - uri: exomiser/Pfeiffer-quartet.vcf.gz
+ htsFormat: VCF
+ genomeAssembly: GRCh37
+metaData:
+ created: '2019-11-12T13:47:51.948Z'
+ createdBy: julesj
+ resources:
+ - id: hp
+ name: human phenotype ontology
+ url: http://purl.obolibrary.org/obo/hp.owl
+ version: hp/releases/2019-11-08
+ namespacePrefix: HP
+ iriPrefix: 'http://purl.obolibrary.org/obo/HP_'
+ phenopacketSchemaVersion: 1.0
+
Running via Docker:
+docker run -it -v '/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data' \
+-v '/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser' \
+-v '/path/to/Exomiser-Tutorial/exomiser-results:/results' \
+exomiser/exomiser-cli:13.2.0 \
+--sample /exomiser/pfeiffer-family.yml \
+--analysis /exomiser/pfeiffer-analysis.yml \
+--spring.config.location=/exomiser/application.properties
+
Running locally:
+Assuming that you are within the exomiser-cli-13.2.0
distribution folder
java -jar exomiser-cli-13.2.0.jar --sample examples/pfeiffer-family.yml --analysis examples/exome-analysis.yml --output examples/output-options.yml
+
The above commands can be added to a batch file for example in the
+file Exomiser-Tutorial/exomiser-config/test-analysis-batch-commands.txt
. Using it with Docker we recommend creating a
+new directory for the batch files and mounting that to the Docker container.
Running via Docker:
+docker run -it -v '/path/to/Exomiser-Tutorial/exomiser-data:/exomiser-data' \
+-v '/path/to/Exomiser-Tutorial/exomiser-config/:/exomiser' \
+-v '/path/to/Exomiser-Tutorial/exomiser-results:/results' \
+-v '/path/to/Exomiser-Tutorial/exomiser-batch-files:/batch-files' \
+exomiser/exomiser-cli:13.2.0 \
+--batch /batch-files/test-analysis-batch-commands.txt
+--spring.config.location=/exomiser/application.properties
+
Running locally:
+Assuming that you are within the exomiser-cli-13.2.0
distribution folder
java -jar exomiser-cli-13.2.0.jar --batch examples/test-analysis-batch-commands.txt
+
The advantage of this is that a single command will be able to analyse many samples in far less time than starting a new +JVM for each as there will be no start-up penalty after the initial start and the Java JIT compiler will be able to take +advantage of a longer-running process to optimise the runtime code. For maximum throughput on a cluster consider +splitting your batch jobs over multiple nodes.
+Depending on the output options provided, Exomiser will write out at least an HTML and JSON output file in the results
subdirectory of the Exomiser installation (by default) or a user-defined results directory as indicated in the output options.
As a general rule, all output files contain a ranked list of genes and variants with the top-ranked gene/variant displayed first. The exception being the VCF output (if requested in the output options; not requested in this tutorial) which, since version 13.1.0, is sorted according to VCF convention and tabix indexed.
+In our tutorial, we requested HTML, JSON, TSV_VARIANT and TSV_GENE output formats which are briefly outlined below.
+A few relevant screenshots from Pfeiffer-HIPHIVE-exome.html
:
+
+
+
The JSON file represents the most accurate representation of the results, as it is referenced internally by Exomiser. As +such, we don’t provide a schema for this, but it has been pretty stable and breaking changes will only occur with major +version changes to the software. Minor additions are to be expected for minor releases, as per the SemVer specification.
+We recommend using Python or JQ to extract data from this file. To give you an idea of how you can extract some data with Python, we have provided examples of how you can iterate over the results below. However, there is a lot more information content that you can pull out from the JSON results file, this only provides a snippet of what you can do.
+# import json library
+import json
+
+# to load in the exomiser json result
+with open("path/to/Exomiser-Tutorial/Pfeiffer-HIPHIVE-exome.json") as exomiser_json_result:
+ exomiser_result = json.load(exomiser_json_result)
+exomiser_json_result.close()
+
+# to retrieve all predicted genes and corresponding identifier (ENSEMBL)
+gene_results = []
+for result in exomiser_result:
+ gene_results.append({result["geneSymbol"]: result["geneIdentifier"]["geneId"]})
+
+# to retrieve all predicted variants
+variant_results = []
+for result in exomiser_result:
+ for moi in result["geneScores"]: # iterating over all modes of inheritance
+ if "contributingVariants" in moi: # checking if there is evidence of contributing variants
+ for cv in moi["contributingVariants"]: # iterating over all contributing variants
+ variant_results.append({"chromosome": cv["contigName"],
+ "start_pos": cv["start"],
+ "end_pos": cv["end"],
+ "ref_allele": cv["ref"],
+ "alt_allele": cv["alt"]})
+
In the Pfeiffer-HIPHIVE-exome.variants.tsv
file it is possible for a variant to appear multiple times, depending on the MOI it is compatible with. For example, in the excerpt of the file below, MUC6 has two variants ranked 7th under the AD model and two ranked 8th under an AR (compound heterozygous) model. In the AD case the CONTRIBUTING_VARIANT column indicates whether the variant was (1) or wasn't (0) used for calculating the EXOMISER_GENE_COMBINED_SCORE and EXOMISER_GENE_VARIANT_SCORE.
#RANK ID GENE_SYMBOL ENTREZ_GENE_ID MOI P-VALUE EXOMISER_GENE_COMBINED_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE EXOMISER_VARIANT_SCORE CONTRIBUTING_VARIANT WHITELIST_VARIANT VCF_ID RS_ID CONTIG START END REF ALT CHANGE_LENGTH QUAL FILTER GENOTYPE FUNCTIONAL_CLASS HGVS EXOMISER_ACMG_CLASSIFICATION EXOMISER_ACMG_EVIDENCE EXOMISER_ACMG_DISEASE_ID EXOMISER_ACMG_DISEASE_NAME CLINVAR_ALLELE_ID CLINVAR_PRIMARY_INTERPRETATION CLINVAR_STAR_RATING GENE_CONSTRAINT_LOEUF GENE_CONSTRAINT_LOEUF_LOWER GENE_CONSTRAINT_LOEUF_UPPER MAX_FREQ_SOURCE MAX_FREQ ALL_FREQ MAX_PATH_SOURCE MAX_PATH ALL_PATH
+1 10-123256215-T-G_AD FGFR2 2263 AD 0.0000 0.9957 0.9187 1.0000 1.0000 1 1 rs121918506 10 123256215 123256215 T G 0 900.0000 PASS 0/1 missense_variant FGFR2:ENST00000346997.2:c.1688A>C:p.(Glu563Ala) PATHOGENIC PM2,PP3_Strong,PP4,PP5_Strong ORPHA:87 Apert syndrome 28333 PATHOGENIC_OR_LIKELY_PATHOGENIC 2 0.13692 0.074 0.27 REVEL 0.965 REVEL=0.965,MVP=0.9517972
+2 5-71755984-C-G_AD ZNF366 167465 AD 0.0018 0.9237 0.8195 0.7910 0.7910 1 0 rs375204168 5 71755984 71755984 C G 0 380.8900 PASS 0/1 splice_region_variant ZNF366:ENST00000318442.5:c.1332+8G>C:p.? UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.27437 0.155 0.515 EXAC_AMERICAN 0.07975895 THOUSAND_GENOMES=0.01997,TOPMED=0.01096,ESP_EUROPEAN_AMERICAN=0.0116,ESP_ALL=0.0077,EXAC_AMERICAN=0.07975895,EXAC_NON_FINNISH_EUROPEAN=0.010914307,GNOMAD_E_AMR=0.07153929,GNOMAD_E_NFE=0.010890082,GNOMAD_E_OTH=0.018328445
+3 16-2150254-G-A_AD PKD1 5310 AD 0.0050 0.8272 0.6597 0.8707 0.8707 1 0 rs147967021 16 2150254 2150254 G A 0 406.0800 PASS 0/1 missense_variant PKD1:ENST00000262304.4:c.9625C>T:p.(Arg3209Cys) UNCERTAIN_SIGNIFICANCE 1319391 UNCERTAIN_SIGNIFICANCE 1 0.12051 0.082 0.179 EXAC_AMERICAN 0.06979585 THOUSAND_GENOMES=0.01997,TOPMED=0.007934,EXAC_AMERICAN=0.06979585,EXAC_NON_FINNISH_EUROPEAN=0.0015655332,EXAC_SOUTH_ASIAN=0.012149192,GNOMAD_E_AFR=0.006708708,GNOMAD_E_AMR=0.05070389,GNOMAD_E_NFE=0.002718672,GNOMAD_E_SAS=0.013009822,GNOMAD_G_AFR=0.011462632 MVP 0.8792868 REVEL=0.346,MVP=0.8792868
+4 3-56653839-CTG-C_AD CCDC66 285331 AD 0.0051 0.8262 0.5463 0.9984 0.9984 1 0 rs751329549 3 56653839 56653841 CTG C -2 1872.9400 PASS 0/1 frameshift_truncation CCDC66:ENST00000326595.7:c.2572_2573del:p.(Val858Glnfs*6) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.9703 0.78 1.215 GNOMAD_E_AMR 0.011914691 TOPMED=7.556E-4,EXAC_EAST_ASIAN=0.01155535,EXAC_NON_FINNISH_EUROPEAN=0.0015023135,GNOMAD_E_AMR=0.011914691,GNOMAD_E_EAS=0.0057977736,GNOMAD_E_NFE=8.988441E-4
+5 13-110855918-C-G_AD COL4A1 1282 AD 0.0075 0.7762 0.5288 0.9838 0.9838 1 0 rs150182714 13 110855918 110855918 C G 0 1363.8700 PASS 0/1 missense_variant COL4A1:ENST00000375820.4:c.994G>C:p.(Gly332Arg) UNCERTAIN_SIGNIFICANCE PP3_Moderate OMIM:175780 Brain small vessel disease with or without ocular anomalies 333515 CONFLICTING_PATHOGENICITY_INTERPRETATIONS 1 0.065014 0.035 0.128 ESP_EUROPEAN_AMERICAN 0.0233 THOUSAND_GENOMES=0.01997,TOPMED=0.0068,ESP_EUROPEAN_AMERICAN=0.0233,ESP_ALL=0.0154,EXAC_AFRICAN_INC_AFRICAN_AMERICAN=0.009609841,EXAC_NON_FINNISH_EUROPEAN=0.007491759,GNOMAD_E_AFR=0.013068479,GNOMAD_E_NFE=0.0071611437,GNOMAD_G_NFE=0.013324451 MVP 0.9869305 REVEL=0.886,MVP=0.9869305
+6 6-132203615-G-A_AD ENPP1 5167 AD 0.0079 0.7695 0.5112 0.9996 0.9996 1 0 rs770775549 6 132203615 132203615 G A 0 922.9800 PASS 0/1 splice_donor_variant ENPP1:ENST00000360971.2:c.2230+1G>A:p.? UNCERTAIN_SIGNIFICANCE PVS1_Strong NOT_PROVIDED 0 0.41042 0.292 0.586 GNOMAD_E_SAS 0.0032486517 TOPMED=7.556E-4,EXAC_NON_FINNISH_EUROPEAN=0.0014985314,GNOMAD_E_NFE=0.0017907989,GNOMAD_E_SAS=0.0032486517
+7 11-1018088-TG-T_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.9990 1 0 rs765231061 11 1018088 1018089 TG T -1 441.8100 PASS 0/1 frameshift_variant MUC6:ENST00000421673.2:c.4712del:p.(Pro1571Hisfs*21) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.0070363074 GNOMAD_E_AMR=0.0030803352,GNOMAD_G_NFE=0.0070363074
+7 11-1018093-G-GT_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.9989 0 0 rs376177791 11 1018093 1018093 G GT 1 592.4500 PASS 0/1 frameshift_elongation MUC6:ENST00000421673.2:c.4707dup:p.(Pro1570Thrfs*136) NOT_AVAILABLE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.007835763 GNOMAD_G_NFE=0.007835763
+8 11-1018088-TG-T_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.9990 1 0 rs765231061 11 1018088 1018089 TG T -1 441.8100 PASS 0/1 frameshift_variant MUC6:ENST00000421673.2:c.4712del:p.(Pro1571Hisfs*21) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.0070363074 GNOMAD_E_AMR=0.0030803352,GNOMAD_G_NFE=0.0070363074
+8 11-1018093-G-GT_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.9989 1 0 rs376177791 11 1018093 1018093 G GT 1 592.4500 PASS 0/1 frameshift_elongation MUC6:ENST00000421673.2:c.4707dup:p.(Pro1570Thrfs*136) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.79622 0.656 0.971 GNOMAD_G_NFE 0.007835763 GNOMAD_G_NFE=0.007835763
+9 7-44610376-G-A_AD DDX56 54606 AD 0.0091 0.7545 0.5036 0.9992 0.9992 1 0 rs774566321 7 44610376 44610376 G A 0 586.6600 PASS 0/1 stop_gained DDX56:ENST00000258772.5:c.991C>T:p.(Arg331*) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.56071 0.379 0.852 EXAC_SOUTH_ASIAN 0.006114712 EXAC_SOUTH_ASIAN=0.006114712,GNOMAD_E_SAS=0.0032509754
+10 14-96730313-G-A_AD BDKRB1 623 AD 0.0093 0.7525 0.5018 1.0000 1.0000 1 0 14 96730313 96730313 G A 0 378.2200 PASS 0/1 stop_gained BDKRB1:ENST00000216629.6:c.294G>A:p.(Trp98*) UNCERTAIN_SIGNIFICANCE NOT_PROVIDED 0 0.52212 0.272 1.097
+
In the Pfeiffer-HIPHIVE-exome.genes.tsv
file, all the various phenotypic scores and HPO matches from the HUMAN, MOUSE, FISH and PPI comparisons are reported per each gene. It is possible for a gene to appear multiple times, depending on the MOI it is compatible with, given the filtered variants. For example in the example below MUC6 is ranked 7th under the AD model and 8th under an AR model.
+
#RANK ID GENE_SYMBOL ENTREZ_GENE_ID MOI P-VALUE EXOMISER_GENE_COMBINED_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE HUMAN_PHENO_SCORE MOUSE_PHENO_SCORE FISH_PHENO_SCORE WALKER_SCORE PHIVE_ALL_SPECIES_SCORE OMIM_SCORE MATCHES_CANDIDATE_GENE HUMAN_PHENO_EVIDENCE MOUSE_PHENO_EVIDENCE FISH_PHENO_EVIDENCE HUMAN_PPI_EVIDENCE MOUSE_PPI_EVIDENCE FISH_PPI_EVIDENCE
+1 FGFR2_AD FGFR2 2263 AD 0.0000 0.9957 0.9187 1.0000 0.8671 0.9187 0.0000 0.5057 0.9187 1.0000 0 Apert syndrome (ORPHA:87): Syndactyly (HP:0001159)-Toe syndactyly (HP:0001770), Strabismus (HP:0000486)-Strabismus (HP:0000486), Hypoplasia of the maxilla (HP:0000327)-Hypoplasia of the maxilla (HP:0000327), Proptosis (HP:0000520)-Proptosis (HP:0000520), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachyturricephaly (HP:0000244), Strabismus (HP:0000486)-ocular hypertelorism (MP:0001300), Hypoplasia of the maxilla (HP:0000327)-short maxilla (MP:0000097), Proptosis (HP:0000520)-exophthalmos (MP:0002750), Hypertelorism (HP:0000316)-ocular hypertelorism (MP:0001300), Brachyturricephaly (HP:0000244)-abnormal frontal bone morphology (MP:0000107), Proximity to FGF18 Syndactyly (HP:0001159)-abnormal metatarsal bone morphology (MP:0003072), Strabismus (HP:0000486)-abnormal neurocranium morphology (MP:0000074), Hypoplasia of the maxilla (HP:0000327)-maxilla hypoplasia (MP:0000457), Proptosis (HP:0000520)-abnormal neurocranium morphology (MP:0000074), Hypertelorism (HP:0000316)-abnormal neurocranium morphology (MP:0000074), Brachyturricephaly (HP:0000244)-abnormal neurocranium morphology (MP:0000074),
+2 ZNF366_AD ZNF366 167465 AD 0.0018 0.9237 0.8195 0.7910 0.0000 0.8195 0.0000 0.5015 0.8195 1.0000 0 Syndactyly (HP:0001159)-syndactyly (MP:0000564), Strabismus (HP:0000486)-microphthalmia (MP:0001297), Hypoplasia of the maxilla (HP:0000327)-micrognathia (MP:0002639), Proptosis (HP:0000520)-microphthalmia (MP:0001297), Hypertelorism (HP:0000316)-microphthalmia (MP:0001297), Brachyturricephaly (HP:0000244)-microphthalmia (MP:0001297), Proximity to CTBP1 associated with Wolf-Hirschhorn syndrome (ORPHA:280): Syndactyly (HP:0001159)-Arachnodactyly (HP:0001166), Strabismus (HP:0000486)-Strabismus (HP:0000486), Hypoplasia of the maxilla (HP:0000327)-Micrognathia (HP:0000347), Proptosis (HP:0000520)-Proptosis (HP:0000520), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Calvarial skull defect (HP:0001362),
+3 PKD1_AD PKD1 5310 AD 0.0050 0.8272 0.6597 0.8707 0.0000 0.6597 0.2697 0.5069 0.6597 1.0000 0 Strabismus (HP:0000486)-micrognathia (MP:0002639), Hypoplasia of the maxilla (HP:0000327)-micrognathia (MP:0002639), Proptosis (HP:0000520)-micrognathia (MP:0002639), Hypertelorism (HP:0000316)-micrognathia (MP:0002639), Brachyturricephaly (HP:0000244)-micrognathia (MP:0002639), Hypoplasia of the maxilla (HP:0000327)-mandibular arch skeleton malformed, abnormal (ZP:0001708), Proximity to IFT88 associated with Retinitis pigmentosa (ORPHA:791): Strabismus (HP:0000486)-Ophthalmoplegia (HP:0000602), Hypoplasia of the maxilla (HP:0000327)-Wide nasal bridge (HP:0000431), Proximity to IFT88 Syndactyly (HP:0001159)-polydactyly (MP:0000562), Strabismus (HP:0000486)-supernumerary molars (MP:0010773), Hypoplasia of the maxilla (HP:0000327)-supernumerary molars (MP:0010773), Proptosis (HP:0000520)-supernumerary molars (MP:0010773), Hypertelorism (HP:0000316)-supernumerary molars (MP:0010773), Brachyturricephaly (HP:0000244)-abnormal coronal suture morphology (MP:0003840),
+4 CCDC66_AD CCDC66 285331 AD 0.0051 0.8262 0.5463 0.9984 0.0000 0.5463 0.0000 0.0000 0.5463 1.0000 0 Strabismus (HP:0000486)-abnormal cone electrophysiology (MP:0004022), Hypoplasia of the maxilla (HP:0000327)-abnormal rod electrophysiology (MP:0004021), Proptosis (HP:0000520)-abnormal rod electrophysiology (MP:0004021), Hypertelorism (HP:0000316)-abnormal rod electrophysiology (MP:0004021), Brachyturricephaly (HP:0000244)-abnormal retina photoreceptor layer morphology (MP:0003728),
+5 COL4A1_AD COL4A1 1282 AD 0.0075 0.7762 0.5288 0.9838 0.3882 0.5288 0.0000 0.5047 0.5288 1.0000 0 Brain small vessel disease with or without ocular anomalies (OMIM:175780): Strabismus (HP:0000486)-Exotropia (HP:0000577), Strabismus (HP:0000486)-buphthalmos (MP:0009274), Hypoplasia of the maxilla (HP:0000327)-abnormal cornea morphology (MP:0001312), Proptosis (HP:0000520)-abnormal cornea morphology (MP:0001312), Hypertelorism (HP:0000316)-abnormal cornea morphology (MP:0001312), Brachyturricephaly (HP:0000244)-abnormal retina morphology (MP:0001325), Proximity to COL7A1 associated with Localized dystrophic epidermolysis bullosa, pretibial form (ORPHA:79410): Syndactyly (HP:0001159)-Nail dystrophy (HP:0008404), Hypoplasia of the maxilla (HP:0000327)-Carious teeth (HP:0000670), Proximity to COL7A1 Syndactyly (HP:0001159)-abnormal digit morphology (MP:0002110), Strabismus (HP:0000486)-abnormal tongue morphology (MP:0000762), Hypoplasia of the maxilla (HP:0000327)-abnormal tongue morphology (MP:0000762), Proptosis (HP:0000520)-abnormal tongue morphology (MP:0000762), Hypertelorism (HP:0000316)-abnormal tongue morphology (MP:0000762),
+6 ENPP1_AD ENPP1 5167 AD 0.0079 0.7695 0.5112 0.9996 0.3738 0.5112 0.0000 0.5044 0.5112 1.0000 0 Autosomal recessive hypophosphatemic rickets (ORPHA:289176): Hypoplasia of the maxilla (HP:0000327)-Tooth abscess (HP:0030757), Brachyturricephaly (HP:0000244)-Craniosynostosis (HP:0001363), Syndactyly (HP:0001159)-abnormal elbow joint morphology (MP:0013945), Strabismus (HP:0000486)-abnormal retina morphology (MP:0001325), Hypoplasia of the maxilla (HP:0000327)-abnormal snout skin morphology (MP:0030533), Proptosis (HP:0000520)-abnormal retina morphology (MP:0001325), Hypertelorism (HP:0000316)-abnormal retina morphology (MP:0001325), Brachyturricephaly (HP:0000244)-abnormal retina morphology (MP:0001325), Proximity to DMP1 associated with Autosomal recessive hypophosphatemic rickets (ORPHA:289176): Hypoplasia of the maxilla (HP:0000327)-Tooth abscess (HP:0030757), Brachyturricephaly (HP:0000244)-Craniosynostosis (HP:0001363), Proximity to DMP1 Syndactyly (HP:0001159)-abnormal long bone hypertrophic chondrocyte zone (MP:0000165), Strabismus (HP:0000486)-abnormal dental pulp cavity morphology (MP:0002819), Hypoplasia of the maxilla (HP:0000327)-abnormal dental pulp cavity morphology (MP:0002819), Proptosis (HP:0000520)-abnormal dental pulp cavity morphology (MP:0002819), Hypertelorism (HP:0000316)-abnormal dental pulp cavity morphology (MP:0002819), Brachyturricephaly (HP:0000244)-abnormal dental pulp cavity morphology (MP:0002819),
+7 MUC6_AD MUC6 4588 AD 0.0089 0.7563 0.5046 0.9990 0.0000 0.0000 0.0000 0.5046 0.5046 1.0000 0 Proximity to GALNT2 associated with Congenital disorder of glycosylation, type IIt (OMIM:618885): Syndactyly (HP:0001159)-Sandal gap (HP:0001852), Strabismus (HP:0000486)-Alternating exotropia (HP:0031717), Hypoplasia of the maxilla (HP:0000327)-Tented upper lip vermilion (HP:0010804), Proptosis (HP:0000520)-Hypertelorism (HP:0000316), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachycephaly (HP:0000248),
+8 MUC6_AR MUC6 4588 AR 0.0089 0.7562 0.5046 0.9990 0.0000 0.0000 0.0000 0.5046 0.5046 1.0000 0 Proximity to GALNT2 associated with Congenital disorder of glycosylation, type IIt (OMIM:618885): Syndactyly (HP:0001159)-Sandal gap (HP:0001852), Strabismus (HP:0000486)-Alternating exotropia (HP:0031717), Hypoplasia of the maxilla (HP:0000327)-Tented upper lip vermilion (HP:0010804), Proptosis (HP:0000520)-Hypertelorism (HP:0000316), Hypertelorism (HP:0000316)-Hypertelorism (HP:0000316), Brachyturricephaly (HP:0000244)-Brachycephaly (HP:0000248),
+9 DDX56_AD DDX56 54606 AD 0.0091 0.7545 0.5036 0.9992 0.0000 0.0000 0.3788 0.5036 0.5036 1.0000 0 Brachyturricephaly (HP:0000244)-head decreased width, abnormal (ZP:0000407), Proximity to PAK1IP1 Strabismus (HP:0000486)-abnormal maxilla morphology (MP:0000455), Hypoplasia of the maxilla (HP:0000327)-abnormal maxilla morphology (MP:0000455), Proptosis (HP:0000520)-abnormal maxilla morphology (MP:0000455), Hypertelorism (HP:0000316)-abnormal maxilla morphology (MP:0000455), Brachyturricephaly (HP:0000244)-decreased forebrain size (MP:0012138),
+10 BDKRB1_AD BDKRB1 623 AD 0.0093 0.7525 0.5018 1.0000 0.0000 0.0000 0.0000 0.5018 0.5018 1.0000 0 Proximity to OPN4 Strabismus (HP:0000486)-abnormal visual pursuit (MP:0006156), Hypoplasia of the maxilla (HP:0000327)-abnormal visual pursuit (MP:0006156), Proptosis (HP:0000520)-abnormal visual pursuit (MP:0006156), Hypertelorism (HP:0000316)-abnormal visual pursuit (MP:0006156), Brachyturricephaly (HP:0000244)-abnormal retina ganglion cell morphology (MP:0008056),
+
Follow this link and download the Docker.dmg for your operating +system.
+ +The Docker.dmg will be found in your /Downloads
directory.
After double-clicking on the Docker.dmg a new window will come up:
+ +Drag and drop the Docker app into your /Applications
folder.
+Double-click on the Docker symbol.
+Docker Desktop will start in the background, after you allow it to be opened.
Additionally, this window will come up to agree the Docker subscription service agreement.
+ +After running the installation restart your terminal and check the Docker installation again from inside your +terminal with:
+docker --version
+
If the output gives you a version and no error you are ready to go. If you have not already restarted your terminal do +this now, +and the error should be fixed.
+In case you get an error message like this, please ensure you have downloaded the correct docker.dmg
.
Now, whenever you want to pull images make sure that Docker is running in the background. Otherwise, you may get an +error stating it's not able to connect to the Docker deamon.
+Follow this link and download the Docker installer for +Windows.
+Inside your /Downloads
directory, search for the Installer and double-click.
To run on Windows Docker requires a virtual machine. Docker recommends using WSL2. +More information on this can be found here.
+ +Click “Ok” and wait a bit.
+ +Now you will have to restart your computer.
+ +After restarting, Docker should start automatically and the Service Agreement will come up, which you will have to agree +to use Docker:
+ +If the Docker desktop app is showing this warning upon start, do not click “Restart”, yet. Instead, follow the link and +install the kernel update.
+ +The link should point you to an address with a separate download link.
+ +Start and finish the installation for WSL.
+ +If you still have the Docker Desktop dialog window open in the background, click on Restart. Otherwise, just restart +your computer as you normally do.
+ +If Docker Desktop did not start on its own, simply open it from the shortcut on your Desktop. You can do the initial +orientation by clicking "Start".
+ +After this, your Docker Desktop screen should look like this:
+ +Now, whenever you want to pull images make sure that Docker is running in the background.
+ + + + + + +This is a fork of the infamous Manchester Family History Advanced OWL Tutorial version 1.1, located at
+http://owl.cs.manchester.ac.uk/publications/talks-and-tutorials/fhkbtutorial/
+The translation to markdown is not without issue, but we are making a start to making the tutorial a bit more accessible. +This reproduction is done with kind permission by Robert Stevens.
+Authors:
+Bio-Health Informatics Group
+School of Computer Science
+University of Manchester
+Oxford Road
+Manchester
+United Kingdom
+M13 9PL
+robert.stevens@manchester.ac.uk
+
The University of Manchester
+Copyright© The University of Manchester
+November 25, 2015
+
This tutorial was realised as part of the Semantic Web Authoring Tool (SWAT) project (see http://www.swatproject.org), +which is supported by the UK Engineering and Physical Sciences Research +Council (EPSRC) grant EP/G032459/1, to the University of Manchester, the University of Sussex and +the Open University.
+The Stevens family—all my ancestors were necessary for this to happen. Also, for my Mum who gathered +all the information.
+2. Adding some Individuals to the FHKB
+6. Individuals in Class Expressions
+7. Data Properties in the FHKB
+The ‘Manchester Family History Advanced OWL Tutorial’ by Robert Stevens, Margaret Stevens, Nicolas +Matentzoglu, Simon Jupp is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported +License.
+This manual will almost certainly contain errors, defects and infelicities. Do report them to robert.stevens@manchester.ac.uk supplying chapter, section and some actual context in the form of words will help in fixing any of these issues.
+As well as the author list, many people have contributed to this work. Any contribution, such as reporting bugs etc., is rewarded by an acknowledgement of contribution (in alphabetical order) when the authors get around to adding them:
+This tutorial introduces the tutee to many of the more advanced features of the Web Ontology Language (OWL). The topic of family history is used to take the tutee through various modelling issues and, in doing so, using many features of OWL 2 to build a Family History Knowledge Base (FHKB). The exercises are designed to maximise inference about family history through the use of an automated reasoner on an OWL knowledge base (KB) containing many members of the Stevens family.
+The aim, therefore, is to enable people to learn advanced features of OWL 2 in a setting that involves both classes and individuals, while attempting to maximise the use of inference within the FHKB.
+By doing this tutorial, a tutee should be able to:
+Building an FHKB enables us to meet our learning outcomes through a topic that is accessible to virtually everyone. Family history or genealogy is a good topic for a general tutorial on OWL 2 as it enables us to touch many features of the language and, importantly, it is a field that everyone knows. All people have a family and therefore a family history – even if they do not know their particular family history. A small caveat was put on the topic being accessible to everyone as some cultures differ, for instance, in the description of cousins and labels given to different siblings. Nevertheless, family history remains a topic that everyone can talk about.
+Family history is a good topic for an OWL ontology as it obviously involves both individuals – the people involved – and classes of individuals – people, men and women, cousins, etc. Also, it is an area rich in inference; from only knowing parentage and sex of an individual, it is possible to work out all family relationships – for example, sharing parents implies a sibling relationship; one’s parent’s brothers are one’s uncles; one’s parent’s parents are one’s grandparents. So, we should be able to construct an ontology that allows us to both express family history, but also to infer family relationships between people from knowing relatively little about them.
+As we will learn through the tutorial, OWL 2 cannot actually do all that is needed to create a FHKB. This is unfortunate, but we use it to our advantage to illustrate some of the limitations of OWL 2. We know that rule based systems can do family history with ease, but that is not the point here; we are not advocating OWL DL as an appropriate mechanism for doing family history, but we do use it as a good educational example.
+We make the following assumptions about what people know:
+We make some simplifying assumptions in this tutorial:
+At the end of the tutorial, you should be able to produce a property hierarchy and a TBox or class hierarchy such as shown in Figure 1.1; all supported by use of the automated reasoner and a lot of OWL 2’s features.
+ +Figure 1.1: A part of the class and property hierarchy of the final FHKB.
+Here are some tips on using this manual to the best advantage:
+The following resources are available at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial:
+1 The image comes fromhttp://ancienthomeofdragon.homestead.com/May 2012.
+In this chapter we will start by creating a fresh OWL ontology and adding some individuals that will be +surrogates for people in the FHKB. In particular you will:
+The ‘world’2 or field of interest we model in an ontology is made up of objects or individuals. Such objects include, but are not limited to:
+2 we use ‘world’ as a synonym of ‘field of interest’ or ‘domain’. ‘World’ does not restrict us to modelling the physical world outside our consciousness.
+We observe these objects, either outside lying around in the world or in our heads. OWL is all about modelling such individuals. Whenever we make a statement in OWL, when we write down an axiom, we are making statements about individuals. When thinking about the axioms in an ontology it is best to think about the individuals involved, even if OWL individuals do not actually appear in the ontology. All through this tutorial we will always be returning to the individuals being described in order to help us understand what we are doing and to help us make decisions about how to do it.
+Biologically, everyone has parents; a mother and a father3. The starting point for family history is parentage; we need to relate the family member objects by object properties. An object property relates two objects, in this case a child object with his or her mother or father object. To do this we need to create three object properties:
+Task 1: Creating object properties for parentage | +
---|
|
+
Note how the reasoner has automatically completed the sub-hierarchy for isParentOf:
isMotherOf
and isFatherOf
are inferred to be sub-properties of isParentOf
.
The OWL snippet below shows some parentage fact assertions on an individual. Note that rather than being assertions to an anonymous individual via some class, we are giving an assertion to a named individual.
+Individual: grant_plinth
+Facts: hasFather mr_plinth, hasMother mrs_plinth
+
3 Don’t quibble; it’s true enough here.
+Task 2: Create the ABox | +
---|
|
+
While asserting facts about all individuals in the FHKB will be a bit tedious at
+times, it might be useful to at least do the task for a subset of the family members.
+For the impatient reader, there is a convenience snapshot of the ontology including
+the raw individuals available at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
+
If you are working with Protégé, you may want to look at the Matrix plugin for
+Protégé at this point. The plugin allows you to add individuals quickly in the
+form of a regular table, and can significantly reduce the effort of adding any type
+of entity to the ontology. In order to install the matrix plugin, open Protégé and
+go to File » Check for plugins. Select the ‘Matrix Views’ plugin. Click install,
+wait until the the installation is confirmed, close and re-open Protégé; go to the
+‘Window’ menu item, select ‘Tabs’ and add the ‘Individuals matrix’.
+
Now do the following:
+Task 3: DL queries | +
---|
|
+
You should find the following:
+Since we have said that isFatherOf
has an inverse of hasFather
, and we have asserted that Robert_David_Bright_1965 hasFather David_Bright_1934
, we have a simple entailment that David_Bright_1934 isFatherOf Robert_David_Bright_1965
. So, without asserting the isFatherOf
facts, we have been able to ask and get answers for that DL query.
As we asserted that Robert_David_Bright_1965 hasFather David_Bright_1934
, we also infer that he hasParent
David_Bright_1934
; this is because hasParent
is the super-property of hasFather
and the sub-property implies the super-property. This works all the way up the property tree until topObjectProperty
, so all individuals are related by topObjectProperty
—this is always true. This implication ‘upwards’ is the way to interpret how the property hierarchies work.
We have now covered the basics of dealing with individuals in OWL ontologies. We have set up some properties, but without domains, ranges, appropriate characteristics and then arranged them in a hierarchy. From only a few assertions in our FHKB, we can already infer many facts about an individual: Simple exploitation of inverses of properties and super-properties of the asserted properties.
+We have also encountered some important principles:
+hasFather
implies the hasParent
fact between individuals. This entailment of the super-property is very important and will drive much of the inference we do with the FHKB.The FHKB ontology at this stage of the tutorial has an expressivity of ALHI.
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.026 sec (0.00001 % of final), by Pellet
+2.2.0 0.144 sec (0.00116 % of final) and by FaCT++ 1.6.4 is approximately 0.
+sec (0.000 % of final). 0 sec indicates failure or timeout.
+
In this Chapter you will:
+Find a snapshot of the ontology at this stage at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial.
+
The FHKB has parents established between individuals and we know that all people have two parents. A parent is an ancestor of its children; a person’s parent’s parents are its ancestors; and so on. So, in our FHKB, Robert’s ancestors are David, Margaret, William, Iris, Charles, Violet, James, another Violet, another William, Sarah and so on. If my parent’s parents are my ancestors, then what we need is a transitive version of the hasParent
property. Obviously we do not want hasParent
to be transitive, as Robert’s grandparents (and so on) would become his parents (and that would be wrong).
We can easily achieve what is necessary. We need a hasAncestor
property that has a transitive characteristic. The trick is to make this a super-property of the hasParent
property. As explained before, a sub-property implies its super-property. So, if individual x holds a hasParent
property with an individual y , then it also holds an instance of its super-property hasAncestor
with the individual y. If individual y then holds a hasParent
property with another individual z , then there is also, by implication, a hasAncestor
property between y and z. As hasAncestor
is transitive, x and z also hold a hasAncestor
relationship between them.
The inverse of hasAncestor
can either be isAncestorOf
or hasDescendant
. We choose the isAncestorOf
option.
Task 4: Object properties: exploiting the semantics | +
---|
|
+
The hasAncestor
object property will look like this:
ObjectProperty: hasAncestor
+SubPropertyOf: hasRelation
+SuperPropertyOf: hasParent,
+Characteristics: Transitive
+InverseOf: isAncestorOf
+
As usual, it is best to think of the objects or individuals involved in the relationships. Consider the three individuals – Robert, David and William. Each has a hasFather
property, linking Robert to David and then David to William. As hasFather
implies its super-property hasParent
, Robert also has a hasParent
property with David, and David has a hasParent
relation to William. Similarly, as hasParent
implies hasAncestor
, the Robert object has a hasAncestor
relation to the David object and the David object has one to the William object. As hasAncestor
is transitive, Robert not only holds this property to the David object, but also to the William object (and so on back through Robert’s ancestors).
We also want to use a sort of restricted transitivity in order to infer grandparents, great grandparents and so on. My grandparents are my parent’s parents; my grandfathers are my parent’s fathers. My great grandparents are my parent’s parent’s parents. My great grandmothers are my parent’s parent’s mothers. This is sort of like transitivity, but we want to make the paths only a certain length and, in the case of grandfathers, we want to move along two relationships – hasParent
and then hasFather
.
We can do this with OWL 2’s sub-property chains. The way to think about sub-property chains is: If we +see property x followed by property y linking three objects, then it implies that property z is held between
+ +Figure 3.1: Three blobs representing objects of the classPerson. The three objects are linked by a hasParent
+property and this implies a hasGrandparent
property.
the first and third objects. Figure 3.1 shows this diagrammatically for the hasGrandfather
property.
For various grandparent object properties we need the following sets of implications:
+Notice that we can trace the paths in several ways, some have more steps than others, though the shorter
+paths themselves employ paths. Tracing these paths is what OWL 2’s sub-property chains achieve. For
+the new object property hasGrandparent
we write:
ObjectProperty: hasGrandparent SubPropertyChain: hasParent o hasParent
+
We read this as ‘hasParent
followed by hasParent
implies hasGrandparent
’. We also need to think where the hasGrandparent
property fits in our growing hierarchy of object properties. Think about the implications: Does holding a hasParent
property between two objects imply that they also hold a hasGrandparent
property? Of course the answer is ‘no’. So, this new property is not a super-property of hasParent
. Does the holding of a hasGrandparent
property between two objects imply that they also hold an hasAncestor
property? The answer is ‘yes’; so that should be a super-property of hasGrandparent
. We need to ask such questions of our existing properties to work out where we put it in the object property hierarchy. At the moment, our hasGrandparent
property will look like this:
ObjectProperty: hasGrandParent
+SubPropertyOf: hasAncestor
+SubPropertyChain: hasParent o hasParent
+SuperPropertyOf: hasGrandmother, hasGrandfather
+InverseOf: isGrandParentOf
+
Do the following task:
+Task 5: Grandparents object properties | +
---|
|
+
Again, think of the objects involved. We can take the same three objects as before: Robert, David and William. Think about the properties that exist, both by assertion and implication, between these objects. We have asserted only hasFather
between these objects. The inverse can be inferred between the actual individuals (remember that this is not the case for class level restrictions – that all instances of a class hold a property does not mean that the filler objects at the other end hold the inverse; the quantification on the restriction tells us this). Remember that:
hasFather
property with David;hasFather
property with William;hasParent
super-property of hasFather
, Robert holds a hasParent
property with David, and the latter holds one with William;hasGrandfather
then implies that Robert holds a hasGrandfather
property to William. Use the diagram in figure 3.1 to trace the path; there is a hasParent
path from Robert to William via David and this implies the hasGrandfather
property between Robert and William.It is also useful to point out that the inverse of hasGrandfather
also has the implication of the sub-property chain of the inverses of hasParent
. That is, three objects linked by a path of two isParentOf
properties implies that an isGrandfatherOf
property is established between the first and third object, in
+this case William and Robert. As the inverses of hasFather
are established by the reasoner, all the inverse implications also hold.
It is important when dealing with property hierarchies to think in terms of properties between objects and of the implications ‘up the hierarchy’. A sub-property implies its super-property. So, in our FHKB, two person objects holding a hasParent
property between them, by implication also hold an hasAncestor
+property between them. In turn, hasAncestor
has a super-property hasRelation
and the two objects in
+question also hold, by implication, this property between them as well.
We made hasAncestor
transitive. This means that my ancestor’s ancestors are also my ancestors. That a sub-property is transitive does not imply that its super-property is transitive. We have seen that by manipulating the property hierarchy we can generate a lot of inferences without adding any more facts to the individuals in the FHKB. This will be a feature of the whole process – keep the work to the minimum (well, almost).
In OWL 2, we can also trace ‘paths’ around objects. Again, think of the objects involved in the path of properties that link objects together. We have done simple paths so far – Robert linked to David via hasParent
and David linked to William via hasFather
implies the link between Robert and William of hasGrandfather
. If this is true for all cases (for which you have to use your domain knowledge), one can capture this implication in the property hierarchy. Again, we are making our work easier by adding no new explicit facts, but making use of the implication that the reasoner works out for us.
The FHKB ontology at this stage of the tutorial has an expressivity ofALRI+.
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.262 sec (0.00014 % of final), by Pellet
+2.2.0 0.030 sec (0.00024 % of final) and by FaCT++ 1.6.4 is approximately 0.004
+sec (0.000 % of final). 0 sec indicates failure or timeout.
+
In this Chapter you will:
+Person
class;Sex
classes;Man
and Woman
;These simple classes will form the structure for the whole FHKB.
+For the FHKB, we start by thinking about the objects involved
+There is a class of Person
that we will use to represent all these people objects.
Task 6: Create the Person class |
+
---|
|
+
We use DomainEntity
as a house-keeping measure. All of our ontology goes underneath this class. We can put other classes ‘outside’ the ontology, as siblings of DomainEntity
, such as ‘probe’ classes we wish to use to test our ontology.
The main thing to remember about the Person
class is that we are using it to represent all ‘people’ individuals. When we make statements about the Person
class, we are making statements about all ‘people’ individuals.
What do we know about people? All members of the Person
class have:
There’s a lot more we know about people, but we will not mention it here.
+Each and every person object has a sex. In the FHKB we will take a simple view on sex – a person is either male or female, with no intersex or administrative sex and so on. Each person only has one sex.
+We have two straight-forward options for modelling sex:
+We will take the approach of having a class of Maleness objects and a class of Femaleness objects. These are qualities or attributes of self-standing objects such as a person. These two classes are disjoint, and each is a subclass of a class called Sex
. The disjointness means that any one instance of Sex
cannot be both an instance of Maleness
and an instance of Femaleness
at once. We also want to put in a covering axiom on the class Sex
, which means that any instance of Sex
must be either Maleness
or Femaleness
; there is no other kind of Sex
.
Again, notice that we have been thinking at the level of objects. We do the same when thinking about Person
and their Sex
. Each and every person is related to an instance of Sex
. Each Person
holds one relationship to a Sex
object. To do this we create an object property called hasSex
. We make this property functional, which means that any object can hold that property to only one distinct filler object.
We make the domain of hasSex
to be Person
and the range to be Sex
. The domain of Person
means that any object holding that property will be inferred to be a member of the class Person
. Putting the range of Sex
on the hasSex
property means that any object at the right-hand end of the hasSex
property will be inferred to be of the class Sex
. Again, think at the level of individuals or objects.
We now put a restriction on the Person
class to state that each and every instance of the class Person
holds a hasSex
property with an instance of the Sex
class. It has an existential operator ‘some’ in the axiom, but the functional characteristic means that each Person
object will hold only one hasSex
property to a distinct instance of a Sex
object4.
4 An individual could hold two hasSex
properties, as long as the sex objects at the right-hand end of the property are not
+different.
Task 7: Modelling sex | +
---|
|
+
The hasSex
property looks like:
ObjectProperty: hasSex
+Characteristics: Functional
+Domain: Person
+Range: Sex
+
The Person
class looks like:
Class: Person
+SubClassOf: DomainEntity,(hasSex some Sex)
+DisjointWith: Sex
+
We now have some of the foundations for the FHKB. We have the concept of Person
, but we also need to have the concepts of Man
and Woman
. Now we have Person
, together with Maleness
and Femaleness
, we have the necessary components to define Man
and Woman
. These two classes can be defined as: Any Person
object that has a male sex can be recognised to be a man; any Person
object that has a female sex can be recognised as a member of the class woman. Again, think about what conditions are sufficient for an object to be recognised to be a member of a class; this is how we create defined classes through the use of OWL equivalence axioms.
To make the Man
and Woman
classes do the following:
Task 8: Describe men and women | +
---|
|
+
Having run the reasoner, the Man
and Woman
classes should appear underneath Person
5.
5Actually in Protégé, this might happen without the need to run the reasoner.
+The Man
and Woman
classes will be important for use as domain and range constraints on many of the properties used in the FHKB. To achieve our aim of maximising inference, we should be able to infer that individuals are members of Man
, Woman
or Person
by the properties held by an object. We should not have to state the type of an individual in the FHKB.
The classes for Man
and Woman
should look like:
Class: Man
+EquivalentTo: Person and (hasSex some Maleness)
+
Class: Woman
+EquivalentTo: Person and (hasSex some Femaleness)
+
To finish off the foundations of the FHKB we need to describe a person object’s parentage. We know that each and every person has one mother and each and every person has one father. Here we are talking about biological mothers and fathers. The complexities of adoption and step parents are outside the scope of this FHKB tutorial.
+Task 9: Describing Parentage | +
---|
|
+
The (inferred) property hierarchy in the FHKB should look like that shown in Figure 4.1. Notice that we have asserted the sub-property axioms on one side of the property hierarchy. Having done so, the reasoner uses those axioms, together with the inverses, to work out the property hierarchy for the ‘other side’.
+We make hasMother
functional, as any one person object can hold only one hasMother
property to a distinct Woman
object. The range of hasMother
is Woman
, as a mother has to be a woman. The Person
object holding the hasMother
property can be either a man or a woman, so we have the domain constraint as Person
; this means any object holding a hasMother
property will be inferred to be a Person
. Similarly, any object at the right-hand end of a hasMother
property will be inferred to be a Woman
, which is the result we need. The same reasoning goes for hasFather
and hasParent
, with the sex constraints on the latter being only Person
. The inverses of the two functional sub-properties of hasParent
are not themselves functional. After all, a Woman
can be the mother of many Person
objects, but each Person
object can have only one mother.
Figure 4.1: The property hierarchy with the hasSex
and the parentage properties
Figure 4.2: the core TBox for the FHKB with the Person
and Sex
classes.
Task 10: Restrict Person class | +
---|
|
+
Class: Person
+SubClassOf: DomainEntity, (hasFather some Man), (hasMother some Woman),
+(hasSex some Sex)
+DisjointWith: Sex
+
Task 11: DL queries for people and sex | +
---|
|
+
The domain and range constraints on our properties have also driven some entailments. We have not asserted that David_Bright_1934
is a member of Man
, but the range constraint on hasFather
(or the inferred domain constraint on the isFatherOf
relation) has enabled this inference to be made. This goes for any individual that is the right-hand-side (either inferred or asserted) of either hasFather
or hasMother
(where the range is that of Woman
). For Robert David Bright, however, he is only the left-hand-side of an hasFather
or an hasMother
property, so we’ve only entailed that this individual is a member of Person
.
In our description of the Person
class we have said that each and every instance of the class Person
has a father (the same goes for mothers). So, when we ask the query ‘which individuals have a father’, we get all the instances of Person
back, even though we have said nothing about the specific parentage of each Person
. We do not know who their mothers and fathers are, but we know that they have one of each. We know all the individuals so far entered are members of the Person
class; when asserting the type to be either Man
or Woman
(each of which is a subclass of Person
), we infer that each is a person. When asserting the type of each individual via the hasSex
property, we know each is a Person
, as the domain of hasSex
is the Person
class. As we have also given the right-hand side of hasSex
as either Maleness
or Femaleness
, we have given sufficient information to recognise each of these Person
instances to be members of either Man
or Woman
.
So far we have not systematically added domains and ranges to the properties in the FHKB. As a reminder, when a property has a domain of X
any object holding that property will be inferred to be a member of class X
. A domain doesn’t add a constraint that only members of class X
hold that property; it is a strong implication of class membership. Similarly, a property holding a range implies that an object acting as right-hand-side to a property will be inferred to be of that class. We have already seen above that we can use domains and ranges to imply the sex of people within the FHKB.
Do the following:
+Task 12: Domains and Ranges | +
---|
|
+
+
Protégé for example in its current version (November 2015) does not visualise
+inherited domains and ranges in the same way as it shows inferred inverse relations.
+
We typically assert more domains and ranges than strictly necessary. For example, if we say that hasParent
has the domain Person
, this means that every object x
that is connected to another object y
via the hasParent
relation must be a Person
. Let us assume the only thing we said about x
and y
is that they are connected by a hasMother
relation. Since this implies that x
and y
are also connected by a hasParent
relation (hasMother
is a sub-property of hasParent
) we do not have to assert that hasFather
has the domain of Person
; it is implied by what we know about the domain and range of hasParent
.
In order to remove as many assertions as possible, we may therefore choose to assert as much as we know starting from the top of the hierarchy, and only ever adding a domain if we want to constrain the already inferred domain even further (or range respectively). For example, in our case, we could have chosen to assert Person
to be the domain of hasRelation
. Since hasRelation
is symmetric, it will also infer Person
to be the range. We do not need to say anything for hasAncestor
or hasParent
, and only if we want to constrain the domain or range further (like in the case of hasFather
by making the range Man
) do we need to actually assert something. It is worth noting that because we have built the object property hierarchy from the bottom (hasMother
etc.) we have ended up asserting more than necessary.
From the Pizza Tutorial and other work with OWL you should have seen some unsatisfiabilities. In Protégé this is highlighted by classes going ‘red’ and being subclasses ofNothing; that is, they can have no instances in that model.
+Task 13: Inconsistencies | +
---|
|
+
After asserting the first fact it should be reported by the reasoner that the ontology is inconsistent. This means, in lay terms, that the model you’ve provided in the ontology cannot accommodate the facts you’ve provided in the fact assertions in your ABox—that is, there is an inconsistency between the facts and the ontology... The ontology is inconsistent because David_Bright_1934
is being inferred to be a Man
and a Woman
at the same time which is inconsistent with what we have said in the FHKB.
When we, however, say that Robert David Bright
has two different mothers, nothing bad happens! Our domain knowledge says that the two women are different, but the reasoner does not know this as yet... ; Iris Ellen Archer and Margaret Grace Rever may be the same person; we have to tell the reasoner that they are different. For the same reason the functional characteristic also has no effect until the reasoner ‘knows’ that the individuals are different. We will do this in Section 7.1.1 and live with this ‘fault’ for the moment.
Task 14: Adding defined classes | +
---|
|
+
The code for the classes looks like:
+Class: Ancestor EquivalentTo: Person and isAncestorOf some Person
+Class: FemaleAncestor EquivalentTo: Woman and isAncestorOf some Person
+Class: Descendant EquivalentTo: Person and hasAncestor some Person
+Class: MaleDescendant EquivalentTo: Man and hasAncestor some Person
+
The TBox after reasoning can be seen in Figure 4.3. Notice that the reasoner has inferred that several of the classes are equivalent or ‘the same’. These are: Descendant
and Person
; MaleDescendant
and Man
, FemaleDescendant
and Woman
.
The reasoner has used the axioms within the ontology to infer that all the instances of Person
are also instances of the class Descendant
and that all the instances of Woman
are also the same instances as the class Female Descendant
. This is intuitively true; all people are descendants – they all have parents that have parents etc. and thus everyone is a descendant. All women are female people that have parents etc. As usual we should think about the objects within the classes and what we know about them. This time it is useful to think about the statements we have made about Person
in this Chapter – that all instances of Person
have a father and a mother; add to this the information from the property hierarchy and we know that all instances of Person
have parents and ancestors. We have repeated all of this in our new defined classes for Ancestor
and Descendant
and the reasoner has highlighted this information.
Figure 4.3: The defined classes from Section 4.8 in the FHKB’s growing class hierarchy
+Task 15: More Ancestors | +
---|
|
+
Most of what we have done in this chapter is straight-forward OWL, all of which would have been met in the pizza tutorial. It is, however, a useful revision and it sets the stage for refining the FHKB. Figure 4.2 shows the basic set-up we have in the FHKB in terms of classes; we have a class to represent person, man and woman, all set-up with a description of sex, maleness and femaleness. It is important to note, however, the approach we have taken: We have always thought in terms of the objects we are modelling.
+Here are some things that should now be understood upon completing this chapter:
+Person
have a mother, so any individual asserted to be a Person
must have a mother. We do not necessarily know who they are, but we know they have one.Person
, not that he is a Man
. This is because, so far, he only has the domain constraint of hasMother
and hasFather
to help out.Finally, we looked at some defined classes. We inferred equivalence between some classes where the extents of the classes were inferred to be the same – in this case the extents of Person
and Descendant
are the same. That is, all the objects that can appear in Person
will also be members of Descendant
. We can check this implication intuitively – all people are descendants of someone. Perhaps not the most profound inference of all time, but we did no real work to place this observation in the FHKB.
This last point is a good general observation. We can make the reasoner do work
+for us. The less maintenance we have to do in the FHKB the better. This will be
+a principle that works throughout the tutorial.
+
The FHKB ontology at this stage of the tutorial has an expressivity of SRIF.
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.884 sec (0.00047 % of final), by Pellet
+2.2.0 0.256 sec (0.00207 % of final) and by FaCT++ 1.6.4 is approximately 0.013
+sec (0.000 % of final). 0 sec indicates failure or timeout.
+
In this chapter you will:
+There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
Do the following first:
+Task 16: The bloodrelation object property | +
---|
|
+
Does a blood relation of Robert have the same relationship to Robert (symmetry)? Is a blood relation of Robert’s blood relation a blood relation of Robert (transitivity)? Think of an aunt by marriage; her children are my cousins and blood relations via my uncle, but my aunt is not my blood relation. My siblings share parents; male siblings are brothers and female siblings are sisters. So far we have asserted parentage facts for the Person
in our ABox. Remember that our parentage properties have inverses, so if we have added an hasFather
property between a Person
and a Man
, we infer the isFatherOf
property between that Man
and that Person
.
We should have enough information within the FHKB to infer siblings. We could use a sub-property chain such as:
+ObjectProperty: hasSibling
+SubPropertyOf: hasBloodrelation
+Characteristics: Symmetric, transitive
+SubPropertyChain: hasParent o isParentOf
+
We make a property of hasSibling
and make it a sub-property of hasBloodrelation
. Remember, think of the objects involved and the implications we want to follow; being a sibling implies being a blood relation, it does not imply any of the other relationships we have in the FHKB.
Note that we have made hasSibling
symmetric; if Robert is sibling of Richard, then Richard is sibling of Robert. We should also think about transitivity; if David is sibling of Peter and Peter is sibling of John, then David is sibling of John. So, we make hasSibling
symmetric and transitive (see Figure 5.1). However, we must take care of half-siblings: child 1 and child 2 share a mother, but not a father; child 2 and child 3 share the father, but not the mother – child 1 and child 3 are not even half-siblings. However, at least for the moment, we will simply ignore this inconvenience, largely so that we can explore what happens with different modelling options.
Figure 5.1: Showing the symmetry and transitivity of the hasSibling
(siblingof) property by looking at the brothers David, John and Peter
We also have the implication using three objects (see Figure 5.2):
+hasParent
property with David;isFatherOf
property with Richard;hasSibling
property with Richard;hasSibling
is symmetric, Richard holds an hasSibling
property with Robert.Figure 5.2: Tracing out the sub-property chain for hasSibling
; note that Robert is a sibling of himself by this
+path
Do the following tasks:
+Task 17: Siblings | +
---|
|
+
From this last DL query you should get the answer that both Robert and Richard are siblings of Robert. Think about the objects involved in the sub-property chain: we go from Robert to David via the hasParent
and from David to Richard via the isParentOf
property; so this is OK. However, we also go from Robert to David and then we can go from David back to Robert again – so Robert is a sibling of Robert. We do not want this to be true.
We can add another characteristic to the hasSibling
property, the one of being irreflexive
. This means that an object cannot hold the property with itself.
Task 18: More siblings | +
---|
|
+
Note that the reasoner claims you have an inconsistent ontology (or in some cases, you might get a message box saying "Reasoner died"). Looking at the hasSibling
property again, the reason might not be immediately obvious. The reason for the inconsistency lies in the fact that we create a logical contradiction: through the property chain, we say that every Person
is a sibling of him or herself, and again disallowing just that by adding the irreflexive characteristic. A different explanation lies within the OWL specification itself: In order to maintain decidability irreflexive properties must be simple - for example, they may not be property chains6.
6 http://www.w3.org/TR/owl2-syntax/#The_Restrictions_on_the_Axiom_Closure
+We have only done siblings, but we obviously need to account for brothers and sisters. In an analogous way to motherhood, fatherhood and parenthood, we can talk about sex specific sibling relationships implying the sex neutral hasSibling
; holding either a hasBrother
or an isSisterOf
between two objects would imply that a hasSibling
property is also held between those two objects. This means that we can place these two sex specific sibling properties below hasSibling
with ease. Note, however, that unlike the hasSibling
property, the brother and sister properties are not symmetric. Robert hasBrother
Richard and vice versa , but if Daisy hasBrother
William, we do not want William to hold an hasBrother
property with Daisy. Instead, we create an inverse of hasBrother
, isBrotherOf
, and the do the same for isSisterOf
.
We use similar, object based, thought processes to choose whether to have transitivity as a characteristic of hasBrother
. Think of some sibling objects or individuals and place hasBrother
properties between them. Make it transitive and see if you get the right answers. Put in a sister to and see if it stil works. If David hasBrother
Peter and Peter hasBrother
John, then David hasBrother
John; so, transitivity works in this case. Think of another example. Daisy hasBrother
Frederick, and Frederick hasBrother
William, thus Daisy hasBrother
William. The inverses work in the same way; William isBrotherOf
Frederick and Frederick isBrotherOf
Daisy; thus William isBrotherOf
Daisy. All this seems reasonable.
Task 19: Brothers and sisters | +
---|
|
+
ObjectProperty: hasBrother
+SubPropertyOf: hasSibling
+Characteristics: Transitive
+InverseOf: isBrotherOf
+Range: Man
+
We have some hasSibling
properties (even if they are wrong). We also know the sex of many of the people in the FHKB through the domains and ranges of properties such as hasFather
, hasMother
and their inverses..
Can we use sub-property chains in the same way as we have used them in the hasSibling
property? The issue is that of sex; the property isFatherOf
is sex neutral at the child end, as is the inverse hasFather
(the same obviously goes for the mother properties). We could use a sub-property chain of the form:
ObjectProperty: hasBrother
+SubPropertyChain: hasParent o hasSon
+
A son is a male child and thus that object is a brother of his siblings. At the moment we do not have son or daughter properties. We can construct a property hierarchy as shown in Figure 5.3. This is made up from the following properties:
+hasChild
and isChildOf
hasSon
(range Man
and domain Person
) and isSonOf
;hasDaughter
(range Woman
domain Person
) and isDaughterOf
Note that hasChild
is the equivalent of the existing property isParentOf
; if I have a child, then I am its
+parent. OWL 2 can accommodate this fact. We can add an equivalent property axiom in the following
+way:
ObjectProperty: isChildOf
+EquivalentTo: hasParent
+
We have no way of inferring the isSonOf
and isDaughterOf
from what already exists. What we want
+to happen is the implication of ‘Man
and hasParent
Person
implies isSonOf
’. OWL 2 and its reasoners
+cannot do this implication. It has been called the ‘man man problem’7. Solutions for this have been
+developed [3], but are not part of OWL 2 and its reasoners.
Figure 5.3: The property hierarchy for isChildOf
and associated son/daughter properties
7 http://lists.w3.org/Archives/Public/public-owl-dev/2007JulSep/0177.html
+Child | +property | +Parent | +
---|---|---|
Robert David Bright 1965 | +isSonOf | +David Bright 1934, Margaret Grace Rever 1934 | +
Richard John Bright 1962 | +isSonOf | +David Bright 1934, Margaret Grace Rever 1934 | +
Mark Bright 1956 | +isSonOf | +John Bright 1930, Joyce Gosport | +
Ian Bright 1959 | +isSonOf | +John Bright 1930, Joyce Gosport | +
Janet Bright 1964 | +isDaughterOf | +John Bright 1930, Joyce Gosport | +
William Bright 1970 | +isSonOf | +John Bright 1930, Joyce Gosport | +
Table 5.1: Child property assertions for the FHKB
+Thus we must resort to hand assertions of properties to test out our new path:
+Task 20: Sons and daughters | +
---|
|
+
Of course, it works, but we see the same problem as above. As usual, think of the objects involved.
+Robert isSonOf
David and David isParentOf
Robert, so Robert is his own brother. Irreflexivity again
+causes problems as it does above (Task 18).
Our option one has lots of problems. So, we have an option of asserting the various levels of sibling. We +can take the same basic structure of sibling properties as before, but just fiddle around a bit and rely on +more assertion while still trying to infer as much as possible. We will take the following approach:
+Person | +Property | +Person | +
---|---|---|
Robert David Bright 1965 | +isBrotherOf | +Richard John Bright 1962 | +
David Bright 1934 | +isBrotherOf | +John Bright 1930 | +
David Bright 1934 | +isBrotherOf | +Peter William Bright 1941 | +
Janet Bright 1964 | +isSisterOf | +Mark Bright 1956 | +
Janet Bright 1964 | +isSisterOf | +Ian Bright 1959 | +
Janet Bright 1964 | +isSisterOf | +William Bright 1970 | +
Mark Bright 1956 | +isBrotherOf | +Ian Bright 1959 | +
Mark Bright 1956 | +isBrotherOf | +Janet Bright 1964 | +
Mark Bright 1956 | +isBrotherOf | +William Bright 1970 | +
Table 5.2: The sibling relationships to add to the FHKB.
+Do the following:
+Task 21: Add sibling assertions | +
---|
|
+
We can see some problems with this option as well:
+hasBrother
property to Robert. We would really like an isBrotherOf
to Robert to hold.Man
and hasSibling value Robert
only retrieves Robert himself. Because we only asserted that Robert is a brother of Richard, and the domain of isBrotherOf
is Man
we know that Robert is a Man
, but we do not know anything about the Sex
of Richard.Which of the two options gives the worse answers and which is the least effort? Option one is obviously the least effort; we only have to assert the same parentage facts as we already have; then the sub-property chains do the rest. It works OK for hasSibling
, but we cannot do brothers and sisters adequately; we need Man
and hasSibling
⊐ isBrotherOf
and we cannot do that implication. This means we cannot ask the questions we need to ask.
So, we do option two, even though it is hard work and is still not perfect for query answering, even though we have gone for a sparse assertion mode. Doing full sibling assertion would work, but is a lot of effort.
+We could start again and use the isSonOfandisDaughterOf
option, with the sub-property chains described above. This still has the problem of everyone being their own sibling. It can get the sex specific sibling relationships, but requires a wholesale re-assertion of parentage facts. We will continue with option two, largely because it highlights some nice problems later on.
In Section 5.2 we briefly talked about half-siblings. So far, we have assumed full-siblings (or, rather, just talked about siblings and made no distinction). Ideally, we would like to accommodate distinctions between full- and half-siblings; here we use half-siblings, where only one parent is in common between two individuals, as the example. The short-answer is, unfortunately, that OWL 2 cannot deal with half-siblings in the way that we want - that is, such that we can infer properties between named individuals indicating full- or half-sibling relationships.
+It is possible to find sets of half-brothers in the FHKB by writing a defined class or DL-query for a particular individual.} The following fragment of OWL defines a class that looks for the half-brothers of an individual called ‘Percival’:
+ +Class: HalfBrotherOfPercival
+EquivalentTo: Man and (((hasFather some (not (isFatherOf value Percival))) and
+(hasMother some (isMotherOf value Percival))) or ((hasFather some (isFatherOf
+value Percival)) and (hasMother some (not (isMotherOf value Percival)))))
+
Here we are asking for any man that either has Percival’s father but not his mother, or his mother, but not his father. This works fine, but is obviously not a general solution. The OWL description is quite complex and the writing will not scale as the number of options (hypothetically, as the number of parents increases... ) increases; it is fine for man/woman, but go any higher and it will become very tedious to write all the combinations.
+Another way of doing this half-brother class to find the set of half-brothers of a individual is to use cardinality constraints:
+Class: HalfBrotherOfPercival
+EquivalentTo: Man and (hasParent exactly 1 (isParentOf value Percival))
+
This is more succinct. We are asking for a man that has exactly one parent from the class of individuals that are the class of Percival’s parents. This works, but one more constraint has to be present in the FHKB. We need to make sure that there can be only two parents (or indeed, just a specified number of parents for a person). If we leave it open as to the number of parents a person has, the reasoner cannot work out that there is a man that shares exactly one parent, as there may be other parents. We added this constraint to the FHKB in Section 6.2; try out the classes to check that they work.
+These two solutions have been about finding sets of half-brothers for an individual. What we really want +in the FHKB is to find half-brothers between any given pair of individuals.
+Unfortunately we cannot, without rules, ask OWL 2 to distinguish full- and half-siblings – we cannot +count the number of routes taken between siblings via different distinct intermediate parent objects.
+An uncle is a brother of either my mother or father. An aunt is a sister of either my mother or father. In common practice, wives and husbands of aunts and uncles are usually uncles and aunts respectively. Formally, these aunts and uncles are aunts-in-law and uncles-in-law. Whatever approach we take, we cannot fully account for aunts and uncles until we have information about marriages, which will not have until Chapter 9. We will, however, do the first part now.
+Look at the objects and properties between them for the following facts:
+As we are tracing paths or ‘chains’ of objects and properties we should use sub-property chains as a solution for the aunts and uncles. We can make an hasUncle
property as follows (see Figure 5.4):
ObjectProperty: hasUncle
+SubPropertyOf: hasBloodrelation
+Domain: Man
+Range: Person
+SubPropertyChain: hasParent o hasBrother
+InverseOf: isUncleOf
+
Figure 5.4: Tracing out the path between objects to get the hasUncle
sub-property chain.
Notice we have the domain of Man
and range of Person
. We also have an inverse. As usual, we can read this as ‘an object that holds an hasParent
property, followed by an object holding a hasBrother
property, implies that the first object holds an hasUncle
property with the last object’.
Note also where the properties (include the ones for aunt) go in the object property hierarchy. Aunts and uncles are not ancestors that are in the direct blood line of a person, but they are blood relations (in the narrower definition that we are using). Thus the aunt and uncle properties go under the hasBloodrelation
property (see Figure 5.5). Again, think of the implications between objects holding a property between them; that two objects linked by a property implies that those two objects also hold all the property’s super-properties as well. As long as all the super-properties are true, the place in the object property hierarchy is correct (think about the implications going up, rather than down).
Figure 5.5: The object property hierarchy with the aunt and uncle properties included. On the right side, we +can see the hasUncle property as shown by Protégé.
+Do the following tasks:
+Task 22: Uncles and Aunts | +
---|
|
+
We can see this works – unless we have any gaps in the sibling relationships (you may have to fix these). Great aunts and uncles are simply a matter of adding another ‘parent’ leg into the sub-property chain. We are not really learning anything new with aunts and uncles, except that we keep gaining a lot for
+free through sub-property chains. We just add a new property with its sub-property chain and we get a whole lot more inferences on individuals. To see what we now know about Robert David Bright, do the following:
+Task 23: What do we know? | +
---|
|
+
You can now see lots of facts about Robert David Bright, with only a very few actual assertions directly on Robert David Bright.
+Siblings have revealed several things for us:
+Man
and hasSibling
⊃ isBrotherOf
, but OWL 2 doesn’t do this implication;The FHKB ontology at this stage of the tutorial has an expressivity ofSRIF.
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 1355.614 sec (0.71682 % of final), by
+Pellet 2.2.0 0.206 sec (0.00167 % of final) and by FaCT++ 1.6.4 is approximately
+0.039 sec (0.001 % of final). 0 sec indicates failure or timeout.
+
In this chapter you will:
+There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
So far we have only used object properties between unspecified objects. We can, however, specify a specific individual to act at the right-hand-side of a class restriction or type assertion on an individual. The basic syntax for so-called nominals is:
+Class: ParentOfRobert
+EquivalentTo: Person and isParentOf valueRobert_David_Bright_1965
+
This is an equivalence axiom that recognises any individual that is a Person
and a parent of Robert David Bright.
Task 24: Robert and Richards parents | +
---|
|
+
We see that these queries work and that we can create more complex nominal based class expressions. The disjunction above is
+isParentOf some {Robert_David_Bright_1965, Richard_John_Bright_1965}
+
The ‘{’ and ‘}’ are a bit of syntax that says ‘here’s a class of individual’.
+We also see that the classes for the parents of Robert David Bright and Richard John Bright have the same members according to the FHKB, but that the two classes are not inferred to be equivalent. Our domain knowledge indicates the two classes have the same extents (members) and thus the classes are equivalent, but the automated reasoner does not make this inference. As usual, this is because the FHKB has not given the automated reasoner enough information to make such an inference.
+The classes describing the parents of Richard and Robert are not equivalent, even though, as humans, we know their classes of parent are the same. We need more constraints so that it is known that the four parents are the only ones that exist. We can try this by closing down what we know about the immediate family of Robert David Bright.
+In Chapter 4 we described that a Person
has exactly one Woman
and exactly one Man
as mother and father (by saying that the hasMother
and hasFather
properties are functional and thus only one of each may be held by any one individual to distinct individuals). The parent properties are defined in terms of hasParent
, hasMother
and hasFather
. The latter two imply hasParent
. The two sub-properties are functional, but there are no constraints on hasParent
, so an individual can hold many instances of this property. So, there is no information in the FHKB to say a Person
has only two parents (we say there is one mother and one father, but not that there are only two parents). Thus Robert and Richard could have other parents and other grandparents than those in the FHKB; we have to close down our descriptions so that only two parents are possible. There are two ways of doing this:
hasParent
in the same way as we did for Sex
in Chapter 4.Task 25: Closing the Person class | +
---|
|
+
We find that these two classes are equivalent; we have supplied enough information to infer that these two classes are equivalent. So, we know that option one above works, but what about option two? This takes a bit of care to think through, but the basic thing is to think about how many ways there are to have a hasParent
relationship between two individuals. We know that we can have either a hasFather
or a hasMother
property between two individuals; we also know that we can have only one of each of these properties between an individual and a distinct individual. However, the open world assumption tells us that there may be other ways of having a hasParent
property between two individuals; we’ve not closed the possibilities. By putting on the hasParent exactly 2 Person
restriction on the Person
class, we are effectively closing down the options for ways that a person can have parents; we know because of the functional characteristic on hasMother
and hasFather
that we can have only one of each of these and the two restrictions say that one of each must exist. So, we know we have two ways of having a parent on each Person
individual. So, when we say that there are exactly two parents (no more and no less) we have closed down the world of having parents—thus these two classes can be inferred to be equivalent. It is also worth noting that this extra axiom on the Person
class will make the reasoner run much more slowly.
Finally, for option 2, we have no way of placing a covering axiom on a property. What we’d like to be +able to state is something like:
+ObjectProperty: hasParent
+EquivalentTo: hasFather or hasMother
+
but we can’t.
+For practice, do the following:
+Task 26: Additional Practice | +
---|
|
+
In this chapter we have seen the use of individuals within class expressions. It allows us to make useful queries and class definitions. The main things to note is that it can be done and that there is some syntax involved. More importantly, some inferences may not be as expected due to the open world assumption in OWL.
+ +By now you might have noticed a significant increase in the time the reasoner needs
+to classify. Closing down what we know about family relationships takes its toll on
+the reasoner performance, especially the usage of 'hasParent exactly 2 Person'. At
+this point we recommend rewriting this axiom to 'hasParent max 2 Person'. It gives
+us most of what we need, but has a little less negative impact on the reasoning
+time.
+
The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ.
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 2067.273 sec (1.09313 % of final), by
+Pellet 2.2.0 0.529 sec (0.00428 % of final) and by FaCT++ 1.6.4 is approximately
+0.147 sec (0.004 % of final). 0 sec indicates failure or timeout.
+
We now have some individuals with some basic object properties between individuals. OWL 2, however, also has data properties that can relate an object or individual to some item of data. There are data about a Person
, such as years of events and names etc. So, in this Chapter you will:
There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial.
+
Everyone has a birth year; death year; and some have a marriage year and so on. We can model these simply with data properties and an integer as a filler. OWL 2 has a DateTime datatype, where it is possible to specify a precise time and date down to a second. 7 This proves cumbersome (see http://robertdavidstevens.wordpress.com/2011/05/05/using-the-datetime-data-type-to-describe-birthdays/ for details); all we need is a simple indication of the year in which a person was born. Of course, the integer type has a zero, which the Gregorian calendar for which we use integer as a proxy does not, but integer is sufficient to our needs. Also, there are various ontological treatments of time and information about people (this extends to names etc. as well), but we gloss over that here—that’s another tutorial.
+7 http://www.w3.org/TR/2008/WD-owl2-quick-reference-20081202/#Built-in_Datatypes_and_Facets
+We can have dates for birth, death and (eventually) marriage (see Chapter 9) and we can just think of these as event years. We can make a little hierarchy of event years as shown in Figure 7.1).
+Task 27: Create a data property hierarchy | +
---|
|
+
Again, asserting birth years for all individuals can be a bit tedious. The reader
+can find a convenience snapshot of the ontology at this stage at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
We now have an ABox with individuals with fact assertions to data indicating a birth year. We can, if we wish, also add a class restriction to the Person
class saying that each and every instance of the class Person
holds a data property to an integer and that this property is called ‘hasBirthYear’. As usual when deciding whether to place such a restriction upon a class, ask whether it is true that each and every instance of the class holds that property; this is exactly the same as we did for the object properties in Chapter 4. Everyone does have a birth year, even if it is not known.
Once birth years have been added to our individuals, we can start asking some questions.
+Task 28: DL queries | +
---|
1. Use a DL query to ask:
|
+
The DL query for people born in the 1960s is:
+Person and hasBirthYear some int[>= 1960, < 1970]
+
This kind of interval is known as a facet.
+The last two queries in the list do not work as expected. We have asked, for instance, for Person
that have more than three children, but we get no members of Person
in the answer, though we know that there are some in the FHKB (e.g., John_Bright_1930
). This is because there is not enough information in the FHKB to tell that this person has more than three different people as children. As humans we can look at the four children of John Bright and know that they are different – for instance, they all have different birth years. The automated reasoner, however, does not know that a Person
can only have one birth year.
Task 29: Make a functional object property | +
---|
|
+
This time the query should work. All the other event year properties should be made functional, expect hasEventYear
, as one individual can have many event years. As the children have different birth year and an individual can only hold one hasBirthYear
property, then these people must be distinct entities.
Of course, making birth year functional is not a reliable way of ensuring that the automated reasoner knows that the individual are different. It is possible for two Person
to have the same birth year within the same family – twins and so on. Peter_William_Bright_1941
has three children, two of which are twins, so will not be a member of the class of people with at least three children. So, we use the different individuals axiom. Most tools, including Protégé, have a feature that allows all individuals to be made different.
Task 30: Make all individuals different | +
---|
|
+
From now on, every time you add individuals, make sure the different individuals axiom is updated.
+We have met again the open world assumption and its importance in the FHKB. In the use of the functional characteristic on the hasBirthYear
property, we saw one way of constraining the interpretation of numbers of children. We also introduced the ‘different individuals’ axiom as a way of making all individuals in a knowledge base distinct. There are more questions, however, for which we need more ways of closing down the openness of OWL 2.
Take the questions:
+We can only answer these questions if we locally close the world.We have said that David and Margaret have two children, Richard and Robert, but we have not said that there are not any others. As usual, try not to apply your domain knowledge too much; ask yourself what the automated reasoner actually knows. As we have the open world assumption, the reasoner will assume, unless otherwise said, that there could be more children; it simply doesn’t know.
+Think of a railway journey enquiry system. If I ask a standard closed world system about the possible routes by rail, between Manchester and Buenos Aires, the answer will be ’none’, as there are none described in the system. With the open world assumption, if there is no information in the system then the answer to the same question will simply be ‘I don’t know’. We have to explicitly say that there is no railway route from Manchester to Buenos Aires for the right answer to come back.
+We have to do the same thing in OWL. We have to say that David and Margaret have only two children. We do this with a type assertion on individuals. So far we have only used fact assertions. A type assertion to close down David Bright’ parentage looks like this:
+isParentOf only {Robert_David_Bright_1965,Richard_John_Bright_1962 }
+
This has the same meaning as the closure axioms that you should be familiar with on classes. We are saying that the only fillers that can appear on the right-hand-side of the isParentOf
property on this individual are the two individuals for Richard and Robert. We use the braces to represent the set of these two individuals.
Task 31: Make a closure axiom | +
---|
|
+
The last query should return the answer of David Bright. Closing down the whole FHKB ABox is a chore and would really have to be done programmatically. OWL scripting languages such as the Ontology Preprocessing Language8 (OPPL) [2] can help here. Also going directly to the OWL API [1]9, if you know what you are doing, is another route.
+ +Adding all these closure type assertions can slow down the reasoner; so think about
+the needs of your system – just adding it ‘because it is right’ is not necessarily the
+right route.
+
8 http://oppl2.sourceforge.net
+9 http://owlapi.sourceforge.net/
+We also want to add some other useful data facts to people – their names. We have been putting names as part of labels on individuals, but data fact assertions make sense to separate out family and given names so that we can ask questions such as ‘give me all people with the family name Bright and the first given name of either James or William’. A person’s name is a fact about that person and is more, in this case, than just a label of the representation of that person. So, we want family names and given names. A person may have more than one given name – ‘Robert David’, for instance – and an arbitrary number of given names can be held. For the FHKB, we have simply created two data properties of hasFirstGivenName
and hasSecondGivenName
). Ideally, it would be good to have some index on the property to given name position, but OWL has no n-ary relationships. Otherwise, we could reify the hasGivenName
property into a class of objects, such as the following:
Class: GivenName
+SubClassOf:hasValue some String,
+hasPosition some Integer
+
but it is really rather too much trouble for the resulting query potential.
+As already shown, we will use data properties relating instances of Person
to strings. We want to distinguish family and given names, and then different positions of given names through simple conflating of position into the property name. Figure 7.1 shows the intended data property hierarchy.
Figure 7.1: The event year and name data property hierarchies in the FHKB.
+Do the following:
+Task 32: Data properties | +
---|
|
+
The name data property hierarchy and the queries using those properties displays what now should be familiar. Sub-properties that imply the super-property. So, when we ask hasFirstGivenName
value "William"
and then the query hasGivenName value value "William"
we can expect different answers. There are people with ‘William’ as either first or second given name and asking the question with the super-property for given names will collect both first and second given names.
We have used data properties that link objects to data such as string, integer, floats and Booleans etc. OWL uses the XML data types. We have seen a simple use of data properties to simulate birth years. The full FHKB also uses them to place names (given and family) on individuals as strings. This means one can ask for the Person
with the given name "James", of which there are many in the FHKB.
Most importantly we have re-visited the open world assumption and its implications for querying an OWL ABox. We have looked at ways in which the ABox can be closed down – unreliably via the functional characteristic (in this particular case) and more generally via type assertions.
+All the DL queries used in this chapter can also serve as defined classes in the TBox. It is a useful exercise to progressively add more defined classes to the FHKB TBox. Make more complex queries, make them into defined classes and inspect where they appear in the class hierarchy.
+ +The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 1891.157 sec (1.00000 % of final), by
+Pellet 2.2.0 1.134 sec (0.00917 % of final) and by FaCT++ 1.6.4 is approximately
+0.201 sec (0.006 % of final). 0 sec indicates failure or timeout.
+
Note that we now cover the whole range of expressivity of OWL 2. HermiT at
+least is impossibly slow by now. This may be because HermiT does more work
+than the others. For now, we recommend to use either Pellet or FaCT++.
+
In this Chapter you will
+There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
Be warned; from here on the reasoner can start running slowly! Please see warning
+at the beginning of the last chapter for more information.
+
Cousins can be confusing, but here is a brief summary:
+Simply, my first cousins are my parent’s sibling’s children. As usual, we can think about the objects and put in place some sub-property chains.
+Figure 8.1: Tracing out the sub-property chain for cousins going from a child to a parent, to its sibling, and +down to its child, a cousin
+Figure 8.1 shows the sub-property chain for first cousins. As usual, think at the object level; to get to the first cousins of Robert David Bright, we go to the parents of Robert David Bright, to their siblings and then to their children. We go up, along and down. The OWL for this could be:
+ObjectProperty: hasFirstCousin
+SubPropertyOf: hasCousin
+SubPropertyChain: hasParent o hasSibling o hasChild
+Characteristics: Symmetric
+
Note that we follow the definitions in Section 8.1 of first cousins sharing a grandparent, but not a parent. The sub-property chain goes up to children of a grandparent (a given person’s parents), along to siblings and down to their children. We do not want this property to be transitive. One’s cousins are not necessarily my cousins. The blood uncles of Robert David Bright have children that are his cousins. These first cousins, however, also have a mother that is not a blood relation of Robert David Bright and the mother’s sibling’s children are not cousins of Robert David Bright.
+We do, however, want the property to be symmetric. One’s cousins have one’s-self as a cousin.
+We need to place the cousin properties in the growing object property hierarchy. Cousins are obviously blood relations, but not ancestors, so they go off to one side, underneath hasBloodrelation
. We should group the different removes and degree of cousin underneath one hasCousin
property and this we will do.
Do the following:
+Task 33: First cousins | +
---|
|
+
You should see the following people as first cousins of Robert David Bright: Mark Anthony Heath, +Nicholas Charles Heath, Mark Bright, Ian Bright, Janet Bright, William Bright, James Bright, Julie +Bright, Clare Bright, Richard John Bright and Robert David Bright. The last two, as should be expected, +are first cousins of Robert David Bright and this is not correct. As David Bright will be his own brother, +his children are his own nieces and nephews and thus the cousins of his own children. Our inability to +infer siblings correctly in the FHKB haunts us still and will continue to do so.
+ +Although the last query for the cousins of Robert David Bright should return the
+same results for every reasoner, we have had experiences where the results differ.
+
Other degrees of cousins follow the same pattern as for first cousins; we go up, along and down. For second cousins we go up from a given individual to children of a great grandparent, along to their siblings and down to their grandchildren. The following object property declaration is for second cousins (note it uses the isGrandparentOf
and its inverse properties, though the parent properties could be used) :
ObjectProperty: hasSecondCousin
+SubPropertyOf: hasCousin
+SubPropertyChain: hasGrandParent o hasSibling o isGrandParentOf
+Characteristics: Symmetric
+
‘ Removes ’ simply add in another ‘leg’ of either ‘up’ or ‘down’ either side of the ‘along’—that is, think of the actual individuals involved and draw a little picture of blobs and lines—then trace your finger up, along and down to work out the sub-property chain. The following object property declaration does it for first cousins once removed (note that this has been done by putting this extra ‘leg’ on to the hasFirstCousin
property; the symmetry of the property makes it work either way around so that a given person is the first cousin once removed of his/her first cousins once removed):
ObjectProperty: hasFirstCousinOnceRemoved
+SubPropertyOf: hasCousin
+SubPropertyChain: hasFirstCousin o hasChild
+Characteristics: Symmetric
+
To exercise the cousin properties do the following:
+Task 34: Cousin properties | +
---|
|
+
You should see that we see some peculiar inferences about Robert David Bright’ cousins – not only are his brother and himself his own cousins, but so are his father, mother, uncles and so on. This makes sense if we look at the general sibling problem, but also it helps to just trace the paths around. If we go up from one of Robert David Bright’ true first cousins to a grandparent and down one parent relationship, we follow the first cousin once removed path and get to one of Robert David Bright’ parents or uncles. This is not to be expected and we need a tighter definition that goes beyond sub-property chains so that we can exclude some implications from the FHKB.
+As far as inferring first cousin facts for Robert David Bright, we have failed. More precisely, we have recalled all Robert David Bright’s cousins, but the precision is not what we would desire. What we can do is ask for Robert David Bright’ cousins, but then remove the children of Robert David Bright’ parents. The following DL query achieves this:
+Person that hasFirstCousin valueRobert_David_Bright_1965
+and (not (hasFather valueDavid_Bright_1934) or not (hasMother valueMar-
+garet_Grace_Rever_1934)
+
This works, but only for a named individual. We could make a defined class for this query; we could also make a defined class FirstCousin
, but it is not of much utility. We would have to make sure that people whose parents are not known to have siblings with children are excluded. That is, people are not ‘first cousins’ whose only first cousins are themselves and their siblings. The following class does this:
Class: FirstCousin
+EquivalentTo: Person that hasFirstCousin some Person
+
Task 35: Roberts first cousins | +
---|
|
+
This gives some practice with negation. One is making a class and then ‘taking’ some of it away – ‘these, but not those’.
+We have now expanded the FHKB to include most blood relationships. We have also found that cousins are hard to capture just using object properties and sub-property chains. Our broken sibling inferences mean that we have too many cousins inferred at the instance level. We can get cousins right at the class level by using our inference based cousins, then excluding some using negation. Perhaps not neat, but it works.
+We have reinforced that we can just add more and more relationships to individuals by just adding more properties to our FHKB object property hierarchy and adding more sub-property chains that use the object properties we have built up upon parentage and sibling properties; this is as it should be.
+ +The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet
+2.2.0 111.395 sec (0.90085 % of final) and by FaCT++ 1.6.4 is approximately 0.868
+sec (0.024 % of final). 0 sec indicates failure or timeout.
+
In this chapter you will:
+There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
Much of what is in this chapter is really revision; it is more of the same - making
+lots of properties and using lots of sub-property chains. However, it is worth it as
+it will test your growing skills and it also makes the reasoners and yourself work
+hard. There are also some good questions to ask of the FHKB as a result of adding
+marriages.
+
Marriage is a culturally complex situation to model. The FHKB started with a conservative model of a marriage involving only one man and one woman.10 Later versions are more permissive; a marriage simply has a minimum of two partners. This leaves it open to numbers and sex of the people involved. In fact, ‘marriage’ is probably not the right name for it. Using BreedingRelationship
as a label (the one favoured by the main author’s mother) may be a little too stark and might be a little exclusive.... In any case, some more generic name is probably better and various subclasses of the FHKB’s Marriage
class are probably necessary.
10 There being no funny stuff in the Stevens family.
+To model marriage do the following:
+Task 36: Marriage | +
---|
|
+
We have the basic infrastructure for marriages. We can ask the usual kinds of questions; try the following:
+Task 37: DL queries | +
---|
|
+
DL query: Marriage and hasMarriageYear some int[<= 1960]
+
This marriage infrastructure can be used to infer some slightly more interesting things for actual people. While we want marriage objects so that we can talk about marriage years and even locations, should we want to, we also want to be able to have the straight-forward spouse relationships one would expect. We can use sub-property chains in the usual manner; do the following:
+Task 38: Wifes and Husbands | +
---|
|
+
Figure 9.1 shows what is happening with the sub-property chains. Note that the domains and ranges of the spouse properties come from the elements of the sub-property chains. Note also that the hasSpouse
relationship will be implied from its sub-property chains.
The following questions can now be asked:
+Figure 9.1: The sub-property chain path used to infer the spouse relationships via the marriage partnerships.
+and many more. This is really a chance to explore your querying abilities and make some complex +nested queries that involve going up and down the hierarchy and tracing routes through the graph of +relationships between the individuals you’ve inferred.
+Now we have spouses, we can also have in-laws. The path is simple: isSpouseOf o hasMother
implies hasMotherInLaw
. The path involved in mother-in-laws can be seen in Figure 9.2. The following OWL code establishes the sub-property chains for hasMotherInLaw
:
ObjectProperty: hasMotherInLaw
+SubPropertyOf: hasParentInLaw
+SubPropertyChain: isSpouseOf o hasMother
+Domain: Person
+Range: Woman
+InverseOf: isMotherInLawOf
+
Figure 9.2: Tracing out the path between objects to make the sub-property chain for mother-in-laws
+Do the following to make the parent in-law properties:
+Task 39: Parents in-law | +
---|
|
+
Brothers and sisters in law have the interesting addition of having more than one path between objects to establish a sister or brother in law relationship. The OWL code below establishes the relationships for ‘is sister in law of’:
+ObjectProperty: hasSisterInLaw
+SubPropertyOf: hasSiblingInLaw
+SubPropertyChain: hasSpouse o hasSister
+SubPropertyChain: hasSibling o isWifeOf
+
A wife’s husband’s sister is a sister in law of the wife. Figure 9.3 shows the two routes to being a sister-in-law. In addition, the wife is a sister in law of the husband’s siblings. One can add as many sub-property chains to a property as one needs. You should add the properties for hasSiblingInLawOf
and its obvious sub-properties following the inverse of the pattern above.
Task 40: Siblings in-law | +
---|
|
+
By now, chances are high that the realisation takes a long time. We recommend to
+remove the very computationally expensive restriction `hasParent` exactly 2 Person
+on the `Person` class, if you have not done it so far.
+
Figure 9.3: The two routes to being a sister-in-law.
+The uncle of Robert David Bright has a wife, but she is not the aunt of Robert David Bright, she is the aunt-in-law. This is another kith relationship, not a kin relationship. The pattern has a familiar feel:
+ObjectProperty: isAuntInLawOf
+SubPropertyOf: isInLawOf
+SubPropertyChain: isWifeOf o isBrotherOf o isParentOf
+
Task 41: Uncles and aunts in-law | +
---|
|
+
Figure 9.4: The object property hierarchy after adding the various in-law properties.
+This has really been a revision chapter; nothing new has really been introduced. We have added a lot of new object properties and one new data property. The latest object property hierarchy with the ‘in-law’ branch can be seen in Figure 9.4. Highlights have been:
+The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet
+2.2.0 123.655 sec (1.00000 % of final) and by FaCT++ 1.6.4 is approximately 1.618
+sec (0.046 % of final). 0 sec indicates failure or timeout.
+
In this chapter you will:
+There is a snapshot of the ontology as required at this point in the tutorial available
+at http://owl.cs.manchester.ac.uk/tutorials/fhkbtutorial
+
Add the following defined classes:
+Task 42: Adding defined classes | +
---|
|
+
The three classes of Child
, Son
and Daughter
are of note. They are coded in the following way:
Class: Child EquivalentTo: Person that hasParent Some Person
+Class: Son EquivalentTo: Man that hasParent Some Person
+Class: Daughter EquivalentTo: Woman that hasParent Some Person
+
After running the reasoner, you will find that Person
is found to be equivalent to Child
; Daughter
is equivalent to Woman
and that Son
is equivalent to Man
. This does, of course, make sense – each and every person is someone’s child, each and every woman is someone’s daughter. We will forget evolutionary time-scales where this might be thought to break down at some point – all Person
individuals are also Descendant
individuals, but do we expect some molecule in some prebiotic soup to be a member of this class?
Nevertheless, within the scope of the FHKB, such inferred equivalences are not unreasonable. They are +also instructive; it is possible to have different intentional descriptions of a class and for them to have the +same logical extents. You can see another example of this happening in the amino acids ontology, but +for different reasons.
+Taking Grandparent
as an example class, there are two ways of writing the defined class:
Class: Grandparent EquivalentTo: Person and isGrandparentOf some Person
+Class: Grandparent EquivalentTo: Person and (isParentOf some (Person and (is-
+ParentOf some Person))
+
Each comes out at a different place in the class hierarchy. They both capture the right individuals as members (that is, those individuals in the ABox that are holding a isGrandparentOf
property), but the class hierarchy is not correct. By definition, all grandparents are also parents, but the way the object property hierarchy works means that the first way of writing the defined class (with the isGrandparentOf
property) is not subsumed by the class Parent
. We want this to happen in any sensible class hierarchy, so we have to use the second pattern for all the classes, spelling out the sub-property path that implies the property such as isGrandparentOf
within the equivalence axiom.
The reason for this need for the ‘long-form’ is that the isGrandparentOf
does not imply the isParentOf
property. As described in Chapter 3 if this implication were the case, being a grandparent of Robert David Bright, for instance, would also imply that the same Person
were a parent of Robert David Bright; an implication we do not want. As these two properties (isParentOf
and isGrandparentOf
) do not subsume each other means that the defined classes written according to pattern one above will not subsume each other in the class hierarchy. Thus we use the second pattern. If we look at the class for grandparents of Robert:
Class: GrandparentOfRobert
+EquivalentTo: Person that isParentOf some (Person that isParentOf value Robert
+David Bright)
+
If we make the equivalent class for Richard John Bright, apply the reasoner and look at the hierarchy, we see that the two classes are not logically equivalent, even though they have the same extents of William George Bright, Iris Ellen Archer, Charles Herbert Rever and Violet Sylvia Steward. We looked at this example in Section 6.2, where there is an explanation and solutions.
+We can add defined classes based on each property we have put into the object property hierarchy. We see the expected hierarchy; as can be seen from Figure 10.1 it has an obvious symmetry based on sex. We also see a lot of equivalences inferred – all women are daughters, as well as women descendants. Perhaps not the greatest insight ever gained, but it at least makes sense; all women must be daughters. It is instructive to use the explanation feature in Protégé to look at why the reasoner has made these inferences. For example, take a look at the class hasGrandmother some Woman
– it is instructive to see how many there are.
Like the Chapter on marriage and in-law (Chapter 9), this chapter has largely been revision. One thing of note is, however, that we must not use the object properties that are inferred through sub-property chains as definitions in the TBox; we must spell out the sub-property chain in the definition, otherwise the implications do not work properly.
+One thing is almost certain; the resulting TBox is rather complex and would be almost impossible to maintain by hand.
+ +Figure 10.1: The full TBox hierarchy of the FHKB
+ +The FHKB ontology at this stage of the tutorial has an expressivity of SROIQ(D).
+
The time to reason with the FHKB at this point (in Protégé) on a typical desktop
+machine by HermiT 1.3.8 is approximately 0.000 sec (0.00000 % of final), by Pellet
+2.2.0 0.000 sec (0.00000 % of final) and by FaCT++ 1.6.4 is approximately 35.438
+sec (1.000 % of final). 0 sec indicates failure or timeout.
+
If you have done all the tasks within this tutorial, then you will have touched most parts of OWL 2. Unusually for most uses of OWL we have concentrated on individuals, rather than just on the TBox. One note of warning – the full FHKB has some 450 members of the Bright family and takes a reasonably long time to classify, even on a sensible machine. The FHKB is not scalable in its current form.
+One reason for this is that we have deliberately maximised inference. We have attempted not to explicitly type the individuals, but drive that through domain and range constraints. We are making the property hierarchy do lots of work. For the individual Robert David Bright, we only have a couple of assertions, but we infer some 1 500 facts between Robert David Bright and other named individuals in the FHKB–displaying this in Protégé causes problems. We have various complex classes in the TBox and so on.
+ +We probably do not wish to drive a genealogical application using an FHKB in this form. Its purpose is educational. It touches most of OWL 2 and shows a lot of what it can do, but also a considerable amount of what it cannot do. As inference is maximised, the FHKB breaks most of the OWL 2 reasoners at the time of writing.However, it serves its role to teach about OWL 2.
+OWL 2 on its own and using it in this style, really does not work for family history. We have seen that siblings and cousins cause problems. rules in various forms can do this kind of thing easily—it is one of the primary examples for learning about Prolog. Nevertheless, the FHKB does show how much inference between named individuals can be driven from a few fact assertions and a property hierarchy. Assuming a powerful enough reasoner and the ability to deal with many individuals, it would be possible to make a family history application using the FHKB; as long as one hid the long and sometimes complex queries and manipulations that would be necessary to ‘prune’ some of the ‘extra’ facts found about individuals. However, the FHKB does usefully show the power of OWL 2, touch a great deal of the language and demonstrate some of its limitations.
+Table A.1: The list of individuals in the FHKB
+Person | +First given name | +Second given name | +Family name | +Birth year | +Mother | +Father | +
---|---|---|---|---|---|---|
Alec John Archer 1927 | +Alec | +John | +Archer | +1927 | +Violet Heath 1887 | +James Alexander Archer 1882 | +
Charles Herbert Rever 1895 | +Charles | +Herbert | +Rever | +1895 | +Elizabeth Frances Jessop 1869 | +William Rever 1870 | +
Charlotte Caroline Jane Bright 1894 | +Charlotte | +Caroline Jane | +Bright | +1894 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Charlotte Hewett 1863 | +Charlotte | +none | +Hewett | +1863 | +not specified | +not specified | +
Clare Bright 1966 | +Clare | +none | +Bright | +1966 | +Diana Pool | +Peter William Bright 1941 | +
Diana Pool | +Diana | +none | +Pool | +none | +not specified | +not specified | +
David Bright 1934 | +David | +none | +Bright | +1934 | +Iris Ellen Archer 1906 | +William George Bright 1901 | +
Dereck Heath | +Dereck | +none | +Heath | +1927 | +not specified | +not specified | +
Eileen Mary Rever 1929 | +Eileen | +Mary | +Rever | +1929 | +Violet Sylvia Steward 1894 | +Charles Herbert Rever 1895 | +
Elizabeth Frances Jessop 1869 | +Elizabeth | +Frances | +Jessop | +1869 | +not specified | +not specified | +
Ethel Archer 1912 | +Ethel | +none | +Archer | +1912 | +Violet Heath 1887 | +James Alexander Archer 1882 | +
Frederick Herbert Bright 1889 | +Frederick | +Herbert | +Bright | +1889 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Henry Edmund Bright 1862 | +Henry | +Edmund | +Bright | +1862 | +not specified | +not specified | +
Henry Edmund Bright 1887 | +Henry | +Edmund | +Bright | +1887 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Ian Bright 1959 | +Ian | +none | +Bright | +1959 | +Joyce Gosport | +John Bright 1930 | +
Iris Ellen Archer 1906 | +Iris | +Ellen | +Archer | +1906 | +Violet Heath 1887 | +James Alexander Archer 1882 | +
James Alexander Archer 1882 | +James | +Alexander | +Archer | +1882 | +not specified | +not specified | +
James Bright 1964 | +James | +none | +Bright | +1964 | +Diana Pool | +Peter William Bright 1941 | +
James Frank Hayden Bright 1891 | +James | +Frank | +Bright | +1891 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Janet Bright 1964 | +Janet | +none | +Bright | +1964 | +Joyce Gosport | +John Bright 1930 | +
John Bright 1930 | +John | +none | +Bright | +1930 | +Iris Ellen Archer 1906 | +William George Bright 1901 | +
John Tacey Steward 1873 | +John | +Tacey | +Steward | +1873 | +not specified | +not specified | +
Joyce Archer 1921 | +Joyce | +none | +Archer | +1921 | +Violet Heath 1887 | +James Alexander Archer 1882 | +
Joyce Gosport | +Joyce | +none | +Gosport | +not specified | +not specified | +not specified | +
Julie Bright 1966 | +Julie | +none | +Bright | +1966 | +Diana Pool | +Peter William Bright 1941 | +
Kathleen Minnie Bright 1904 | +Kathleen | +Minnie | +Bright | +1904 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Leonard John Bright 1890 | +Leonard | +John | +Bright | +1890 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Lois Green 1871 | +Lois | +none | +Green | +1871 | +not specified | +not specified | +
Margaret Grace Rever 1934 | +Margaret | +Grace | +Rever | +1934 | +Violet Sylvia Steward 1894 | +Charles Herbert Rever 1895 | +
Mark Anthony Heath 1960 | +Mark | +Anthony | +Heath | +1960 | +Eileen Mary Rever 1929 | +Dereck Heath | +
Mark Bright 1956 | +Mark | +none | +Bright | +1956 | +Joyce Gosport | +John Bright 1930 | +
Nicholas Charles Heath 1964 | +Nicholas | +Charles | +Heath | +1964 | +Eileen Mary Rever 1929 | +Dereck Heath | +
Nora Ada Bright 1899 | +Nora | +Ada | +Bright | +1899 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
Norman James Archer 1909 | +Norman | +James | +Archer | +1909 | +Violet Heath 1887 | +James Alexander Archer 1882 | +
Peter William Bright 1941 | +Peter | +William | +Bright | +1941 | +Iris Ellen Archer 1906 | +William George Bright 1901 | +
Richard John Bright 1962 | +Richard | +John | +Bright | +1962 | +Margaret Grace Rever 1934 | +David Bright 1934 | +
Robert David Bright 1965 | +Robert | +David | +Bright | +1965 | +Margaret Grace Rever 1934 | +David Bright 1934 | +
Violet Heath 1887 | +Violet | +none | +Heath | +1887 | +not specified | +not specified | +
Violet Sylvia Steward 1894 | +Violet | +Sylvia | +Steward | +1894 | +Lois Green 1871 | +John Tacey Steward 1873 | +
William Bright 1970 | +William | +none | +Bright | +1970 | +Joyce Gosport | +John Bright 1930 | +
William George Bright 1901 | +William | +George | +Bright | +1901 | +Charlotte Hewett 1863 | +Henry Edmund Bright 1862 | +
William Rever 1870 | +William | +none | +Rever | +1870 | +not specified | +not specified | +
[1] M. Horridge and S. Bechhofer. The owl api: a java api for working with owl 2 ontologies. Proc. of +OWL Experiences and Directions , 2009, 2009.
+[2] Luigi Iannone, Alan Rector, and Robert Stevens. Embedding knowledge patterns into owl. In European +Semantic Web Conference (ESWC09) , pages 218–232, 2009.
+[3] Dmitry Tsarkov, Uli Sattler, Margaret Stevens, and Robert Stevens. A Solution for the Man-Man +Problem in the Family History Knowledge Base. In Sixth International Workshop on OWL: Experiences and Directions 2009 , 2009.
+ + + + + + +GitHub is increasingly used by software developers, programmers and project managers for uploading and sharing content, as well as basic project management. You build a profile, upload projects to share and connect with other users by "following" their accounts. Many users store programs and code projects, but you can also upload text documents or other file types in your project folders to share publicly (or privately). It is capable of storing any file type from text, to structured data, to software. And more features are being added by the day. The real power of Git, however, is less about individuals publishing content (many places can do that, including google docs etc). It is more about that content being easily shared, built upon, and credited in a way that is robust to the realities of distributed collaboration. You don't have to know how to code or use the command line. It is a powerful way to organize projects with multiple participants.
+Git supports the following types of primary entities:
+The relationships between any combination of these entities is many-to-many, with the nuanced exception of repositories. +For our purposes today we will oversimplify by saying that a repositoy belongs either to a single organization or to a single individual.
+ +Content in GitHub is written using Markdown, a text-to-HTML conversion tool for web writers (ref).
+For more help with Markdown, see this GitHub guide.
+Raw markup syntax | +As rendered | +
---|---|
Header - use # for H1, ## for H2, etc. |
+# Header, ## Header (note, the header is not displaying properly in this table) | +
Emphasis, aka italics, with *asterisks* or _underscores_. |
+Emphasis, aka italics, with asterisks or underscores. | +
Strong emphasis, aka bold, with **asterisks** or __underscores__. |
+Strong emphasis, aka bold, with asterisks or underscores. | +
Combined emphasis with **asterisks and _underscores_**. |
+Combined emphasis with asterisks and underscores. | +
Strikethrough uses two tildes. ~~Scratch this.~~ |
+Strikethrough uses two tildes. ~~Scratch this.~~ | +
Lists:
+To introduce line breaks in markdown, add two spaces
+For a bulleted list, use * or - (followed by a space)
Here is an example of a list:
+One
+Two
+Three
Here is an example of a bulleted list:
+GitHub can store any kind of content, provided it isn't too big. (And now even this is possible). +However, it is more capable for some filetypes than it is for others. Certain filetypes can be viewed 'natively' within the GitHub interface. These are:
+Adopted from CD2H MTIP tutorial
+ + + + + + +Why: +"Issues are a great way to keep track of tasks, enhancements, and bugs for your projects or for anyone else's. As long as you are a registered GitHub user you can log an issue, or comment on an issue for any open repo on GitHub. Issues are a bit like email—except they can be shared, intelligently organized, and discussed with the rest of your team. GitHub’s tracker is called Issues, and has its own section in every repository." (From: https://guides.github.com/features/issues/)
+How:
+How to create an issue in GitHub:
+- [ ]
markdown syntax before each bullet. Note, you can also add sub-tasks by clicking the 'add a task list' button in the tool bar. The status of the tasks in an issue (eg. https://github.com/nicolevasilevsky/c-path-practice/issues/1 will then be reflected in any summary view. Eg. https://github.com/nicolevasilevsky/c-path-practice/issues.Your turn:
+Follow the instructions above to create a ticket about a hypothetical issue (such as an improvement to this tutorial) that includes a sub-task list.
+Assign issues to people
+Add labels
+New Labels
+Your turn:
+On the ticket you previously created:
+Comment on issues
+Close issues
+Use direct @ mentions
+Link documents
+You can link documents and files by:
+Cross reference to another ticket
+Before saving your changes, you can preview the comment to ensure the correct formatting.
+Your turn:
+Milestones
+Your turn
+Create a new milestone, and add the milestone to an existing ticket.
+Projects
+To create project:
+Your turn
+Create a new project and add columns and add cards to the columns.
+Once you start using GitHub for lots of things it is easy to get overwhelmed by the number of issues. The query dashboard https://github.com/issues allows you to filter on tickets.
+More complex queries are also possible.
+Note, you must be signed in to GitHub to view the above links.
+Further reading on Issue querys
+Adopted from CD2H MTIP tutorial
+ + + + + + +As a modern ontology curator, you are an engineer - you are curating computable knowledge, testing the integrity of your curation using quality control testing, and are responsible for critical components of modern knowledge systems that directly affect user experience - the ontologies.
+Scientific computing is a big, scary world comprising many different tools, methodologies, training resources and philosophies, but nearly all modern workflows share one key aspect: the ability to execute commands that help you find and manipulate data with the command line. Some examples of that include:
+sh run.sh make prepare_release
git
and committing changescurl
or wget
Here we are doing a basic hands on tutorial which will walk you through the must-know commands. For a more comprehensives introduction into thinking about automation please see our lesson on Automating Ontology Development Workflows: Make, Shell and Automation Thinking
+The tutorial uses example tailored for users of UNIX systems, like Mac and Linux.
+Users of Windows generally have analogous steps - wherever we talk about an sh
file in the following
+there exists a corresponding bat
file that can be run in the windows powershell, or CMD.
You have:
+Intro to Command Lind Interface Part 1
+ + +We are not going to discuss here in any detail what the command line is. We will focus on what you can do with it: for more information skip to the further reading section.
+The basic idea behind the command line is that you run a command to achieve a goal. Among the most common goals relevant to you as a semantic engineer will be:
+Most commands result in some kind of printed statement. Lets try one. Open your terminal (a terminal is the program you use to enter commands. For a nice overview of how shell, terminal, command line and console relate, see here). On Mac, you can type CMD+Space to search for programs and then type "terminal". For this tutorial we use the default Terminal.app, but there are many others, including iterm2. For this introduction, it does not matter which terminal you use. When first opening the terminal you will see something like this:
+ +or
+ +Note that your terminal window may look slightly different, depending on your configuration. More on that later.
+Let's type our first command and hit enter:
+whoami
+
On my machine I get
+(base) matentzn@mbp.local:~ $ whoami
+matentzn
+
This does not seem like a useful command, but sometimes, we forget who we are, and it is good to be reminded. So, what happened here? We ran a command, named whoami
and our command line executed that command which is implemented somewhere on our machine as a program. That program simply determined who I am in some way, and then printed the result again.
Ok so, lets lets look a bit closer at the command prompt itself:
+matentzn@mbp.local:~ $
+
Two interesting things to not here for today:
+~
. This universally (on all Unix systems) refers to your user directory on your computer. In this case here, it tells you that in your terminal, you are "in your user directory".$
sign. It simply denotes where your command line starts (everything before the $ is information provided to you, everything will be about your commands). Make sure that you do not accidentally copy based the $
sign from the examples on the web into your command prompt:(base) matentzn@mbp.local:~ $ $ whoami
+-bash: $: command not found
+(base) matentzn@mbp.local:~ $
+
whoami
did not do anything.
Ok, based on the ~
we know that we are "in the user home directory". Let as become a bit more confident about that and ask the command prompt where we are:
matentzn@mbp.local:~ $ pwd
+/Users/matentzn
+
The pwd
command prints out the full path of our current location in the terminal. As you can see, the default location when opening the command prompt is, indeed, the home director, located in /Users/matentzn
. We will use it later again.
A word about paths. /Users/matentzn
is what we call a path. On UNIX systems, /
separates one directory from another. So matentzn
is a directory inside of the Users
directory.
Let us now take a look what our current directory contains (type ls
and hit enter):
matentzn@mbp.local:~ $ ls
+Applications Library ...
+
This command will simply list all of the files in your directory as a big list. We can do this a bit nicer:
+matentzn@mbp.local:~ $ ls -l
+total 80000
+drwx------@ 4 matentzn staff 128 31 Jul 2020 Applications
+drwx------@ 26 matentzn staff 832 12 Sep 2021 Desktop
+
-l
is a short command line option which allows you specify that you would like print the results in a different format (a long list). We will not go into any detail here what this means but a few things to not in the output: You can see some pieces of information that are interesting, like when the file or directory was last modified (i.e. 31. July 2020), who modified it (me) and, of course, the name e.g. Applications
.
Before we move on to the next section, let us clear
the current terminal from all the command outputs we ran:
clear
+
Your command prompt should now be empty again.
+ +In the previous section we learned how to figure out who we are (whoami
), where we are (pwd
) and how to see what is inside the current directory (ls -l
) and how to clear all the output (clear
).
Let us know look at how we can programmatically create a new directory and change the location in our terminal.
+First let us create a new directory:
+mkdir tutorial-my
+
Now if we list the contents of our current directory again (ls -l
), we will see our newly created directory listed! Unfortunately, we just realised that we chose the wrong name for our directory! It should have been my-tutorial
instead of tutorial-my
! So, let us rename it. In the command prompt, rather than "renaming" files and directories, we "move" them (mv
).
mv tutorial-my my-tutorial
+
Now, lets enter our newly created directory using the _c_hange _d_irectory command (cd
), and create another sub-directory in my-tutorial
, called "data" (mkdir data
):
cd my-tutorial
+mkdir data
+
You can check again with ls -l
. If you see the data directory listed, we are all set! Feel free to run clear
again to get rid of all the current output on the command prompt.
Let us also enter this directory now: cd data
.
If we want to leave the directory again, feel free to do that like this:
+cd ..
+
The two dots (..
) mean: "parent directory." This is very important to remember during your command line adventures: ..
stands for "parent directory", and .
stands for "current/this directory" (see more on that below).
Now, let's get into something more advanced: downloading files.
+ +Our first task is to download the famous Human Phenotype Ontology Gene to Phenotype Annotations (aka HPOA). As you should already now, whenever we download ontologies, or ontology related files, we should always use a persistent URL, if available! This is the one for HPOA: http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt
.
There are two very popular commands for downloading content: curl
and wget
. I think most of my colleagues prefer curl
, but I like wget
because it simpler for beginners. So I will use it here. Lets us try downloading the file!
wget http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt -O genes_to_phenotype.txt
+
The -O
parameter is optional and specifies a filename. If you do not add the parameter, wget
will try to guess the filename from the URL. This does not always go so well with complex URLs, so I personally recommend basically always specifying the -O
parameter.
You can also use the curl equivalent of the wget command;
+curl -L http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt --output genes_to_phenotype.txt
+
Try before reading on: Exercises!
+genes_to_phenotype.txt
to the data directory you previously created.data
directory.Do not move on to the next step unless your data directory looks similar to this:
+matentzn@mbp.local:~/my-tutorial/data $ pwd
+/Users/matentzn/my-tutorial/data
+matentzn@mbp.local:~/my-tutorial/data $ ls -l
+total 53968
+-rw-r--r-- 1 matentzn staff 19788987 11 Jun 19:09 genes_to_phenotype.txt
+-rw-r--r-- 1 matentzn staff 7836327 27 Jun 22:50 hp.obo
+
Ok, let us look at the first 10 lines of genes_to_phenotype.txt using the head
command:
head genes_to_phenotype.txt
+
head
is a great command to familiarise yourself with a file. You can use a parameter to print more or less lines:
head -3 genes_to_phenotype.txt
+
This will print the first 3 lines of the genes_to_phenotype.txt file. There is another analogous command that allows us to look at the last lines off a file:
+tail genes_to_phenotype.txt
+
head
, tail
. Easy to remember.
Next, we will learn the most important of all standard commands on the command line: grep
. grep
stands for "Global regular expression print" and allows us to search files, and print the search results to the command line. Let us try some simple commands first.
grep diabetes genes_to_phenotype.txt
+
You will see a list of hundreds of lines out output. Each line corresponds to a line in the genes_to_phenotype.txt
file which contains the word "diabetes".
grep is case sensitive. It wont find matches like Diabetes, with capital D!
+
+Use the `-i` parameter in the grep command to instruct grep to
+perform case insensitive matches.
+
There is a lot more to grep than we can cover here today, but one super cool thing is searching across an entire directory.
+grep -r "Elevated circulating follicle" .
+
Assuming you are in the data
directory, you should see something like this:
./genes_to_phenotype.txt:190 NR0B1 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510
+./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - - mim2gene OMIM:273250
+...... # Removed other results
+./hp.obo:name: Elevated circulating follicle stimulating hormone level
+
There are two new aspects to the command here:
+-r
option ("recursive") allows is to search a directory and all directories within in..
in the beginning. Remember, in the previous use of the grep
command we had the name of a file in the place where now the .
is. The .
means "this directory" - i.e. the directory you are in right now (if lost, remember pwd
).As you can see, grep
does not only list the line of the file in which the match was found, it also tells us which filename it was found in!
+We can make this somewhat more easy to read as well by only showing filenames using the -l
parameter:
matentzn@mbp.local:~/my-tutorial/data $ grep -r -l "Elevated circulating follicle" .
+./genes_to_phenotype.txt
+./hp.obo
+
The final lesson for today is about one of the most powerful features of the command line: the ability to chain commands together. Let us start with a simple example (make sure you are inside the data directory):
+grep -r "Elevated circulating follicle" . | head -3
+
This results in:
+./genes_to_phenotype.txt:190 NR0B1 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510
+./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - - mim2gene OMIM:273250
+./genes_to_phenotype.txt:57647 DHX37 HP:0008232 Elevated circulating follicle stimulating hormone level - HP:0040281 orphadata ORPHA:251510
+
So, what is happening here? First, we use the grep
command to find "Elevated circulating follicle" in our data directory. As you may remember, there are more than 10 results for this command. So the grep command now wants to print these 10 results for you, but the |
pipe symbol intercepts the result from grep
and passes it on to the next command, which is head
. Remember head
and tail
from above? Its exactly the same thing, only that, rather than printing the first lines of a file, we print the first lines of the output of the previous command. You can do incredible things with pipes. Here a taster which is beyond this first tutorial, but should give you a sense:
grep "Elevated circulating follicle" genes_to_phenotype.txt | cut -f2 | sort | uniq | head -3
+
Output:
+AR
+BNC1
+C14ORF39
+
What is happening here?
+grep
is looking for "Elevated circulating follicle" in all files in the directory, then "|" is passing the output on tocut
, which extracts the second column of the table (how cool?), then "|" is passing the output on tosort
, which sorts the output, then "|" is passing the output on touniq
, which removes all duplicate values from the output, then "|" is passing the output on tohead
, which is printing only the first 3 rows of the result.Another super cool use of piping is searching your command history. Try running:
+history
+
This will show you all the commands you have recently run. Now if you want to simply look for some very specific commands that you have run in the past you can combine history
with grep
:
history | grep follicle
+
This will print every command you ran in the last hour that contains the word "follicle". Super useful if you, like me, keep forgetting your commands!
+The last critical feature of the command line we cover today is the "file redirect". Instead of printing the output to file, we may chose to redirect the results to a file instead:
+matentzn@mbp.local:~/my-tutorial/data $ grep "Elevated circulating follicle" genes_to_phenotype.txt | cut -f2 | sort | uniq | head -3 > gene.txt
+matentzn@mbp.local:~/my-tutorial/data $ head gene.txt
+AR
+BNC1
+C14ORF39
+
> gene.txt
basically tells the command line: instead of printing the results to the command line, "print" them into a file which is called gene.txt
.
Sam also did here PhD in and around ontologies but has moved entirely to data engineering since. I really liked her 1 hour introduction into the terminal, this should fill some of the yawning gaps in this introduction here.
+ + + +Today we will pick up where we left off after the first CLI tutorial, and discuss some more usages of the command line. In particular, we will:
+You have:
+~/.zshrc
file in case you have had any previous customisations you wish to preserve.Introduction to Command Line Interface Part 2
+ + +odk.bat
as instructed above in some directory on your machine (the path to the odk.bat file should have no spaces!).bash_profile
in the same directory as your odk.bat file.-v %cd%\.bash_profile:/root/.bash_profile
to the odk.bat file (this is mounting the .bash_profile
file inside your ODK container). There is already a similar -v statement in this file, just copy it right afterodk.bat bash
on your CMD (first, cd
to the directory containing the odk.bat file).If you have not done so, install https://ohmyz.sh/. It is not strictly speaking necessary to use ohmyzsh to follow the rest of this tutorial, but it is a nice way to managing your Zsh (z-shell) configuration. Note that the ODK is using the much older bash
, but it should be fine for you to work with anyways.
As Semantic Engineers or Ontology Curators we frequently have to install custom tools like ROBOT, owltools, and more on our computer. These are frequently downloaded from the internet as "binaries", for example as Java "jar" files. In order for our shell to "know" about these downloaded programs, we have to "add them to the path".
+Let us first look at what we currently have loaded in our path:
+echo $PATH
+
What you see here is a list of paths. To read this list a bit more easily, let us remember our lesson on piping commands:
+echo $PATH | tr ':' '\n' | sort
+
What we do here:
+echo
command to print the contents of the $PATH variable. In Unix systems, the $
signifies the beginning of a variable name (if you are curious about what other "environment variables" are currently active on your system, use theprintenv
command). The output of theecho
command is piped to the next command (tr
).tr – translate characters
command copies the input of the previous command to the next with substitution or deletion of selected characters. Here, we substitute the :
character, which is used to separate the different directory paths in the $PATH
variable, with "\n", which is the all important character that denotes a "new line".So, how do we change the "$PATH"? Let's try and install ROBOT and see! Before we download ROBOT, let us think how we will organise our custom tools moving forward. Everyone has their own preferences, but I like to create a tools
directory right in my Users directory, and use this for all my tools moving forward. In this spirit, lets us first go to our user directory in the terminal, and then create a "tools" directory:
cd ~
+mkdir -p tools
+
The -p
parameter simply means: create the tools directory only if it does not exist. Now, let us go inside the tools directory (cd ~/tools
) and continue following the instructions provided here.
First, let us download the latest ROBOT release using the curl
command:
curl -L https://github.com/ontodev/robot/releases/latest/download/robot.jar > robot.jar
+
ROBOT is written in the Java programming language, and packaged up as an executable JAR file. It is still quite cumbersome to directly run a command with that JAR file, but for the hell of it, let us just do it (for fun):
+java -jar robot.jar --version
+
If you have worked with ROBOT before, this looks quite a bit more ugly then simply writing:
+robot --version
+
If you get this (or a similar) error:
+zsh: permission denied: robot
+
You will have to run the following command as well, which makes the robot
wrapper script executable:
chmod +x ~/tools/robot
+
So, how can we achieve this? The answer is, we download a "wrapper script" and place it in the same folder as the Jar. Many tools provide such wrapper scripts, and they can sometimes do many more things than just "running the jar file". Let us know download the latest wrapper script:
+curl https://raw.githubusercontent.com/ontodev/robot/master/bin/robot > robot
+
If everything went well, you should be able to print the contents of that file to the terminal using cat
:
cat robot
+
You should see something like:
+#!/bin/sh
+
+## Check for Cygwin, use grep for a case-insensitive search
+IS_CYGWIN="FALSE"
+if uname | grep -iq cygwin; then
+ IS_CYGWIN="TRUE"
+fi
+
+# Variable to hold path to this script
+# Start by assuming it was the path invoked.
+ROBOT_SCRIPT="$0"
+
+# Handle resolving symlinks to this script.
+# Using ls instead of readlink, because bsd and gnu flavors
+# have different behavior.
+while [ -h "$ROBOT_SCRIPT" ] ; do
+ ls=`ls -ld "$ROBOT_SCRIPT"`
+ # Drop everything prior to ->
+ link=`expr "$ls" : '.*-> \(.*\)$'`
+ if expr "$link" : '/.*' > /dev/null; then
+ ROBOT_SCRIPT="$link"
+ else
+ ROBOT_SCRIPT=`dirname "$ROBOT_SCRIPT"`/"$link"
+ fi
+done
+
+# Directory that contains the this script
+DIR=$(dirname "$ROBOT_SCRIPT")
+
+if [ $IS_CYGWIN = "TRUE" ]
+then
+ exec java $ROBOT_JAVA_ARGS -jar "$(cygpath -w $DIR/robot.jar)" "$@"
+else
+ exec java $ROBOT_JAVA_ARGS -jar "$DIR/robot.jar" "$@"
+fi
+
We are not getting into the details of what this wrapper script does, but note that, you can fine the actually call the the ROBOT jar file towards the end: java $ROBOT_JAVA_ARGS -jar "$DIR/robot.jar" "$@"
. The cool thing is, we do not need to ever worry about this script, but it is good for use to know, as Semantic Engineers, that it exists.
Now, we have downloaded the ROBOT jar file and the wrapper script into the ~/tools
directory. The last step remaining is to add the ~/tools
directory to your path. It makes sense to try to at least understand the basic idea behind environment variables: variables that are "loaded" or "active" in your environment (your shell). The first thing you could try to do is change the variable right here in your terminal. To do that, we can use the export
command:
export PATH=$PATH:~/tools
+
What you are doing here is using the export
command to set the PATH
variable to $PATH:~/tools
, which is the old path ($PATH
), a colon (:
) and the new directory we want to add (~/tools
). And, indeed, if we now look at our path again:
echo $PATH | tr ':' '\n' | sort
+
We will see the path added. We can now move around to any directory on our machine and invoke the robot
command. Try it before moving on!
Unfortunately, the change we have now applied to the $PATH
variable is not persistent: if you open a new tab in your Terminal, your $PATH
variable is back to what it was. What we have to do in order to make this persistent is to add the export
command to a special script which is run every time the you open a new terminal: your shell profile.
There is a lot to say about your shell profiles, and we are taking a very simplistic view here that covers 95% of what we need: If you are using zsh
your profile is managed using the ~/.zshrc
file, and if you are using bash
, your profile is managed using the ~/.bash_profile
file. In this tutorial I will assume you are using zsh
, and, in particular, after installing "oh-my-zsh". Let us look at the first 5 lines of the ~/.zshrc
file:
head ~/.zshrc
+
If you have installed oh-my-zsh, the output will look something like:
+# If you come from bash you might have to change your $PATH.
+# export PATH=$HOME/bin:/usr/local/bin:$PATH
+
+# Path to your oh-my-zsh installation.
+export ZSH="$HOME/.oh-my-zsh"
+
+# Set name of the theme to load --- if set to "random", it will
+# load a random theme each time oh-my-zsh is loaded, in which case,
+# to know which specific one was loaded, run: echo $RANDOM_THEME
+# See https://github.com/ohmyzsh/ohmyzsh/wiki/Themes
+
This ~/.zshrc
profile script is loaded every time you open up a new shell. What we want to do is add our export
command above to this script, so that it is running every time. That is the basic concept of a shell profile: providing a series of commands that is run every time a new shell (terminal window, tab) is opened.
For this tutorial, we use nano
to edit the file, but feel free to use your text editor of choice. For example, you can open the profile file using TextEdit
on Mac like this:
open -a TextEdit ~/.zshrc
+
We will proceed using nano
, but feel free to use any editor.
nano ~/.zshrc
+
Using terminal-based editors like nano or, even worse, vim, involves a bit of a learning curve. nano
is by far the least powerful and simple to use. If you typed the above command, you should see its contents on the terminal. The next step is to copy the following (remember, we already used it earlier)
export PATH=$PATH:~/tools
+
and paste it somewhere into the file. Usually, there is a specific section of the file that is concerned with setting up your path. Eventually, as you become more of an expert, you will start organising your profile according to your own preferences! Today we will just copy the command anywhere, for example:
+# If you come from bash you might have to change your $PATH.
+# export PATH=$HOME/bin:/usr/local/bin:$PATH
+export PATH=~/tutorial:$PATH
+# ..... other lines in the file
+
Note that the #
symbol denotes the beginning of a "comment" which is ignored by the shell/CLI. After you have pasted the above, you use the following keyboard key-combinations to safe and close the file:
control + O
+
This saves the file. Confirm with Enter.
+control + x
+
This closes the file. Now, we need to tell the shell we are currently in that it should reload our profile we have just edited. We do that using the source
command.
source ~/.zshrc
+
Great! You should be able open a new tab in your terminal (with command+t on a Mac, for example) and run the following command:
+robot --version
+
This section will only give a sense of the kinds of things you can do with your shell profile - in the end you will have to jump into the cold water and build your skills up yourself. Let us start with a very powerful concept: aliases. Aliases are short names for your commands you can use if you use them repeatedly but are annoyed typing them out every time. For example, tired of typing out long paths all the time to jump between your Cell Ontology and Human Phenotype Ontology directories? Instead of:
+cd /Users/matentzn/ws/human-phenotype-ontology/src/ontology
+
wouldn't it be nice to be able to use, instead,
+cdhp
+
or, if you are continuously checking git status
, why not implement a alias gits
? Or activating your python environment (source ~/.pyenv/versions/oak/bin/activate
) with a nice env-oak
? To achieve this we do the following:
(1) Open your profile in a text editor of your choice, e.g.
+nano ~/.zshrc
+
add the following lines:
+alias cdt='cd ~/tools'
+alias hg='history | grep'
+
Save (control+o) and close (control+x) the profile. Reload the profile:
+source ~/.zshrc
+
(Alternatively, just open a new tab in your Terminal.) Now, lets try our new aliases:
+cdt
+
Will bring you straight to your tools
directory you created in the previous lesson above.
hg robot
+
Will search your terminal command history for every command you have executed involving robot
.
In the following, we provide a list of aliases we find super useful:
+alias cdt='cd ~/tools'
- add shortcuts to all directories you frequently visit!alias orcid='echo '\''https://orcid.org/0000-0002-7356-1779'\'' | tr -d '\''\n'\'' | pbcopy'
- if you keep having to look up your ORCID, your favourite ontologies PURL or the your own zoom room, why not add a shortcut that copies it straight into your clipboard?alias opent='open ~/tools'
- why not open your favourite directory in finder without faving to search the User Interface? You can use the same idea to open your favourite ontology from wherever you are, i.e. alias ohp='open ~/ws/human-phenotype-ontology/src/ontology/hp-edit.owl'
.alias env-linkml='source ~/.pyenv/versions/linkml/bin/activate'
- use simple shortcuts to active your python environments. This will become more important if you learn to master special python tools like OAK.alias update_repo='sh run.sh make update_repo'
- for users of ODK - alias all your long ODK commands!The most advanced thought we want to cover today is "functions". You can not only manage simple aliases, but you can actually add proper functions into your shell profile. Here is an example of one that I use:
+ols() {
+ open https://www.ebi.ac.uk/ols/search?q="$1"
+}
+
This is a simple function in my bash profile that I can use to search on OLS:
+ols "lung disorder"
+
It will open this search straight in my browser.
+rreport() {
+ robot report -i "$1" --fail-on none -o /Users/matentzn/tmp_data/report_"$(basename -- $1)".tsv
+}
+
This allows me to quickly run a robot report on an ontology.
+rreport cl.owl
+
Why not expand the function and have it open in my atom text editor right afterwards?
+rreport() {
+ robot report -i "$1" --fail-on none -o /Users/matentzn/tmp_data/report_"$(basename -- $1)".tsv && atom /Users/matentzn/tmp_data/report_"$(basename -- $1)".tsv
+}
+
The possibilities are endless. Some power-users have hundreds of such functions in their shell profiles, and they can do amazing things with them. Let us know about your own ideas for functions on the OBOOK issue tracker. Or, why not add a function to create a new, titled issue on OBOOK?
+obook-issue() {
+ open https://github.com/OBOAcademy/obook/issues/new?title="$1"
+}
+
and from now on run:
+obook-issue "Add my awesome function"
+
In this tutorial, we will learn to use a very basic lexical matching tool (OAK Lexmatch). The goal is not only to enable the learner to design their own matching pipelines, but also to to think about how they fit into their mapping efforts. Note that this tutorial is not about how to do proper matching: the goal here is simply to introduce you to the general workflow. Proper ontology matching is a major discipline with many tools, preprocessing and tuning approaches and often intricate interplay between matching tools and human curators. Today, you will just get a sense of the general method.
+In this tutorial, you will learn how to match fruit juices in Wikidata with FOODON using a simple lexical matching tool (OAK). The idea is simple: We obtain the ontologies we like to match, ask OAK to generate the matches and then curate the results.
+Makefile
to prepare your input ontology with ROBOT.Setting up oak
is described in its documentation. Note that, aside from oak
itself, you also need relation-graph
, rdftab
and riot
installed, see https://incatools.github.io/ontology-access-kit/intro/tutorial07.html#without-docker.
+This tutorial requires OAK version 0.1.59 or higher.
Note that if you are using the ODK docker image, oaklib
is already installed. In the following, we will use the ODK wrapper to ensure that everyone has a consistent experience. If you want to use the local (non-docker) setup, you have to follow the instructions above before continuing and ignore the sh odk.sh
part of the commands.
ODK 1.3.1, the version still active on the 8th December 2022, does not have the latest dependencies of OAK installed. +To follow the tutorial you have to use the ODK development snapshot.
+Install the ODK Development snapshot:
+docker pull obolibrary/odkfull:dev
+
After downloading https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/resources/odk.sh into your local working directory, open it with a text editor and change:
+docker ... obolibrary/odkfull ...
+
to
+docker ... obolibrary/odkfull:dev ...
+
First, we download FOODON
ontology. You can do this in whatever way you want, for example with wget
:
sh odk.sh wget http://purl.obolibrary.org/obo/foodon.owl -O foodon.owl
+
Next, we extract the subset of FOODON that is relevant to our task at hand: relevant terms about fruit juices. The right method of subset extraction will differ from task to task. For this tutorial, we are using ROBOT extract to obtain a MIREOT
module containing all the fruit juices. We do this by selecting everything between fruit juice food product
as the upper-term
and fruit juices (apple juice
, orange juice
and grapefruit juice
) as the lower-term
of the FOODON
subset.
sh odk.sh robot extract --method MIREOT --input foodon.owl --upper-term "FOODON:00001140" --lower-term "FOODON:00001277" --lower-term "FOODON:00001059" --lower-term "FOODON:03306174 " --output fruit_juice_food_foodon.owl
+
If you open fruit_juice_food_foodon.owl
in Protege, you will see something similar to:
Next, we use OAK to extract juices and their labels from wikidata by selecting the descendants of juice
from wikidata
, store the result as a ttl
file and then convert it to OWL
using ROBOT
.
sh odk.sh runoak -i wikidata: descendants wikidata:Q8492 -p i,p -o juice_wd.ttl -O rdf
+sh odk.sh robot convert -i juice_wd.ttl -o juice_wd.owl
+
Note that you wont be able to see anything when opening juice_wd.owl
in wikidata, because it does not have any OWL types (class, individual assertions) attached to it. However, you can convince yourself all is well by opening juice_wd.owl
in a text editor, and see expressions such as:
<rdf:Description rdf:about="http://www.wikidata.org/entity/Q10374646">
+ <rdfs:label>cashew apple juice</rdfs:label>
+</rdf:Description>
+
The last preparation step is merging the two subsets (from FOODON and wikidata) into a single file using ROBOT
:
sh odk.sh robot merge -i fruit_juice_food_foodon.owl -i juice_wd.owl -o foodon_wd.owl
+
Now we are ready to create our first set of matches. First, let's run oak
's lexmatch
command to generate lexical matches between the contents of the merged file:
sh odk.sh runoak -i sqlite:foodon_wd.owl lexmatch -o foodon_wd_lexmatch.tsv
+
This will generate an SSSOM tsv file with the mapped contents as shown below:
+# curie_map:
+# FOODON: http://purl.obolibrary.org/obo/FOODON_
+# owl: http://www.w3.org/2002/07/owl#
+# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
+# rdfs: http://www.w3.org/2000/01/rdf-schema#
+# semapv: https://w3id.org/semapv/
+# skos: http://www.w3.org/2004/02/skos/core#
+# sssom: https://w3id.org/sssom/
+# wikidata: http://www.wikidata.org/entity/
+# license: https://w3id.org/sssom/license/unspecified
+# mapping_set_id: https://w3id.org/sssom/mappings/091390a2-6f64-436d-b2d1-309045ff150c
+
subject_id | +subject_label | +predicate_id | +object_id | +object_label | +mapping_justification | +mapping_tool | +confidence | +subject_match_field | +object_match_field | +match_string | +
---|---|---|---|---|---|---|---|---|---|---|
FOODON:00001059 | +apple juice | +skos:closeMatch | +wikidata:Q618355 | +apple juice | +semapv:LexicalMatching | +oaklib | +0.5 | +rdfs:label | +rdfs:label | +apple juice | +
FOODON:00001059 | +apple juice | +skos:closeMatch | +wikidata:Q618355 | +apple juice | +semapv:LexicalMatching | +oaklib | +0.5 | +oio:hasExactSynonym | +rdfs:label | +apple juice | +
FOODON:03301103 | +orange juice | +skos:closeMatch | +wikidata:Q219059 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.5 | +rdfs:label | +rdfs:label | +orange juice | +
FOODON:03306174 | +grapefruit juice | +skos:closeMatch | +wikidata:Q1138468 | +grapefruit juice | +semapv:LexicalMatching | +oaklib | +0.5 | +rdfs:label | +rdfs:label | +grapefruit juice | +
wikidata:Q15823640 | +cherry juice | +skos:closeMatch | +wikidata:Q62030277 | +cherry juice | +semapv:LexicalMatching | +oaklib | +0.5 | +rdfs:label | +rdfs:label | +cherry juice | +
wikidata:Q18201657 | +must | +skos:closeMatch | +wikidata:Q278818 | +must | +semapv:LexicalMatching | +oaklib | +0.5 | +rdfs:label | +rdfs:label | +must | +
This is great - we get a few mappings without much work. If you need some help interpreting this table, please refer to the SSSOM tutorials for details.
+Just eyeballing the labels in our ontology with OAK:
+sh odk.sh runoak -i sqlite:foodon_wd.owl terms | grep juice
+
We notice rows like:
+...
+FOODON:00001001 ! orange juice (liquid)
+...
+
It may be beneficial for us to pre-process the labels a bit before performing the matches, for example, by excluding comments in the labels provided in brackets (essentially removing (liquid)
).
To do this, we will define a few simple mapping rules in a file called matcher_rules.yaml
. OAK provides a standard for representing the matching rules. You can see an example here.
Here is an example file:
+rules:
+ - description: default
+ postconditions:
+ predicate_id: skos:closeMatch
+ weight: 0.0
+
+ - description: exact to exact
+ preconditions:
+ subject_match_field_one_of:
+ - oio:hasExactSynonym
+ - rdfs:label
+ - skos:prefLabel
+ object_match_field_one_of:
+ - oio:hasExactSynonym
+ - rdfs:label
+ - skos:prefLabel
+ postconditions:
+ predicate_id: skos:exactMatch
+ weight: 2.0
+
+ - preconditions:
+ subject_match_field_one_of:
+ - oio:hasExactSynonym
+ - rdfs:label
+ object_match_field_one_of:
+ - oio:hasBroadSynonym
+ postconditions:
+ predicate_id: skos:broadMatch
+ weight: 2.0
+
+ - synonymizer:
+ the_rule: Remove parentheses bound info from the label.
+ match: r'\([^)]*\)'
+ match_scope: "*"
+ replacement: ""
+
+ - synonymizer:
+ the_rule: Replace "'s" by "s" in the label.
+ match: r'\'s'
+ match_scope: "*"
+ replacement: "s"
+
As you can see, there are basically two kinds of rules: normal ones, and synonimizer
ones. The normal rules provide preconditions and postconditions. For example, the second rule says: if an exact synonym, preferred label or label of the subject matches an exact synonym, preferred label or label of the object, then assert a skos:exactMatch
. The synonimizer
rules are preprocessing rules which are applied to the labels and synonyms prior to matching. Let's now run the matcher again:
sh odk.sh runoak -i sqlite:foodon_wd.owl lexmatch -R matcher_rules.yaml -o foodon_wd_lexmatch_with_rules.tsv
+
This will generate an SSSOM tsv file with a few more matches than the previous output (the exact matches may differ from version to version):
+# curie_map:
+# FOODON: http://purl.obolibrary.org/obo/FOODON_
+# IAO: http://purl.obolibrary.org/obo/IAO_
+# owl: http://www.w3.org/2002/07/owl#
+# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
+# rdfs: http://www.w3.org/2000/01/rdf-schema#
+# semapv: https://w3id.org/semapv/
+# skos: http://www.w3.org/2004/02/skos/core#
+# sssom: https://w3id.org/sssom/
+# wikidata: http://www.wikidata.org/entity/
+# license: https://w3id.org/sssom/license/unspecified
+# mapping_set_id: https://w3id.org/sssom/mappings/6b9c727f-9fdc-4a78-bbda-a107b403e3a9
+
subject_id | +subject_label | +predicate_id | +object_id | +object_label | +mapping_justification | +mapping_tool | +confidence | +subject_match_field | +object_match_field | +match_string | +subject_preprocessing | +object_preprocessing | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
FOODON:00001001 | +orange juice (liquid) | +skos:exactMatch | +FOODON:00001277 | +orange juice (unpasteurized) | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | +semapv:RegularExpressionReplacement | +semapv:RegularExpressionReplacement | +
FOODON:00001001 | +orange juice (liquid) | +skos:exactMatch | +FOODON:03301103 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | +semapv:RegularExpressionReplacement | ++ |
FOODON:00001001 | +orange juice (liquid) | +skos:exactMatch | +wikidata:Q219059 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | +semapv:RegularExpressionReplacement | ++ |
FOODON:00001059 | +apple juice | +skos:exactMatch | +wikidata:Q618355 | +apple juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +apple juice | ++ | + |
FOODON:00001059 | +apple juice | +skos:exactMatch | +wikidata:Q618355 | +apple juice | +semapv:LexicalMatching | +oaklib | +0.8 | +oio:hasExactSynonym | +rdfs:label | +apple juice | ++ | + |
FOODON:00001277 | +orange juice (unpasteurized) | +skos:exactMatch | +FOODON:03301103 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | +semapv:RegularExpressionReplacement | ++ |
FOODON:00001277 | +orange juice (unpasteurized) | +skos:exactMatch | +wikidata:Q219059 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | +semapv:RegularExpressionReplacement | ++ |
FOODON:00002403 | +food material | +skos:exactMatch | +FOODON:03430109 | +food (liquid, low viscosity) | +semapv:LexicalMatching | +oaklib | +0.8 | +oio:hasExactSynonym | +rdfs:label | +food | +semapv:RegularExpressionReplacement | ++ |
FOODON:00002403 | +food material | +skos:exactMatch | +FOODON:03430130 | +food (liquid) | +semapv:LexicalMatching | +oaklib | +0.8 | +oio:hasExactSynonym | +rdfs:label | +food | +semapv:RegularExpressionReplacement | ++ |
FOODON:03301103 | +orange juice | +skos:exactMatch | +wikidata:Q219059 | +orange juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +orange juice | ++ | + |
FOODON:03306174 | +grapefruit juice | +skos:exactMatch | +wikidata:Q1138468 | +grapefruit juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +grapefruit juice | ++ | + |
FOODON:03430109 | +food (liquid, low viscosity) | +skos:exactMatch | +FOODON:03430130 | +food (liquid) | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +food | +semapv:RegularExpressionReplacement | +semapv:RegularExpressionReplacement | +
wikidata:Q15823640 | +cherry juice | +skos:exactMatch | +wikidata:Q62030277 | +cherry juice | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +cherry juice | ++ | + |
wikidata:Q18201657 | +must | +skos:exactMatch | +wikidata:Q278818 | +must | +semapv:LexicalMatching | +oaklib | +0.8497788951776651 | +rdfs:label | +rdfs:label | +must | ++ | + |
As we have described in detail in our introduction to Semantic Matching, it is important to remember that matching in its raw form should not be understood to result in semantic mappings: they are better understood as mapping candidates. Therefore, it is always to plan for a review of false positives and false negatives:
+orange juice [wikidata:Q219059]
and orange juice (unpasteurized) [FOODON:00001277]
may not be considered as the same thing in the sense of skos:exactMatch
. For a more detailed introduction into manual mapping curation with SSSOM we recommend following this tutorial: https://mapping-commons.github.io/sssom/tutorial/.
+ + + + + + +These are the kinds of things that I do +when I need to work with a new dataset. +My goal is to have data that makes good sense +and that I can integrate with other data +using standard technologies: +Linked Data.
+The boss just sent me this new table to figure out:
+datetime | +investigator | +subject | +species | +strain | +sex | +group | +protocol | +organ | +disease | +qualifier | +comment | +
---|---|---|---|---|---|---|---|---|---|---|---|
1/1/14 10:21 AM | +JAO | +12 | +RAT | +F 344/N | +FEMALE | +1 | +HISTOPATHOLOGY | +LUNG | +ADENOCARCINOMA | +SEVERE | ++ |
1/1/14 10:30 AM | +JO | +31 | +MOUSE | +B6C3F1 | +MALE | +2 | +HISTOPATHOLOGY | +NOSE | +INFLAMMATION | +MILD | ++ |
1/1/14 10:45 AM | +JAO | +45 | +RAT | +F 344/N | +MALE | +1 | +HISTOPATHOLOGY | +ADRENAL CORTEX | +NECROSIS | +MODERATE | ++ |
It doesn't seem too bad, +but there's lots of stuff that I don't quite understand. +Where to start?
+Before I do anything else, +I'm going to set up a new project for working with this data. +Maybe I'll change my mind later +and want to merge the new project with an existing project, +but it never hurts to start from a nice clean state.
+I'll make a new directory in a sensible place
+with a sensible name.
+In my case I have a ~/Repositories/
directory,
+with subdirectories for GitHub and various GitLab servers,
+a local
directory for projects I don't plan to share,
+and a temp
directory for projects that I don't need to keep.
+I'm not sure if I'm going to share this work,
+so it can go in a new subdirectory of local
.
+I'll call it "linking-data-tutorial" for now.
Then I'll run git init
to turn that directory into a git repository.
+For now I'm just going to work locally,
+but later I can make a repository on GitHub
+and push my local repository there.
Next I'll create a README.md
file
+where I'll keep notes for myself to read later.
+My preferred editor is Kakoune.
So I'll open a terminal and run these commands:
+$ cd ~/Repositories/local/
+$ mkdir linking-data-tutorial
+$ cd linking-data-tutorial
+$ git init
+$ kak README.md
+
In the README I'll start writing something like this:
+# Linking Data Tutorial
+
+An example of how to convert a dataset to Linked Data.
+
+The source data is available from
+<https://github.com/jamesaoverton/obook/tree/master/03-RDF/data.csv>
+
Maybe this information should go somewhere else eventually, +but the README is a good place to start.
+"Commit early, commit often" they say, so:
+$ git add README.md
+$ git commit -m "Initial commit"
+
Data has an annoying tendency to get changed. +You don't want it changing out from under you +while you're in the middle of something. +So the next thing to do is get a copy of the data +and store it locally. +If it's big, you can store a compressed copy. +If it's too big to fit on your local machine, +well keep the best notes you can of how to get to the data, +and what operations you're doing on it.
+I'm going to make a cache
directory
+and store all my "upstream" data there.
+I'm going to fetch the data and that's it --
+I'm not going to edit these files.
+When I want to change the data I'll make copies in another directory.
+I don't want git to track the cached data,
+so I'll add /cache/
to .gitignore
+and tell git to track that.
+Then I'll use curl
to download the file.
$ mkdir cache
+$ echo "/cache/" >> .gitignore
+$ git add .gitignore
+$ git commit -m "Ignore /cache/ directory"
+$ cd cache
+$ curl -LO "https://github.com/jamesaoverton/obook/raw/master/03-RDF/data.csv"
+$ ls
+data.csv
+$ cd ..
+$ ls -a
+.gitignore data README.md
+
The first thing to do is look at the data. +In this case I have just one table in CSV format, +so I can use any number of tools to open the file and look around. +I bet the majority of people would reach for Excel. +My (idiosyncratic) preference is VisiData.
+What am I looking for? +A bunch of different things:
+In my README file I'll make a list of the columns +like this:
+- datetime
+- investigator
+- subject
+- species
+- strain
+- sex
+- group
+- protocol
+- organ
+- disease
+- qualifier
+- comment
+
Then I'll make some notes for myself:
+- datetime: American-style dates, D/M/Y or M/D/Y?
+- investigator: initials, ORCID?
+- subject: integer ID
+- species: common name for species, NCBITaxon?
+- strain: some sort of code with letters, numbers, spaces, some punctuation
+- sex: string female/male
+- group: integer ID
+- protocol: string, OBI?
+- organ: string, UBERON?
+- disease: string, DO/MONDO?
+- qualifier: string, PATO?
+- comment: ???
+
You can see that I'm trying to figure out what's in each column. +I'm also thinking ahead to OBO ontologies that I know of +that may have terms that I can use for each column.
+In the end, I want to have nice, clean Linked Data. +But I don't have to get there in one giant leap. +Instead I'll take a bunch of small, incremental steps.
+There's lots of tools I can use, +but this time I'll use SQLite.
+First I'll set up some more directories.
+I'll create a build
directory
+where I'll store temporary files.
+I don't want git to track this directory,
+so I'll add it to .gitignore
.
$ mkdir build/
+$ echo "/build/" >> .gitignore
+$ git add .gitignore
+$ git commit -m "Ignore /build/ directory"
+
I'll also add a src
directory to store code.
+I do want to track src
with git.
$ mkdir src
+$ kak src/data.sql
+
In src/data.sql
I'll add just enough to import build/data.csv
:
-- import build/data.csv
+.mode csv
+.import build/data.csv data_csv
+
This will create a build/data.db
file
+and import build/data.csv
into a data_csv
table.
+Does it work?
$ sqlite3 build/data.db < src/data.sql
+$ sqlite3 build/data.db <<< "SELECT * FROM data_csv LIMIT 1;"
+2014-01-01 10:21:00-0500|JAO|12|RAT|F 344/N|FEMALE|1|HISTOPATHOLOGY|LUNG|ADENOCARCINOMA|SEVERE|
+
Nice!
+Note that I didn't even specify a schema for data_csv
.
+It uses the first row as the column names,
+and the type of every column is TEXT
.
+Here's the schema I end up with:
$ sqlite3 build/data.db <<< ".schema data_csv"
+CREATE TABLE data_csv(
+ "datetime" TEXT,
+ "investigator" TEXT,
+ "subject" TEXT,
+ "species" TEXT,
+ "strain" TEXT,
+ "sex" TEXT,
+ "group" TEXT,
+ "protocol" TEXT,
+ "organ" TEXT,
+ "disease" TEXT,
+ "qualifier" TEXT,
+ "comment" TEXT
+);
+
I'm going to want to update src/data.sql
+then rebuild the database over and over.
+It's small, so this will only take a second.
+If it was big,
+then I would copy a subset into build/data.csv
for now
+so that I the script still runs in a second or two
+and I can iterate quickly.
+I'll write a src/build.sh
script to make life a little easier:
#!/bin/sh
+
+rm -f build/*
+cp cache/data.csv build/data.csv
+sqlite3 build/data.db < src/data.sql
+
Does it work?
+$ sh src/build.sh
+
Nice! +Time to update the README:
+## Requirements
+
+- [SQLite3](https://sqlite.org/index.html)
+
+## Usage
+
+Run `sh src/build.sh`
+
I'll commit my work in progress:
+$ git add src/data.sql src/build.sh
+$ git add --update
+$ git commit -m "Load data.csv into SQLite"
+
Now I have a script
+that executes a SQL file
+that loads the source data into a new database.
+I'll modify the src/data.sql
file
+in a series of small steps
+until it has the structure that I want.
In the real world, data is always a mess. +It takes real work to clean it up. +And really, it's almost never perfectly clean.
+It's important to recognize that cleaning data has diminishing returns. +There's low hanging fruit: +easy to clean, often with code, and bringing big benefits. +Then there's tough stuff +that requires an expert to work through the details, +row by row.
+The first thing to do is figure out the schema you want.
+I'll create a new data
table
+and start with the default schema from data_csv
.
+Notice that in the default schema all the column names are quoted.
+That's kind of annoying.
+But when I remove the quotation marks
+I realize that one of the column names is "datetime",
+but datetime
is a keyword in SQLite!
+You can't use it as a column name without quoting.
+I'll rename it to "assay_datetime".
+I have the same problem with "group".
+I'll rename "group" to "group_id"
+and "subject" to "subject_id".
+The rest of the column names seem fine.
I want "assay_datetime" to be in standard ISO datetime format, +but SQLite stores these as TEXT. +The "subject" and "group" columns are currently integers, +but I plan to make them into URIs to CURIEs. +So everything will still be TEXT.
+CREATE TABLE data(
+ assay_datetime TEXT,
+ investigator TEXT,
+ subject_id TEXT,
+ species TEXT,
+ strain TEXT,
+ sex TEXT,
+ group_id TEXT,
+ protocol TEXT,
+ organ TEXT,
+ disease TEXT,
+ qualifier TEXT,
+ comment TEXT
+);
+
The dates currently look like "1/1/14 10:21 AM".
+Say I know that they were done on Eastern Standard Time.
+How do I convert to ISO dates like "2014-01-01 10:21:00-0500"?
+Well SQLite isn't the right tool for this.
+The Unix date
command does a nice job, though:
$ date -d "1/1/14 10:21 AM EST" +"%Y-%m-%d %H:%M:%S%z"
+2014-01-01 10:21:00-0500
+
I can run that over each line of the file using awk
.
+So I update the src/build.sh
+to rework the build/data.csv
before I import:
#!/bin/sh
+
+rm -f build/*
+
+head -n1 cache/data.csv > build/data.csv
+tail -n+2 cache/data.csv \
+| awk 'BEGIN{FS=","; OFS=","} {
+ "date -d \""$1" EST\" +\"%Y-%m-%d %H:%M:%S%z\"" | getline $1;
+ print $0
+}' \
+>> build/data.csv
+
+sqlite3 build/data.db < src/data.sql
+
One more problem I could clean up
+is that "JO" should really be "JAO" --
+that's just a typo,
+and they should both refer to James A. Overton.
+I could make that change in src/build.sh
,
+but I'll do it in src/data.sql
instead.
+I'll write a query to copy all the rows of data_csv
into data
+and then I'll update data
with some fixes.
-- copy from data_csv to data
+INSERT INTO data SELECT * FROM data_csv;
+
+-- clean data
+UPDATE data SET investigator="JAO" WHERE investigator="JO";
+
Honestly, it took me quite a while to write that awk
command.
+It's a very powerful tool,
+but I don't use it enough to remember how it works.
+You might prefer to write yourself a Python script, or some R code.
+You could use that instead of this SQL UPDATE as well.
+I just wanted to show you two of the thousands of ways to do this.
+If there's a lot of replacements like "JO",
+then you might also consider listing them in another table
+that you can read into your script.
The important part is to automate your cleaning!
+Why didn't I just edit cache/data.csv
in Excel?
+In step 2 I saved a copy of the data
+because I didn't want it to change while I was working on it,
+but I do expect it to change!
+By automating the cleaning process,
+I should be able to just update cache/data.csv
+run everything again,
+and the fixes will be applied again.
+I don't want to do all this work manually
+every time the upstream data is updated.
I'll commit my work in progress:
+$ git add --update
+$ git commit -m "Start cleaning data"
+
Cleaning can take a lot of work. +This is example table is pretty clean already. +The next hard part is sorting out your terminology.
+It's pretty easy to convert a table structure to triples. +The hard part is converting the table contents. +There are some identifiers in the table that would be better as URLs, +and there's a bunch of terminology that would be better +if it was linked to an ontology or other system.
+I'll start with the identifiers that are local to this data: +subject_id and group_id. +I can convert them to URLs by defining a prefix +and then just using that prefix. +I'll use string concatenation to update the table:
+-- update subject and groupd IDs
+UPDATE data SET subject_id='ex:subject-' || subject_id;
+UPDATE data SET group_id='ex:group-' || group_id;
+
Now I'll check my work:
+$ sqlite3 build/data.db <<< "SELECT * FROM data_csv LIMIT 1;"
+2014-01-01 10:21:00-0500|JAO|ex:subject-12|RAT|F 344/N|FEMALE|ex:group-1|HISTOPATHOLOGY|LUNG|ADENOCARCINOMA|SEVERE|
+
I should take a moment to tell you, +that while I was writing the Turtle conversion code later in this essay, +I had to come back here and change these identifiers. +The thing is that Turtle is often more strict than I expect +about identifier syntax. +Turtle identifiers look like +CURIEs, +but they're actually +QNames. +CURIEs are pretty much just just URLs shortened with a prefix, +so almost anything goes. +QNames come from XML, +and Turtle identifiers have to be valid XML element names.
+I always remember that I need to stick to alphanumeric characters,
+and that I have to replace whitespace and punctuation with a -
or _
.
+I didn't remember that the local part (aka "suffix", aka "NCName")
+can't start with a digit.
+So I tried to use "subject:12" and "group:1" as my identifiers.
+That worked fine until I generated Turtle.
+The Turtle looked fine,
+so it took me quite a while to figure out why
+it looked very wrong when I converted it into RDXML format.
This kind of thing happens to me all the time. +I'm almost always using a mixture of technologies +based on different sets of assumptions, +and there are always things that don't line up. +That's why I like to work in small iterations, +checking my work as I go +(preferrably with automated tests), +and keeping everything in version control. +When I need to make a change like this one, +I just circle back and iterate again.
+The next thing is to tackle the terminology.
+First I'll just make a list of the terms I'm using
+from the relevant columns in build/term.tsv
:
```sh #collect +$ sqlite3 build/data.db << EOF > build/term.tsv +SELECT investigator FROM data +UNION SELECT species FROM data +UNION SELECT strain FROM data +UNION SELECT strain FROM data +UNION SELECT sex FROM data +UNION SELECT protocol FROM data +UNION SELECT organ FROM data +UNION SELECT disease FROM data +UNION SELECT qualifier FROM data; +EOF +
It's a lot of work to go through all those terms
+and find good ontology terms.
+I'm going to do that hard work for you
+(just this once!)
+so we can keep moving.
+I'll add this table to `src/term.tsv`
+
+| id | code | label |
+| ------------------------- | -------------- | ------------------ |
+| obo:NCBITaxon_10116 | RAT | Rattus norvegicus |
+| obo:NCBITaxon_10090 | MOUSE | Mus musculus |
+| ex:F344N | F 344/N | F 344/N |
+| ex:B6C3F1 | B6C3F1 | B6C3F1 |
+| obo:PATO_0000383 | FEMALE | female |
+| obo:PATO_0000384 | MALE | male |
+| obo:OBI_0600020 | HISTOPATHOLOGY | histology |
+| obo:UBERON_0002048 | LUNG | lung |
+| obo:UBERON_0007827 | NOSE | external nose |
+| obo:UBERON_0001235 | ADRENAL CORTEX | adrenal cortex |
+| obo:MPATH_268 | ADENOCARCINOMA | adenocarcinoma |
+| obo:MPATH_212 | INFLAMMATION | inflammation |
+| obo:MPATH_4 | NECROSIS | necrosis |
+| obo:PATO_0000396 | SEVERE | severe intensity |
+| obo:PATO_0000394 | MILD | mild intensity |
+| obo:PATO_0000395 | MODERATE | moderate intensity |
+| orcid:0000-0001-5139-5557 | JAO | James A. Overton |
+
+And I'll add these prefixes to `src/prefix.tsv`:
+
+| prefix | base |
+| ------- | ------------------------------------------- |
+| rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
+| rdfs | http://www.w3.org/2000/01/rdf-schema# |
+| xsd | http://www.w3.org/2001/XMLSchema# |
+| owl | http://www.w3.org/2002/07/owl# |
+| obo | http://purl.obolibrary.org/obo/ |
+| orcid | http://orcid.org/ |
+| ex | https://example.com/ |
+| subject | https://example.com/subject/ |
+| group | https://example.com/group/ |
+
+Now I can import these tables into SQL
+and use the term table as a FOREIGN KEY constraint
+on data:
+
+```sql
+.mode tabs
+
+CREATE TABLE prefix (
+ prefix TEXT PRIMARY KEY,
+ base TEXT UNIQUE
+);
+.import --skip 1 src/prefix.tsv prefix
+
+CREATE TABLE term (
+ id TEXT PRIMARY KEY,
+ code TEXT UNIQUE,
+ label TEXT UNIQUE
+);
+.import --skip 1 src/term.tsv term
+
+CREATE TABLE data(
+ assay_datetime TEXT,
+ investigator TEXT,
+ subject_id TEXT,
+ species TEXT,
+ strain TEXT,
+ sex TEXT,
+ group_id TEXT,
+ protocol TEXT,
+ organ TEXT,
+ disease TEXT,
+ qualifier TEXT,
+ comment TEXT,
+ FOREIGN KEY(investigator) REFERENCES term(investigator),
+ FOREIGN KEY(species) REFERENCES term(species),
+ FOREIGN KEY(strain) REFERENCES term(strain),
+ FOREIGN KEY(sex) REFERENCES term(sex),
+ FOREIGN KEY(protocol) REFERENCES term(protocol),
+ FOREIGN KEY(organ) REFERENCES term(organ),
+ FOREIGN KEY(disease) REFERENCES term(disease),
+ FOREIGN KEY(qualifier) REFERENCES term(qualifier)
+);
+
+-- copy from data_csv to data
+INSERT INTO data SELECT * FROM data_csv;
+
+-- clean data
+UPDATE data SET investigator='JAO' WHERE investigator='JO';
+
+-- update subject and groupd IDs
+UPDATE data SET subject_id='ex:subject-' || subject_id;
+UPDATE data SET group_id='ex:group-' || group_id;
+
I'll update the README:
+See `src/` for:
+
+- `prefix.tsv`: shared prefixes
+- `term.tsv`: terminology
+
I'll commit my work in progress:
+$ git add src/prefix.tsv src/term.tsv
+$ git add --update
+$ git commit -m "Add and apply prefix and term tables"
+
Now all the terms are linked to controlled vocabularies +of one sort or another. +If I want to see the IDs for those links instead of the "codes" +I can define a VIEW:
+CREATE VIEW linked_data_id AS
+SELECT assay_datetime,
+ investigator_term.id AS investigator,
+ subject_id,
+ species_term.id AS species,
+ strain_term.id AS strain,
+ sex_term.id AS sex,
+ group_id,
+ protocol_term.id AS protocol,
+ organ_term.id AS organ,
+ disease_term.id AS disease,
+ qualifier_term.id AS qualifier
+FROM data
+JOIN term as investigator_term ON data.investigator = investigator_term.code
+JOIN term as species_term ON data.species = species_term.code
+JOIN term as strain_term ON data.strain = strain_term.code
+JOIN term as sex_term ON data.sex = sex_term.code
+JOIN term as protocol_term ON data.protocol = protocol_term.code
+JOIN term as organ_term ON data.organ = organ_term.code
+JOIN term as disease_term ON data.disease = disease_term.code
+JOIN term as qualifier_term ON data.qualifier = qualifier_term.code;
+
I'll check:
+$ sqlite3 build/data.db <<< "SELECT * FROM linked_ids LIMIT 1;"
+2014-01-01 10:21:00-0500|orcid:0000-0001-5139-5557|ex:subject-12|obo:NCBITaxon_10116|ex:F344N|obo:PATO_0000383|ex:group-1|obo:OBI_0600020|obo:UBERON_0002048|obo:MPATH_268|obo:PATO_0000396
+
I can also define a similar view for their "official" labels:
+CREATE VIEW linked_data_label AS
+SELECT assay_datetime,
+ investigator_term.label AS investigator,
+ subject_id,
+ species_term.label AS species,
+ strain_term.label AS strain,
+ sex_term.label AS sex,
+ group_id,
+ protocol_term.label AS protocol,
+ organ_term.label AS organ,
+ disease_term.label AS disease,
+ qualifier_term.label AS qualifier
+FROM data
+JOIN term as investigator_term ON data.investigator = investigator_term.code
+JOIN term as species_term ON data.species = species_term.code
+JOIN term as strain_term ON data.strain = strain_term.code
+JOIN term as sex_term ON data.sex = sex_term.code
+JOIN term as protocol_term ON data.protocol = protocol_term.code
+JOIN term as organ_term ON data.organ = organ_term.code
+JOIN term as disease_term ON data.disease = disease_term.code
+JOIN term as qualifier_term ON data.qualifier = qualifier_term.code;
+
I'll check:
+$ sqlite3 build/data.db <<< "SELECT * FROM linked_data_label LIMIT 1;"
+2014-01-01 10:21:00-0500|James A. Overton|ex:subject-12|Rattus norvegicus|F 344/N|female|ex:group-1|histology|lung|adenocarcinoma|severe intensity
+
I'll commit my work in progress:
+$ git add --update
+$ git commit -m "Add linked_data tables"
+
Now the tables use URLs and is connected to ontologies and stuff. +But are we Linked yet?
+SQL tables aren't an official Linked Data format. +Of all the RDF formats, I prefer Turtle. +It's tedious but not difficult to get Turtle out of SQL. +These query do what I need them to do, +but note that if the literal data contained quotation marks +(for instance) +then I'd have to do more work to escape those. +First I create a triple table:
+CREATE TABLE triple (
+ subject TEXT,
+ predicate TEXT,
+ object TEXT,
+ literal INTEGER -- 0 for object IRI, 1 for object literal
+);
+
+-- create triples from term table
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT id, 'rdfs:label', label, 1
+FROM term;
+
+-- create triples from data table
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-assay_datetime', assay_datetime, 1
+FROM data;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-investigator', term.id, 0
+FROM data
+JOIN term AS term ON data.investigator = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-subject_id', subject_id, 0
+FROM data;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-species', term.id, 0
+FROM data
+JOIN term AS term ON data.species = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-strain', term.id, 0
+FROM data
+JOIN term AS term ON data.strain = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-sex', term.id, 0
+FROM data
+JOIN term AS term ON data.sex = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-group_id', group_id, 0
+FROM data;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-protocol', term.id, 0
+FROM data
+JOIN term AS term ON data.protocol = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-organ',term.id, 0
+FROM data
+JOIN term AS term ON data.organ= term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-disease', term.id, 0
+FROM data
+JOIN term AS term ON data.disease = term.code;
+
+INSERT INTO triple(subject, predicate, object, literal)
+SELECT 'ex:assay-' || data.rowid, 'ex:column-qualifier', term.id, 0
+FROM data
+JOIN term AS term ON data.qualifier = term.code;
+
Then I can turn triples into Turtle +using string concatenation:
+SELECT '@prefix ' || prefix || ': <' || base || '> .'
+FROM prefix
+UNION ALL
+SELECT ''
+UNION ALL
+SELECT subject || ' ' ||
+ predicate || ' ' ||
+ CASE literal
+ WHEN 1 THEN '"' || object || '"'
+ ELSE object
+ END
+ || ' . '
+FROM triple;
+
I can add this to the src/build.sh
:
sqlite3 build/data.db < src/turtle.sql > build/data.ttl
+
Here's just a bit of that build/data.ttl
file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
+
+orcid:0000-0001-5139-5557 rdfs:label "James A. Overton" .
+assay:1 column:assay_datetime "2014-01-01 10:21:00-0500"^^xsd:datetime .
+assay:1 column:investigator orcid:0000-0001-5139-5557 .
+
SQL is not a particularly expressive language. +Building the triple table is straightforward but verbose. +I could have done the same thing with much less Python code. +(Or I could have been clever and generated some SQL to execute!)
+I'll commit my work in progress:
+$ git add src/turtle.sql
+$ git add --update
+$ git commit -m "Convert to Turtle"
+
So technically I have a Turtle file. +Linked Data! +Right? +Well, it's kind of "flat". +It still looks more like a table than a graph.
+The table I started with is very much focused on the data: +there was some sort of assay done, +and this is the information that someone recorded about it. +The Turtle I just ended up with is basically the same.
+Other people may have assay data. +They may have tables that they converted into Turtle. +So can I just merge them? +Technically yes: +I can put all these triples in one graph together. +But I'll still just have "flat" chunks of data +representing rows +sitting next to other rows, +without really linking together.
+The next thing I would do with this data +is reorganized it based on the thing it's talking about. +I know that:
+Most of these are things that I could point to in the world, +or could have pointed to +if I was in the right place at the right time.
+By thinking about these things, +I'm stepping beyond what it was convenient for someone to record, +and thinking about what happened in the world. +If somebody else has some assay data, +then they might have recorded it differently +for whatever reason, +and so it wouldn't line up with my rows. +I'm trying my best to use the same terms for the same things. +I also want to use the same "shapes" for the same things. +When trying to come to an agreement about what is connected to what, +life is easier if I can point to the things I want to talk about: +"See, here is the person, and the mouse came from here, and he did this and this."
+I could model the data in SQL +by breaking the big table into smaller tables. +I could have tables for:
+Then I would convert each table to triples more carefully. +That's a good idea. +Actually it's a better idea than what I'm about to do...
+Since we're getting near the end,
+I'm going to show you how you can do that modelling in SPARQL.
+SPARQL has a CONSTRUCT operation that you use to build triples.
+There's lots of tools that I could use to run SPARQL
+but I'll use ROBOT.
+I'll start with the "flat" triples in build/data.ttl
,
+select them with my WHERE clause,
+then CONSTRUCT better triples,
+and save them in build/model.ttl
.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+PREFIX owl: <http://www.w3.org/2002/07/owl#>
+PREFIX obo: <http://purl.obolibrary.org/obo/>
+PREFIX ex: <https://example.com/>
+
+CONSTRUCT {
+ ?group
+ rdfs:label ?group_label .
+ ?subject
+ rdf:type ?species ;
+ rdfs:label ?subject_label ;
+ ex:strain ?strain ;
+ obo:RO_0000086 ?sex ; # has quality
+ ex:group ?group .
+ ?sex
+ rdf:type ?sex_type ;
+ rdfs:label ?sex_label .
+ ?organ
+ rdf:type ?organ_type ;
+ rdfs:label ?organ_label ;
+ obo:BFO_0000050 ?subject . # part of
+ ?assay
+ rdf:type ?assay_type ;
+ rdfs:label ?assay_label ;
+ obo:OBI_0000293 ?subject ; # has specified input
+ obo:IAO_0000136 ?organ . # is about
+}
+WHERE {
+ ?subject_row
+ ex:column-assay_datetime ?datetime ;
+ ex:column-investigator ?investigator ;
+ ex:column-subject_id ?subject ;
+ ex:column-species ?species ;
+ ex:column-sex ?sex_type ;
+ ex:column-group_id ?group ;
+ ex:column-protocol ?assay_type ;
+ ex:column-organ ?organ_type ;
+ ex:column-disease ?disease ;
+ ex:column-qualifier ?qualifier .
+
+ ?assay_type
+ rdfs:label ?assay_type_label .
+ ?sex_type
+ rdfs:label ?sex_type_label .
+ ?organ_type
+ rdfs:label ?organ_type_label .
+
+ BIND (URI(CONCAT(STR(?subject), "-assay")) AS ?assay)
+ BIND (URI(CONCAT(STR(?subject), "-sex")) AS ?sex)
+ BIND (URI(CONCAT(STR(?subject), "-organ")) AS ?organ)
+ BIND (CONCAT("subject ", REPLACE(STR(?subject), "^.*-", "")) AS ?subject_label)
+ BIND (CONCAT("group ", REPLACE(STR(?group), "^.*-", "")) AS ?group_label)
+ BIND (CONCAT(?subject_label, " ", ?assay_type_label) AS ?assay_label)
+ BIND (CONCAT(?subject_label, " sex: ", ?sex_type_label) AS ?sex_label)
+ BIND (CONCAT(?subject_label, " ", ?organ_type_label) AS ?organ_label)
+}
+
I can add this to the src/build.sh
:
java -jar robot.jar query \
+ --input build/data.ttl \
+ --query src/model.rq build/model.ttl
+
Then I get build/model.ttl
that looks (in part) like this:
ex:subject-31 a obo:NCBITaxon_10090 ;
+ rdfs:label "subject 31" ;
+ obo:RO_0000086 ex:subject-31-sex ;
+ ex:group ex:group-2 .
+
+ex:group-2 rdfs:label "group 2" .
+
Now that's what I call Linked Data!
+I'll update the README:
+## Modelling
+
+The data refers to:
+
+- investigator
+- subject
+- group
+- assay
+- measurement data
+ - subject organ
+ - disease
+
+TODO: A pretty diagram.
+
I'll commit my work in progress:
+$ git add src/model.rq
+$ git add --update
+$ git commit -m "Build model.ttl"
+
That was a lot of work for a small table. +And I did all the hard work of mapping +the terminology to ontology terms for you!
+There's lots more I can do. +The SPARQL is just one big chunk, +but it would be better in smaller pieces. +The modelling isn't all that great yet. +Before changing that +I want to run it past the boss +and see what she thinks.
+It's getting close to the end of the day. +Before I quit I should update the README, +clean up anything that's no longer relevant or correct, +and make any necessary notes to my future self:
+$ git add --update
+$ git commit -m "Update README"
+$ quit
+
In this tutorial, we discuss the general workflow of managing dynamic imports, i.e. importing terms from other ontologies which can be kept up to date.
+Follow instructions for the PATO dynamic import process here.
+ + + + + + +This tutorial is not about editing ontologies and managing the evolution of its content (aka ontology curation), but the general process of managing an ontology project overall. In this lesson, we will cover the following:
+It is important to understand that the following is just one good way of doing project management for OBO ontologies, and most projects will do it slightly differently. We do however believe that thinking about your project management process and the roles involved will benefit your work in the long term, and hope that the following will help you as a starting point.
+For an effective management of an ontology, the following criteria are recommended:
+Without the above minimum criteria, the following recommendations will be very hard to implement.
+We make use of three tools in the following recommendation:
+Project boards: +Project boards, sometimes referred to as Kanban boards, GitHub boards or agile boards, are a great way to organise outstanding tickets and help maintain a clear overview of what work needs to be done. They are usually realised with either GitHub projects or ZenHub. If you have not worked with project boards before, we highly recommend watching a quick tutorial on Youtube, such as:
+ + +GitHub teams. +GitHub teams, alongside with organisations, are a powerfull too to organise collaborative workflows on GitHub. They allow you to communicate and organise permissions for editing your ontology in a transparent way. You can get a sense of GitHub teams by watching one of the the numerous tutorials on GitHub, such as:
+ + +Markdown-based documentation system. +Writing great documentation is imperative for a sustainable project. Across many of our recent projects, were are using mkdocs, which we have also integrated with the Ontology Development Kit, but there are others to consider. We deeply recommend to complete a very short introduction to Markdown, this tutorial on YouTube.
+Every ontology or group of related ontologies (sometimes it is easier to manage multiple ontologies at once, because their scope or technical workflows are quite uniform or they are heavily interrelated) should have:
+To Do
(issues that are important but not urgent), Priority
(issues that are important and urgent), In Progress
(issues that are being worked on) and Under review
(issues that need review). From years of experience with project boards, we recommend against the common practice of keeping a Backlog
column (issues that are neither important nor urgent nor likely to be addressed in the next 6 months), nor a Done
column (to keep track of closed issues) - they just clutter the view.mkdocs
in OBO projects) with a page listing the members of the team (example). This page should provide links to all related team pages from Github and their project boards, as well as a table listing all current team members with the following information:To Do
and Priority
columns of the Technical Team. The later is important: it is the job of the curation team to prioritise the technical issues. The Technical Team can add tickets to the To Do
and Priority
columns, but this usually happens only in response to a request from the Curation Team.Priority
tickets. The Technical Team is responsible toPriority
to the In Progress
and later to the Done
section.Priority
issues.To Do
issues should first be moved to the Priority
section before being addressed. This prevents focusing on easy to solve tickets in favour of important ones.Backlog
items are not added at all to the board - if they ever become important, they tend to resurface all by themselves.main
(formerly master
) branch should be write protected with suitable rules. For example, requiring QC to pass and 1 approving review as a minimum.The new Monarch Knowledge Graph has a more streamlined focus on the core Monarch data model, centering on Diseases, Phenotypes and Genes and the associations between them. This has the benefit of being a graph that can be build in 2 hours instead of 2 days, and that you can run locally on your laptop.
+++Note: As of the writing of this tutorial, (Feb 2023), the graph is just starting to move from its initial construction phrase into real use, and so there are still bugs to find. Some of which show up in this tutorial.
+
https://github.com/monarch-initiative/monarch-neo4j
+dumps
directorycopy dot_env_template to .env and edit the values to look like:
+# This Environment Variable file is referenced by the docker-compose.yaml build
+
+# Set this variable to '1' to trigger an initial loading of a Neo4j dump
+DO_LOAD=1
+
+# Name of Neo4j dump file to load, assumed to be accessed from within
+# the 'dumps' internal Volume path within the Docker container
+NEO4J_DUMP_FILENAME=monarch-kg.neo4j.dump
+
That should mean uncommenting DO_LOAD and NEO4j_DUMP_FILENAME
++ +#### Download plugins + +* Download the [APOC plugin jar file](https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.4.0.13/apoc-4.4.0.13-all.jar) and put in the `plugins` directory + +* Download, the [GDS plugin](https://graphdatascience.ninja/neo4j-graph-data-science-2.3.0.zip), unzip the download and copy jar file to the `plugins` directory + +#### Environment setup + +In addition to the changes above to .env, you will need to uncomment the following lines in the .env file: + +
NEO4J_apoc_export_file_enabled=true
+NEO4J_apoc_import_file_enabled=true
+NEO4J_apoc_import_file_use__neo4j__config=true
+NEO4JLABS_PLUGINS=\[\"apoc\", \"graph-data-science\"\]
+
On the command line, from the root of the monarch-neo4j repository you can launch the neo4j with:
+docker-compose up
+
Nodes in a cypher query are expressed with ()
and the basic form of a query is MATCH (n) RETURN n
. To limit the results to just our disease of interest, we can restrict by a property, in this case the id
property.
MATCH (d {id: 'MONDO:0007038'}) RETURN d
+
This returns a single bubble, but by exploring the controls just to the left of the returned query, you can see a json or table representation of the returned node.
+{
+ "identity": 480388,
+ "labels": [
+ "biolink:Disease",
+ "biolink:NamedThing"
+ ],
+ "properties": {
+ "name": "Achoo syndrome",
+ "provided_by": [
+ "phenio_nodes"
+ ],
+ "id": "MONDO:0007038",
+ "category": [
+ "biolink:Disease"
+ ]
+ },
+ "elementId": "480388"
+}
+
Clicking back to the graph view, you can expand to see direct connections out from the node by clicking on the node and then clicking on the graph icon. This will return all nodes connected to the disease by a single edge.
+++Tip: the node images may not be labeled the way you expect. Clicking on the node reveals a panel on the right, clicking on that node label at the top of the panel will reveal a pop-up that lets you pick which property is used as the caption in the graph view.
+
In cypher, nodes are represented by ()
and edges are represented by []
in the form of ()-[]-()
, and your query is a little chance to express yourself with ascii art. To get the same results as the expanded graph view, you can query for any edge connecting to any node. Note that the query also asks for the connected node to be returned.
MATCH (d {id: 'MONDO:0007038'})-[]-(n) RETURN d, n
+
It's possible to add another edge to the query to expand out further. In this case, we're adding a second edge to the query, and restricting the direction of the second edge to be outgoing. This will return all nodes connected to the disease by a single edge, and then all nodes connected to those nodes by a single outgoing edge. It's important to note that without limiting the direction of the association, this query will traverse up, and then back down the subclass tree.
+MATCH (d {id: 'MONDO:0007038'})-[]->(n)-[]->(m) RETURN d,n,m
+
Sometimes, we don't know what kind of questions to ask without seeing the shape of the data. Neo4j provides a graph representation of the schema by calling a procedure
+CALL db.schema.visualization
+
If you tug on nodes and zoom, you may find useful information, but it's not a practical way to explore the schema.
+We can explore the kinds of connections available for a given category of node. Using property restriction again, but this time instead of restricting by the ID, we'll restrict by the category. Also, instead of returning nodes themselves, we'll return the categories of those nodes.
+MATCH (g:`biolink:Gene`)-[]->(n) RETURN DISTINCT labels(n)
+
++Tip: the
+DISTINCT
keyword is used to remove duplicate results. In this case, we're only interested in the unique categories of nodes connected to genes.
Expanding on the query above, we can also return the type of relationship connecting the gene to the node.
+MATCH (g:`biolink:Gene`)-[rel]->(n) RETURN DISTINCT type(rel), labels(n)
+
Which returns tabular data like:
+╒════════════════════════════════════════════════════╤═══════════════════════════════════════════════════════════╕
+│"type(rel)" │"labels(n)" │
+╞════════════════════════════════════════════════════╪═══════════════════════════════════════════════════════════╡
+│"biolink:located_in" │["biolink:NamedThing","biolink:CellularComponent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:part_of" │["biolink:NamedThing","biolink:MacromolecularComplexMixin"]│
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:enables" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:actively_involved_in" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:colocalizes_with" │["biolink:NamedThing","biolink:CellularComponent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:active_in" │["biolink:NamedThing","biolink:CellularComponent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:actively_involved_in" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:contributes_to" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:orthologous_to" │["biolink:NamedThing","biolink:Gene"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:participates_in" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:interacts_with" │["biolink:NamedThing","biolink:Gene"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:has_phenotype" │["biolink:NamedThing","biolink:GeneticInheritance"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:has_phenotype" │["biolink:NamedThing","biolink:PhenotypicQuality"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:risk_affected_by" │["biolink:NamedThing","biolink:Disease"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:gene_associated_with_condition" │["biolink:NamedThing","biolink:Disease"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:has_phenotype" │["biolink:NamedThing","biolink:ClinicalModifier"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_positive_effect" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:risk_affected_by" │["biolink:NamedThing","biolink:Gene"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:gene_associated_with_condition" │["biolink:NamedThing","biolink:Gene"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within_positive_effect"│["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:has_mode_of_inheritance" │["biolink:NamedThing","biolink:GeneticInheritance"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_negative_effect" │["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_positive_effect" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within_negative_effect"│["biolink:NamedThing","biolink:Occurrent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:has_phenotype" │["biolink:NamedThing","biolink:PhenotypicFeature"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within_negative_effect"│["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_or_within_positive_effect"│["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing","biolink:GrossAnatomicalStructure"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing","biolink:AnatomicalEntity"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:acts_upstream_of_negative_effect" │["biolink:NamedThing","biolink:Pathway"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing","biolink:Cell"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:located_in" │["biolink:NamedThing","biolink:MacromolecularComplexMixin"]│
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing","biolink:CellularComponent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing","biolink:MacromolecularComplexMixin"]│
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:part_of" │["biolink:NamedThing","biolink:CellularComponent"] │
+├────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
+│"biolink:expressed_in" │["biolink:NamedThing"] │
+└────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
+
++Note: the DISTINCT keyword will only remove duplicate results if the entire result is the same. In this case, we're interested in the unique combinations of relationship type and node category.
+
Further constraining on the type of the connecting node, we can ask what kinds of associations exist between two entity types. For example, what kinds of associations exist between genes and diseases?
+MATCH (g:`biolink:Gene`)-[rel]->(n:`biolink:Disease`) RETURN DISTINCT type(rel)
+
╒════════════════════════════════════════╕
+│"type(rel)" │
+╞════════════════════════════════════════╡
+│"biolink:gene_associated_with_condition"│
+├────────────────────────────────────────┤
+│"biolink:risk_affected_by" │
+└────────────────────────────────────────┘
+
MATCH (g:`biolink:Gene`{id:"HGNC:1100"})-[]-(d:`biolink:Disease`) RETURN g,d
+
MATCH (g:`biolink:Gene`{id:"HGNC:1100"})-[]->(d:`biolink:Disease`)-[]->(p:`biolink:PhenotypicFeature`) RETURN g,d,p
+
Why doesn't this return results? This is a great opportunity to track down an unexpected problem.
+First, try a less constrained query, so that the 3rd node can be anything:
+MATCH (g:`biolink:Gene`{id:"HGNC:1100"})-[]->(d:`biolink:Disease`)-[]->(p) RETURN g,d,p
+
With a little tugging and stretching, a good picture emerges, and by clicking our phenotype bubbles, they look like they're showing as PhenotypicQuality rather than PhenotypicFeature. This is likely a bug, but a sensible alternative for this same intent might be:
+MATCH (g:`biolink:Gene`{id:"HGNC:1100"})-[]->(d:`biolink:Disease`)-[:`biolink:has_phenotype`]->(p) RETURN g,d,p
+
Sometimes, we don't know the specific number of hops. What if we want to answer the question "What genes affect the risk for an inherited auditory system disease?"
+First, lets find out how are diseases connected to one another. Name the relationship to query for just the predicates. +
MATCH (d:`biolink:Disease`)-[rel]-(d2:`biolink:Disease`) RETURN DISTINCT type(rel)
+
╒════════════════════════════════════════╕
+│"type(rel)" │
+╞════════════════════════════════════════╡
+│"biolink:subclass_of" │
+├────────────────────────────────────────┤
+│"biolink:related_to" │
+├────────────────────────────────────────┤
+│"biolink:associated_with" │
+├────────────────────────────────────────┤
+│"biolink:has_phenotype" │
+├────────────────────────────────────────┤
+│"biolink:gene_associated_with_condition"│
+├────────────────────────────────────────┤
+│"biolink:risk_affected_by" │
+└────────────────────────────────────────┘
+
(* Please ignore biolink:gene_associated_with_condition
and biolink:risk_affected_by
showing up here, those are due to a bug in our OMIM ingest)
We'll construct a query that fixes the super class disease, then connects at any distance to any subclass of that disease, and then brings genes that affect risk for those diseases. To avoid a big hairball graph being returned, we can return the results as a table showing the diseases and genes.
+MATCH (d:`biolink:Disease`{id:"MONDO:0002409"})<-[:`biolink:subclass_of`*]-(d2:`biolink:Disease`)<-[`biolink:risk_affected_by`]-(g:`biolink:Gene`) RETURN d.id, d.name, d2.id, d2.name,g.symbol,g.id
+
once you trust the query, you can also use the DISTINCT keyword again focus in on just the gene list
+MATCH (d:`biolink:Disease`{id:"MONDO:0002409"})<-[:`biolink:subclass_of`*]-(d2:`biolink:Disease`)<-[`biolink:risk_affected_by`]-(g:`biolink:Gene`) RETURN DISTINCT g.id
+
First, we can ask what kind of associations we have between genes.
+MATCH (g:`biolink:Gene`)-[rel]->(g2:`biolink:Gene`) RETURN DISTINCT type(rel)
+
╒════════════════════════════════════════╕
+│"type(rel)" │
+╞════════════════════════════════════════╡
+│"biolink:orthologous_to" │
+├────────────────────────────────────────┤
+│"biolink:interacts_with" │
+├────────────────────────────────────────┤
+│"biolink:risk_affected_by" │
+├────────────────────────────────────────┤
+│"biolink:gene_associated_with_condition"│
+└────────────────────────────────────────┘
+
++Again, please ignore
+biolink:gene_associated_with_condition
andbiolink:risk_affected_by
.
Let's say that from the list above, we're super interested in the DIABLO gene, because, obviously, it has a cool name. We can find it's orthologues by querying through the biolink:orthologous_to
relationship.
MATCH (g {id:"HGNC:21528"})-[:`biolink:orthologous_to`]-(o:`biolink:Gene`) RETURN g,o
+
We can then make the question more interesting, by finding phenotypes associated with these orthologues.
+MATCH (g {id:"HGNC:21528"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:has_phenotype`]->(p) RETURN g,og,p
+
That was a dead end. What about gene expression?
+MATCH (g {id:"HGNC:21528"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:expressed_in`]->(a) RETURN g,og,a
+
We can add this one step further by connecting our gene expression list in UBERON terms
+MATCH (g {id:"HGNC:21528"})-[:`biolink:orthologous_to`]-(og:`biolink:Gene`)-[:`biolink:expressed_in`]->(a)-[`biolink:subclass_of`]-(u)
+WHERE u.id STARTS WITH 'UBERON:'
+RETURN distinct u.id, u.name
+
In particular, it's a nice confirmation to see that we started at the high level MONDO term "inherited auditory system disease", passed through subclass relationships to more specific diseases, connected to genes that affect risk for those diseases, focused on a single gene, and were able to find that it is expressed in the cochlea.
+╒════════════════╤════════════════════════════════╕
+│"u.id" │"u.name" │
+╞════════════════╪════════════════════════════════╡
+│"UBERON:0000044"│"dorsal root ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0000151"│"pectoral fin" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0000948"│"heart" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0000961"│"thoracic ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001017"│"central nervous system" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001555"│"digestive tract" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001675"│"trigeminal ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001700"│"geniculate ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001701"│"glossopharyngeal ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001844"│"cochlea" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001991"│"cervical ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0002107"│"liver" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0002441"│"cervicothoracic ganglion" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0003060"│"pronephric duct" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0003922"│"pancreatic epithelial bud" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0004141"│"heart tube" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0004291"│"heart rudiment" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0005426"│"lens vesicle" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0007269"│"pectoral appendage musculature"│
+├────────────────┼────────────────────────────────┤
+│"UBERON:0019249"│"2-cell stage embryo" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0000965"│"lens of camera-type eye" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0001645"│"trigeminal nerve" │
+├────────────────┼────────────────────────────────┤
+│"UBERON:0003082"│"myotome" │
+└────────────────┴────────────────────────────────┘
+
This tutorial will show you how to use the tools that are made available by +the ODK Docker images, independently of an ODK-generated repository and of +ODK-managed workflows.
+You have:
+You know:
+Let’s check which Docker images, if any, are available in your Docker +installation:
+$ docker images
+REPOSITORY TAG IMAGE ID CREATED SIZE
+
Here, the listing comes up empty, meaning there are no images at all. This is +what you would expect if you have just installed Docker and have yet to do +anything with it.
+Let’s download the main ODK image:
+$ docker pull obolibrary/odkfull
+Using default tag: latest
+latest: Pulling from obolibrary/odkfull
+[… Output truncated for brevity …]
+Digest: sha256:272d3f788c18bc98647627f9e6ac7311ade22f35f0d4cd48280587c15843beee
+Status: Downloaded newer image for obolibrary/odkfull:latest
+docker.io/obolibrary/odkfull:latest
+
Let’s see the images list again:
+$ docker images
+REPOSITORY TAG IMAGE ID CREATED SIZE
+obolibrary/odkfull latest 0947360954dc 6 months ago 2.81GB
+
Docker images can exist in several versions, which are called tags in Docker
+parlance. In our pull
command, since we have not specified any tag, Docker
+had automatically defaulted to the latest
tag, which by convention is the
+latest ODK release.
To download a specific version, append the tag after the image name (you can +check which tags are available on +DockerHub). For example, +let’s download the 1.3.1 release from June 2022:
+$ docker pull obolibrary/odkfull:v1.3.1
+v1.3.1: Pulling from obolibrary/odkfull
+Digest: sha256:272d3f788c18bc98647627f9e6ac7311ade22f35f0d4cd48280587c15843beee
+Status: Downloaded newer image for obolibrary/odkfull:v1.3.1
+docker.io/obolibrary/odkfull:v1.3.1
+
Again, let’s see the output of docker images
:
$ docker images
+REPOSITORY TAG IMAGE ID CREATED SIZE
+obolibrary/odkfull latest 0947360954dc 6 months ago 2.81GB
+obolibrary/odkfull v1.3.1 0947360954dc 6 months ago 2.81GB
+
Note how both the latest
and the v1.3.1
images have the same ID. This is
+because, at the time of this writing, the 1.3.1 release is the latest ODK
+release, so the latest
tag actually points to the same image as the v1.3.1
+tag. This will change when the ODK v1.3.2 is released: then, using latest
+(explicitly or by not specifying any tag at all) will point to the new
+release, while v1.3.1
will forever continue to point to the June 2022
+release.
In the rest of this tutorial, we will always use the latest
image, and so we
+will dispense with the explicit tag. But remember that anywhere you see
+obolibrary/odkfull
in one of the commands below, you can always use
+obolibrary/odkfull:TAG
to force Docker to use a specific ODK version.
Now that we have the ODK image available, let’s try to start it. The command
+for that is docker run
, which has the following syntax:
docker run [OPTIONS] <IMAGE> [COMMAND [ARGUMENTS...]]
+
where IMAGE
is the name of the image to use (in our case, always
+obolibrary/odkfull
).
With the ODK, you will always need the --rm
option. It instructs the Docker
+engine to automatically remove the container it creates to run a command, once
+that command terminates. (Not using the --rm
option would lead to those
+“spent” containers to accumulate on your system, ultimately forcing you to
+manually remove them with the docker container rm
command.)
If we don’t specify an explicit command, the simplest command line we can have +is thus:
+$ docker run --rm obolibrary/odkfull
+Usage: odk.py [OPTIONS] COMMAND [ARGS]...
+
+Options:
+ --help Show this message and exit.
+
+Commands:
+ create-dynfile For testing purposes
+ create-makefile For testing purposes
+ dump-schema Dumps the python schema as json schema.
+ export-project For testing purposes
+ seed Seeds an ontology project
+$
+
In the absence of an explicit command, the default command odk.py
is
+automatically invoked by Docker. Since it has been invoked without any
+argument, odk.py
does nothing but printing its “usage” message before
+terminating. When it terminates, the Docker container terminates as well, and
+we are back at the terminal prompt.
To invoke one of the tools available in the toolbox (we’ll see what those +tools are later in this document), just complete the command line as needed. +For example, to test that ROBOT is there (and to see which version we have):
+$ docker run --rm obolibrary/odkfull robot --version
+ROBOT version 1.9.0
+
Since we have ROBOT, let’s use it. Move to a directory containing some +ontology files (here, I’ll use a file from the Drosophila Anatomy Ontology, +because if you have to pick an ontology, why not picking an ontology that +describes the One True Model Organism?).
+$ ls
+fbbt.obo
+
We want to convert that OBO file to a file in, say, the OWL Functional Syntax. +So we call ROBOT with the appropriate command and options:
+$ docker run ---rm obolibrary/odkfull robot convert -i fbbt.obo -f ofn -o fbbt.ofn
+org.semanticweb.owlapi.io.OWLOntologyInputSourceException: java.io.FileNotFoundException: fbbt.obo (No such file or directory)
+Use the -vvv option to show the stack trace.
+Use the --help option to see usage information.
+
Huh? Why the “No such file or directory” error? We just checked that
+fbbt.obo
is present in the current directory, why can’t ROBOT find it?
Because Docker containers run isolated from the rest of the system – that’s +kind of the entire point of such containers in general! From within a +container, programs can, by default, only access files from the image from +which the container has been started.
+For the ODK Toolbox to be at all useful, we need to explicitly allow the
+container to access some parts of our machine. This is done with the -v
+option, as in the following example:
$ docker run --rm -v /home/alice/fbbt:/work […rest of the command omitted for now…]
+
This -v /home/alice/fbbt:/work
has the effect of binding the directory
+/home/alice/fbbt
from our machine to the directory /work
inside the
+container. This means that if a program that runs within the container tries
+to have a look at the /work
directory, what this program will actually see
+is the contents of the /home/alice/fbbt
directory. Figuratively, the -v
+option opens a window in the container’s wall, allowing to see parts of what’s
+outside from within the container.
With that window, and assuming our fbbt.obo
file is within the
+/home/alice/fbbt
directory, we can try again invoking the conversion
+command:
$ docker run --rm -v /home/alice/fbbt:/work obolibrary/odkfull robot convert -i /work/fbbt.obo -f ofn -o /work/fbbt.ofn
+$ ls
+fbbt.obo
+fbbt.ofn
+
This time, ROBOT was able to find out fbbt.obo
file, and to convert it as we
+asked.
We can slightly simplify the last command line in two ways.
+First, instead of explicitly specifying the full pathname to the current
+directory (/home/alice/fbbt
), we can use the shell variable $PWD
, which is
+automatically expanded to that pathname: -v $PWD:/work
.
Second, to avoid having to explicitly refer to the /work
directory in the
+command, we can ask the Docker engine to run our command as if the current
+directory, within the container, was already /work
. This is done with the
+-w /work
option.
The command above now becomes:
+$ docker run --rm -v $PWD:/work -w /work obolibrary/odkfull robot convert -i fbbt.obo -f ofn -o fbbt.ofn
+
This is the typical method of invoking a tool from the ODK Toolbox to work on +files from the current directory.
+In fact, this is exactly how the src/ontology/run.sh
wrapper script, that is
+automatically created in an ODK-generated repository, works. If you work with
+an ODK-managed ontology, you can invoke an arbitrary ODK tool by using the
+run.sh
instead of calling docker run
yourself. Assuming for example that
+you already are in the src/ontology
directory of an ODK-managed ontology,
+you could use:
./run.sh robot convert -i fbbt.obo -f ofn -o fbbt.ofn
+
If you want to use the ODK toolbox with ontologies that are not managed by
+the ODK (so, where a run.sh
script is not readily available), you can set up
+an independent wrapper script, as explained in the Setting up the
+ODK tutorial.
If you have several commands to invoke in a row involving files from the same
+directory, you do not have to repeatedly invoke docker run
once for each
+command. Instead, you can invoke a shell, from which you will be able to run
+successively as many commands as you need:
$ docker run --rm -ti -v $PWD:/work -w /work obolibrary/odkfull bash
+root@c1c2c80c491b:/work#
+
The -ti
options allow to use your current terminal to control the shell that
+is started within the container. This is confirmed by the modified prompt that
+you can see above, which indicates that you are now “in” the container. You
+can now directly use all the tools that you need:
root@c1c2c80c491b:/work# robot convert -i fbbt.obo -f owx -o fbbt.owl
+root@c1c2c80c491b:/work# Konclude consistency -i fbbt.owl
+{info} 18:21:14.543 >> Starting Konclude …
+[…]
+{info} 18:21:16.949 >> Ontology ‘out.owl’ is consistent.
+root@c1c2c80c491b:/work#
+
When you are done, exit the shell by hitting Ctrl-D
or with the exit
+command. The shell will terminate, and with it, the container will terminate
+as well, sending you back to your original terminal.
Now that you know how to invoke any tool from the ODK Toolbox, here’s a quick +overview of which tools are available.
+For a definitive list, the authoritative source is the ODK
+repository, especially
+the Dockerfile
and requirements.txt.full
files. And if you miss a tool
+that you think should be present in the toolbox, don’t hesitate to open a
+ticket to
+suggest that the tool be added in a future ODK release!
The goal of this tutorial is to quickly showcase key ODK workflows. +It is not geared at explaining individual steps in detail. For a much more detailed tutorial for creating a fresh ODK repo, see here for a tutorial for setting up your first workflow. We recommend to complete this tutorial before attempting this one.
+This is some useful background from the ICBO 2022 OBO Tutorial:
+ + + +cato-odk.yaml
change github_org
to your GitHub username. If you dont do this, some ODK features wont work perfectly, like documentation.
+ github_org: matentzn
+repo: cat-ontology
+
curl https://raw.githubusercontent.com/INCATools/ontology-development-kit/v1.3.1/seed-via-docker.sh | bash -s -- --clean -C cato-odk.yaml
+
Let us now import planned process:
+src/ontology/imports/cob_terms.txt
in your favourite text editorCOB:0000082
to the term file (this is the planned process
class in COB).src/ontology
directory, run sh run.sh make refresh-cob
.src/ontology/cato-odk.yaml
, locate the entry for importing cob
and switch it to a different module type: filter
.
+ import_group:
+ products:
+ - id: ro
+ - id: cob
+ module_type: filter
+
sh run.sh make update_repo
to apply the changes. Check out the git diff to the Makefile
to convince yourself that the new extraction method has been applied.src/ontology
directory, run sh run.sh make refresh-cob
. Convince yourself that now only the planned process
term is imported.Makefile
, cato-odk.yaml
, imports/cob_terms.txt
and imports/cob_import.owl
.Great, we have done our change, now we are ready to make a release!
+main
branch in git
.git pull
).src/ontology
execute the release workflow: sh run.sh make prepare_release_fast
(we are using fast
release here which skips refreshing imports again - we just did that).planned process
class has been added to all ontology release artefacts.v2022-09-01
. Note the leading v
. Select the correct date (the date you made the release, YYYY-MM-dd
). Fill in all the other form elements as you see fit. Click Publish release
.With our ODK setup, we also have a completely customisable documentation system installed. We just need to do a tiny change to the GitHub pages settings:
+Build and deployment
select Deploy from branch
.gg-pages
as the branch (this is where ODK deploys to), and /(root)
as the directory.
+ Save
.Actions
in the main menu to follow the build process).Pages
section in Settings
. You should see a button Visit site
. Click on it. If everything went correctly, you should see your new page:
+ github_org
, see seeding). If you have not configured your repo, go to the GitHub front page of your repo, go into the docs
directory, click on
+index.md
and edit it from here. Make a small random edit.main
or do it properly, create a branch, PR, ask for reviews, merge.That's it! In about 20 minutes, we
+ + + + + + + +A project ontology, sometimes and controversially referred to as an application ontology, is an ontology which is composed of other ontologies for a particular use case, such as Natural Language Processing applications, Semantic Search and Knowledge Graph integration. A defining feature of a project ontology is that it is not intended to be used as a domain ontology. Concretely, this means that content from project ontologies (such as terms or axioms) is not to be re-used by domain ontologies (under no circumstances). Project ontology developers have the freedom to slice & dice, delete and add relationships, change labels etc as their use case demands it. Usually, such processing is minimal, and in a well developed environment such as OBO, new project ontology-specific terms are usually kept at a minimum.
+In this tutorial, we discuss the fundamental building blocks of application ontologies and show you how to build one using the Ontology Development Kit as one of several options.
+There are a few reasons for developing project ontologies. Here are two that are popular in our domain:
+Any application ontology will be concerned with at least 3 ingredients:
+MONDO:123, MONDO:231
MONDO:123, incl. all children
MONDO:123, incl. all terms that are in some way logically related to MONDO:123
There are five phases on project ontology development which we will discuss in detail in this section:
+There are other concerns, like continuous integration (basically making sure that changes to the seed or project ontology pipelines do not break anything) and release workflows which are not different from any other ontology.
+ +As described above, the seed is the set of terms that should be extracted from the source ontologies into the project ontology. The seed comprises any of the following:
+MONDO:0000001
all, children, descendants, ancestors, annotations
Users of ODK will be mostly familiar with term files located in the imports directory, such as src/ontology/imports/go_terms.txt
. Selectors are usually hidden from the user by the ODK build system, but they are much more important now when building project ontologies.
Regardless of which system you use to build your project ontology, it makes sense to carefully plan your seed management. In the following, we will discuss some examples:
+It makes sense to document your seed management plan. You should usually account for the possibility of changes (terms being added or removed) during the design phase.
+ +Module extraction is the process for selecting an appropriate subset from an ontology. There are many ways to extracting subsets from an ontology:
+You can consult the ROBOT documentation for some details on module extraction.
+Let's be honest - none of these module extraction techniques are really ideal for project ontologies. SLME modules are typically used for domain ontology development to ensure logical consistency with imported ontologies, but otherwise contain too much information (for most project ontology use cases). ROBOT filter has a hard time with dealing with closures of existential restrictions: for example you cant be sure that, if you import "endocardial endothelium" and "heart" using filter, that the one former is still part of the latter (it is only indirectly a part) - a lot of research and work has being going on to make this easier. The next version of ROBOT (1.8.5) is going to contain a new module extraction command which will ensure that such links are not broken.
+One of the design confusions in this part of the process is that most use cases of application ontologies really do not care at all about OWL. Remember, OWL really only matters for the design of domain ontologies, to ensure a consistent representation of the domain and enable reasoning-based classification. So it is, at least slightly, unsatisfactory that we have to use OWL tools to do something that may as well be done by something simpler, more akin to "graph-walking".
+ +Just like any other ontology, a project ontology should be well annotated according to the standards of FAIR Semantics, for example using the OBO Foundry conventions. In particular, project ontologies should be
+Furthermore, it is often necessary to add additional terms to the ontology which are not covered by other upstream ontologies. Here we need to distinguish two cases:
+With our OBO hat on, if you start adding terms "quickly", you should develop a procedure to get these terms into suitable upstream ontologies at a later stage. This is not so much a necessity as a matter of "open data ethics": if you use other people's work to make your life easier, its good to give back!
+Lastly, our use cases sometimes require us to add additional links between the terms in our ontologies. For example, we may have to add subClassOf links between classes of different ontologies that cover the same domain. Or we want to add additional information. As with "quickly adding terms", if the information is generally useful, you should consider to add them to the respective upstream source ontology (synonyms of disease terms from Mondo, for example). We often manage such axioms as ROBOT templates and curate them as simple to read tables.
+ +Just like with most ontologies, the last part of the process is merging the various pieces (modules from external sources, customisations, metadata) together into a single whole. During this phase a few things can happen, but these are the most common ones:
+One thing to remember is that you are not building a domain ontology. You are usually not concerned with typical issues in ontology engineering, such as logical consistency (or coherence, i.e. the absence of unsatisfiable classes). The key for validating an application ontology comes from its intended use case: Can the ontology deliver the use case it promised? There are many approaches to ensure that, chief among them competency questions. What we usually do is try to express competency questions as SPARQL queries, and ensure that there is at least one result. For example, for one of the project ontologies the author is involved with (CPONT), we have developed a synthetic data generator, which we combine with the ontology to ask questions such as: "Give me all patients which has a recorded diagnosis of scoliosis" (SPARQL). So the ontology does a "good job" if it is able to return, say, at least 100 patients in our synthetic data for which we know that they are diagnoses with scoliosis or one of its subtypes.
+The perfect framework for building project ontologies does not exist yet. The Ontology Development Kit (ODK) has all the tools you need set up a basic application ontology, but the absence of a "perfect" module extraction algorithm for this use case is still unsatisfactory. However, for many use cases, filter
modules like the ones described above are actually good enough. Here we will go through a simple example.
An alternative framework for application ontology development based on a Web User Interface and tables for managing the seed is developed by James Overton at (ontodev).
+Another potential alternative is to go all the way to graph-land and build the application ontology with KGX and LinkML. See here for an example. Creating a project ontology this way feels more like a Knowledge Graph ETL task than building an ontology!
+Set up a basic ODK ontology. We are not covering this again in this tutorial, please refer to the tutorial on setting up your ODK repo.
+Many of the larger imports in application ontologies do not fit into the normal GitHub file size limit. In this cases it is better to attach them to a GitHub release rather than to check them into version control.
+TBD
+Participants will need to have access to the following resources and tools prior to the training:
+Description: How to create and manage pull requests to ontology files in GitHub.
+A pull request (PR) is an event in Git where a contributor (you!) asks a maintainer of a Git repository to review changes (e.g. edits to an ontology file) they want to merge into a project (e.g. the owl file) (see reference). A contributor creates a pull request to propose and collaborate on changes to a repository. These changes are proposed in a branch, which ensures that the default branch only contains finished and approved work. See more details here.
+ +When committing a pull request, you must include a title and a description (more details in the workflow below.) Tips below (adapted from Hugo Dias):
+The title of the PR should be self-explanatory
+Do: Describe what was changed in the pull request
+Example: Add new term: MONDO:0100503 DPH5-related diphthamide-deficiency syndrome`
+Don't: write a vague title that has very little meaning.
+Example: Add new term
+Don't: use the branch name in the pull request (sometimes GitHub will offer this as a default name)
+Example:
+ +A video is below.
+ + +Example diffs:
+Example 1 (Cell Ontology):
+ +Example 2 (Mondo):
+ +Commit message: Before Committing, you must add a commit message. In GitHub Desktop in the Commit field in the lower left, there is a subject line and a description.
+Give a very descriptive title: Add a descriptive title in the subject line. For example: add new class ONTOLOGY:ID [term name] (e.g. add new class MONDO:0000006 heart disease)
+Write a detailed summary of what the change is in the Description box, referring to the issue. The sentence should clearly state how the issue is addressed.
+NOTE: You can use the word ‘fixes’ or ‘closes’ in the commit message - these are magic words in GitHub; when used in combination with the ticket number, it will automatically close the ticket. Learn more on this GitHub Help Documentation page about Closing issues via commit messages.
+‘Fixes’ and “Closes’ is case-insensitive and can be plural or singular (fixes, closes, fix, close).
+If you don’t want to close the ticket, just refer to the ticket # without the word ‘fixes’ or use ‘adresses’ or 'addresses'. The commit will be associated with the correct ticket but the ticket will remain open.
+Push: To incorporate the changes into the remote repository, click Commit to [branch name], then click Push.
+Tips for finding reviewers:
+An ontology repository should have an owner assigned. This may be described in the ReadMe file or on the OBO Foundry website. For example, the contact person for Mondo is Nicole Vasilevsky.
+If you are assigned to review a pull request, you should receive an email notification. You can also check for PRs assigned to you by going to https://github.com/pulls/assigned.
+It depends on what the pull request is addressing. Remember the QC checks will check for things like unsatisfiable classes and many other checks (that vary between ontologies). Your job as a reviewer is to check for things that the QC checks won't pick up and need human judgement.
+If you don't know who to assign, we recommend assigning the ontology contact person and they can triage the request.
+To review a PR, you should view the 'Files changed' and view the diff(s). You can review changes in a pull request one file at a time.
+Example: +
+Make sure the changes made address the ticket. In the example above, Sabrina addressed a ticket that requested adding a new term to Mondo, which is what she did on the PR (see https://github.com/monarch-initiative/mondo/pull/5078).
+Examples of things to look for in content changes (like adding new terms or revising existing terms):
+appropriate annotations
+Make sure there are not any unintended or unwanted changes on the PR. See example below. Protege reordered the location of a term in the file.
+After reviewing the file(s), you can approve the pull request or request additional changes by submitting your review with a summary comment.
+Comment (Submit general feedback without explicit approval)
+Request changes (Submit feedback that must be addressed before merging)
+In addition or instead of adding inline comments, you can leave comments on the Conversation page. The conversation page is a good place to discuss the PR, and for the original creator to respond to the reviewer comments.
+GitHub added a 'suggested Changes' feature that allows a PR reviewer to suggest an exact change in a comment in a PR. You can add inline comments and commit your comment using 'inline commits'. Read more about it here.
+If you review the PR and the changes properly address what was described in the description, then it should be sufficient. Not every PR needs comments, it can be approved without any comments or requests for changes. Feel free to ask for help with your review, and/or assign additional reviewers.
+Some of the content above was adapted from GitHub Docs.
+Conflicts arise when edits are made on two separate branches to the same line in a file. (reference). When editing an ontology file (owl file or obo file), conflicts often arise when adding new terms to an ontology file on separate branches, or when there are a lot of open pull requests.
+Conflicts in ontology files can be fixed either on the command line or using GitHub Desktop. In this lesson, we describe how to fix conflicts using GitHub Desktop.
+open [ontology file name]
(e.g.open mondo-edit.obo
) or open in Protege manually.Watch a video below with an example fixing a conflict in the Mondo ontology file.
+ + +Some examples of conflicts that Nicole fixed in Mondo are below:
++ + +
+