Standardize how metadata supporting text mined results is represented #399

mbrush · 2023-02-28T19:38:57Z

Translator uses two main sources for text-mined knowledge: TMKP, and SemmeDB.

These sources want to report metadata supporting a text-mined edge, including the sentence(s) mined, metrics/scores reflecting confidence in accurate extraction of concepts and relationships form each sentence, and information about the context in which the sentence is found (e.g. what section of an article).

Often, a given edge is supported by mining of multiple sentences/spans of text - each of which comes with its own set of such metadata.

Precise representation of this information requires a way to group metadata for each NLP-based sentence analysis together in a TRAPI message.

The modeling team worked with TMKP to define a way to do this using Biolink StudyResult objects, and leveraging nested Attributes in the TRAPI structure. Details and examples of this model are here.

This modeling structure is reflected in how edge metadata is returned in the ARAX-ARS interface. Below I show a subset of the metadata on a 'is treated_by' edge from TMKP, which shows up in the KG supporting ARAGORN's 'Nutarsudil' result for this query:

However, other KPs who provide text-mined edges from SemMedDB (BTE, RTX-KG2) return less detailed metadata . . .

(from https://arax.ncats.io/?r=623df483-e0c8-45b5-80bb-38f15627c93c, specifically a 'treated by' edge in ARAGORN's TOFACITIMIB result)

. . . and when more detail is provided, a very different structure is used. In the rtx2-semmed example below, sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute:

(from https://arax.transltr.io/?r=9360c5c9-cb10-47d2-9910-535fc4cbbf05, specifically a 'treats' edge in ARAGORN's biguaniide result)

In summary, semmeddb edge metadata describing source publications, sentences and metadata about these (dates, scores, etc) are inconsistently provided and represented across KPs, and do not use the same detailed structure as TMKP.

The TMKP model provides rich metadata using Study Result object as organizing nodes in a two level structure.
In the bte-sememddb example , the KP doesn’t include sentence or other metadata at all.
In the rtx2-semmed example sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute. (Each object in this blob is analogous to a Study Result in the TMKP model).

We should try to use a similar structure in all cases, aligning where possible with that defined by TMKP.

The text was updated successfully, but these errors were encountered:

edeutsch · 2023-06-22T17:47:21Z

@mbrush do we need to keep this issue open still or is it being tracked elsewhere?

mbrush mentioned this issue Feb 28, 2023

Align and connect 'supporting document' and 'publications' edge properties biolink/biolink-model#1107

Closed

andrewsu mentioned this issue Mar 1, 2023

adjust SemMedDB SmartAPI annotation to comply with new NLP metadata standard biothings/biothings_explorer#570

Closed

mbrush mentioned this issue Mar 1, 2023

Standard/conventions for representing supporting publications #398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize how metadata supporting text mined results is represented #399

Standardize how metadata supporting text mined results is represented #399

mbrush commented Feb 28, 2023 •

edited

Loading

edeutsch commented Jun 22, 2023

Standardize how metadata supporting text mined results is represented #399

Standardize how metadata supporting text mined results is represented #399

Comments

mbrush commented Feb 28, 2023 • edited Loading

edeutsch commented Jun 22, 2023

mbrush commented Feb 28, 2023 •

edited

Loading