Add dataset, FHIR stores and annotations stores and related resources #71

tschaffter · 2020-11-04T20:03:24Z

Contributions of this PR

Summary

This PR introduces a major redesign of the Data Node API.
The Date Annotator API has also been slightly updated.

New Date Node API

The new API is inspired from Google Healthcare API that organizes data into three hierarchical levels: Project, Dataset, and Store. The Date Node API skips the Project level for simplicity and only keeps Dataset and Stores.

The paragraphs below describe how to push and get data, namely notes and annotations, using the Date Node API. In order to better understand the steps, we recommend that you clone the content of this PR and start the server used to preview the OpenAPI specification in HTML format. Once you have cloned this repo:

git fetch
git checkout -b add-text-schema origin/add-text-schema
npm run start openapi/data-node/openapi.yaml

Create a Dataset

First, you need to create a dataset that has a unique name in the entire data node.
http://localhost:8080/#operation/createDataset

Add a NoteStore to the Dataset

The next step is to add a data store to the dataset. Two types of stores are available: NoteStore and AnnotationStore.

Let's start by adding a NoteStore to our Dataset using the endpoint below. This endpoint takes one path parameter named datasetId. The value of this parameter must be set with the name of the dataset to use. The payload contains one property named name, which is the name of the store. The name of a store must be unique for a given dataset.
http://localhost:8080/#operation/createNoteStore

Add Notes to the NoteStore

Notes are added to a NoteStore using the endpoint below. This endpoints takes two path parameters: datasetId and storeId that take the name of the dataset and note store to use. The payload is a Note object with two properties: noteId (unique in a given note store) and text.
http://localhost:8080/#operation/createNote

Get notes from the NoteStore

You can retrieve the notes stored in the NoteStore with this endpoint. The results returned are paginated.
http://localhost:8080/#operation/listNotes

Push and pull annotations

The Data Node API currently has a second type of store: DataStore. So far two types of annotations can be pushed to an instance of this store: Annotation and DateAnnotation (extends Annotation).

Add an AnnotationStore to your dataset.
http://localhost:8080/#operation/createAnnotationStore

Add Annotation or DateAnnotation to this store.
http://localhost:8080/#operation/createAnnotation

Get all annotations (filters will be added soon). The results returned are paginated.
http://localhost:8080/#operation/listAnnotations

Date Annotator API Update

Previously, the input was an array of Notes. The input is now a Text object. It is still possible to pass a Note object because Note extends the Text object.
The output is an array of DateAnnotation objects.

TODO: Decide what to do with Annotation.noteId. Rename textId? :/

Initial PR description

This is an attempt to simplify the note schema and add a level of abstraction to target one day text that are not clinical notes.

First, this PR proposes to add a Text schema that has one string property called text. The idea is to pass to the date annotator a text object rather than a Note object. The motivation is that the Note object does not provide additional useful information that can help the date annotator in its task as of now. The second motivation is for the date annotator to be used with texts that are not clinical notes. Hence, the input to the date annotator becomes very similar to the input of Amazon Comprehend deid service.

Source

Another proposal is to temporarily at least get ride of the fields like "createdBy", "createdAt", "updatedBy", etc. These fields are meaningful for storing the record in a DB, but there is currently no need to expose this information to the user of most API. The data node should ideally continue to store "createdBy" value in the DB but API does not need to document this at this point. This will be revisited later.

Once again, the motivation is to simplify the API spec of the NLP tool. In the API documentation, it is more friendly to ask the input:

{
    "text": "This is a text."
}

Rather than the current large Note object that comes with "createdBy" etc. that are not useful to the tool developer.

Regarding the type of clinical note, I proposed to use the LOINC concept used by FHIR mentioned in this page:
FHIR Clinical Notes Guidance

Finally, one proposed change is give as input to the annotators ONE text or note instead of an array. Here are the motivations:

Simplification: it takes less lines of codes in the NLP tool to process one item than an array (we same the iteration code)
Closer to standard: It is possible to pass as input an array of object because we currently do it and this works. However, I believe that a best practice is that the body of a request should be one (json) object. A hint towards that direction is that 1) I never saw a requestBody that is an array in the OpenAPI doc, and 2) the openapi-generator uses the singular form for the name of the input parameter. For example, the parameter is named note in this function of the data annotator in both the Python and java implementation, while we currently treat it as an array.

Python: https://github.com/Sage-Bionetworks/nlp-sandbox-date-annotator-example/blob/develop/server/openapi_server/controllers/date_controller.py#L9
Java: https://github.com/nlpsandbox/date-annotator-example-java/blob/develop/server/src/main/java/org/openapitools/api/DatesApi.java#L60

If this PR is accepted

I will update the date annotator (easy).
We will not update the date node immediately. I would like to create another PR in this repo to define the concept of "Dataset" in data nodes and the endpoints needed to manage clinical notes and gold standard annotations (POST, GET).

…ut one note

tschaffter · 2020-11-04T20:20:01Z

Finally, one proposed change is give as input to the annotators ONE text or note instead of an array. Here are the motivations:

Stability: the input is one object rather than a potentially large array. Also, requests with a large body (array) can block / slow down the service.
Performance: ideally multiple instances of the service or "workers" will be spin up and multiple requests should be sent in parallel to them.

gkowalski · 2020-11-04T20:25:57Z

"Text" Implies a written source , https://www.dictionary.com/browse/text . Wouldn't it be more generic to call it "data" https://www.dictionary.com/browse/data? if we really need to change this? Also gets us into the arena of what is the character encoding supported for this data ? ASCII , UTF-8 , UTF-16 , etc. You mention this naming conversion is "The second motivation is for the date annotator to be used with texts that are not clinical notes" , but yet then go on to add LOINC codes ( a medical Ontology ) as the type of note ? Will these types support other non-medical data ?

As for the created_by and created_at fields . One may not see use for them as a tool developer , but would be thankful to have them when your looking downstream at a final result that does not make sense and then need to track the data back to the source system. We get many notes from various systems and pipelines into a Data Warehouse that are constantly being created and updated . They all have various note id's that overlap and conflict. Only want to keep track of this is a "created_by" name for the system / it came from and created_date for when it was last updated.

tschaffter · 2020-11-04T21:29:48Z

@gkowalski Here is feedback on the first part of your questions.

"Text" Implies a written source , https://www.dictionary.com/browse/text . Wouldn't it be more generic to call it "data" https://www.dictionary.com/browse/data?

I don't see the need to go down to "data" as so far we only consider "text". Data could be useful the day we support DICOM image, for example if one day we decide to add the deid of DICOM image metadata. Stopping at text for now seems reasonable.

Also gets us into the arena of what is the character encoding supported for this data ? ASCII , UTF-8 , UTF-16 , etc.

Adding an encoding property to the Text schema is something that I can definitively agree with.

but yet then go on to add LOINC codes ( a medical Ontology ) as the type of note ? Will these types support other non-medical data ?

I included this example in the spec: "loinc:LP29684-5". The idea behind this formatting is that it would allow us to support non-medical notes by replacing the prefix "loinc:".

tschaffter · 2020-11-04T23:08:35Z

The data node now has all the endpoints needed for now to manage Datasets.

npm run start openapi/data-node/openapi.yaml

tschaffter · 2020-11-04T23:12:37Z

Using Google Healthcare API as reference:
https://cloud.google.com/healthcare/docs/concepts/projects-datasets-data-stores#datasets_and_data_stores

tschaffter · 2020-11-05T00:08:07Z

The data node now has all the endpoints needed for now to manage AnnotationStores linked to a Dataset ID.

npm run start openapi/data-node/openapi.yaml

tschaffter · 2020-11-05T17:26:16Z

Issue regarding annotation and noteId

Background

The current Annotation schema has a property noteId. This property is important when storing annotation objects in a DB to be able to link an Annotation and Note object. This noteId was also used by an annotator to link an input note and an output annotation object. Having the annotator setting the noteId of the annotation objects is important because the input is an array of notes.

Issue

We are moving away from giving an array as input to the annotator. Therefore, the is no longer the need for the annotator to link the output annotation to the note. Instead, it is the responsibility of the user to link the note sent to the annotator and the annotations received from it.

Most importantly, we could not trust the value of Annotation.noteId set by the annotator. While I don't see how an incorrect noteId value could mess up with our system, this just add one degree of liberty for the annotator to do something wrong. By making the "user" / client responsible for linking the note sent and the received annotations, we lift a burden of the developers of the annotators.

tschaffter · 2020-11-05T17:48:58Z

Get all annotations in a store

Currently I defined the spec to returned an array of object that can be a mix of Annotation objects and DateAnnotation object. The way I do it seems correct according to this comment, however the HTML doc does not seem to be able to render this yet.

swagger-api/swagger-ui#3803

If this is really an issue, we could define specific endpoints to get only annotation of a given type.

Also it will be interesting to see how the openapi-generator handles this spec.

EDIT: The above is actually incorrect. The HTML documentation generated by redocly-cli does actually show that the output can be of mixed types.

…geMetadata

…dbox-schemas into add-text-schema

Health endpoint was deleted from commons in PR nlpsandbox#71

github-actions · 2021-11-19T00:59:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Add Text, make Note extend Text, update date-annotator to take as inp…

e072737

…ut one note

tschaffter self-assigned this Nov 4, 2020

The date annotator takes as input Text instead of Notes

eb8b578

tschaffter requested review from gkowalski, thomasyu888, yy6linda and boyleconnor November 4, 2020 20:10

tschaffter marked this pull request as draft November 4, 2020 22:38

tschaffter added 2 commits November 4, 2020 14:59

Add post and get all datasets

5977c5b

Add get one dataset by id and delete one dataset

13668f2

tschaffter removed request for gkowalski, boyleconnor, thomasyu888 and yy6linda November 4, 2020 23:07

Add annotation store

deb638b

tschaffter added 4 commits November 4, 2020 20:16

Add endpoints to manage annotations

8560b9e

Shave Annotation and DateAnnotation schemas from DB-related fields

7946afe

Add note stores and notes

5c1dcc4

Date annotator now takes one Text object as input

e42fac5

Add StoredAnnotation and StoredDateAnnotation schemas

12d1976

tschaffter marked this pull request as ready for review November 5, 2020 18:54

tschaffter added 22 commits November 12, 2020 20:49

Array in page now named according to the type of objects

09be54d

Successful post request now returns 201

1fe4a45

Note path comes ../fhir/Note

2df4b06

Now returning paginated annotation stores

8f8e60a

Update annotation schemas and endpoints

793b8fb

Add option to filter annotations to be returned by annotationType

c959791

Enable poster to set annotation id; rename PageResponse to ResponsePa…

327ae45

…geMetadata

Text annotation object is now a property of the annotation record

b36faab

Merge branch 'develop' into add-text-schema

33f80c7

Remove health endpoint from date annotator

fe0260d

Remove commons path health

1772729

Merge branch 'develop' into add-text-schema

f54ba33

Fix case sensitive issue

4abc830

Fix date annotator API

2462111

Use Google bundle schema approach for Annotation

0d6f1d9

Make name of dataset and stores readOnly

3fb838d

The poster no longer specify an annotationId

94bc009

Merge branch 'develop' into add-text-schema

0c3f4bf

Fix PageOfAnnotations

86d9e23

Add annotation examples

311e447

Merge branch 'add-text-schema' of github.com:Sage-Bionetworks/nlp-san…

0a6c76d

…dbox-schemas into add-text-schema

Replace PageOfAnnotations.items by PageOfAnnotations.annotations

7c4bae4

tschaffter changed the title ~~Add Text, make Note extend Text, update date-annotator to take as input one note~~ Add dataset, FHIR stores and annotations stores and related resources Nov 18, 2020

tschaffter merged commit 1c429ca into develop Nov 18, 2020

tschaffter deleted the add-text-schema branch November 18, 2020 01:30

boyleconnor pushed a commit to boyleconnor/nlp-sandbox-schemas that referenced this pull request Nov 18, 2020

Remove "Health" Endpoint

3a591f8

Health endpoint was deleted from commons in PR nlpsandbox#71

github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset, FHIR stores and annotations stores and related resources #71

Add dataset, FHIR stores and annotations stores and related resources #71

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 4, 2020

gkowalski commented Nov 4, 2020

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 4, 2020

tschaffter commented Nov 5, 2020 •

edited

Loading

tschaffter commented Nov 5, 2020

tschaffter commented Nov 5, 2020 •

edited

Loading

github-actions bot commented Nov 19, 2021

Add dataset, FHIR stores and annotations stores and related resources #71

Add dataset, FHIR stores and annotations stores and related resources #71

Conversation

tschaffter commented Nov 4, 2020 • edited Loading

Contributions of this PR

Summary

New Date Node API

Create a Dataset

Add a NoteStore to the Dataset

Add Notes to the NoteStore

Get notes from the NoteStore

Push and pull annotations

Date Annotator API Update

Initial PR description

If this PR is accepted

tschaffter commented Nov 4, 2020

gkowalski commented Nov 4, 2020

tschaffter commented Nov 4, 2020 • edited Loading

tschaffter commented Nov 4, 2020 • edited Loading

tschaffter commented Nov 4, 2020

tschaffter commented Nov 5, 2020 • edited Loading

tschaffter commented Nov 5, 2020

Issue regarding annotation and noteId

Background

Issue

tschaffter commented Nov 5, 2020 • edited Loading

Get all annotations in a store

github-actions bot commented Nov 19, 2021

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 4, 2020 •

edited

Loading

tschaffter commented Nov 5, 2020 •

edited

Loading

tschaffter commented Nov 5, 2020 •

edited

Loading