Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset, FHIR stores and annotations stores and related resources #71

Merged
merged 50 commits into from
Nov 18, 2020

Conversation

tschaffter
Copy link
Member

@tschaffter tschaffter commented Nov 4, 2020

Contributions of this PR

Summary

  • This PR introduces a major redesign of the Data Node API.
  • The Date Annotator API has also been slightly updated.

New Date Node API

The new API is inspired from Google Healthcare API that organizes data into three hierarchical levels: Project, Dataset, and Store. The Date Node API skips the Project level for simplicity and only keeps Dataset and Stores.

The paragraphs below describe how to push and get data, namely notes and annotations, using the Date Node API. In order to better understand the steps, we recommend that you clone the content of this PR and start the server used to preview the OpenAPI specification in HTML format. Once you have cloned this repo:

git fetch
git checkout -b add-text-schema origin/add-text-schema
npm run start openapi/data-node/openapi.yaml

Create a Dataset

First, you need to create a dataset that has a unique name in the entire data node.
http://localhost:8080/#operation/createDataset

Add a NoteStore to the Dataset

The next step is to add a data store to the dataset. Two types of stores are available: NoteStore and AnnotationStore.

Let's start by adding a NoteStore to our Dataset using the endpoint below. This endpoint takes one path parameter named datasetId. The value of this parameter must be set with the name of the dataset to use. The payload contains one property named name, which is the name of the store. The name of a store must be unique for a given dataset.
http://localhost:8080/#operation/createNoteStore

Add Notes to the NoteStore

Notes are added to a NoteStore using the endpoint below. This endpoints takes two path parameters: datasetId and storeId that take the name of the dataset and note store to use. The payload is a Note object with two properties: noteId (unique in a given note store) and text.
http://localhost:8080/#operation/createNote

Get notes from the NoteStore

You can retrieve the notes stored in the NoteStore with this endpoint. The results returned are paginated.
http://localhost:8080/#operation/listNotes

Push and pull annotations

The Data Node API currently has a second type of store: DataStore. So far two types of annotations can be pushed to an instance of this store: Annotation and DateAnnotation (extends Annotation).

Add an AnnotationStore to your dataset.
http://localhost:8080/#operation/createAnnotationStore

Add Annotation or DateAnnotation to this store.
http://localhost:8080/#operation/createAnnotation

Get all annotations (filters will be added soon). The results returned are paginated.
http://localhost:8080/#operation/listAnnotations

Date Annotator API Update

  • Previously, the input was an array of Notes. The input is now a Text object. It is still possible to pass a Note object because Note extends the Text object.
  • The output is an array of DateAnnotation objects.

TODO: Decide what to do with Annotation.noteId. Rename textId? :/

Initial PR description

This is an attempt to simplify the note schema and add a level of abstraction to target one day text that are not clinical notes.

First, this PR proposes to add a Text schema that has one string property called text. The idea is to pass to the date annotator a text object rather than a Note object. The motivation is that the Note object does not provide additional useful information that can help the date annotator in its task as of now. The second motivation is for the date annotator to be used with texts that are not clinical notes. Hence, the input to the date annotator becomes very similar to the input of Amazon Comprehend deid service.

image

Source

Another proposal is to temporarily at least get ride of the fields like "createdBy", "createdAt", "updatedBy", etc. These fields are meaningful for storing the record in a DB, but there is currently no need to expose this information to the user of most API. The data node should ideally continue to store "createdBy" value in the DB but API does not need to document this at this point. This will be revisited later.

Once again, the motivation is to simplify the API spec of the NLP tool. In the API documentation, it is more friendly to ask the input:

{
    "text": "This is a text."
}

Rather than the current large Note object that comes with "createdBy" etc. that are not useful to the tool developer.

Regarding the type of clinical note, I proposed to use the LOINC concept used by FHIR mentioned in this page:
FHIR Clinical Notes Guidance

Finally, one proposed change is give as input to the annotators ONE text or note instead of an array. Here are the motivations:

  • Simplification: it takes less lines of codes in the NLP tool to process one item than an array (we same the iteration code)
  • Closer to standard: It is possible to pass as input an array of object because we currently do it and this works. However, I believe that a best practice is that the body of a request should be one (json) object. A hint towards that direction is that 1) I never saw a requestBody that is an array in the OpenAPI doc, and 2) the openapi-generator uses the singular form for the name of the input parameter. For example, the parameter is named note in this function of the data annotator in both the Python and java implementation, while we currently treat it as an array.

Python: https://github.com/Sage-Bionetworks/nlp-sandbox-date-annotator-example/blob/develop/server/openapi_server/controllers/date_controller.py#L9
Java: https://github.com/nlpsandbox/date-annotator-example-java/blob/develop/server/src/main/java/org/openapitools/api/DatesApi.java#L60

If this PR is accepted

  1. I will update the date annotator (easy).
  2. We will not update the date node immediately. I would like to create another PR in this repo to define the concept of "Dataset" in data nodes and the endpoints needed to manage clinical notes and gold standard annotations (POST, GET).

@tschaffter tschaffter self-assigned this Nov 4, 2020
@tschaffter
Copy link
Member Author

Finally, one proposed change is give as input to the annotators ONE text or note instead of an array. Here are the motivations:

  • Stability: the input is one object rather than a potentially large array. Also, requests with a large body (array) can block / slow down the service.
  • Performance: ideally multiple instances of the service or "workers" will be spin up and multiple requests should be sent in parallel to them.

@gkowalski
Copy link
Member

"Text" Implies a written source , https://www.dictionary.com/browse/text . Wouldn't it be more generic to call it "data" https://www.dictionary.com/browse/data? if we really need to change this? Also gets us into the arena of what is the character encoding supported for this data ? ASCII , UTF-8 , UTF-16 , etc. You mention this naming conversion is "The second motivation is for the date annotator to be used with texts that are not clinical notes" , but yet then go on to add LOINC codes ( a medical Ontology ) as the type of note ? Will these types support other non-medical data ?

As for the created_by and created_at fields . One may not see use for them as a tool developer , but would be thankful to have them when your looking downstream at a final result that does not make sense and then need to track the data back to the source system. We get many notes from various systems and pipelines into a Data Warehouse that are constantly being created and updated . They all have various note id's that overlap and conflict. Only want to keep track of this is a "created_by" name for the system / it came from and created_date for when it was last updated.

@tschaffter
Copy link
Member Author

tschaffter commented Nov 4, 2020

@gkowalski Here is feedback on the first part of your questions.

"Text" Implies a written source , https://www.dictionary.com/browse/text . Wouldn't it be more generic to call it "data" https://www.dictionary.com/browse/data?

I don't see the need to go down to "data" as so far we only consider "text". Data could be useful the day we support DICOM image, for example if one day we decide to add the deid of DICOM image metadata. Stopping at text for now seems reasonable.

Also gets us into the arena of what is the character encoding supported for this data ? ASCII , UTF-8 , UTF-16 , etc.

Adding an encoding property to the Text schema is something that I can definitively agree with.

but yet then go on to add LOINC codes ( a medical Ontology ) as the type of note ? Will these types support other non-medical data ?

I included this example in the spec: "loinc:LP29684-5". The idea behind this formatting is that it would allow us to support non-medical notes by replacing the prefix "loinc:".

@tschaffter tschaffter marked this pull request as draft November 4, 2020 22:38
@tschaffter
Copy link
Member Author

tschaffter commented Nov 4, 2020

The data node now has all the endpoints needed for now to manage Datasets.

npm run start openapi/data-node/openapi.yaml

image

@tschaffter
Copy link
Member Author

@tschaffter
Copy link
Member Author

tschaffter commented Nov 5, 2020

The data node now has all the endpoints needed for now to manage AnnotationStores linked to a Dataset ID.

npm run start openapi/data-node/openapi.yaml

image

@tschaffter
Copy link
Member Author

Issue regarding annotation and noteId

Background

The current Annotation schema has a property noteId. This property is important when storing annotation objects in a DB to be able to link an Annotation and Note object. This noteId was also used by an annotator to link an input note and an output annotation object. Having the annotator setting the noteId of the annotation objects is important because the input is an array of notes.

Issue

We are moving away from giving an array as input to the annotator. Therefore, the is no longer the need for the annotator to link the output annotation to the note. Instead, it is the responsibility of the user to link the note sent to the annotator and the annotations received from it.

Most importantly, we could not trust the value of Annotation.noteId set by the annotator. While I don't see how an incorrect noteId value could mess up with our system, this just add one degree of liberty for the annotator to do something wrong. By making the "user" / client responsible for linking the note sent and the received annotations, we lift a burden of the developers of the annotators.

@tschaffter
Copy link
Member Author

tschaffter commented Nov 5, 2020

Get all annotations in a store

Currently I defined the spec to returned an array of object that can be a mix of Annotation objects and DateAnnotation object. The way I do it seems correct according to this comment, however the HTML doc does not seem to be able to render this yet.

swagger-api/swagger-ui#3803

If this is really an issue, we could define specific endpoints to get only annotation of a given type.

Also it will be interesting to see how the openapi-generator handles this spec.

EDIT: The above is actually incorrect. The HTML documentation generated by redocly-cli does actually show that the output can be of mixed types.

image

@tschaffter tschaffter marked this pull request as ready for review November 5, 2020 18:54
@tschaffter tschaffter changed the title Add Text, make Note extend Text, update date-annotator to take as input one note Add dataset, FHIR stores and annotations stores and related resources Nov 18, 2020
@tschaffter tschaffter merged commit 1c429ca into develop Nov 18, 2020
@tschaffter tschaffter deleted the add-text-schema branch November 18, 2020 01:30
boyleconnor pushed a commit to boyleconnor/nlp-sandbox-schemas that referenced this pull request Nov 18, 2020
Health endpoint was deleted from commons in PR nlpsandbox#71
@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation Data Node OpenAPI specification with IBM OpenAPI Validator
3 participants