MT-CNN is a CNN for natural language processing (NLP) and information extraction from free-form texts. This model extracts information from cancer pathology reports.
Data scientists interested in classifying free form texts (such as pathology reports, clinical trials, abstracts, and so on).
Data scientists can train the provided untrained model on their own data, or use the trained model to classify the provided test samples. The provided scripts use pathology reports that has been downloaded from the Genomics Data Commons (GDC), converted to text format, cleaned, and preprocessed. Here is an example report.
Classification of unstructured text is a classical problem in natural language processing. The community has developed state-of-the-art models like BERT, Bio-BERT, and Transformer. This model has the advantage of working on a relatively long report (that is, over 400 words) and shows robustness in terms of accuracy and speed with relatively small number of unstructured pathology reports.
The following components are in the Model and Data Clearinghouse (MoDaC):
- The ML Ready Pathology Reports dataset contains the original data used for training, validation, and testing.
- The MultiTask Convolutional Neural Network (MT-CNN) dataset contains the trained model weights and topology to be used in inference.
Refer to this README.
Biomedical Sciences, Engineering, and Computing (BSEC) Group; Computer Sciences and Engineering Division; Oak Ridge National Laboratory