Skip to content

Latest commit

 

History

History
83 lines (56 loc) · 5.16 KB

MODEL_TRAINING.md

File metadata and controls

83 lines (56 loc) · 5.16 KB

Replication, Retraining, and Improving the Model Training

In this section we explain the steps and tools used to choose hyperparameters, train the model, and generate the database. If you'd like to skip the details you can check out the Quick Start subsection below which explains the minimal steps necessary to recreate OnSIDES from scratch.

Prerequisites

In addition to the cloned repository, a data subdirectory is required that contains three pieces of data.

  1. A file that maps MedDRA preferred terms to lower level terms.
  2. The manual annotations from Denmer-Fushman, et al paper and TAC.
  3. The TAC SPL labels in XML format with the Adverse Reactions, Boxed Warnings, and Warnings and Precautions sections parsed.

For your convenience, there is an example data directory download available with the minimum requirements available for download.

Model training and evaluation is handled through the use of a helper script named experiment_tracker.py. There are several steps in the model training and evaluation pipeline and each has their own set of parameter options. The Experiment Tracker makes it straightforward to manage this process.

Quick Start

# Setup
wget https://github.com/tatonetti-lab/onsides/archive/refs/tags/v2.0.0.tar.gz
tar -xvzf v2.0.0.tar.gz
cd onsides-2.0.0
wget https://github.com/tatonetti-lab/onsides/releases/download/v2.0.0/data.zip
unzip data
python3 -m pip install -r requirements.txt

# Train model for ADVERSE REACTIONS section
python3 src/experiment_tracker.py --id v2.0.0-AR | bash

# Train model for BOXED WARNINGS section
# BW section uses a model pre-trained on ALL sections which is built in Experiment 4B
python3 src/experiment_tracker.py --id 4B | bash
# Fine-tune the pre-trained model for BOXED WARNINGS
python3 src/experiment_tracker.py --id v2.0.0-BW | bash

# Download all available prescription Structured Product Labels (SPLs)
python3 src/spl_processor.py --full

# Apply model to downloaded labels to identify ADRs from ADVERSE REACTIONS sections
python3 src/deployment_tracker.py --release v2.0.0-AR | bash

# Apply model to downloaded labels to identify ADRs from BOXED WARNINGS sections
python3 src/deployment_tracker.py --release v2.0.0-BW | bash

# Build database files
python3 src/build_onsides.py --vocab ./data/omop/vocab_5.4 --release v2.0.0

Replication of hyperparameter optimization experiments

Model Training consists of four steps: i) constructing the training data (construct_training_data.py), ii) fitting the BERT model (fit_clinicalbert.py), iii) generating probabilities for the example sentence fragments (analyze_results.py), and iv) aggregating the probabilities across sentence fragments at the adverse event term level (compile_results.py). The Experiment Tracker (experiment_tracker.py) will keep track of this entire process and what commands need to be run to complete the experiment. Experiments are managed by editing the experiments entries in the experiments.json file. In each experiments entry, the parameters that are to be explored can be specified. Any parameters not specified are assumed to be the default values.

To track the status of the experiment run the script with the experiment identifier. For example:

python3 src/experiment_tracker.py --id 0

If any steps in the process of the experiment are incomplete, the script will print out a list of bash commands to the standard output that can be used to complete the experiment. For example you could run those commands with the following:

python3 src/experiment_tracker.py --id 0 | bash

If running on a GPU enabled machine, it may be beneficial to specify which GPU to use. The experiment tracker can automatically take care of this for you through the use of the CUDA_VISIBLE_DEVICES environment variable. This is set with the --gpu flag. For example:

python3 src/experiment_tracker.py --id 0 --gpu 1

You can monitor the status of all experiments that using the --all flag.

python3 src/experiment_tracker.py --all

Training the Deployment Models for release

The experiments.json file can also be used to manage deployments. A deployment entry works the same as an experiment, except that only one set of parameters is used. We used the results of Experiments 1 through 10 to decide the deployment parameters. Each Experiment has a corresponding notebook in notebooks with an experiment description and an interpretation of the results.

Evaluation

Each experiment has a corresponding Jupyter notebook for evaluation (See notebooks subdirectory). The files and parameters necessary to run the notebook are saved in the analysis.json file. The analysis.json file is automatically generated by the experiment tracker once an experiment is complete and should not be edited directly. Each of these notebooks is essentially identical with the only difference being which experiment is being evaluated. Therefore, to add a notebook for a new experiment copy an existing notebook, rename it, and edit the experiment ID in the third code block. The notebook will print ROC and PR curves as well as a table of summary performance statistics.