Skip to content

Training and Testing

Arya Massarat edited this page Apr 1, 2021 · 9 revisions

Introduction

Based on the input given it in config.yml, the pipeline can be used to either

  1. Train a classifier to recognize plant species
  2. Apply a pre-trained classification model
  3. Test a pre-trained classification model
  4. Do both #2 and #3 at the same time
  5. Do #2 using the model from #1
  6. Do #3 using the model from #1
  7. Do both #5 and #6 at the same time

This is all done through two different configuration parameters in config.yml: truth and model.

Truth sets

What is a truth set?

A "truth set" is a set of manually created labels for each segment in a dataset. More specifically, a truth set for a dataset of segmented plants would consist of an integer for each segmented plant indicating its species type (called a species label). For example, if buckwheat is the only species of flowering plant in a dataset of 30 segmented plants, then the truth set for that dataset would consist of 30 species labels, each either 0 for "non-flowering" or 1 for "buckwheat". You will need to create a truth set for each dataset that you would like to train or test.

What is the format of a truth set?

A truth set is a tab-delimited file containing only two columns: 1) the ID of each segment and 2) the species label. Both columns should be integers. The segment IDs should correspond with those in the json segments file or in the segments map.

the truth config option

You can specify truth sets for one or more of your datasets via the truth config option in config.yml.

The truth config option should be a dictionary of key-values pairs, where the keys can be any dataset IDs from the samples.tsv file and the values are themselves dictionaries containing two key-value pairs:

  1. path (required) - the path to the truth set file
  2. train_all (optional) - a boolean denoting whether to use all of the data for training or half for training and the other half for testing

Creating a truth set

Assuming your output directory is out and you'd like to create a truth set for a dataset with ID region1

  1. Create a map of the segmented plants in region1:
    run.bash -U out/region1/segments-map-exp.tiff
    
    Or, if you are running the traditional strategy instead of the experimental one:
    run.bash -U out/region1/segments-map.tiff
    
    Each segmented plant in this segments map should be labeled by its segment ID from the json segments file.
  2. Open the map in an image viewer and write down the species label of each segment in a truth set file:
    vim out/region1/truth.tsv
    
  3. Provide the path to the truth set file in your config.yml

Models

What is a model?

A model.rda file contains a pre-trained random forest classifier. This classifier is used by the pipeline to classify plants by their species. It can be applied to (or tested on) any other dataset so long as the features are the same.

the model config option

This is the path to a pre-trained model, which will be used by the pipeline to predict the species of the segments in your datasets.

Creating a model

Follow these directions to create a trained classifier. The model will appear in the output. You should copy the model somewhere else (like the recommended data/ directory) before using it in your config file.

Scenarios

1. Train a classifier to recognize plant species for the region1 dataset

  1. Create a truth set for region1
  2. Set region1's train_all config option to true
  3. Comment out the model config option
  4. Tell the pipeline to create the model
    run.bash -U out/region1/train-exp/model.rda
    
    Or, if you are running the traditional strategy instead of the experimental one:
    run.bash -U out/region1/train/model.rda
    

2. Apply a pre-trained classification model to the region1 dataset

  1. Specify the path to the model in the model config option
  2. Comment out the truth config option for region1
  3. Run the pipeline

3. Test a pre-trained classification model on the region1 dataset

  1. Create a truth set for region1
  2. Provide the pre-trained model in the model config option
  3. Run the pipeline

Outputs

Training

The output of the training steps will appear in train/ (or train-exp/ if running the experimental strategy). There are a few files in this directory:

  1. model.rda - the trained classifier
  2. training_data.tsv - the data that was used to train the model
  3. variable_importance.tsv - the random forest importance of each feature in the training data

Testing

The output of the test steps will appear in test/ (or test-exp/ if running the experimental strategy). There are a few files in this directory:

  1. results.pdf - a precision/recall curve for the classifier
  2. statistics.tsv - the points of the precision/recall curve
  3. metrics.tsv - some performance metrics for the classifier: precision, recall, f-beta, auroc, avg pr, support
  4. testing_data - the data that was used to test the model
  5. classify/ - the species labels and probabilities of every segment in the test data