Training and Testing

Introduction

Based on the input given it in config.yml, the pipeline can be used to either

Train a classifier to recognize plant species
Apply a pre-trained classification model
Test a pre-trained classification model
Do both #2 and #3 at the same time
Do #2 using the model from #1
Do #3 using the model from #1
Do both #5 and #6 at the same time

This is all done through two different configuration parameters in config.yml: truth and model.

Truth sets

What is a truth set?

A "truth set" is a set of manually created labels for each segment in a dataset. More specifically, a truth set for a dataset of segmented plants would consist of an integer for each segmented plant indicating its species type (called a species label). For example, if buckwheat is the only species of flowering plant in a dataset of 30 segmented plants, then the truth set for that dataset would consist of 30 species labels, each either 0 for "non-flowering" or 1 for "buckwheat". You will need to create a truth set for each dataset that you would like to train or test.

What is the format of a truth set?

A truth set is a tab-delimited file containing only two columns: 1) the ID of each segment and 2) the species label. Both columns should be integers. The segment IDs should correspond with those in the json segments file or in the segments map.

the `truth` config option

You can specify truth sets for one or more of your datasets via the truth config option in config.yml.

The truth config option should be a dictionary of key-values pairs, where the keys can be any dataset IDs from the samples.tsv file and the values are themselves dictionaries containing two key-value pairs:

path (required) - the path to the truth set file
train_all (optional) - a boolean denoting whether to use all of the data for training or half for training and the other half for testing

Creating a truth set

Assuming your output directory is out and you'd like to create a truth set for a dataset with ID region1

Create a map of the segmented plants in region1:
```
run.bash -U out/region1/segments-map-exp.tiff
```
Or, if you are running the traditional strategy instead of the experimental one:
```
run.bash -U out/region1/segments-map.tiff
```
Each segmented plant in this segments map should be labeled by its segment ID from the json segments file.
Open the map in an image viewer and write down the species label of each segment in a truth set file:
```
vim out/region1/truth.tsv
```
Provide the path to the truth set file in your config.yml

Models

What is a model?

A model.rda file contains a pre-trained random forest classifier. This classifier is used by the pipeline to classify plants by their species. It can be applied to (or tested on) any other dataset so long as the features are the same.

the `model` config option

This is the path to a pre-trained model, which will be used by the pipeline to predict the species of the segments in your datasets.

Creating a model

Follow these directions to create a trained classifier. The model will appear in the output. You should copy the model somewhere else (like the recommended data/ directory) before using it in your config file.

Scenarios

1. Train a classifier to recognize plant species for the `region1` dataset

Create a truth set for region1
Set region1's train_all config option to true
Comment out the model config option
Tell the pipeline to create the model
```
run.bash -U out/region1/train-exp/model.rda
```
Or, if you are running the traditional strategy instead of the experimental one:
```
run.bash -U out/region1/train/model.rda
```

2. Apply a pre-trained classification model to the `region1` dataset

Specify the path to the model in the model config option
Comment out the truth config option for region1
Run the pipeline

3. Test a pre-trained classification model on the `region1` dataset

Create a truth set for region1
Provide the pre-trained model in the model config option
Run the pipeline

Outputs

Training

The output of the training steps will appear in train/ (or train-exp/ if running the experimental strategy). There are a few files in this directory:

model.rda - the trained classifier
training_data.tsv - the data that was used to train the model
variable_importance.tsv - the random forest importance of each feature in the training data

Testing

The output of the test steps will appear in test/ (or test-exp/ if running the experimental strategy). There are a few files in this directory:

results.pdf - a precision/recall curve for the classifier
statistics.tsv - the points of the precision/recall curve
metrics.tsv - some performance metrics for the classifier: precision, recall, f-beta, auroc, avg pr, support
testing_data - the data that was used to test the model
classify/ - the species labels and probabilities of every segment in the test data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training and Testing

Introduction

Truth sets

What is a truth set?

What is the format of a truth set?

the `truth` config option

Creating a truth set

Models

What is a model?

the `model` config option

Creating a model

Scenarios

1. Train a classifier to recognize plant species for the `region1` dataset

2. Apply a pre-trained classification model to the `region1` dataset

3. Test a pre-trained classification model on the `region1` dataset

Outputs

Training

Testing

Clone this wiki locally

Training and Testing

Introduction

Truth sets

What is a truth set?

What is the format of a truth set?

the truth config option

Creating a truth set

Models

What is a model?

the model config option

Creating a model

Scenarios

1. Train a classifier to recognize plant species for the region1 dataset

2. Apply a pre-trained classification model to the region1 dataset

3. Test a pre-trained classification model on the region1 dataset

Outputs

Training

Testing

Clone this wiki locally

the `truth` config option

the `model` config option

1. Train a classifier to recognize plant species for the `region1` dataset

2. Apply a pre-trained classification model to the `region1` dataset

3. Test a pre-trained classification model on the `region1` dataset