-
Notifications
You must be signed in to change notification settings - Fork 2
Training and Testing
Based on the input given it in config.yml
, the pipeline can be used to either
- Train a classifier to recognize plant species
- Apply a pre-trained classification model
- Test a pre-trained classification model
- Do both #2 and #3 at the same time
- Do #2 using the model from #1
- Do #3 using the model from #1
- Do both #5 and #6 at the same time
This is all done through two different configuration parameters in config.yml
: truth
and model
.
A "truth set" is a set of manually created labels for each segment in a dataset. More specifically, a truth set for a dataset of segmented plants would consist of an integer for each segmented plant indicating its species type (called a species label). For example, if buckwheat is the only species of flowering plant in a dataset of 30 segmented plants, then the truth set for that dataset would consist of 30 species labels, each either 0
for "non-flowering" or 1
for "buckwheat". You will need to create a truth set for each dataset that you would like to train or test.
A truth set is a tab-delimited file containing only two columns: 1) the ID of each segment and 2) the species label. Both columns should be integers. The segment IDs should correspond with those in the json segments file or in the segments map.
You can specify truth sets for one or more of your datasets via the truth
config option in config.yml
.
The truth
config option should be a dictionary of key-values pairs, where the keys can be any dataset IDs from the samples.tsv
file and the values are themselves dictionaries containing two key-value pairs:
-
path
(required) - the path to the truth set file -
train_all
(optional) - a boolean denoting whether to use all of the data for training or half for training and the other half for testing
Assuming your output directory is out
and you'd like to create a truth set for a dataset with ID region1
- Create a map of the segmented plants in
region1
:Or, if you are running the traditional strategy instead of the experimental one:run.bash -U out/region1/segments-map-exp.tiff
Each segmented plant in this segments map should be labeled by its segment ID from the json segments file.run.bash -U out/region1/segments-map.tiff
- Open the map in an image viewer and write down the species label of each segment in a truth set file:
vim out/region1/truth.tsv
- Provide the path to the truth set file in your
config.yml
A model.rda
file contains a pre-trained random forest classifier. This classifier is used by the pipeline to classify plants by their species. It can be applied to (or tested on) any other dataset so long as the features are the same.
This is the path to a pre-trained model, which will be used by the pipeline to predict the species of the segments in your datasets.
Follow these directions to create a trained classifier. The model will appear in the output. You should copy the model somewhere else (like the recommended data/
directory) before using it in your config file.
- Create a truth set for
region1
- Set
region1
'strain_all
config option to true - Comment out the
model
config option - Tell the pipeline to create the model
Or, if you are running the traditional strategy instead of the experimental one:
run.bash -U out/region1/train-exp/model.rda
run.bash -U out/region1/train/model.rda
- Specify the path to the model in the
model
config option - Comment out the
truth
config option forregion1
- Run the pipeline
- Create a truth set for
region1
- Provide the pre-trained model in the
model
config option - Run the pipeline
The output of the training steps will appear in train/
(or train-exp/
if running the experimental strategy). There are a few files in this directory:
-
model.rda
- the trained classifier -
training_data.tsv
- the data that was used to train the model -
variable_importance.tsv
- the random forest importance of each feature in the training data
The output of the test steps will appear in test/
(or test-exp/
if running the experimental strategy). There are a few files in this directory:
-
results.pdf
- a precision/recall curve for the classifier -
statistics.tsv
- the points of the precision/recall curve -
metrics.tsv
- some performance metrics for the classifier: precision, recall, f-beta, auroc, avg pr, support -
testing_data
- the data that was used to test the model -
classify/
- the species labels and probabilities of every segment in the test data