-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[onert] Python Bindings for Training #14505
Comments
Python Package StructureThe Training API, as experimental API, needs to be provided as a separate module from the existing Python module. Package Separation
Binding Modules
Provided Python API
Expected Package Structure
|
I found out that an error occurs when setting I/O in the batch size different from the batch size configured during the $ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Batch 6: Loss=0.0012
Error during nnfw_session::train_set_input : not supporeted to change tensorinfo
[ERROR] NNFW_STATUS_ERROR ONE/runtime/onert/api/nnfw/src/nnfw_session.cc Lines 1451 to 1456 in 5bffe08
To provide the functionality of training with dataset, there are cases where the remaining batches(smaller) of the dataset need to be performed. Therefore, we need to support cases where the batch size of the I/O differs. I think there are three approaches, including alternative solutions:
batched_inputs = []
for batch_start in range(0, self.num_samples, self.batch_size):
batch_end = min(batch_start + self.batch_size, self.num_samples)
# Collect batched inputs
inputs_batch = [
input_array[batch_start:batch_end] for input_array in self.inputs
]
if batch_end - batch_start < self.batch_size:
# Resize the last batch to match batch_size
inputs_batch = [
np.resize(batch, (self.batch_size, *batch.shape[1:])) for batch in inputs_batch
]
batched_inputs.append(inputs_batch)
I prefer to use No.1. |
I found that the accuracy measurement was not working, and I checked that obtaining output results during training does not works. $ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Train Loss: 0.0012
Batch 1: Loss=0.0012
Batch 2: Loss=0.0013
Validation Loss: 0.0012
CategoricalAccuracy: 0.0000 $ ./Product/x86_64-linux.release/out/bin/onert_train mobilenetv2.circle --load_expected:raw test-models/imagenet_a/test.output.100.bin --load_input:raw test-models/imagenet_a/test.input.100.bin --loss 1 --loss_reduction_type 1 --optimizer 1 --learning_rate 0.01 --epoch 5 --batch_size 10 --num_of_trainable_ops -1 --validation_split 0.2 --metric 0
Model Filename mobilenetv2.circle
== training parameter ==
- learning_rate = 0.01
- batch_size = 10
- loss_info = {loss = mean squared error, reduction = sum over batch size}
- optimizer = sgd
- num_of_trainable_ops = -1
========================
Epoch 1/5 - time: 769.609ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 2/5 - time: 795.052ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 3/5 - time: 724.606ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 4/5 - time: 712.029ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 5/5 - time: 721.155ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
===================================
MODEL_LOAD takes 7.5380 ms
PREPARE takes 268.6270 ms
EXECUTE takes 30332.8390 ms
- Epoch 1 takes 6156.8730 ms
- Epoch 2 takes 6360.4160 ms
- Epoch 3 takes 5796.8470 ms
- Epoch 4 takes 5696.2360 ms
- Epoch 5 takes 5769.2410 ms
=================================== We need to fix the problem. |
Here is the result of running a sample example for training. $ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100 --optimizer adam --loss cce --learning_rate 0.01 --batch_size 10 --validation_split=0.2
Load data
== training parameter ==
- learning_rate = 0.01
- batch_size = 10
- loss_info = {loss = CategoricalCrossentropy, reduction = sum over batch size}
- optimizer = Adam
- num_of_trainable_ops = -1
========================
Epoch 1/5 - Train time: 602.569ms/step - IO time: 0.057ms/step - Train Loss: 10.7749 - Validation Loss: 10.1255 - CategoricalAccuracy: 0.0000
Epoch 2/5 - Train time: 634.849ms/step - IO time: 0.076ms/step - Train Loss: 6.1418 - Validation Loss: 12.0664 - CategoricalAccuracy: 0.0000
Epoch 3/5 - Train time: 634.984ms/step - IO time: 0.064ms/step - Train Loss: 5.7052 - Validation Loss: 14.5072 - CategoricalAccuracy: 0.0000
Epoch 4/5 - Train time: 642.914ms/step - IO time: 0.068ms/step - Train Loss: 5.4454 - Validation Loss: 15.3301 - CategoricalAccuracy: 0.0000
Epoch 5/5 - Train time: 672.250ms/step - IO time: 0.059ms/step - Train Loss: 6.6274 - Validation Loss: 17.4566 - CategoricalAccuracy: 0.0000
===================================
MODEL_LOAD takes 8.3949 ms
COMPILE takes 243.6544 ms
EXECUTE takes 26446.5046 ms
- Epoch 1 takes 5004.7149 ms
- Epoch 2 takes 5264.5105 ms
- Epoch 3 takes 5267.8122 ms
- Epoch 4 takes 5335.6967 ms
- Epoch 5 takes 5573.7703 ms
===================================
nnpackage mobilenetv2 trains successfully. "IO time" is almost meaningless. It would be better to merge it with "Train time". |
What
Lets introduce Python APIs for training
Why
Currently, Python APIs for inference have been implemented using pybind11 (#11368). However, there are no Python APIs for training yet. To provide a better user experience, it seems necessary to introduce Python bindings for training APIs as well.
Draft : #14492
The text was updated successfully, but these errors were encountered: