Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[onert] Python Bindings for Training #14505

Open
ragmani opened this issue Dec 30, 2024 · 4 comments
Open

[onert] Python Bindings for Training #14505

ragmani opened this issue Dec 30, 2024 · 4 comments

Comments

@ragmani
Copy link
Contributor

ragmani commented Dec 30, 2024

What

Lets introduce Python APIs for training

Why

Currently, Python APIs for inference have been implemented using pybind11 (#11368). However, there are no Python APIs for training yet. To provide a better user experience, it seems necessary to introduce Python bindings for training APIs as well.

Draft : #14492

@ragmani
Copy link
Contributor Author

ragmani commented Dec 30, 2024

Python Package Structure

The Training API, as experimental API, needs to be provided as a separate module from the existing Python module.

Package Separation

  • Current: A single onert package
  • Proposed Change: Maintain the current package (retain .so file components as is)
  • Reason: Separating packages would make it difficult to manage dependency between those packages.

Binding Modules

  • Current: Infer module
  • Proposed Change: Infer and Experimental submodules
  • Reason: To provide experimental APIs in a separate module

Provided Python API

  • Current : Infer["session", "tensorinfo"]
  • Proposed Change : infer["session"], tensorinfo, train["session", "traininfo", "DataLoader", "optimizer", "losses", "metrics"]
  • Reason: To provide train APIs

Expected Package Structure

['onert-0.2.0.data/purelib/onert/__init__.py',
 'onert-0.2.0.data/purelib/onert/common/__init__.py',
 'onert-0.2.0.data/purelib/onert/common/basesession.py',
 'onert-0.2.0.data/purelib/onert/infer/__init__.py',
 'onert-0.2.0.data/purelib/onert/infer/session.py',
 'onert-0.2.0.data/purelib/onert/native/libnnfw_api_pybind.so',
 'onert-0.2.0.data/purelib/onert/native/libonert.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/libonert_core.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_cpu.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_ruy.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_train.so',
 'onert-0.2.0.data/purelib/onert/train/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/dataloader.py',
 'onert-0.2.0.data/purelib/onert/train/session.py',
 'onert-0.2.0.data/purelib/onert/train/losses/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/losses/cce.py',
 'onert-0.2.0.data/purelib/onert/train/losses/loss.py',
 'onert-0.2.0.data/purelib/onert/train/losses/mse.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/categorical_accuracy.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/metric.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/registry.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/adam.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/optimizer.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/sgd.py',
... ]

@ragmani
Copy link
Contributor Author

ragmani commented Jan 3, 2025

I found out that an error occurs when setting I/O in the batch size different from the batch size configured during the train_prepare stage.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Batch 6: Loss=0.0012
Error during nnfw_session::train_set_input : not supporeted to change tensorinfo
[ERROR]	NNFW_STATUS_ERROR

if (input_tensorinfo && getBufSize(input_tensorinfo) != size)
{
std::cerr
<< "Error during nnfw_session::train_set_input : not supporeted to change tensorinfo"
<< std::endl;
return NNFW_STATUS_ERROR;

To provide the functionality of training with dataset, there are cases where the remaining batches(smaller) of the dataset need to be performed. Therefore, we need to support cases where the batch size of the I/O differs. I think there are three approaches, including alternative solutions:

  1. Resizing the remaining batches of the dataset like below:
        batched_inputs = []

        for batch_start in range(0, self.num_samples, self.batch_size):
            batch_end = min(batch_start + self.batch_size, self.num_samples)

            # Collect batched inputs
            inputs_batch = [
                input_array[batch_start:batch_end] for input_array in self.inputs
            ]
            if batch_end - batch_start < self.batch_size:
                # Resize the last batch to match batch_size
                inputs_batch = [
                    np.resize(batch, (self.batch_size, *batch.shape[1:])) for batch in inputs_batch
                ]

            batched_inputs.append(inputs_batch)
  1. Allowing and supporting I/O to be performed with a smaller batch size.
  2. Supporting dynamic shape

I prefer to use No.1.

@ragmani
Copy link
Contributor Author

ragmani commented Jan 3, 2025

I found that the accuracy measurement was not working, and I checked that obtaining output results during training does not works.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Train Loss: 0.0012
Batch 1: Loss=0.0012
Batch 2: Loss=0.0013
Validation Loss: 0.0012
CategoricalAccuracy: 0.0000
$ ./Product/x86_64-linux.release/out/bin/onert_train mobilenetv2.circle --load_expected:raw test-models/imagenet_a/test.output.100.bin --load_input:raw test-models/imagenet_a/test.input.100.bin --loss 1 --loss_reduction_type 1 --optimizer 1 --learning_rate 0.01 --epoch 5 --batch_size 10 --num_of_trainable_ops -1 --validation_split 0.2 --metric 0
Model Filename mobilenetv2.circle
== training parameter ==
- learning_rate        = 0.01
- batch_size           = 10
- loss_info            = {loss = mean squared error, reduction = sum over batch size}
- optimizer            = sgd
- num_of_trainable_ops = -1
========================
Epoch 1/5 - time: 769.609ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 2/5 - time: 795.052ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 3/5 - time: 724.606ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 4/5 - time: 712.029ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 5/5 - time: 721.155ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
===================================
MODEL_LOAD   takes 7.5380 ms
PREPARE      takes 268.6270 ms
EXECUTE      takes 30332.8390 ms
- Epoch 1      takes 6156.8730 ms
- Epoch 2      takes 6360.4160 ms
- Epoch 3      takes 5796.8470 ms
- Epoch 4      takes 5696.2360 ms
- Epoch 5      takes 5769.2410 ms
===================================

We need to fix the problem.

@ragmani
Copy link
Contributor Author

ragmani commented Jan 6, 2025

Here is the result of running a sample example for training.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100 --optimizer adam --loss cce --learning_rate 0.01 --batch_size 10 --validation_split=0.2
Load data
== training parameter ==
- learning_rate        = 0.01
- batch_size           = 10
- loss_info            = {loss = CategoricalCrossentropy, reduction = sum over batch size}
- optimizer            = Adam
- num_of_trainable_ops = -1
========================
Epoch 1/5 - Train time: 602.569ms/step - IO time: 0.057ms/step - Train Loss: 10.7749 - Validation Loss: 10.1255 - CategoricalAccuracy: 0.0000
Epoch 2/5 - Train time: 634.849ms/step - IO time: 0.076ms/step - Train Loss: 6.1418 - Validation Loss: 12.0664 - CategoricalAccuracy: 0.0000
Epoch 3/5 - Train time: 634.984ms/step - IO time: 0.064ms/step - Train Loss: 5.7052 - Validation Loss: 14.5072 - CategoricalAccuracy: 0.0000
Epoch 4/5 - Train time: 642.914ms/step - IO time: 0.068ms/step - Train Loss: 5.4454 - Validation Loss: 15.3301 - CategoricalAccuracy: 0.0000
Epoch 5/5 - Train time: 672.250ms/step - IO time: 0.059ms/step - Train Loss: 6.6274 - Validation Loss: 17.4566 - CategoricalAccuracy: 0.0000
===================================
MODEL_LOAD   takes 8.3949 ms
COMPILE      takes 243.6544 ms
EXECUTE      takes 26446.5046 ms
- Epoch 1      takes 5004.7149 ms
- Epoch 2      takes 5264.5105 ms
- Epoch 3      takes 5267.8122 ms
- Epoch 4      takes 5335.6967 ms
- Epoch 5      takes 5573.7703 ms
===================================
nnpackage mobilenetv2 trains successfully.

"IO time" is almost meaningless. It would be better to merge it with "Train time".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant