[onert] Python Bindings for Training #14505

ragmani · 2024-12-30T04:56:26Z

What

Lets introduce Python APIs for training

Why

Currently, Python APIs for inference have been implemented using pybind11 (#11368). However, there are no Python APIs for training yet. To provide a better user experience, it seems necessary to introduce Python bindings for training APIs as well.

Draft : #14492

ragmani · 2024-12-30T05:43:32Z

Python Package Structure

The Training API, as experimental API, needs to be provided as a separate module from the existing Python module.

Package Separation

Current: A single onert package
Proposed Change: Maintain the current package (retain .so file components as is)
Reason: Separating packages would make it difficult to manage dependency between those packages.

Binding Modules

Current: Infer module
Proposed Change: Infer and Experimental submodules
Reason: To provide experimental APIs in a separate module

Provided Python API

Current : Infer["session", "tensorinfo"]
Proposed Change : infer["session"], tensorinfo, train["session", "traininfo", "DataLoader", "optimizer", "losses", "metrics"]
Reason: To provide train APIs

Expected Package Structure

['onert-0.2.0.data/purelib/onert/__init__.py',
 'onert-0.2.0.data/purelib/onert/common/__init__.py',
 'onert-0.2.0.data/purelib/onert/common/basesession.py',
 'onert-0.2.0.data/purelib/onert/infer/__init__.py',
 'onert-0.2.0.data/purelib/onert/infer/session.py',
 'onert-0.2.0.data/purelib/onert/native/libnnfw_api_pybind.so',
 'onert-0.2.0.data/purelib/onert/native/libonert.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/libonert_core.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_cpu.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_ruy.so',
 'onert-0.2.0.data/purelib/onert/native/nnfw/backend/libbackend_train.so',
 'onert-0.2.0.data/purelib/onert/train/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/dataloader.py',
 'onert-0.2.0.data/purelib/onert/train/session.py',
 'onert-0.2.0.data/purelib/onert/train/losses/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/losses/cce.py',
 'onert-0.2.0.data/purelib/onert/train/losses/loss.py',
 'onert-0.2.0.data/purelib/onert/train/losses/mse.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/categorical_accuracy.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/metric.py',
 'onert-0.2.0.data/purelib/onert/train/metrics/registry.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/__init__.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/adam.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/optimizer.py',
 'onert-0.2.0.data/purelib/onert/train/optimizer/sgd.py',
... ]

ragmani · 2025-01-03T06:49:09Z

I found out that an error occurs when setting I/O in the batch size different from the batch size configured during the train_prepare stage.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Batch 6: Loss=0.0012
Error during nnfw_session::train_set_input : not supporeted to change tensorinfo
[ERROR]	NNFW_STATUS_ERROR

ONE/runtime/onert/api/nnfw/src/nnfw_session.cc

Lines 1451 to 1456 in 5bffe08

    
           if (input_tensorinfo && getBufSize(input_tensorinfo) != size) 
        
           { 
        
             std::cerr 
        
               << "Error during nnfw_session::train_set_input : not supporeted to change tensorinfo" 
        
               << std::endl; 
        
             return NNFW_STATUS_ERROR;

To provide the functionality of training with dataset, there are cases where the remaining batches(smaller) of the dataset need to be performed. Therefore, we need to support cases where the batch size of the I/O differs. I think there are three approaches, including alternative solutions:

Resizing the remaining batches of the dataset like below:

        batched_inputs = []

        for batch_start in range(0, self.num_samples, self.batch_size):
            batch_end = min(batch_start + self.batch_size, self.num_samples)

            # Collect batched inputs
            inputs_batch = [
                input_array[batch_start:batch_end] for input_array in self.inputs
            ]
            if batch_end - batch_start < self.batch_size:
                # Resize the last batch to match batch_size
                inputs_batch = [
                    np.resize(batch, (self.batch_size, *batch.shape[1:])) for batch in inputs_batch
                ]

            batched_inputs.append(inputs_batch)

Allowing and supporting I/O to be performed with a smaller batch size.
Supporting dynamic shape

I prefer to use No.1.

ragmani · 2025-01-03T08:08:29Z

I found that the accuracy measurement was not working, and I checked that obtaining output results during training does not works.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100
Load data
Epoch 1/5
Batch 1: Loss=0.0012
Batch 2: Loss=0.0012
Batch 3: Loss=0.0011
Batch 4: Loss=0.0012
Batch 5: Loss=0.0011
Train Loss: 0.0012
Batch 1: Loss=0.0012
Batch 2: Loss=0.0013
Validation Loss: 0.0012
CategoricalAccuracy: 0.0000

$ ./Product/x86_64-linux.release/out/bin/onert_train mobilenetv2.circle --load_expected:raw test-models/imagenet_a/test.output.100.bin --load_input:raw test-models/imagenet_a/test.input.100.bin --loss 1 --loss_reduction_type 1 --optimizer 1 --learning_rate 0.01 --epoch 5 --batch_size 10 --num_of_trainable_ops -1 --validation_split 0.2 --metric 0
Model Filename mobilenetv2.circle
== training parameter ==
- learning_rate        = 0.01
- batch_size           = 10
- loss_info            = {loss = mean squared error, reduction = sum over batch size}
- optimizer            = sgd
- num_of_trainable_ops = -1
========================
Epoch 1/5 - time: 769.609ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 2/5 - time: 795.052ms/step - loss: [0] 0.0012 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 3/5 - time: 724.606ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 4/5 - time: 712.029ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
Epoch 5/5 - time: 721.155ms/step - loss: [0] 0.0011 - categorical_accuracy: [0] 0.0000 - val_loss: [0] 0.0012 - val_categorical_accuracy: [0] 0.0000
===================================
MODEL_LOAD   takes 7.5380 ms
PREPARE      takes 268.6270 ms
EXECUTE      takes 30332.8390 ms
- Epoch 1      takes 6156.8730 ms
- Epoch 2      takes 6360.4160 ms
- Epoch 3      takes 5796.8470 ms
- Epoch 4      takes 5696.2360 ms
- Epoch 5      takes 5769.2410 ms
===================================

We need to fix the problem.

ragmani · 2025-01-06T12:38:00Z

Here is the result of running a sample example for training.

$ python3 runtime/onert/sample/minimal-python/experimental/src/train_with_dataset.py -m mobilenetv2 -i out/imagenet_a.test.input.100.bin -l out/imagenet_a.test.output.100.bin --data_length 100 --optimizer adam --loss cce --learning_rate 0.01 --batch_size 10 --validation_split=0.2
Load data
== training parameter ==
- learning_rate        = 0.01
- batch_size           = 10
- loss_info            = {loss = CategoricalCrossentropy, reduction = sum over batch size}
- optimizer            = Adam
- num_of_trainable_ops = -1
========================
Epoch 1/5 - Train time: 602.569ms/step - IO time: 0.057ms/step - Train Loss: 10.7749 - Validation Loss: 10.1255 - CategoricalAccuracy: 0.0000
Epoch 2/5 - Train time: 634.849ms/step - IO time: 0.076ms/step - Train Loss: 6.1418 - Validation Loss: 12.0664 - CategoricalAccuracy: 0.0000
Epoch 3/5 - Train time: 634.984ms/step - IO time: 0.064ms/step - Train Loss: 5.7052 - Validation Loss: 14.5072 - CategoricalAccuracy: 0.0000
Epoch 4/5 - Train time: 642.914ms/step - IO time: 0.068ms/step - Train Loss: 5.4454 - Validation Loss: 15.3301 - CategoricalAccuracy: 0.0000
Epoch 5/5 - Train time: 672.250ms/step - IO time: 0.059ms/step - Train Loss: 6.6274 - Validation Loss: 17.4566 - CategoricalAccuracy: 0.0000
===================================
MODEL_LOAD   takes 8.3949 ms
COMPILE      takes 243.6544 ms
EXECUTE      takes 26446.5046 ms
- Epoch 1      takes 5004.7149 ms
- Epoch 2      takes 5264.5105 ms
- Epoch 3      takes 5267.8122 ms
- Epoch 4      takes 5335.6967 ms
- Epoch 5      takes 5573.7703 ms
===================================
nnpackage mobilenetv2 trains successfully.

"IO time" is almost meaningless. It would be better to merge it with "Train time".

This was referenced Jan 3, 2025

[onert] Fix accuracy measurement #14523

Open

[onert] Add wrapping CAPIs for training #14524

Merged

[onert/python] Separate prepare from session constructor #14525

Merged

[onert/python] Introduce BaseSession #14527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[onert] Python Bindings for Training #14505

[onert] Python Bindings for Training #14505

ragmani commented Dec 30, 2024 •

edited

Loading

ragmani commented Dec 30, 2024 •

edited

Loading

ragmani commented Jan 3, 2025 •

edited

Loading

ragmani commented Jan 3, 2025 •

edited

Loading

ragmani commented Jan 6, 2025 •

edited

Loading

[onert] Python Bindings for Training #14505

[onert] Python Bindings for Training #14505

Comments

ragmani commented Dec 30, 2024 • edited Loading

What

Why

ragmani commented Dec 30, 2024 • edited Loading

Python Package Structure

Package Separation

Binding Modules

Provided Python API

Expected Package Structure

ragmani commented Jan 3, 2025 • edited Loading

ragmani commented Jan 3, 2025 • edited Loading

ragmani commented Jan 6, 2025 • edited Loading

ragmani commented Dec 30, 2024 •

edited

Loading

ragmani commented Dec 30, 2024 •

edited

Loading

ragmani commented Jan 3, 2025 •

edited

Loading

ragmani commented Jan 3, 2025 •

edited

Loading

ragmani commented Jan 6, 2025 •

edited

Loading