The vitstr-small-patch16-224
model is small
version of the ViTSTR models. ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). Small version of model has an embedding size of 384 and number of heads of 6. Model is able to recognize alphanumeric case sensitive text and special characters.
More details provided in the paper and repository.
Metric | Value |
---|---|
Type | Scene Text Recognition |
GFLOPs | 9.1544 |
MParams | 21.5061 |
Source framework | PyTorch* |
Alphanumeric subset of common scene text recognition benchmarks are used. For your convenience you can see dataset size. Note, that we use here ICDAR15 alphanumeric subset without irregular (arbitrary oriented, perspective or curved) texts. See details here, section 4.1. All reported results are achieved without using any lexicon.
Dataset | Accuracy | Dataset size |
---|---|---|
ICDAR-03 | 93.43% | 867 |
ICDAR-13 | 90.34% | 1015 |
ICDAR-15 | 75.04% | 1811 |
SVT | 85.47% | 647 |
IIIT5K | 87.07% | 3000 |
Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_model>
to specify the path to additional model attributes. path_to_folder_with_downloaded_model
is a path to the folder, where the current model is downloaded by Model Downloader tool.
Image, name: image
, shape: 1, 1, 224, 224
in the format B, C, H, W
, where:
B
- batch sizeC
- number of channelsH
- image heightW
- image width
Note that the source image should be tight aligned crop with detected text converted to grayscale.
Scale values - [255].
Image, name: image
, shape: 1, 1, 224, 224
in the format B, C, H, W
, where:
B
- batch sizeC
- number of channelsH
- image heightW
- image width
Note that the source image should be tight aligned crop with detected text converted to grayscale.
Output tensor, name: logits
, shape: 1, 25, 96
in the format B, W, L
, where:
B
- batch sizeW
- output sequence lengthL
- confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed filevocab.txt
.
The network output decoding process is pretty easy: get the argmax on L
dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence
symbol.
Output tensor, name: logits
, shape: 1, 25, 96
in the format B, W, L
, where:
B
- batch sizeW
- output sequence lengthL
- confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed filevocab.txt
.
The network output decoding process is pretty easy: get the argmax on L
dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence
symbol.
You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.
An example of using the Model Downloader:
omz_downloader --name <model_name>
An example of using the Model Converter:
omz_converter --name <model_name>
The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:
The original model is distributed under the
Apache License, Version 2.0.
A copy of the license is provided in <omz_dir>/models/public/licenses/APACHE-2.0.txt
.