vitstr-small-patch16-224

Use-case and high-level description

The vitstr-small-patch16-224 model is small version of the ViTSTR models. ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). Small version of model has an embedding size of 384 and number of heads of 6. Model is able to recognize alphanumeric case sensitive text and special characters.

More details provided in the paper and repository.

Specification

Metric	Value
Type	Scene Text Recognition
GFLOPs	9.1544
MParams	21.5061
Source framework	PyTorch*

Accuracy

Alphanumeric subset of common scene text recognition benchmarks are used. For your convenience you can see dataset size. Note, that we use here ICDAR15 alphanumeric subset without irregular (arbitrary oriented, perspective or curved) texts. See details here, section 4.1. All reported results are achieved without using any lexicon.

Dataset	Accuracy	Dataset size
ICDAR-03	93.43%	867
ICDAR-13	90.34%	1015
ICDAR-15	75.04%	1811
SVT	85.47%	647
IIIT5K	87.07%	3000

Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_model> to specify the path to additional model attributes. path_to_folder_with_downloaded_model is a path to the folder, where the current model is downloaded by Model Downloader tool.

Input

Original model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

B - batch size
C - number of channels
H - image height
W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Scale values - [255].

Converted model

Image, name: image, shape: 1, 1, 224, 224 in the format B, C, H, W, where:

B - batch size
C - number of channels
H - image height
W - image width

Note that the source image should be tight aligned crop with detected text converted to grayscale.

Output

Original model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

B - batch size
W - output sequence length
L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Converted model

Output tensor, name: logits, shape: 1, 25, 96 in the format B, W, L, where:

B - batch size
W - output sequence length
L - confidence distribution across [GO] - special start token for decoder, [s] - special end of sequence character for decoder and characters, listed in enclosed file vocab.txt.

The network output decoding process is pretty easy: get the argmax on L dimension, transform indices to letters and slice the resulting phrase on the first entry of end-of-sequence symbol.

Download a Model and Convert it into OpenVINO™ IR Format

You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.

An example of using the Model Downloader:

omz_downloader --name <model_name>

An example of using the Model Converter:

omz_converter --name <model_name>

Demo usage

The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:

Text Detection C++ Demo

Legal Information

The original model is distributed under the Apache License, Version 2.0. A copy of the license is provided in <omz_dir>/models/public/licenses/APACHE-2.0.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

vitstr-small-patch16-224

Use-case and high-level description

Specification

Accuracy

Input

Original model

Converted model

Output

Original model

Converted model

Download a Model and Convert it into OpenVINO™ IR Format

Demo usage

Legal Information

Files

README.md

Latest commit

History

README.md

File metadata and controls

vitstr-small-patch16-224

Use-case and high-level description

Specification

Accuracy

Input

Original model

Converted model

Output

Original model

Converted model

Download a Model and Convert it into OpenVINO™ IR Format

Demo usage

Legal Information