[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

dhkim0225 · 2022-03-15T02:34:04Z

bai 붙은 사람들은 다 OCR을 잘 하는 걸까? (~~xiang bai 센세에 이어..~~)

OCR task 를 위한 pretraining strategy 제안.

INTRO

3개의 pipeline 을 그림으로 표현

OCR pipeline
Vision-Language Pretraining (VLP) pipeline
proposed pipeline

그냥 VL 파이프라인 그으대로 가져온 형태.
다른 점은 character encode.
image - text pair 정보 필요.
transfer 하는 부분은 backbone 에 한정.
decoder 나, text encoder 는 전부 버림.

Methodology

Character-Aware Text Encoder

character embedding (ce)은 다음 수식으로 구한다

n은 Text instance 개수 (i 로 indexing)
t_i 는 특정 word 가 되는 거고,
c^i_j 는 t_i 내부의 각 character 가 되는 형태.
W_c 는 character embedding matrix
PE 는 learnable setting 사용 (not sinusoidal)

character encoder 말고, 일반적으로 사용하는 encoder 로도 위 파이프라인을 학습해 봤다.
최종 transfer 대상인 visual encoder 영역의 attention 을 뽑아보면, 다음과 같다.
character encoder 써야 좋다 ㅇㅇ

pretrain 할 때, 25 자 text까지만 input으로 활용한다.
recognizer batch 는 이미지당 3 개까지만.
다시 말해, 이미지당 3개의 word 씩만 학습에 사용.

Visual-Textual Decoder

6 stacked decoder layer
이미지당 3개의 word 씩만 학습에 사용하므로, output도 3개가 나오게 된다.
character encoder 25 개 각각 query 로 들어가므로, output 굳이 따지면 (B, 25, 3) ㅇㅇ

masking 은 그냥 word 당 하나씩만 넣음.
ratio 분석하고 그런 거 없음.

Network Optimization

CLS loss 는 masking 위치에 한해서 CE loss

CL loss 는 CLIP 비스무레하게 가져감.
image 기준, CL 에 text 기준 CL 해서 두 개 더해줌.

최종 loss. scaling 은 없음

Note

TCL ([75] Vision-Language Pre-Training with Triple Contrastive Learning #104 ) 형태로 가져가면 성능향상 더 있지 않을까? (~~물론 TCL 방식은 parameter 가 많이 늘어나긴 하지만.~~)
뒷 단에 ELECTRA 형식의 pretraining 을 붙이면 잘 될 듯.
max-length 25 는 일본어에는 불리할 수 있다.
- 물론 더 늘려도 되긴 한다. 일본어는 근데, 25자 넘어가는 게 많으니까.

Impl. Detail

Pretraining

Encoder Backbone: ResNet-50
input image: 512x512 resized
optimizer: AdamW
scheduler
- init-LR: 1e-4
- cosine 1cycle
V100 * 8
640 batchsize
max-length 25

Finetuning

각 모델 설정 따름

PSENet
DB
FCENet
TextBPN
MTSv3

Results

ICDAR19-LSVT Detection

‘+Ours’ == IC19-LSVT 400,000 image 사용. pretrained model

ICDAR19-LSVT E2E

‘+Ours’ == IC19-LSVT 400,000 image 사용. pretrained model
NED == Normalized Edit Distance

Pretrain Data portion 에 따른 성능

PSENet (Synth pretrain + TotalText finetune)
‘+Ours’ == synthtext pretrained model

다른 Pretraining technique 비교

pretrain set 은 synthtext 로 통일
‘+Ours’ == synthtext pretrained model

CTW 1500 Detection

‘+Ours’ == synthtext pretrained model

TotalText Detection

‘+Ours’ == synthtext pretrained model

IC15 Detection

‘+Ours’ == synthtext pretrained model

IC15 & TotalText E2E

‘+Ours’ == synthtext pretrained model

Ablation

PSENet (Synth pretrain + TotalText finetune)

CAE == Character Aware Encoder
VTD == Visual Textual Decoder
BCL == Batch-level Contrastive Loss

dhkim0225 added OCR SenseTime Pretraining labels Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

dhkim0225 commented Mar 15, 2022 •

edited

Loading

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting #110

Comments

dhkim0225 commented Mar 15, 2022 • edited Loading

INTRO

Methodology

Character-Aware Text Encoder

Visual-Textual Decoder

Network Optimization

Note

Impl. Detail

Pretraining

Finetuning

Results

ICDAR19-LSVT Detection

ICDAR19-LSVT E2E

Pretrain Data portion 에 따른 성능

다른 Pretraining technique 비교

CTW 1500 Detection

TotalText Detection

IC15 Detection

IC15 & TotalText E2E

Ablation

dhkim0225 commented Mar 15, 2022 •

edited

Loading