Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition
Recast text recognition as image captioning and directly transfer a unified vision-language pretrained model
OFA-OCR outperforms baselines and achieves state-of-the-art performance in Chinese text recognition benchmark
Constructed an OCR pipeline with OFA-OCR and achieved competitive performance with product-level API

Optical Character Recognition (OCR) is used to extract text from images
Text recognition is a key challenge in OCR
Deep learning methods are used to improve accuracy
Transformer encoder-decoder framework has been applied to text recognition
Complex model and objective designs can hinder reproduction
Multimodal pretraining can boost performance in text recognition
Finetuning a unified multimodal pretrained model on text recognition datasets can achieve high accuracy
Ablation studies demonstrate the effectiveness of the proposed method

Experiments were conducted on a Chinese text recognition benchmark with 4 subtasks.
Multitask finetuning and single-task finetuning were implemented for comparison.
Multitask finetuning achieved outstanding performance on all datasets.
Single-task finetuning was applied after multitask finetuning to further boost performance.

Conducted an ablation study to evaluate the effects of multitask learning
4 setups: training from scratch, single-task finetuning, multitask-finetuning, and multitask + singletask finetuning
Addition of pretrained OFA model boosts performance on datasets
Multitask finetuning alone outperforms single-task finetuning on all 4 tasks
Combination of multitask finetuning and single-task finetuning is the best solution

Preprocessing of images can be used as data augmentation
Performance growth with pretrained weight initialization and multitask finetuning
Experiments on 4 datasets with single-task finetuning on base-size models
Preprocessing technique can effectively boost performance

Need to build a pipeline with both text detection and recognition modules
Use a light-weight model for detection
Crop bounding boxes to create a batch of new images
Process images with OFA-OCR for text recognition
Recent methods use Transformer for improved performance
Vision-language pretraining has leveled up model performance on downstream tasks

OFA-OCR achieves high accuracy but has high costs
Future work will focus on distilling or compressing OFA-OCR to a light-weight model
Model should faithfully reflect its input
Future research will focus on improving downstream performance and increasing controllability
OFA-OCR outperforms previous state-of-the-art
Data augmentation improves performance