Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition
- Recast text recognition as image captioning and directly transfer a unified vision-language pretrained model
- OFA-OCR outperforms baselines and achieves state-of-the-art performance in Chinese text recognition benchmark
- Constructed an OCR pipeline with OFA-OCR and achieved competitive performance with product-level API
Paper Content
Introduction
- Optical Character Recognition (OCR) is used to extract text from images
- Text recognition is a key challenge in OCR
- Deep learning methods are used to improve accuracy
- Transformer encoder-decoder framework has been applied to text recognition
- Complex model and objective designs can hinder reproduction
- Multimodal pretraining can boost performance in text recognition
- Finetuning a unified multimodal pretrained model on text recognition datasets can achieve high accuracy
- Ablation studies demonstrate the effectiveness of the proposed method
Finetuning with image captioning
- Text recognition can be recast as image captioning
- Finetune model with maximum likelihood estimation
- Input images are made square by resizing and padding
- Interpolation used to adapt to larger resolution images
Multitask finetuning
- Experiments were conducted on a Chinese text recognition benchmark with 4 subtasks.
- Multitask finetuning and single-task finetuning were implemented for comparison.
- Multitask finetuning achieved outstanding performance on all datasets.
- Single-task finetuning was applied after multitask finetuning to further boost performance.
Datasets and metrics
- Implemented OFA-OCR on Chinese text recognition benchmark
- Benchmark consists of multiple subtasks of text recognition
- Subtasks include scene, web, document, and handwriting
- Evaluation metric is accuracy
Experimental results
- OFA-OCR outperforms MaskOCR in all scenarios
- OFA-OCR has largest advantage in scene dataset
- Scaling up model size brings steady improvement in performance
Ablation study of training strategies
- Conducted an ablation study to evaluate the effects of multitask learning
- 4 setups: training from scratch, single-task finetuning, multitask-finetuning, and multitask + singletask finetuning
- Addition of pretrained OFA model boosts performance on datasets
- Multitask finetuning alone outperforms single-task finetuning on all 4 tasks
- Combination of multitask finetuning and single-task finetuning is the best solution
Ablation study of data augmentation
- Preprocessing of images can be used as data augmentation
- Performance growth with pretrained weight initialization and multitask finetuning
- Experiments on 4 datasets with single-task finetuning on base-size models
- Preprocessing technique can effectively boost performance
Deployment
- Need to build a pipeline with both text detection and recognition modules
- Use a light-weight model for detection
- Crop bounding boxes to create a batch of new images
- Process images with OFA-OCR for text recognition
- Recent methods use Transformer for improved performance
- Vision-language pretraining has leveled up model performance on downstream tasks
Conclusion
- OFA-OCR achieves high accuracy but has high costs
- Future work will focus on distilling or compressing OFA-OCR to a light-weight model
- Model should faithfully reflect its input
- Future research will focus on improving downstream performance and increasing controllability
- OFA-OCR outperforms previous state-of-the-art
- Data augmentation improves performance