Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition
  • Recast text recognition as image captioning and directly transfer a unified vision-language pretrained model
  • OFA-OCR outperforms baselines and achieves state-of-the-art performance in Chinese text recognition benchmark
  • Constructed an OCR pipeline with OFA-OCR and achieved competitive performance with product-level API

Paper Content

Introduction

  • Optical Character Recognition (OCR) is used to extract text from images
  • Text recognition is a key challenge in OCR
  • Deep learning methods are used to improve accuracy
  • Transformer encoder-decoder framework has been applied to text recognition
  • Complex model and objective designs can hinder reproduction
  • Multimodal pretraining can boost performance in text recognition
  • Finetuning a unified multimodal pretrained model on text recognition datasets can achieve high accuracy
  • Ablation studies demonstrate the effectiveness of the proposed method

Finetuning with image captioning

  • Text recognition can be recast as image captioning
  • Finetune model with maximum likelihood estimation
  • Input images are made square by resizing and padding
  • Interpolation used to adapt to larger resolution images

Multitask finetuning

  • Experiments were conducted on a Chinese text recognition benchmark with 4 subtasks.
  • Multitask finetuning and single-task finetuning were implemented for comparison.
  • Multitask finetuning achieved outstanding performance on all datasets.
  • Single-task finetuning was applied after multitask finetuning to further boost performance.

Datasets and metrics

  • Implemented OFA-OCR on Chinese text recognition benchmark
  • Benchmark consists of multiple subtasks of text recognition
  • Subtasks include scene, web, document, and handwriting
  • Evaluation metric is accuracy

Experimental results

  • OFA-OCR outperforms MaskOCR in all scenarios
  • OFA-OCR has largest advantage in scene dataset
  • Scaling up model size brings steady improvement in performance

Ablation study of training strategies

  • Conducted an ablation study to evaluate the effects of multitask learning
  • 4 setups: training from scratch, single-task finetuning, multitask-finetuning, and multitask + singletask finetuning
  • Addition of pretrained OFA model boosts performance on datasets
  • Multitask finetuning alone outperforms single-task finetuning on all 4 tasks
  • Combination of multitask finetuning and single-task finetuning is the best solution

Ablation study of data augmentation

  • Preprocessing of images can be used as data augmentation
  • Performance growth with pretrained weight initialization and multitask finetuning
  • Experiments on 4 datasets with single-task finetuning on base-size models
  • Preprocessing technique can effectively boost performance

Deployment

  • Need to build a pipeline with both text detection and recognition modules
  • Use a light-weight model for detection
  • Crop bounding boxes to create a batch of new images
  • Process images with OFA-OCR for text recognition
  • Recent methods use Transformer for improved performance
  • Vision-language pretraining has leveled up model performance on downstream tasks

Conclusion

  • OFA-OCR achieves high accuracy but has high costs
  • Future work will focus on distilling or compressing OFA-OCR to a light-weight model
  • Model should faithfully reflect its input
  • Future research will focus on improving downstream performance and increasing controllability
  • OFA-OCR outperforms previous state-of-the-art
  • Data augmentation improves performance