Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • We present a neural encoder-decoder model to convert images into presentational markup.
  • We introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup.
  • Our approach outperforms classical mathematical OCR systems.
  • We introduce a new coarse-to-fine attention layer to reduce inference complexity.

Paper Content

Introduction

  • Optical Character Recognition (OCR) is used to recognize natural language from an image.
  • Research has been done to convert images into structured language or markup.
  • Target of research is OCR for mathematical expressions.
  • Systems combine specialized character segmentation with grammars of the underlying mathematical layout language.
  • INFTY system is used to convert printed mathematical expressions to LaTeX and other markup formats.
  • Research interest in OCR has increased due to refinement of deep neural models.
  • Systems learn an abstract encoded representation of the input image which is then decoded to generate a textual output.
  • Traditional OCR task assumes left-to-right ordering.
  • Model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder.
  • Model incorporates a multi-row recurrent model as part of the encoder.
  • Model uses a two-layer hard-soft approach to attention.

Problem: image-to-markup generation

  • Problem is converting an image to markup language
  • Source is an image, target is a sequence of tokens
  • Rendering is defined by a compile function
  • Supervised task is to learn to invert the compile function
  • Evaluation is done between rendered images

Model

  • Model uses a full grid encoder over the input image
  • Model adapted from Xu et al. (2015)
  • Includes a row encoder
  • Extracts image features using a CNN
  • Arranges features in a grid
  • Each row is encoded using an RNN
  • Decoder implements a conditional language model
  • Model trained to maximize likelihood of observed markup
  • Visual features extracted with a multi-layer CNN
  • Row encoder localizes relative positions within source image
  • Decoder RNN generates probability of next token given history and annotations
  • Context vector captures context information from annotation grid

Attention in markup generation

  • Accuracy of model is dependent on tracking current position of image
  • Define latent categorical variable to indicate which cell model is attending to
  • Three forms of attention: standard, hierarchical, and coarse-to-fine
  • Standard attention uses neural network to approximate attention distribution
  • Attention distribution is important for grid to be small and support of distribution is small
  • Decoding complexity of attention mechanism is O(T HW)
  • Hierarchical attention first attends to coarse grid, then fine cells
  • Coarse-to-fine attention uses sparse support to reduce number of fine attention cells
  • Two approaches for training sparse coarse distribution: sparsemax and hard attention
  • Reward rt = T s=t γ s r s used to reduce noise and slow convergence

Dataset construction

  • IM2LATEX-100K is a public dataset of real-world mathematical expressions written in LaTeX
  • The dataset contains 103,556 different LaTeX math equations and rendered pictures
  • The formulas are extracted from over 60,000 papers
  • The formulas are split into minimal meaningful LaTeX tokens
  • An optional normalization step is used to eliminate common spurious ambiguity
  • A synthetic handwritten corpus of the IM2LATEX-100K dataset is created using Detexify’s training data

Experiments

  • Experiments compare proposed model (IM2TEX) to classical OCR baselines, neural models, and model ablations
  • Compared to commercial OCR-based mathematical expression recognition system InftyReader
  • Simulate image captioning setup with model CAPTION
  • Use CTC implementation of Shi et al. (2016)
  • Run experiments with different attention styles
  • Experiment with standard attention system with coarse feature maps only
  • Experiment with two-layer hierarchical model
  • Experiment with different coarse-to-fine (C2F) mechanisms
  • Compare model to other models for handwritten mathematical expressions on CROHME 2013 and 2014 shared tasks
  • Use Torch (Collobert et al., 2011) based on OpenNMT system (Klein et al., 2017)

Results

  • INFTY system performs well in terms of text accuracy, but poorly on exact match image metrics
  • Strict left-to-right order assumption is unsuitable for this task
  • Use of coarse-only system leads to large drop in accuracy, indicating fine attention is crucial
  • Hard REINFORCE system and sparsemax reduce lookups at small cost in accuracy
  • Models achieve comparable performance to best systems except MyScript, which has access to additional in-domain data

Conclusion

  • Presented a visual attention-based model for OCR of presentational markup
  • Introduced a new dataset IM2LATEX-100K
  • Proposed a coarse-to-fine attention layer to reduce attention complexity
  • Data-driven models can be effective without any knowledge of the language
  • Coarse-to-fine attention mechanism is general and applicable to other domains
  • Model generates one LaTeX symbol at a time based on the input image
  • Dataset contains 103,556 images and corresponding LaTeX formulas
  • Network structure includes a CNN, RNN encoder, and RNN decoder with visual attention mechanism
  • Results reported on IM2LATEX-100K dataset and CROHME handwriting datasets