Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- We present a neural encoder-decoder model to convert images into presentational markup.
- We introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup.
- Our approach outperforms classical mathematical OCR systems.
- We introduce a new coarse-to-fine attention layer to reduce inference complexity.
Paper Content
Introduction
- Optical Character Recognition (OCR) is used to recognize natural language from an image.
- Research has been done to convert images into structured language or markup.
- Target of research is OCR for mathematical expressions.
- Systems combine specialized character segmentation with grammars of the underlying mathematical layout language.
- INFTY system is used to convert printed mathematical expressions to LaTeX and other markup formats.
- Research interest in OCR has increased due to refinement of deep neural models.
- Systems learn an abstract encoded representation of the input image which is then decoded to generate a textual output.
- Traditional OCR task assumes left-to-right ordering.
- Model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder.
- Model incorporates a multi-row recurrent model as part of the encoder.
- Model uses a two-layer hard-soft approach to attention.
Problem: image-to-markup generation
- Problem is converting an image to markup language
- Source is an image, target is a sequence of tokens
- Rendering is defined by a compile function
- Supervised task is to learn to invert the compile function
- Evaluation is done between rendered images
Model
- Model uses a full grid encoder over the input image
- Model adapted from Xu et al. (2015)
- Includes a row encoder
- Extracts image features using a CNN
- Arranges features in a grid
- Each row is encoded using an RNN
- Decoder implements a conditional language model
- Model trained to maximize likelihood of observed markup
- Visual features extracted with a multi-layer CNN
- Row encoder localizes relative positions within source image
- Decoder RNN generates probability of next token given history and annotations
- Context vector captures context information from annotation grid
Attention in markup generation
- Accuracy of model is dependent on tracking current position of image
- Define latent categorical variable to indicate which cell model is attending to
- Three forms of attention: standard, hierarchical, and coarse-to-fine
- Standard attention uses neural network to approximate attention distribution
- Attention distribution is important for grid to be small and support of distribution is small
- Decoding complexity of attention mechanism is O(T HW)
- Hierarchical attention first attends to coarse grid, then fine cells
- Coarse-to-fine attention uses sparse support to reduce number of fine attention cells
- Two approaches for training sparse coarse distribution: sparsemax and hard attention
- Reward rt = T s=t γ s r s used to reduce noise and slow convergence
Dataset construction
- IM2LATEX-100K is a public dataset of real-world mathematical expressions written in LaTeX
- The dataset contains 103,556 different LaTeX math equations and rendered pictures
- The formulas are extracted from over 60,000 papers
- The formulas are split into minimal meaningful LaTeX tokens
- An optional normalization step is used to eliminate common spurious ambiguity
- A synthetic handwritten corpus of the IM2LATEX-100K dataset is created using Detexify’s training data
Experiments
- Experiments compare proposed model (IM2TEX) to classical OCR baselines, neural models, and model ablations
- Compared to commercial OCR-based mathematical expression recognition system InftyReader
- Simulate image captioning setup with model CAPTION
- Use CTC implementation of Shi et al. (2016)
- Run experiments with different attention styles
- Experiment with standard attention system with coarse feature maps only
- Experiment with two-layer hierarchical model
- Experiment with different coarse-to-fine (C2F) mechanisms
- Compare model to other models for handwritten mathematical expressions on CROHME 2013 and 2014 shared tasks
- Use Torch (Collobert et al., 2011) based on OpenNMT system (Klein et al., 2017)
Results
- INFTY system performs well in terms of text accuracy, but poorly on exact match image metrics
- Strict left-to-right order assumption is unsuitable for this task
- Use of coarse-only system leads to large drop in accuracy, indicating fine attention is crucial
- Hard REINFORCE system and sparsemax reduce lookups at small cost in accuracy
- Models achieve comparable performance to best systems except MyScript, which has access to additional in-domain data
Conclusion
- Presented a visual attention-based model for OCR of presentational markup
- Introduced a new dataset IM2LATEX-100K
- Proposed a coarse-to-fine attention layer to reduce attention complexity
- Data-driven models can be effective without any knowledge of the language
- Coarse-to-fine attention mechanism is general and applicable to other domains
- Model generates one LaTeX symbol at a time based on the input image
- Dataset contains 103,556 images and corresponding LaTeX formulas
- Network structure includes a CNN, RNN encoder, and RNN decoder with visual attention mechanism
- Results reported on IM2LATEX-100K dataset and CROHME handwriting datasets