Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

We present a neural encoder-decoder model to convert images into presentational markup.
We introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup.
Our approach outperforms classical mathematical OCR systems.
We introduce a new coarse-to-fine attention layer to reduce inference complexity.

Optical Character Recognition (OCR) is used to recognize natural language from an image.
Research has been done to convert images into structured language or markup.
Target of research is OCR for mathematical expressions.
Systems combine specialized character segmentation with grammars of the underlying mathematical layout language.
INFTY system is used to convert printed mathematical expressions to LaTeX and other markup formats.
Research interest in OCR has increased due to refinement of deep neural models.
Systems learn an abstract encoded representation of the input image which is then decoded to generate a textual output.
Traditional OCR task assumes left-to-right ordering.
Model incorporates a multi-layer convolutional network over the image with an attention-based recurrent neural network decoder.
Model incorporates a multi-row recurrent model as part of the encoder.
Model uses a two-layer hard-soft approach to attention.

Accuracy of model is dependent on tracking current position of image
Define latent categorical variable to indicate which cell model is attending to
Three forms of attention: standard, hierarchical, and coarse-to-fine
Standard attention uses neural network to approximate attention distribution
Attention distribution is important for grid to be small and support of distribution is small
Decoding complexity of attention mechanism is O(T HW)
Hierarchical attention first attends to coarse grid, then fine cells
Coarse-to-fine attention uses sparse support to reduce number of fine attention cells
Two approaches for training sparse coarse distribution: sparsemax and hard attention
Reward rt = T s=t γ s r s used to reduce noise and slow convergence

IM2LATEX-100K is a public dataset of real-world mathematical expressions written in LaTeX
The dataset contains 103,556 different LaTeX math equations and rendered pictures
The formulas are extracted from over 60,000 papers
The formulas are split into minimal meaningful LaTeX tokens
An optional normalization step is used to eliminate common spurious ambiguity
A synthetic handwritten corpus of the IM2LATEX-100K dataset is created using Detexify’s training data

Experiments compare proposed model (IM2TEX) to classical OCR baselines, neural models, and model ablations
Compared to commercial OCR-based mathematical expression recognition system InftyReader
Simulate image captioning setup with model CAPTION
Use CTC implementation of Shi et al. (2016)
Run experiments with different attention styles
Experiment with standard attention system with coarse feature maps only
Experiment with two-layer hierarchical model
Experiment with different coarse-to-fine (C2F) mechanisms
Compare model to other models for handwritten mathematical expressions on CROHME 2013 and 2014 shared tasks
Use Torch (Collobert et al., 2011) based on OpenNMT system (Klein et al., 2017)

INFTY system performs well in terms of text accuracy, but poorly on exact match image metrics
Strict left-to-right order assumption is unsuitable for this task
Use of coarse-only system leads to large drop in accuracy, indicating fine attention is crucial
Hard REINFORCE system and sparsemax reduce lookups at small cost in accuracy
Models achieve comparable performance to best systems except MyScript, which has access to additional in-domain data

Presented a visual attention-based model for OCR of presentational markup
Introduced a new dataset IM2LATEX-100K
Proposed a coarse-to-fine attention layer to reduce attention complexity
Data-driven models can be effective without any knowledge of the language
Coarse-to-fine attention mechanism is general and applicable to other domains
Model generates one LaTeX symbol at a time based on the input image
Dataset contains 103,556 images and corresponding LaTeX formulas
Network structure includes a CNN, RNN encoder, and RNN decoder with visual attention mechanism
Results reported on IM2LATEX-100K dataset and CROHME handwriting datasets