Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Visual language data is common in the human world.
  • Current vision-language models do not perform well on these data.
  • MatCha is proposed to enhance visual language models’ capabilities.
  • MatCha pretraining outperforms state-of-the-art methods by nearly 20%.
  • MatCha pretraining is useful for broader visual language tasks.

Paper Content

Introduction

  • Visual language is a system that uses text and visuals to convey meaning
  • Visual language is found in textbooks, scientific papers, web pages, etc.
  • Structural units of visual language include line, shape, color, orientation, scale, angle, space, etc.
  • Little research on visual language understanding from the machine learning community
  • Pix2Struct is a pretraining strategy for visually-situated language that outperforms standard vision-language models
  • Two key ingredients for visual language understanding: layout understanding and mathematical reasoning
  • Proposed two pretraining tasks for enhancing visual language understanding: chart derendering and math reasoning
  • Tested on ChartQA, PlotQA, Chart-to-Text summarization, and other datasets
  • Achieved SOTA results on ChartQA, PlotQA, and Chart-to-Text summarization
  • Vision-language research has focused on natural images
  • Visual reasoning datasets are mostly in the natural image domain
  • Synthesized datasets are in the visual language domain, but their visual language systems are simpler than real-world examples
  • Current vision-language models can handle synthesized datasets, but not real-world datasets
  • OCR-based and end-to-end methods for visually-situated language have been developed
  • Learning to reason by designing novel pretraining tasks is related to the literature
  • Examples of novel pretraining tasks include numerical reasoning, synthesizing data and programs, and injecting knowledge

Method

  • Layout understanding and basic math operation capabilities are key for visual language understanding/reasoning
  • Chart derendering and math reasoning are proposed as pretraining tasks

Chart derendering

  • Plots and charts are generated by data tables and code.
  • To understand a chart, one needs to discover visual patterns and extract key information.
  • To collect data for pretraining, (chart, code) and (chart, table) pairs are collected from GitHub IPython notebooks and manually written code.

Math reasoning

  • Reasoning over visual language requires effective recognition and grouping of visual elements and applying mathematical operations
  • Plot derendering addresses recognition but not mathematical operations
  • Propose to inject numerical reasoning knowledge to image-to-text model by learning from two existing textual math datasets
  • MATH contains two million training examples per module, DROP has 96k question and answer pairs
  • Keep applying screenshot parsing pretraining from Pix2Struct

Experiment

  • Experimental setup detailed in §4.1
  • Main results introduced in §4.2
  • Results on Pix2Struct tasks in §4.3
  • Pretraining ablations in §4.4

Experimental setups

  • Pretraining tasks are 40% math reasoning, 40% chart derendering, and 20% screenshot parsing
  • Chart derendering has four sources of data
  • Math reasoning has two sets: human and augmented
  • PlotQA has two sets: v1 and v2
  • Chart-to-Text has two sets: Pew and Statista
  • Metrics for ChartQA and PlotQA is relaxed correct-6
  • Chart-to-Text uses BLEU4
  • Pretraining uses batch size of 512 and max sequence length of 192
  • Finetuning uses batch size of 256 and max sequence length of 128

Main results

  • MATCHA outperforms all baselines and SOTAs on all three chart/plot-domain benchmarks
  • PaLI, a SOTA for VQA and captioning on natural images, fails significantly on ChartQA and PlotQA
  • Increasing input resolution helps performance, but also increases sequence length
  • PaLI performs reasonably well on Chart-to-Text due to its language modeling objective

Results on pix2struct tasks

  • MATCHA was tested on chart/plot domain datasets, documents, user interfaces, and natural images.
  • Results showed that MATCHA outperformed Pix2Struct by 2.3% on average.
  • Even when ChartQA was excluded, MATCHA still outperformed Pix2Struct by 1.6%.

Ablations and analyses

  • Two types of ablations conducted: component-level and individual dataset
  • Removing any major component causes a performance drop
  • Chart derendering is the most important component
  • Math reasoning is more important for human set, chart derendering for augmented set
  • DROP dataset more important than MATH dataset for numerical reasoning

Conclusion

  • Proposed pretraining method MATCHA for visual language tasks
  • Learns to predict underlying data tables and code given chart images
  • Decodes answers of math questions rendered in form of images
  • Establishes new SOTA on 5 out of 6 setups across three chart domain benchmarks
  • Transfers knowledge outside of pretraining domain
  • Chart derendering essential for extractive questions
  • Math pretraining important for queries requiring complex reasoning
  • Room for improvement on queries requiring complex reasoning
  • Visual language is an umbrella term
  • Data collected by Masry et al. (2022) and Saxton et al. (2019)
  • Baselines used in Table 2: T5, VL-T5, VisionTaPas, T5-OCR, VL-T5-OCR, VisionTaPas-OCR
  • CRCT (Levy et al., 2022) best performing model on PlotQA
  • Continue pretraining Pix2Struct with original objective
  • Chart-to-code pretraining component important