Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Visual Question Answering (VQA) and Image Captioning (CAP) are popular vision-language tasks.
Scene-text versions of these tasks require reasoning from the text in the image.
Task-specific methods can either see or read, but not both.
UniTNT is a Unified Text-Non-Text approach that grants existing multimodal architectures scene-text understanding capabilities.
UniTNT leads to the first single model that successfully handles both task types.
Scene-text understanding capabilities can boost vision-language models’ performance on VQA and CAP.

Paper Content

Background and motivation

Analyzed performance of general and scenetext-oriented models on VL benchmarks
Used VQAv2, TextVQA, ST-VQA for VQA and COCO Captions and TextCaps for CAP

Visual question answering

Numerous methods have been suggested to advance the state-of-the-art in VQA tasks.
Pretraining and task-specific fine-tuning are used to benefit from image-caption pairs.
Three leading VQA models have low performance on scene-text VQA datasets.
Scene-text VQA methods leverage OCR output, pretraining, and layout information.
Scene-text VQA datasets have biases that discourage models from relying on visual modality.

Image captioning

Models trained on TextCaps do not perform well on general image captioning benchmarks
Models trained on general image captioning benchmarks do not perform well on TextCaps
BLIP obtains a CIDEr score of 61.9 on TextCaps, compared to 90.1 of M4C-Captioner on COCO captions
TAP outperforms M4C on TextCaps, but not on COCO Captions

Method

UniTNT grants pretrained VL models the ability to reason over scene-text information.
An off-the-shelf-OCR system is used to extract scene-text information.
OCR information is encoded using a dedicated OCR encoder.
Positional information is used to represent OCR instances.
Auxiliary losses are used to encourage the pretrained decoder to utilize the OCR information.

Architecture

UniTNT is compatible with visual question answering and image captioning tasks.
UniTNT is model agnostic and can be applied to any encoder-decoder-based VL model.
UniTNT is integrated into two top-performing open-source methods.
UniTNT adds two components to integrate OCR information into the decoder.

Scene-text auxiliary losses

Propose two auxiliary losses to better fuse scene-text information
OCR Causal Language Modeling to predict next OCR token
OCR Binary Classification to predict if OCR token is part of ground-truth answer

Training procedure

Our method consists of a trained general encoder-decoder VL model which is modified as described in the paper.
The VL model’s pre-existing image encoder is frozen and UniTNT is trained on a unified dataset.
The base task-dependent loss term is used in the base architecture, and α 1 and α 2 are tunable hyperparameters.

Experiments

Experimentally examined method performance on VQA and CAP tasks
Used 3 standard benchmarks for VQA and 2 for CAP
Used Amazon Text-in-Image1 for OCR information
Reported performance on each benchmark and non-weighted average
Adopted open-vocabulary unconstrained generation scheme
Evaluated performance on scene-text and general benchmarks
Compared to models with similar capacity
Presented new evaluation setting for scene-text VQA

Visual question answering experiments

Unifying the training datasets improves results on both the original task and the analogous scene-text task.
There is still a substantial performance gap between scene-text models and VQA methods.
UniTNT leads to the best average score and improves the base models on both benchmarks.

Image captioning experiments

UniTNT is compared to top-performing methods on CAP and VQA tasks.
Combined training leads to performance improvements for both BLIP and M4C-Captioner.
UniTNT leads to additional improvements over each benchmark.

A subset for reasoning over all modalities

VQA data is composed of three categories: ‘see’, ‘read’, and ‘see-∩-read’
Most questions in current benchmarks fall into ‘see’ or ‘read’ categories
480 image-question pairs from TextVQA validation set manually curated to measure models’ capabilities on ‘see-∩-read’ questions
UniTNT substantially better at reasoning over scene-text and visual information at once
UniTNT leads to best performance on ‘see-∩-read’ questions
UniTNT treats scene-text information as additional modality and gradually fuses it with existing VL features
UniTNT trained on unified Text-Non-Text VQA and CAP datasets
OCR conveys meaningful information, leading to significant improvement of UniTNT
Freezing Visual Encoder significantly improves results on VQA and TextVQA
Current state-of-the-art methods are incapable of properly reasoning over both scene-text and vision information
VL research community should aim to develop models which can reason over vision, language, and scene-text altogether
UniTNT leads to comparable results with other, much larger models
UniTNT capable of reasoning over both visual and scene-text information
Granting scene-text understanding also benefit VQAv2
BLIP incapable of incorporating scene-text information
M4C overfitted for TextCaps, causing it to fail completely on COCO Captions

Link to paper#

Abstract#

Paper Content#

Background and motivation#

Visual question answering#

Image captioning#

Method#

Architecture#

Scene-text auxiliary losses#

Training procedure#

Experiments#

Visual question answering experiments#

Image captioning experiments#

A subset for reasoning over all modalities#

Link to paper

Abstract

Paper Content

Background and motivation

Visual question answering

Image captioning

Method

Architecture

Scene-text auxiliary losses

Training procedure

Experiments

Visual question answering experiments

Image captioning experiments

A subset for reasoning over all modalities