Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Visual Question Answering (VQA) and Image Captioning (CAP) are popular vision-language tasks.
- Scene-text versions of these tasks require reasoning from the text in the image.
- Task-specific methods can either see or read, but not both.
- UniTNT is a Unified Text-Non-Text approach that grants existing multimodal architectures scene-text understanding capabilities.
- UniTNT leads to the first single model that successfully handles both task types.
- Scene-text understanding capabilities can boost vision-language models’ performance on VQA and CAP.
Paper Content
Background and motivation
- Analyzed performance of general and scenetext-oriented models on VL benchmarks
- Used VQAv2, TextVQA, ST-VQA for VQA and COCO Captions and TextCaps for CAP
Visual question answering
- Numerous methods have been suggested to advance the state-of-the-art in VQA tasks.
- Pretraining and task-specific fine-tuning are used to benefit from image-caption pairs.
- Three leading VQA models have low performance on scene-text VQA datasets.
- Scene-text VQA methods leverage OCR output, pretraining, and layout information.
- Scene-text VQA datasets have biases that discourage models from relying on visual modality.
Image captioning
- Models trained on TextCaps do not perform well on general image captioning benchmarks
- Models trained on general image captioning benchmarks do not perform well on TextCaps
- BLIP obtains a CIDEr score of 61.9 on TextCaps, compared to 90.1 of M4C-Captioner on COCO captions
- TAP outperforms M4C on TextCaps, but not on COCO Captions
Method
- UniTNT grants pretrained VL models the ability to reason over scene-text information.
- An off-the-shelf-OCR system is used to extract scene-text information.
- OCR information is encoded using a dedicated OCR encoder.
- Positional information is used to represent OCR instances.
- Auxiliary losses are used to encourage the pretrained decoder to utilize the OCR information.
Architecture
- UniTNT is compatible with visual question answering and image captioning tasks.
- UniTNT is model agnostic and can be applied to any encoder-decoder-based VL model.
- UniTNT is integrated into two top-performing open-source methods.
- UniTNT adds two components to integrate OCR information into the decoder.
Scene-text auxiliary losses
- Propose two auxiliary losses to better fuse scene-text information
- OCR Causal Language Modeling to predict next OCR token
- OCR Binary Classification to predict if OCR token is part of ground-truth answer
Training procedure
- Our method consists of a trained general encoder-decoder VL model which is modified as described in the paper.
- The VL model’s pre-existing image encoder is frozen and UniTNT is trained on a unified dataset.
- The base task-dependent loss term is used in the base architecture, and α 1 and α 2 are tunable hyperparameters.
Experiments
- Experimentally examined method performance on VQA and CAP tasks
- Used 3 standard benchmarks for VQA and 2 for CAP
- Used Amazon Text-in-Image1 for OCR information
- Reported performance on each benchmark and non-weighted average
- Adopted open-vocabulary unconstrained generation scheme
- Evaluated performance on scene-text and general benchmarks
- Compared to models with similar capacity
- Presented new evaluation setting for scene-text VQA
Visual question answering experiments
- Unifying the training datasets improves results on both the original task and the analogous scene-text task.
- There is still a substantial performance gap between scene-text models and VQA methods.
- UniTNT leads to the best average score and improves the base models on both benchmarks.
Image captioning experiments
- UniTNT is compared to top-performing methods on CAP and VQA tasks.
- Combined training leads to performance improvements for both BLIP and M4C-Captioner.
- UniTNT leads to additional improvements over each benchmark.
A subset for reasoning over all modalities
- VQA data is composed of three categories: ‘see’, ‘read’, and ‘see-∩-read’
- Most questions in current benchmarks fall into ‘see’ or ‘read’ categories
- 480 image-question pairs from TextVQA validation set manually curated to measure models’ capabilities on ‘see-∩-read’ questions
- UniTNT substantially better at reasoning over scene-text and visual information at once
- UniTNT leads to best performance on ‘see-∩-read’ questions
- UniTNT treats scene-text information as additional modality and gradually fuses it with existing VL features
- UniTNT trained on unified Text-Non-Text VQA and CAP datasets
- OCR conveys meaningful information, leading to significant improvement of UniTNT
- Freezing Visual Encoder significantly improves results on VQA and TextVQA
- Current state-of-the-art methods are incapable of properly reasoning over both scene-text and vision information
- VL research community should aim to develop models which can reason over vision, language, and scene-text altogether
- UniTNT leads to comparable results with other, much larger models
- UniTNT capable of reasoning over both visual and scene-text information
- Granting scene-text understanding also benefit VQAv2
- BLIP incapable of incorporating scene-text information
- M4C overfitted for TextCaps, causing it to fail completely on COCO Captions