Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

A new dataset, SlideVQA, has been proposed for developing document VQA systems.
SlideVQA contains 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions.
SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning.
Annotated arithmetic expressions of numerical answers are provided to enhance numerical reasoning.
A new end-to-end document VQA model has been developed.
Experiments show that the model outperforms existing state-of-the-art QA models, but has a large gap behind human performance.

Paper Content

Introduction

Building intelligent agents that can read and comprehend real-world documents is a long-standing goal of AI
Machine reading comprehension (MRC) is a central task in natural language understanding
MRC typically involves extracting a span from a text in response to a question
Visual question answering on document images (document VQA) has been studied to address real-world applications
Current models are not capable of performing reasoning across multiple images
SlideVQA is a new dataset for tasks involving a slide deck composed of multiple slide images and a corresponding question
SlideVQA requires complex reasoning over slide images, including single-hop, multi-hop, and numerical reasoning
63% of journalists rely on internal meetings, tip-offs, events and primary research for story ideas
Most journalists from the South look for story triggers in competitive media
Most journalists from the East draw on events to evolve fresh story ideas

Datasets for VQA on document images have been published, such as DocVQA, VisualMRC, Web-SRC, and InfographicVQA
Slide-VQA consists of 14.5k questions, while DocCVQA provides only 20 questions
SlideVQA requires multihop reasoning over multiple slides to find the answer, while DocCVQA requires only single-hop reasoning on individual images
Slide-VQA provides questions that require numerical reasoning and arithmetic expression annotations
Transformer has come to be used for understanding unstructured text in document images
LayoutLM, LayoutLMv2, LayoutT5, and TILT have achieved impressive results in single-image document VQA tasks
Multi-modal question answering takes textual and visual information as input contexts
VideoQA focuses on answering questions about video frames of TV shows and movies
VQA on image sets involves handling photos taken from different viewpoint indoors
Slide images understanding involves object segmentation on slide-pages and generating slides from research papers
Numerical reasoning plays an important role in NLP tasks
SlideVQA requires a system to answer a question about a slide deck and select evidence slide images

Dataset collection

Recruited crowd workers from English-speaking countries
Quality control with other workers
Collected 25,327 slide decks from slideshare
Filtered decks to meet criteria
Annotated images with bounding boxes and categories
Created 12,466 single-hop QA pairs
Created 2,018 multi-hop QA pairs
Annotated arithmetic expressions

Statistics and analysis

SlideVQA contains 14,484 QA pairs from 2,619 slide decks
Contains 52,480 slide images annotated with 890,945 bounding boxes
Split into 10,617 questions for training, 1,652 (2,215) questions for development (test)
Largest number of images and bounding box annotations
Bounding boxes broken down into nine categories
Extracted text from images using Google Cloud Vision API
Questions require complex reasoning including single/multi-hop, and numerical reasoning
25.5% of numerical questions require arithmetic operations
32.4% of answers require multi-span and non-span
Frequent first words of questions are “In” and “Regarding”

Our model

Model called M3D is used
Fusion-in-Decoder (FiD) is used as base model
FiD is initialized with pre-trained T5
SlideVQA task is performed using multi-task learning
Arithmetic expressions are predicted as intermediate reasoning steps
Input sequence is modified to learn visual layout and content of image

Input token sequence is used to train the evidence selection and question answering tasks
An OCR engine is used to parse the slide image and obtain OCR tokens
Input embeddings are defined by utilizing multi-modal information, including token, segment, layout, and visual embeddings

Multi-modal encoder consists of m Transformer blocks
Input sequences are encoded independently and then concatenated
Answer/Arithmetic-expression decoder is a stack of m Transformer blocks with cross-attention
Decoder models answer generation as a conditional generation
Model can perform numerical reasoning by predicting annotated arithmetic expressions
Evidence selector shares weights and architecture of answer/arithmetic-expression decoder
Model is trained by minimizing two losses
During inference, model decides whether numerical reasoning is required
Main task baselines include hierarchical LayoutLMv2, T5, PreasM, and LayoutT5
Evidence selection baselines include BM25, CLIP, BERT, LayoutLMv2, H-LayoutLMv2, BinaryClass, and ChainGen
Question answering baselines include Q-only, UniVL, and FiD
Human performance evaluated by 6 crowdworkers
Evaluation metrics include EM, F1, JEM, and JF1

Implementation details

Implemented models in PyTorch
Used 8 Tesla V100 32GB GPUs
Used AdamW with learning rate of 5e-5 and dropout rate of 10%
Batch size of 32
Evaluated models every 500 steps
Maximum length of 200 tokens for M3D
Maximum target sequence length of 50
Trained Faster-RCNN with ResNet-101
Used SGD with learning rate of 1e-3 and batch size of one
Standard anchor scales and ratios used
Created new video at rate of 5 frames per second
Used Google Cloud Vision API to extract text and bounding boxes

Experimental results and analysis

M3D outperformed baselines on joint EM/F1
H-LayoutLMv2 and M3D performed better than baselines on evidence selection task
M3D outperformed pipeline methods on QA task
Adding modality information improved performance in all tasks
LayoutT5 outperformed LayoutLMv2
MultiGen decoder obtained highest performance
Detecting randomly placed and small boxes is more difficult than fixed and large boxes
M3D successfully extracted information and generated same answer as ground-truth

Discussion and limitations

SlideVQA is the largest document VQA benchmark that uses multiple images as input and requires multi-hop reasoning.
An editing method is used to guarantee multi-hop questions and extend the dataset size.
Cross-attention is used on all evidence candidates, which can cause a computational problem with a lot of input images.
Models that use a two-stage selector to narrow down candidates and an answer generator in an end-to-end manner are promising.

Conclusion

Introduced a new document VQA dataset, SlideVQA
Focused on understanding slide decks composed of multiple images
Introduced a unified end-to-end model, M3D
Model can perform evidence selection and question answering tasks
Model can enhance numerical reasoning by generating arithmetic expressions
Evaluation highlighted the promise of this approach, but also revealed a gap compared to human performance
Dataset will contribute to development of intelligent assistant agents

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Dataset collection#

Statistics and analysis#

Our model#

Multi-modal task-specific input#

Multi-modal encoder-decoder#

Implementation details#

Experimental results and analysis#

Discussion and limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Dataset collection

Statistics and analysis

Our model

Multi-modal task-specific input

Multi-modal encoder-decoder

Implementation details

Experimental results and analysis

Discussion and limitations

Conclusion