Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- A new dataset, SlideVQA, has been proposed for developing document VQA systems.
- SlideVQA contains 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions.
- SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning.
- Annotated arithmetic expressions of numerical answers are provided to enhance numerical reasoning.
- A new end-to-end document VQA model has been developed.
- Experiments show that the model outperforms existing state-of-the-art QA models, but has a large gap behind human performance.
Paper Content
Introduction
- Building intelligent agents that can read and comprehend real-world documents is a long-standing goal of AI
- Machine reading comprehension (MRC) is a central task in natural language understanding
- MRC typically involves extracting a span from a text in response to a question
- Visual question answering on document images (document VQA) has been studied to address real-world applications
- Current models are not capable of performing reasoning across multiple images
- SlideVQA is a new dataset for tasks involving a slide deck composed of multiple slide images and a corresponding question
- SlideVQA requires complex reasoning over slide images, including single-hop, multi-hop, and numerical reasoning
- 63% of journalists rely on internal meetings, tip-offs, events and primary research for story ideas
- Most journalists from the South look for story triggers in competitive media
- Most journalists from the East draw on events to evolve fresh story ideas
Related work
- Datasets for VQA on document images have been published, such as DocVQA, VisualMRC, Web-SRC, and InfographicVQA
- Slide-VQA consists of 14.5k questions, while DocCVQA provides only 20 questions
- SlideVQA requires multihop reasoning over multiple slides to find the answer, while DocCVQA requires only single-hop reasoning on individual images
- Slide-VQA provides questions that require numerical reasoning and arithmetic expression annotations
- Transformer has come to be used for understanding unstructured text in document images
- LayoutLM, LayoutLMv2, LayoutT5, and TILT have achieved impressive results in single-image document VQA tasks
- Multi-modal question answering takes textual and visual information as input contexts
- VideoQA focuses on answering questions about video frames of TV shows and movies
- VQA on image sets involves handling photos taken from different viewpoint indoors
- Slide images understanding involves object segmentation on slide-pages and generating slides from research papers
- Numerical reasoning plays an important role in NLP tasks
- SlideVQA requires a system to answer a question about a slide deck and select evidence slide images
Dataset collection
- Recruited crowd workers from English-speaking countries
- Quality control with other workers
- Collected 25,327 slide decks from slideshare
- Filtered decks to meet criteria
- Annotated images with bounding boxes and categories
- Created 12,466 single-hop QA pairs
- Created 2,018 multi-hop QA pairs
- Annotated arithmetic expressions
Statistics and analysis
- SlideVQA contains 14,484 QA pairs from 2,619 slide decks
- Contains 52,480 slide images annotated with 890,945 bounding boxes
- Split into 10,617 questions for training, 1,652 (2,215) questions for development (test)
- Largest number of images and bounding box annotations
- Bounding boxes broken down into nine categories
- Extracted text from images using Google Cloud Vision API
- Questions require complex reasoning including single/multi-hop, and numerical reasoning
- 25.5% of numerical questions require arithmetic operations
- 32.4% of answers require multi-span and non-span
- Frequent first words of questions are “In” and “Regarding”
Our model
- Model called M3D is used
- Fusion-in-Decoder (FiD) is used as base model
- FiD is initialized with pre-trained T5
- SlideVQA task is performed using multi-task learning
- Arithmetic expressions are predicted as intermediate reasoning steps
- Input sequence is modified to learn visual layout and content of image
Multi-modal task-specific input
- Input token sequence is used to train the evidence selection and question answering tasks
- An OCR engine is used to parse the slide image and obtain OCR tokens
- Input embeddings are defined by utilizing multi-modal information, including token, segment, layout, and visual embeddings
Multi-modal encoder-decoder
- Multi-modal encoder consists of m Transformer blocks
- Input sequences are encoded independently and then concatenated
- Answer/Arithmetic-expression decoder is a stack of m Transformer blocks with cross-attention
- Decoder models answer generation as a conditional generation
- Model can perform numerical reasoning by predicting annotated arithmetic expressions
- Evidence selector shares weights and architecture of answer/arithmetic-expression decoder
- Model is trained by minimizing two losses
- During inference, model decides whether numerical reasoning is required
- Main task baselines include hierarchical LayoutLMv2, T5, PreasM, and LayoutT5
- Evidence selection baselines include BM25, CLIP, BERT, LayoutLMv2, H-LayoutLMv2, BinaryClass, and ChainGen
- Question answering baselines include Q-only, UniVL, and FiD
- Human performance evaluated by 6 crowdworkers
- Evaluation metrics include EM, F1, JEM, and JF1
Implementation details
- Implemented models in PyTorch
- Used 8 Tesla V100 32GB GPUs
- Used AdamW with learning rate of 5e-5 and dropout rate of 10%
- Batch size of 32
- Evaluated models every 500 steps
- Maximum length of 200 tokens for M3D
- Maximum target sequence length of 50
- Trained Faster-RCNN with ResNet-101
- Used SGD with learning rate of 1e-3 and batch size of one
- Standard anchor scales and ratios used
- Created new video at rate of 5 frames per second
- Used Google Cloud Vision API to extract text and bounding boxes
Experimental results and analysis
- M3D outperformed baselines on joint EM/F1
- H-LayoutLMv2 and M3D performed better than baselines on evidence selection task
- M3D outperformed pipeline methods on QA task
- Adding modality information improved performance in all tasks
- LayoutT5 outperformed LayoutLMv2
- MultiGen decoder obtained highest performance
- Detecting randomly placed and small boxes is more difficult than fixed and large boxes
- M3D successfully extracted information and generated same answer as ground-truth
Discussion and limitations
- SlideVQA is the largest document VQA benchmark that uses multiple images as input and requires multi-hop reasoning.
- An editing method is used to guarantee multi-hop questions and extend the dataset size.
- Cross-attention is used on all evidence candidates, which can cause a computational problem with a lot of input images.
- Models that use a two-stage selector to narrow down candidates and an answer generator in an end-to-end manner are promising.
Conclusion
- Introduced a new document VQA dataset, SlideVQA
- Focused on understanding slide decks composed of multiple images
- Introduced a unified end-to-end model, M3D
- Model can perform evidence selection and question answering tasks
- Model can enhance numerical reasoning by generating arithmetic expressions
- Evaluation highlighted the promise of this approach, but also revealed a gap compared to human performance
- Dataset will contribute to development of intelligent assistant agents