  • A new dataset, SlideVQA, has been proposed for developing document VQA systems.
  • SlideVQA contains 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions.
  • SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning.
  • Annotated arithmetic expressions of numerical answers are provided to enhance numerical reasoning.
  • A new end-to-end document VQA model has been developed.
  • Experiments show that the model outperforms existing state-of-the-art QA models, but has a large gap behind human performance.

Paper Content


  • Building intelligent agents that can read and comprehend real-world documents is a long-standing goal of AI
  • Machine reading comprehension (MRC) is a central task in natural language understanding
  • MRC typically involves extracting a span from a text in response to a question
  • Visual question answering on document images (document VQA) has been studied to address real-world applications
  • Current models are not capable of performing reasoning across multiple images
  • SlideVQA is a new dataset for tasks involving a slide deck composed of multiple slide images and a corresponding question
  • SlideVQA requires complex reasoning over slide images, including single-hop, multi-hop, and numerical reasoning
  • 63% of journalists rely on internal meetings, tip-offs, events and primary research for story ideas
  • Most journalists from the South look for story triggers in competitive media
  • Most journalists from the East draw on events to evolve fresh story ideas
  • Datasets for VQA on document images have been published, such as DocVQA, VisualMRC, Web-SRC, and InfographicVQA
  • Slide-VQA consists of 14.5k questions, while DocCVQA provides only 20 questions
  • SlideVQA requires multihop reasoning over multiple slides to find the answer, while DocCVQA requires only single-hop reasoning on individual images
  • Slide-VQA provides questions that require numerical reasoning and arithmetic expression annotations
  • Transformer has come to be used for understanding unstructured text in document images
  • LayoutLM, LayoutLMv2, LayoutT5, and TILT have achieved impressive results in single-image document VQA tasks
  • Multi-modal question answering takes textual and visual information as input contexts
  • VideoQA focuses on answering questions about video frames of TV shows and movies
  • VQA on image sets involves handling photos taken from different viewpoint indoors
  • Slide images understanding involves object segmentation on slide-pages and generating slides from research papers
  • Numerical reasoning plays an important role in NLP tasks
  • SlideVQA requires a system to answer a question about a slide deck and select evidence slide images

Dataset collection

  • Recruited crowd workers from English-speaking countries
  • Quality control with other workers
  • Collected 25,327 slide decks from slideshare
  • Filtered decks to meet criteria
  • Annotated images with bounding boxes and categories
  • Created 12,466 single-hop QA pairs
  • Created 2,018 multi-hop QA pairs
  • Annotated arithmetic expressions

Statistics and analysis

  • SlideVQA contains 14,484 QA pairs from 2,619 slide decks
  • Contains 52,480 slide images annotated with 890,945 bounding boxes
  • Split into 10,617 questions for training, 1,652 (2,215) questions for development (test)
  • Largest number of images and bounding box annotations
  • Bounding boxes broken down into nine categories
  • Extracted text from images using Google Cloud Vision API
  • Questions require complex reasoning including single/multi-hop, and numerical reasoning
  • 25.5% of numerical questions require arithmetic operations
  • 32.4% of answers require multi-span and non-span
  • Frequent first words of questions are “In” and “Regarding”

Our model

  • Model called M3D is used
  • Fusion-in-Decoder (FiD) is used as base model
  • FiD is initialized with pre-trained T5
  • SlideVQA task is performed using multi-task learning
  • Arithmetic expressions are predicted as intermediate reasoning steps
  • Input sequence is modified to learn visual layout and content of image

Multi-modal task-specific input

  • Input token sequence is used to train the evidence selection and question answering tasks
  • An OCR engine is used to parse the slide image and obtain OCR tokens
  • Input embeddings are defined by utilizing multi-modal information, including token, segment, layout, and visual embeddings

Multi-modal encoder-decoder

  • Multi-modal encoder consists of m Transformer blocks
  • Input sequences are encoded independently and then concatenated
  • Answer/Arithmetic-expression decoder is a stack of m Transformer blocks with cross-attention
  • Decoder models answer generation as a conditional generation
  • Model can perform numerical reasoning by predicting annotated arithmetic expressions
  • Evidence selector shares weights and architecture of answer/arithmetic-expression decoder
  • Model is trained by minimizing two losses
  • During inference, model decides whether numerical reasoning is required
  • Main task baselines include hierarchical LayoutLMv2, T5, PreasM, and LayoutT5
  • Evidence selection baselines include BM25, CLIP, BERT, LayoutLMv2, H-LayoutLMv2, BinaryClass, and ChainGen
  • Question answering baselines include Q-only, UniVL, and FiD
  • Human performance evaluated by 6 crowdworkers
  • Evaluation metrics include EM, F1, JEM, and JF1

Implementation details

  • Implemented models in PyTorch
  • Used 8 Tesla V100 32GB GPUs
  • Used AdamW with learning rate of 5e-5 and dropout rate of 10%
  • Batch size of 32
  • Evaluated models every 500 steps
  • Maximum length of 200 tokens for M3D
  • Maximum target sequence length of 50
  • Trained Faster-RCNN with ResNet-101
  • Used SGD with learning rate of 1e-3 and batch size of one
  • Standard anchor scales and ratios used
  • Created new video at rate of 5 frames per second
  • Used Google Cloud Vision API to extract text and bounding boxes

Experimental results and analysis

  • M3D outperformed baselines on joint EM/F1
  • H-LayoutLMv2 and M3D performed better than baselines on evidence selection task
  • M3D outperformed pipeline methods on QA task
  • Adding modality information improved performance in all tasks
  • LayoutT5 outperformed LayoutLMv2
  • MultiGen decoder obtained highest performance
  • Detecting randomly placed and small boxes is more difficult than fixed and large boxes
  • M3D successfully extracted information and generated same answer as ground-truth

Discussion and limitations

  • SlideVQA is the largest document VQA benchmark that uses multiple images as input and requires multi-hop reasoning.
  • An editing method is used to guarantee multi-hop questions and extend the dataset size.
  • Cross-attention is used on all evidence candidates, which can cause a computational problem with a lot of input images.
  • Models that use a two-stage selector to narrow down candidates and an answer generator in an end-to-end manner are promising.


  • Introduced a new document VQA dataset, SlideVQA
  • Focused on understanding slide decks composed of multiple images
  • Introduced a unified end-to-end model, M3D
  • Model can perform evidence selection and question answering tasks
  • Model can enhance numerical reasoning by generating arithmetic expressions
  • Evaluation highlighted the promise of this approach, but also revealed a gap compared to human performance
  • Dataset will contribute to development of intelligent assistant agents