Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Self-supervised learning in vision-language processing uses semantic alignment between imaging and text.
  • Prior work in biomedical VLP mostly used single image and report pairs.
  • This work uses a CNN-Transformer hybrid multi-image encoder.
  • The model excels on downstream tasks and achieves state-of-the-art performance.
  • A novel multi-modal temporal benchmark dataset was released to quantify the quality of vision-language representations.

Paper Content

Introduction

  • Self-supervision from image-text pairs has enabled the development of flexible general-purpose vision-language models.
  • Vision-language processing (VLP) has shown that cross-modal supervision can provide a richer signal for training both image and text models.
  • Our approach exploits domain knowledge by learning to incorporate a series of images and correlate them to reports.
  • Self-supervised VLP can reduce need for manual labels
  • Large-scale paired image-text datasets have led to development of general-purpose VLP models
  • Objectives include contrastive and discriminative image-text matching, auto-regressive captioning and multi-modal masked modelling
  • Paired medical image-report datasets used for supervised learning
  • Advances in general-domain self-supervised VLP benefit biomedical imaging applications
  • Temporal information not directly employed for self-supervision

Biovil-t training framework

  • Multi-image encoder extracts spatio-temporal features from image sequences
  • Text encoder incorporates optional cross-attention on image features
  • Models trained with image-guided MLM and cross-modal contrastive objectives
  • Image and text models adapted for uni- or multimodal downstream tasks
  • Datasets include prior image when it exists

Extracting spatio-temporal image features

  • Clinical findings often occur across different image regions and at the same time.
  • BioViL-T uses local correspondences between image regions across time.
  • Hybrid CNN-Transformer encoder model is used due to its data efficiency and flexibility.
  • Visual tokens are augmented with spatial and temporal positional encodings.
  • Features from current and prior images are concatenated to form the final image representation.

Text-supervision for spatio-temporal learning

  • Vector of M tokens of a report is passed through a BERT encoder
  • Image representations are extracted from single and multiple input scans
  • Text and image features are projected into a joint latent space
  • Local and global contrastive objectives are used with InfoNCE loss
  • MLM objective is used to predict masked tokens
  • Cross-attention is used to image features during MLM objective

Adaptations to downstream tasks

  • BioViL-T can be adapted to various downstream tasks
  • Projected text embeddings are marginalised prior to 2-normalisation
  • Cross-attention is used between text tokens
  • Prior report is used as a prompt to contextualise the report generation task
  • Curation of imaging datasets is conducted to choose higher quality images

Datasets & experiments

  • BioViL-T is a computer science model that can be used for a wide range of applications
  • BioViL-T achieves SOTA performance on various downstream tasks
  • MS-CXR-T is a new multi-modal benchmark dataset used to evaluate the model
  • The dataset contains 1326 multi-image and ground-truth label pairs across 5 findings
  • The dataset also contains 361 sentence pairs to quantify temporal-semantic similarity
  • MIMIC-CXR v2 is used for pre-training
  • Downstream evaluations are performed on a disjoint held-out test set
  • The model is compared to other domain-specific SOTA pre-training frameworks
  • Metrics used to evaluate the model include macroaccuracy, mIoU, CNR, BLEU, ROUGE-L, CheXbert, and TEM

Temporal pre-training yields data efficiency

  • Downstream tasks are enabled with minimal labels
  • BioViL-T enables efficient use of raw data
  • Zero-shot classification yields performance superior to prior fully-supervised work
  • With only 10% of labels, classification fine-tuning provides a further boost
  • BioViL-T leveraging prior images improves across all metrics

Static tasks benefit from temporal learning

  • BioViL-T improves performance on static tasks.
  • Establishes new SOTA on pneumonia classification from single images.
  • Text encoders learn temporal semantics through supervision from longitudinal image series.

Which tokens require a prior image in mlm?

  • Leveraged MLM objective to analyze influence of prior images in predicting masked tokens
  • Defined βˆ† prior img as change in loss by conditioning estimation with prior image
  • Figure 4 shows distribution of βˆ† prior img as a function of token category
  • Model heavily relies on prior image for image-guided MLM for Progression-type terms
  • Effect is specific to temporal tokens, not other semantic categories

Conclusion

  • Introduced BioViL-T, a visionlanguage pre-training framework
  • Enables alignment between text and multiple images
  • Novel multi-image encoder decomposes static-temporal features
  • Grounding of temporal references in text
  • First method capable of leveraging temporal content in biomedical text
  • Incorporating multi-modal temporal content provides strong learning signals
  • BioViL-T excels on static and temporal tasks
  • New SOTA on report generation, temporal image classification, few/zero-shot pneumonia detection, and phrase grounding
  • New multi-modal benchmark (MS-CXR-T) to measure quality of image and text representations
  • Model weights and code will be made publicly available
  • Can scale linearly with number of time points
  • Analysis of correlation between inter-rater agreement and image model’s prediction errors
  • Self-attention rollout maps in agreement with radiologist-annotated bounding boxes
  • Robust to pose variations
  • Used BioViL-T to perform pairwise ranking of instances in MIMIC-CXR
  • MS-CXR-T temporal image classification contains progression labels
  • RadGraph and Swaps used to create MS-CXR-T temporal sentence similarity benchmark
  • Temporal entity matching (TEM) score to quantify how well generated report describes progression-related information