Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Self-supervised learning in vision-language processing uses semantic alignment between imaging and text.
Prior work in biomedical VLP mostly used single image and report pairs.
This work uses a CNN-Transformer hybrid multi-image encoder.
The model excels on downstream tasks and achieves state-of-the-art performance.
A novel multi-modal temporal benchmark dataset was released to quantify the quality of vision-language representations.

Paper Content

Introduction

Self-supervision from image-text pairs has enabled the development of flexible general-purpose vision-language models.
Vision-language processing (VLP) has shown that cross-modal supervision can provide a richer signal for training both image and text models.
Our approach exploits domain knowledge by learning to incorporate a series of images and correlate them to reports.

Self-supervised VLP can reduce need for manual labels
Large-scale paired image-text datasets have led to development of general-purpose VLP models
Objectives include contrastive and discriminative image-text matching, auto-regressive captioning and multi-modal masked modelling
Paired medical image-report datasets used for supervised learning
Advances in general-domain self-supervised VLP benefit biomedical imaging applications
Temporal information not directly employed for self-supervision

Biovil-t training framework

Multi-image encoder extracts spatio-temporal features from image sequences
Text encoder incorporates optional cross-attention on image features
Models trained with image-guided MLM and cross-modal contrastive objectives
Image and text models adapted for uni- or multimodal downstream tasks
Datasets include prior image when it exists

Extracting spatio-temporal image features

Clinical findings often occur across different image regions and at the same time.
BioViL-T uses local correspondences between image regions across time.
Hybrid CNN-Transformer encoder model is used due to its data efficiency and flexibility.
Visual tokens are augmented with spatial and temporal positional encodings.
Features from current and prior images are concatenated to form the final image representation.

Text-supervision for spatio-temporal learning

Vector of M tokens of a report is passed through a BERT encoder
Image representations are extracted from single and multiple input scans
Text and image features are projected into a joint latent space
Local and global contrastive objectives are used with InfoNCE loss
MLM objective is used to predict masked tokens
Cross-attention is used to image features during MLM objective

Adaptations to downstream tasks

BioViL-T can be adapted to various downstream tasks
Projected text embeddings are marginalised prior to 2-normalisation
Cross-attention is used between text tokens
Prior report is used as a prompt to contextualise the report generation task
Curation of imaging datasets is conducted to choose higher quality images

Datasets & experiments

BioViL-T is a computer science model that can be used for a wide range of applications
BioViL-T achieves SOTA performance on various downstream tasks
MS-CXR-T is a new multi-modal benchmark dataset used to evaluate the model
The dataset contains 1326 multi-image and ground-truth label pairs across 5 findings
The dataset also contains 361 sentence pairs to quantify temporal-semantic similarity
MIMIC-CXR v2 is used for pre-training
Downstream evaluations are performed on a disjoint held-out test set
The model is compared to other domain-specific SOTA pre-training frameworks
Metrics used to evaluate the model include macroaccuracy, mIoU, CNR, BLEU, ROUGE-L, CheXbert, and TEM

Temporal pre-training yields data efficiency

Downstream tasks are enabled with minimal labels
BioViL-T enables efficient use of raw data
Zero-shot classification yields performance superior to prior fully-supervised work
With only 10% of labels, classification fine-tuning provides a further boost
BioViL-T leveraging prior images improves across all metrics

Static tasks benefit from temporal learning

BioViL-T improves performance on static tasks.
Establishes new SOTA on pneumonia classification from single images.
Text encoders learn temporal semantics through supervision from longitudinal image series.

Which tokens require a prior image in mlm?

Leveraged MLM objective to analyze influence of prior images in predicting masked tokens
Defined ∆ prior img as change in loss by conditioning estimation with prior image
Figure 4 shows distribution of ∆ prior img as a function of token category
Model heavily relies on prior image for image-guided MLM for Progression-type terms
Effect is specific to temporal tokens, not other semantic categories

Conclusion

Introduced BioViL-T, a visionlanguage pre-training framework
Enables alignment between text and multiple images
Novel multi-image encoder decomposes static-temporal features
Grounding of temporal references in text
First method capable of leveraging temporal content in biomedical text
Incorporating multi-modal temporal content provides strong learning signals
BioViL-T excels on static and temporal tasks
New SOTA on report generation, temporal image classification, few/zero-shot pneumonia detection, and phrase grounding
New multi-modal benchmark (MS-CXR-T) to measure quality of image and text representations
Model weights and code will be made publicly available
Can scale linearly with number of time points
Analysis of correlation between inter-rater agreement and image model’s prediction errors
Self-attention rollout maps in agreement with radiologist-annotated bounding boxes
Robust to pose variations
Used BioViL-T to perform pairwise ranking of instances in MIMIC-CXR
MS-CXR-T temporal image classification contains progression labels
RadGraph and Swaps used to create MS-CXR-T temporal sentence similarity benchmark
Temporal entity matching (TEM) score to quantify how well generated report describes progression-related information

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Biovil-t training framework#

Extracting spatio-temporal image features#

Text-supervision for spatio-temporal learning#

Adaptations to downstream tasks#

Datasets & experiments#

Temporal pre-training yields data efficiency#

Static tasks benefit from temporal learning#

Which tokens require a prior image in mlm?#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Biovil-t training framework

Extracting spatio-temporal image features

Text-supervision for spatio-temporal learning

Adaptations to downstream tasks

Datasets & experiments

Temporal pre-training yields data efficiency

Static tasks benefit from temporal learning

Which tokens require a prior image in mlm?

Conclusion