Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Self-supervised learning in vision-language processing uses semantic alignment between imaging and text.
- Prior work in biomedical VLP mostly used single image and report pairs.
- This work uses a CNN-Transformer hybrid multi-image encoder.
- The model excels on downstream tasks and achieves state-of-the-art performance.
- A novel multi-modal temporal benchmark dataset was released to quantify the quality of vision-language representations.
Paper Content
Introduction
- Self-supervision from image-text pairs has enabled the development of flexible general-purpose vision-language models.
- Vision-language processing (VLP) has shown that cross-modal supervision can provide a richer signal for training both image and text models.
- Our approach exploits domain knowledge by learning to incorporate a series of images and correlate them to reports.
Related work
- Self-supervised VLP can reduce need for manual labels
- Large-scale paired image-text datasets have led to development of general-purpose VLP models
- Objectives include contrastive and discriminative image-text matching, auto-regressive captioning and multi-modal masked modelling
- Paired medical image-report datasets used for supervised learning
- Advances in general-domain self-supervised VLP benefit biomedical imaging applications
- Temporal information not directly employed for self-supervision
Biovil-t training framework
- Multi-image encoder extracts spatio-temporal features from image sequences
- Text encoder incorporates optional cross-attention on image features
- Models trained with image-guided MLM and cross-modal contrastive objectives
- Image and text models adapted for uni- or multimodal downstream tasks
- Datasets include prior image when it exists
Extracting spatio-temporal image features
- Clinical findings often occur across different image regions and at the same time.
- BioViL-T uses local correspondences between image regions across time.
- Hybrid CNN-Transformer encoder model is used due to its data efficiency and flexibility.
- Visual tokens are augmented with spatial and temporal positional encodings.
- Features from current and prior images are concatenated to form the final image representation.
Text-supervision for spatio-temporal learning
- Vector of M tokens of a report is passed through a BERT encoder
- Image representations are extracted from single and multiple input scans
- Text and image features are projected into a joint latent space
- Local and global contrastive objectives are used with InfoNCE loss
- MLM objective is used to predict masked tokens
- Cross-attention is used to image features during MLM objective
Adaptations to downstream tasks
- BioViL-T can be adapted to various downstream tasks
- Projected text embeddings are marginalised prior to 2-normalisation
- Cross-attention is used between text tokens
- Prior report is used as a prompt to contextualise the report generation task
- Curation of imaging datasets is conducted to choose higher quality images
Datasets & experiments
- BioViL-T is a computer science model that can be used for a wide range of applications
- BioViL-T achieves SOTA performance on various downstream tasks
- MS-CXR-T is a new multi-modal benchmark dataset used to evaluate the model
- The dataset contains 1326 multi-image and ground-truth label pairs across 5 findings
- The dataset also contains 361 sentence pairs to quantify temporal-semantic similarity
- MIMIC-CXR v2 is used for pre-training
- Downstream evaluations are performed on a disjoint held-out test set
- The model is compared to other domain-specific SOTA pre-training frameworks
- Metrics used to evaluate the model include macroaccuracy, mIoU, CNR, BLEU, ROUGE-L, CheXbert, and TEM
Temporal pre-training yields data efficiency
- Downstream tasks are enabled with minimal labels
- BioViL-T enables efficient use of raw data
- Zero-shot classification yields performance superior to prior fully-supervised work
- With only 10% of labels, classification fine-tuning provides a further boost
- BioViL-T leveraging prior images improves across all metrics
Static tasks benefit from temporal learning
- BioViL-T improves performance on static tasks.
- Establishes new SOTA on pneumonia classification from single images.
- Text encoders learn temporal semantics through supervision from longitudinal image series.
Which tokens require a prior image in mlm?
- Leveraged MLM objective to analyze influence of prior images in predicting masked tokens
- Defined β prior img as change in loss by conditioning estimation with prior image
- Figure 4 shows distribution of β prior img as a function of token category
- Model heavily relies on prior image for image-guided MLM for Progression-type terms
- Effect is specific to temporal tokens, not other semantic categories
Conclusion
- Introduced BioViL-T, a visionlanguage pre-training framework
- Enables alignment between text and multiple images
- Novel multi-image encoder decomposes static-temporal features
- Grounding of temporal references in text
- First method capable of leveraging temporal content in biomedical text
- Incorporating multi-modal temporal content provides strong learning signals
- BioViL-T excels on static and temporal tasks
- New SOTA on report generation, temporal image classification, few/zero-shot pneumonia detection, and phrase grounding
- New multi-modal benchmark (MS-CXR-T) to measure quality of image and text representations
- Model weights and code will be made publicly available
- Can scale linearly with number of time points
- Analysis of correlation between inter-rater agreement and image model’s prediction errors
- Self-attention rollout maps in agreement with radiologist-annotated bounding boxes
- Robust to pose variations
- Used BioViL-T to perform pairwise ranking of instances in MIMIC-CXR
- MS-CXR-T temporal image classification contains progression labels
- RadGraph and Swaps used to create MS-CXR-T temporal sentence similarity benchmark
- Temporal entity matching (TEM) score to quantify how well generated report describes progression-related information