Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Designed and trained a Generative Image-to-text Transformer (GIT) to unify vision-language tasks
- Simplified architecture with one image encoder and one text decoder
- Scaled up pre-training data and model size to boost performance
- Established new state of the arts on 12 challenging benchmarks
- Surpassed human performance on TextCaps
- Presented a new scheme of generation-based image classification and scene text recognition
Paper Content
Introduction
- Pre-training on large-scale image-text pairs
- Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks used
- Task-specific adaptation needed
- Unified generative models for pre-training
- Multi-modal encoder and text decoder with careful design
- Generative Image-to-text Transformer (GIT) proposed
- GIT achieves new state of the arts across numerous challenging benchmarks
- Image encoder is a Swin-like vision transformer
- Text decoder is a transformer network
- Language modeling task used for pre-training
- New generation-based scheme for ImageNet classification proposed
Network architecture
- Image encoder based on contrastive pre-trained model
- Input is raw image, output is 2D feature map
- Extra linear layer and layernorm layer to project image features into D dimensions
- Text decoder is transformer module with self-attention layer and feed-forward layer
- Text tokenized and embedded into D dimensions
- Image features concatenated with text embeddings as input to transformer module
- Text decoder randomly initialized
- Alternative architecture is cross-attention-based decoder
- Self-attention-based decoder better with large-scale pre-training
Pre-training
- Model is trained using language modeling (LM) loss
- Alternative choice is MLM, which predicts 15% of input tokens
- LM can predict all tokens, which is more efficient for large-scale pre-training data
- Number of epochs is limited to 2 due to computational resource limitation
- Model is similar to GPT3 in architecture wise
Fine-tuning
- Applied same LM task to fine-tune GIT for image captioning
- For VQA, question and answer concatenated as new caption during fine-tuning
- Generative approach chosen over discriminative existing work
- No OCR engine used, model learns to read scene text with pre-training
- Simple architecture change for video domain
- Generation model applied to image classification task
Experiments
Setting
- 0.8B image-text pairs used for pre-training
- Image encoder initialized from pre-trained contrastive model
- Text decoder consists of 6 randomly-initialized transformer blocks
- 0.7 billion model parameters
- Learning rates of image encoder and decoder are 1e-5 and 5e-5 respectively
- Total number of epochs is 2
- Beam size is 4 and length penalty is 0.6 during inference
Results on image captioning and question answering
- Evaluated captioning performance on Karpathy split of COCO and Flickr30K
- Evaluated on nocaps, TextCaps, and VizWiz-Captions
- Achieved new SOTA performance on all metrics except COCO Karpathy test
- Model size is 0.7B, higher performance than CoCa (2.1B)
- Outperformed previous SOTA on TextCaps by 28.5 points in CIDEr
- Significantly benefited from more shots
- Achieved new SOTA on VizWiz-VQA and OCR-VQA
- Same performance as prior SOTA on ST-VQA
- Higher accuracy on TextVQA, lower on VQAv2 than Flamingo
- Model performs worse than discriminative model of Florence on VQAv2
Results on video captioning and question answering
- Performance is evaluated on MSVD, MSRVTT, YouCook2, VATEX, and TVC
- Results are shown in Table 5 and Table 6 for captioning and QA
- Model is not as complex as Tang et al. (2021)
- Model is better than Flamingo (Alayrac et al., 2022)
Results on image classification
- GIT is fine-tuned on ImageNet-1k
- GIT can achieve descent accuracy without pre-defining the vocabulary
- GIT is worse than Florence by 1.2 points
- GIT accuracy is 1.93% when exact match is required
- GIT accuracy is 40.88% when prediction contains ground-truth
- GIT accuracy is 33.48% when output tokens are limited to vocabulary
- GIT accuracy is improved with 1 or 5 shots per category
- GIT accuracy is higher than Flamingo
- GIT requires lightweight fine-tuning once and no training shots during inference
Results on scene text recognition
- Task aims to read scene text directly from image
- Evaluated in two settings
- Prediction is correct if caption is exact match to ground-truth
- Model evaluated on 6 standard benchmarks
- TextCaps-fine-tuned captioning model achieves 89.9 accuracy
- GIT achieves 92.9, surpassing prior arts of 91.9
Analysis
- Constructed two smaller pre-training datasets
- Used 30 epochs for small-scale datasets
- Named model as Huge
- Image encoder replaced with ViT-B/16 and ViT-L/14
- Performance drops with 0.8B data
- All model variants benefit from more pre-training data
- Difficult to effectively scale up text decoder
Scene text in pre-training data.
- 15% of CC12M images contain scene text descriptions
- 31% of downloaded images contain scene text descriptions
- OCR result must be longer than 5 characters to be considered matched
Conclusion
- Design and train a generative model to map input image to associated text description
- Model achieves new state-of-the-art performance on image/video captioning and question answering tasks
- Model surpasses human performance on TextCaps
- Model used to predict label name directly
- Limitations include difficulty controlling generated caption and in-context learning without parameter update
- Model improves performance and is more appropriate to help visually-impaired people
- Data preprocessing is faster than training and is overlapped with GPU training
- Model is fine-tuned with 10 epochs, batch size of 512 and learning rate of 2.5e-6
- Model achieves competitive performance with existing approaches
- Model can identify novel objects without object tags as network input
- Model is knowledgeable and can produce diverse and informative captions
- Model is used to caption scene text and recognize novel concepts