Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introducing Kosmos-1, a Multimodal Large Language Model (MLLM)
Trained from scratch on web-scale multimodal corpora
Evaluated on a wide range of tasks without any gradient updates or finetuning
Impressive performance on language understanding, generation, and OCR-free NLP
Also performs well on perception-language tasks and vision tasks
Cross-modal transfer from language to multimodal and from multimodal to language
Introducing a dataset of Raven IQ test to diagnose nonverbal reasoning capability of MLLMs

MLLMs can handle both language tasks and perception-intensive tasks
Evaluate KOSMOS-1 on various types of tasks
Evaluate perception-language capability of KOSMOS-1 under vision-language settings

Raven’s Progressive Matrices is a test to evaluate nonverbal reasoning.
It is used to measure an individual’s intelligence quotient (IQ).
The task is to identify the following element from six similar candidates.
Models need to conduct zero-shot nonverbal reasoning without explicitly fine-tuning.
The IQ task is a good testbed to benchmark the nonverbal in-context learning capability.

Web page question answering requires understanding of both the semantics and structure of texts.
Structure of web page (tables, lists, HTML layout) is important for how information is arranged and displayed.
Task can help evaluate model’s ability to understand web page semantics and structure.

Chain-of-thought prompting allows large language models to generate a series of reasoning steps.
It can decompose a multi-step problem into intermediate steps.
Table 12 shows results of zero-shot image classification without and with verbal descriptions.

KOSMOS-1 and LLM are evaluated on visual commonsense reasoning tasks.
KOSMOS-1 outperforms LLM on three object commonsense reasoning datasets.
KOSMOS-1 has modality transferability which enables it to transfer visual knowledge to language tasks.

Introduced KOSMOS-1, a multimodal large language model
Trained on web-scale multimodal corpora
Model size and speech capability to be scaled up
Unified interface for multimodal learning
Language tasks: understanding, generation, OCR-free text classification, cross-modal transfer, commonsense reasoning, nonverbal reasoning, IQ Test
Perception-language tasks: image captioning, visual question answering, web page question answering
Vision tasks: zero-shot image classification, zero-shot image classification with descriptions
Embedding module to encode text tokens and other input modalities
Training objective to maximize log-likelihood of tokens in examples
Text corpora: The Pile, Common Crawl, CC-Stories, RealNews
Zero-shot and few-shot image captioning results on COCO caption Karpathy test and Flickr30k test
Zero-shot and few-shot visual question answering results on VQAv2 and VizWiz
ROC AUC of 63.9% for HatefulMemes validation set
Test accuracy of 67.1% for Rendered SST-2 test set
Ablation study on language-only instruction tuning
Zero-shot visual commonsense reasoning on RELATIVESIZE, MEMORYCOLOR, and COLORTERMS datasets