Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability.
Previous selection methods for ICL are based on simple heuristics.
CEIL (Compositional Exemplars for In-context Learning) is proposed to model the interaction between the given input and in-context examples.
CEIL is validated on 12 classification and generation datasets from 7 distinct NLP tasks.
Experiments demonstrate the state-of-the-art performance, transferability and compositionality of CEIL.

Paper Content

Introduction

Artificial intelligence aims to develop models that can generalize to unseen tasks.
In-context learning (ICL) is a major step towards this goal.
ICL is sensitive to the selection of in-context examples.
CEIL (Compositional Exemplars for In-context Learning) models the joint probability of the entire in-context example set.

Preliminary

In-context learning

In-context learning (ICL) can be used to perform complex tasks
Growing concerns about ICL instability
Researchers have proposed learning-free and learning-based methods to mitigate instability
Learning-free methods require human experts and lead to sub-optimal performance
Learning-based methods have been proposed to push the envelope further

Determinantal point processes

DPPs measure both diversity and quality of items in a subset
DPPs have been applied to document and video summarization, recommendation systems, object detection, and multi-label classification
DPPs have been employed in in-context learning for compositional tasks
DPPs are framed into an end-to-end framework to capture interaction between in-context examples and reflect preference of LMs

Model

Introduces CEIL, a framework to learn composition of exemplars for in-context learning
Models joint probability of in-context example sets with a learnable conditional DPP and trained with contrastive learning

Modeling

In-context learning requires relevance and diversity.
A new kernel is defined to incorporate relevance and diversity.
A trade-off parameter is used to balance the magnitude of diversity and relevance.
Two embedders are used to encode input text and in-context examples.

Training

Introduce a contrastive learning framework to rectify the embedding of each in-context example and training instance
Training data consists of N instances, each containing one input instance and M in-context example subsets
Employ a two-stage framework to obtain a set of relevant examples
Use inference LM as scoring function to measure quality of each subset
Employ a fine-grained pair-wise margin loss to determine which subset is preferable

Inference

MAP inference of a DPP is NP-hard
Candidate space is narrowed down with KNN retriever
Greedy algorithm with O(nK2) complexity is used
Cholesky decomposition reduces complexity to O(nK)

Experiments

Conducted extensive experiments over 12 datasets
Spanning 7 distinct tasks
Showed better approach to in-context learning

Datasets and evaluation

Datasets and tasks are listed in Table 1
Different task formulations allow for extensive evaluations
Prompts and examples are in Appendix A.1
Accuracy reported for classification tasks
Exact Match and LF-EM reported for generation tasks
Results reported on validation set

Baselines

RANDOM: Retriever randomly selects in-context examples from training set without repetition
TOPK-BM25: Classical sparse retrieval method based on TF-IDF
TOPK-BERT: Dense retriever based on BERT embeddings
DPP-BERT: DPP retriever uses original BERT embedding without fine-tuning
EPR: Learning-based dense retriever trained to

Implementation details

GPT-Neo and GPT2-XL are used as language models
Exemplars are sorted based on similarity to input text
Classification tasks are reframed into multiple choice
Greedy decoding is used for answer generation
Training data is limited to 44,000 instances
Adam optimizer is used with batch size 128 and learning rate 1e-5
Trade-off factor λ is searched in {0.01, 0.05, 0.1}
BERT-based encoder is used to encode examples into embeddings

Main results

Experiments conducted on 12 datasets across 7 tasks
Generation tasks benefit more from better in-context examples than classification tasks
CEIL outperforms learning-free baselines and is especially effective on NLI tasks
CEIL outperforms EPR on all tasks
CEIL introduces no additional parameters compared to EPR and TOPK-BERT

Compositionality

CEIL has superior performance in predicting answers
CEIL learns to compose exemplars to help in predicting answers
CEIL is evaluated on two datasets and the results are shown in Table 3
CEIL improves performance on difficult splits on the two datasets
CEIL is an alternative approach to generating compositional programs without tuning inference LM or test-time question decomposition

Transferability

Natural language has general compositional characteristics that can be exploited in different tasks and inference LMs.
Transferring a retriever trained on one dataset and LM inferencer to others can be beneficial without further tuning.
Transferring a retriever trained on GPT-Neo to GPT2-XL and Codex can be beneficial.
Retrievers trained on NLI tasks can transfer to other NLI tasks, but not to other single-input tasks.

Analysis

Investigated effect of training data
Sampled candidates randomly and based on probability
Measured quality of training data with MRR
One-stage random retrieval degraded performance
Two-stage random sampling slightly outperformed k-DPP sampling
Number of candidates mostly affects generation tasks
Initialization and contrastive losses compared
Knowledge from single in-context example selection contributes to set-level selection
Pair-wise margin loss better for more difficult tasks
Compared two inference algorithms
Trade-off factor affects different tasks differently

Conclusion

In-context example selection is recast into an end-to-end optimization problem
CEIL leverages DPP to model the probability of the entire subset of in-context examples
CEIL is learned through a contrastive learning framework
Experiments conducted on 7 classification and generation tasks with 12 different benchmarks
CEIL outperforms previous competitive methods
CEIL exhibits transferability across LMs and datasets
CEIL shows compositionality for compositional tasks
Datasets include SST-5, MRPC, MNLI, QNLI, CMSQA, HellaSwag, WebQs, NL2Bash, Break, MTOP, SMCalFlow
Experiments conducted on Template, TMCD, Length splits
Results show absolute performance gain over EPR
Sampling strategies and number of candidates per instance in construction training data reported
Comparisons of different initializations and contrastive losses for CEIL
Inference algorithms compared on BERT, EPR and CEIL

Link to paper#

Abstract#

Paper Content#

Introduction#

Preliminary#

In-context learning#

Determinantal point processes#

Model#

Modeling#

Training#

Inference#

Experiments#

Datasets and evaluation#

Baselines#

Implementation details#

Main results#

Compositionality#

Transferability#

Analysis#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Preliminary

In-context learning

Determinantal point processes

Model

Modeling

Training

Inference

Experiments

Datasets and evaluation

Baselines

Implementation details

Main results

Compositionality

Transferability

Analysis

Related work

Conclusion