Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability.
  • Previous selection methods for ICL are based on simple heuristics.
  • CEIL (Compositional Exemplars for In-context Learning) is proposed to model the interaction between the given input and in-context examples.
  • CEIL is validated on 12 classification and generation datasets from 7 distinct NLP tasks.
  • Experiments demonstrate the state-of-the-art performance, transferability and compositionality of CEIL.

Paper Content

Introduction

  • Artificial intelligence aims to develop models that can generalize to unseen tasks.
  • In-context learning (ICL) is a major step towards this goal.
  • ICL is sensitive to the selection of in-context examples.
  • CEIL (Compositional Exemplars for In-context Learning) models the joint probability of the entire in-context example set.

Preliminary

In-context learning

  • In-context learning (ICL) can be used to perform complex tasks
  • Growing concerns about ICL instability
  • Researchers have proposed learning-free and learning-based methods to mitigate instability
  • Learning-free methods require human experts and lead to sub-optimal performance
  • Learning-based methods have been proposed to push the envelope further

Determinantal point processes

  • DPPs measure both diversity and quality of items in a subset
  • DPPs have been applied to document and video summarization, recommendation systems, object detection, and multi-label classification
  • DPPs have been employed in in-context learning for compositional tasks
  • DPPs are framed into an end-to-end framework to capture interaction between in-context examples and reflect preference of LMs

Model

  • Introduces CEIL, a framework to learn composition of exemplars for in-context learning
  • Models joint probability of in-context example sets with a learnable conditional DPP and trained with contrastive learning

Modeling

  • In-context learning requires relevance and diversity.
  • A new kernel is defined to incorporate relevance and diversity.
  • A trade-off parameter is used to balance the magnitude of diversity and relevance.
  • Two embedders are used to encode input text and in-context examples.

Training

  • Introduce a contrastive learning framework to rectify the embedding of each in-context example and training instance
  • Training data consists of N instances, each containing one input instance and M in-context example subsets
  • Employ a two-stage framework to obtain a set of relevant examples
  • Use inference LM as scoring function to measure quality of each subset
  • Employ a fine-grained pair-wise margin loss to determine which subset is preferable

Inference

  • MAP inference of a DPP is NP-hard
  • Candidate space is narrowed down with KNN retriever
  • Greedy algorithm with O(nK2) complexity is used
  • Cholesky decomposition reduces complexity to O(nK)

Experiments

  • Conducted extensive experiments over 12 datasets
  • Spanning 7 distinct tasks
  • Showed better approach to in-context learning

Datasets and evaluation

  • Datasets and tasks are listed in Table 1
  • Different task formulations allow for extensive evaluations
  • Prompts and examples are in Appendix A.1
  • Accuracy reported for classification tasks
  • Exact Match and LF-EM reported for generation tasks
  • Results reported on validation set

Baselines

  • RANDOM: Retriever randomly selects in-context examples from training set without repetition
  • TOPK-BM25: Classical sparse retrieval method based on TF-IDF
  • TOPK-BERT: Dense retriever based on BERT embeddings
  • DPP-BERT: DPP retriever uses original BERT embedding without fine-tuning
  • EPR: Learning-based dense retriever trained to

Implementation details

  • GPT-Neo and GPT2-XL are used as language models
  • Exemplars are sorted based on similarity to input text
  • Classification tasks are reframed into multiple choice
  • Greedy decoding is used for answer generation
  • Training data is limited to 44,000 instances
  • Adam optimizer is used with batch size 128 and learning rate 1e-5
  • Trade-off factor ฮป is searched in {0.01, 0.05, 0.1}
  • BERT-based encoder is used to encode examples into embeddings

Main results

  • Experiments conducted on 12 datasets across 7 tasks
  • Generation tasks benefit more from better in-context examples than classification tasks
  • CEIL outperforms learning-free baselines and is especially effective on NLI tasks
  • CEIL outperforms EPR on all tasks
  • CEIL introduces no additional parameters compared to EPR and TOPK-BERT

Compositionality

  • CEIL has superior performance in predicting answers
  • CEIL learns to compose exemplars to help in predicting answers
  • CEIL is evaluated on two datasets and the results are shown in Table 3
  • CEIL improves performance on difficult splits on the two datasets
  • CEIL is an alternative approach to generating compositional programs without tuning inference LM or test-time question decomposition

Transferability

  • Natural language has general compositional characteristics that can be exploited in different tasks and inference LMs.
  • Transferring a retriever trained on one dataset and LM inferencer to others can be beneficial without further tuning.
  • Transferring a retriever trained on GPT-Neo to GPT2-XL and Codex can be beneficial.
  • Retrievers trained on NLI tasks can transfer to other NLI tasks, but not to other single-input tasks.

Analysis

  • Investigated effect of training data
  • Sampled candidates randomly and based on probability
  • Measured quality of training data with MRR
  • One-stage random retrieval degraded performance
  • Two-stage random sampling slightly outperformed k-DPP sampling
  • Number of candidates mostly affects generation tasks
  • Initialization and contrastive losses compared
  • Knowledge from single in-context example selection contributes to set-level selection
  • Pair-wise margin loss better for more difficult tasks
  • Compared two inference algorithms
  • Trade-off factor affects different tasks differently

Conclusion

  • In-context example selection is recast into an end-to-end optimization problem
  • CEIL leverages DPP to model the probability of the entire subset of in-context examples
  • CEIL is learned through a contrastive learning framework
  • Experiments conducted on 7 classification and generation tasks with 12 different benchmarks
  • CEIL outperforms previous competitive methods
  • CEIL exhibits transferability across LMs and datasets
  • CEIL shows compositionality for compositional tasks
  • Datasets include SST-5, MRPC, MNLI, QNLI, CMSQA, HellaSwag, WebQs, NL2Bash, Break, MTOP, SMCalFlow
  • Experiments conducted on Template, TMCD, Length splits
  • Results show absolute performance gain over EPR
  • Sampling strategies and number of candidates per instance in construction training data reported
  • Comparisons of different initializations and contrastive losses for CEIL
  • Inference algorithms compared on BERT, EPR and CEIL