Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability.
- Previous selection methods for ICL are based on simple heuristics.
- CEIL (Compositional Exemplars for In-context Learning) is proposed to model the interaction between the given input and in-context examples.
- CEIL is validated on 12 classification and generation datasets from 7 distinct NLP tasks.
- Experiments demonstrate the state-of-the-art performance, transferability and compositionality of CEIL.
Paper Content
Introduction
- Artificial intelligence aims to develop models that can generalize to unseen tasks.
- In-context learning (ICL) is a major step towards this goal.
- ICL is sensitive to the selection of in-context examples.
- CEIL (Compositional Exemplars for In-context Learning) models the joint probability of the entire in-context example set.
Preliminary
In-context learning
- In-context learning (ICL) can be used to perform complex tasks
- Growing concerns about ICL instability
- Researchers have proposed learning-free and learning-based methods to mitigate instability
- Learning-free methods require human experts and lead to sub-optimal performance
- Learning-based methods have been proposed to push the envelope further
Determinantal point processes
- DPPs measure both diversity and quality of items in a subset
- DPPs have been applied to document and video summarization, recommendation systems, object detection, and multi-label classification
- DPPs have been employed in in-context learning for compositional tasks
- DPPs are framed into an end-to-end framework to capture interaction between in-context examples and reflect preference of LMs
Model
- Introduces CEIL, a framework to learn composition of exemplars for in-context learning
- Models joint probability of in-context example sets with a learnable conditional DPP and trained with contrastive learning
Modeling
- In-context learning requires relevance and diversity.
- A new kernel is defined to incorporate relevance and diversity.
- A trade-off parameter is used to balance the magnitude of diversity and relevance.
- Two embedders are used to encode input text and in-context examples.
Training
- Introduce a contrastive learning framework to rectify the embedding of each in-context example and training instance
- Training data consists of N instances, each containing one input instance and M in-context example subsets
- Employ a two-stage framework to obtain a set of relevant examples
- Use inference LM as scoring function to measure quality of each subset
- Employ a fine-grained pair-wise margin loss to determine which subset is preferable
Inference
- MAP inference of a DPP is NP-hard
- Candidate space is narrowed down with KNN retriever
- Greedy algorithm with O(nK2) complexity is used
- Cholesky decomposition reduces complexity to O(nK)
Experiments
- Conducted extensive experiments over 12 datasets
- Spanning 7 distinct tasks
- Showed better approach to in-context learning
Datasets and evaluation
- Datasets and tasks are listed in Table 1
- Different task formulations allow for extensive evaluations
- Prompts and examples are in Appendix A.1
- Accuracy reported for classification tasks
- Exact Match and LF-EM reported for generation tasks
- Results reported on validation set
Baselines
- RANDOM: Retriever randomly selects in-context examples from training set without repetition
- TOPK-BM25: Classical sparse retrieval method based on TF-IDF
- TOPK-BERT: Dense retriever based on BERT embeddings
- DPP-BERT: DPP retriever uses original BERT embedding without fine-tuning
- EPR: Learning-based dense retriever trained to
Implementation details
- GPT-Neo and GPT2-XL are used as language models
- Exemplars are sorted based on similarity to input text
- Classification tasks are reframed into multiple choice
- Greedy decoding is used for answer generation
- Training data is limited to 44,000 instances
- Adam optimizer is used with batch size 128 and learning rate 1e-5
- Trade-off factor ฮป is searched in {0.01, 0.05, 0.1}
- BERT-based encoder is used to encode examples into embeddings
Main results
- Experiments conducted on 12 datasets across 7 tasks
- Generation tasks benefit more from better in-context examples than classification tasks
- CEIL outperforms learning-free baselines and is especially effective on NLI tasks
- CEIL outperforms EPR on all tasks
- CEIL introduces no additional parameters compared to EPR and TOPK-BERT
Compositionality
- CEIL has superior performance in predicting answers
- CEIL learns to compose exemplars to help in predicting answers
- CEIL is evaluated on two datasets and the results are shown in Table 3
- CEIL improves performance on difficult splits on the two datasets
- CEIL is an alternative approach to generating compositional programs without tuning inference LM or test-time question decomposition
Transferability
- Natural language has general compositional characteristics that can be exploited in different tasks and inference LMs.
- Transferring a retriever trained on one dataset and LM inferencer to others can be beneficial without further tuning.
- Transferring a retriever trained on GPT-Neo to GPT2-XL and Codex can be beneficial.
- Retrievers trained on NLI tasks can transfer to other NLI tasks, but not to other single-input tasks.
Analysis
- Investigated effect of training data
- Sampled candidates randomly and based on probability
- Measured quality of training data with MRR
- One-stage random retrieval degraded performance
- Two-stage random sampling slightly outperformed k-DPP sampling
- Number of candidates mostly affects generation tasks
- Initialization and contrastive losses compared
- Knowledge from single in-context example selection contributes to set-level selection
- Pair-wise margin loss better for more difficult tasks
- Compared two inference algorithms
- Trade-off factor affects different tasks differently
Related work
Conclusion
- In-context example selection is recast into an end-to-end optimization problem
- CEIL leverages DPP to model the probability of the entire subset of in-context examples
- CEIL is learned through a contrastive learning framework
- Experiments conducted on 7 classification and generation tasks with 12 different benchmarks
- CEIL outperforms previous competitive methods
- CEIL exhibits transferability across LMs and datasets
- CEIL shows compositionality for compositional tasks
- Datasets include SST-5, MRPC, MNLI, QNLI, CMSQA, HellaSwag, WebQs, NL2Bash, Break, MTOP, SMCalFlow
- Experiments conducted on Template, TMCD, Length splits
- Results show absolute performance gain over EPR
- Sampling strategies and number of candidates per instance in construction training data reported
- Comparisons of different initializations and contrastive losses for CEIL
- Inference algorithms compared on BERT, EPR and CEIL