Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Dense retrieval is effective and efficient across tasks and languages.
- It is difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available.
- HyDE is proposed to pivot through Hypothetical Document Embeddings.
- HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
- HyDE shows strong performance comparable to fine-tuned retrievers across various tasks and languages.
Paper Content
Introduction
- Dense retrieval is a method of retrieving documents using semantic embedding similarities
- A variety of methods have been proposed to improve the effectiveness of supervised dense retrieval models
- Zero-shot dense retrieval still remains difficult
- MS-MARCO collection is a massive judged dataset with a large number of judged query-document pairs
- Aim to build effective fully zero-shot dense retrieval systems that require no relevance supervision
- Self-supervised representation learning methods are used
- Generative large language models (LLM) and text (chunk) encoders are used
- LLMs further trained to follow instructions can zero-shot generalize to diverse unseen instructions
- Retrieval task is cast into two NLU and NLG tasks
- HyDE using Instruct-GPT and Contriever outperforms previous state-of-the-art
Related works
- Researchers studied metric learning problems and introduced distillation
- Later works studied second stage pre-training of language model specifically for retrieval
- Popularity of dense retrieval partially attributed to successful research in very efficient minimum inner product search
- LLMs trained on data consisting of instructions and their execution can zero-shot generalize to perform new tasks
- Zero-shot dense retrieval tasks empirically defined by Thakur et al. (2021)
- Generative search uses neural generative models as search indices
- Generative retrieval uses standard MIPS index and requires no training or training data
Methodology
- Problem of zero-shot dense retrieval is formally defined
- HyDE is designed to solve the problem
Preliminaries
- Dense retrieval models similarity between query and document using inner product similarity.
- Zero-shot retrieval requires mapping functions for query and document without access to any relevance judgments.
- Difficulty of zero-shot dense retrieval lies in learning two embedding functions into same embedding space.
Hyde
- HyDE performs search in document-only embedding space to capture document-document similarity.
- Unsupervised contrastive learning is used to learn the document encoder.
- An instruction following language model is used to build the query vector.
- Generating documents replaces explicit modeling of relevance scores.
Experiments
Setup
- Implemented HyDE using InstructGPT and Contriever models
- Used OpenAI playground default temperature of 0.7 for open-ended generations
- Used English-only Contriever model for English retrieval tasks and multilingual mContriever for non-English tasks
- Conducted retrieval experiments with Pyserini toolkit
- Considered web search query sets TREC DL19 and DL20, BEIR dataset, and Mr.Tydi dataset
- Used different instructions for each dataset
- Compared HyDE to Contriever models, BM25, and several fine-tuned models
Web search
- HyDE brings improvements to Contriever on TREC DL19 and TREC DL20
- HyDE outperforms BM25 by large margins
- HyDE is competitive with fine-tuned models
- HyDE shows comparable map and ndcg@10 to Contriever FT on TREC DL19
- HyDE gets around 10% lower map and ndcg@10 than Contriever FT on DL20
- ANCE model shows better ndcg@10 than HyDE but lower recall
Low resource retrieval
- HyDE brings improvements to Contriever in terms of ndcg and recall
- HyDE outperformed by BM25 on one dataset, TREC-Covid
- HyDE shows better performance than ANCE and DPR
- Contriever FT shows performance advantages on FiQA and DBPedia
Multilingual retrieval
- Small-sized contrastive encoder gets saturated with more languages
- Generative LLM can get under-trained with lower resource languages
- HyDE improves mContriever model and outperforms non-Contriever models
- Margin between HyDE and mContriever FT due to under-training of non-English languages
Analysis
- HyDE uses a generative LLM and contrastive encoder as its backbone
- We studied the effect of changing the realizations of the LLM and encoder, such as using smaller language models and fine-tuning the encoder
- We conducted our studies on TREC DL19/20
Effect of different generative models
- HyDE uses other instruction-following language models.
- Larger models bring larger improvements to the unsupervised Contriever.
- Cohere model is still experimental and details are not disclosed.
Hyde with fine-tuned encoder
- HyDE is more powerful when few relevance labels are present
- Table 4 shows that less powerful instruction LMs can negatively impact the performance of the fine-tuned retriever
- InstructGPT model can bring up the performance, especially on DL19
- Generative model may capture certain factors not captured by the fine-tuned encoder
Conclusion
- HyDE model should be compared to other retrievers and re-rankers
- Relevance in HyDE is captured by an NLG model and language generation process
- HyDE can be as effective as dense retrievers that learn to model numerical relevance scores
- HyDE can be used at the beginning of a search system’s life and gradually replaced by a supervised dense retriever
- HyDE can be used for web search, low resource tasks, and other sophisticated tasks