Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Dense retrieval is effective and efficient across tasks and languages.
It is difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available.
HyDE is proposed to pivot through Hypothetical Document Embeddings.
HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
HyDE shows strong performance comparable to fine-tuned retrievers across various tasks and languages.

Paper Content

Introduction

Dense retrieval is a method of retrieving documents using semantic embedding similarities
A variety of methods have been proposed to improve the effectiveness of supervised dense retrieval models
Zero-shot dense retrieval still remains difficult
MS-MARCO collection is a massive judged dataset with a large number of judged query-document pairs
Aim to build effective fully zero-shot dense retrieval systems that require no relevance supervision
Self-supervised representation learning methods are used
Generative large language models (LLM) and text (chunk) encoders are used
LLMs further trained to follow instructions can zero-shot generalize to diverse unseen instructions
Retrieval task is cast into two NLU and NLG tasks
HyDE using Instruct-GPT and Contriever outperforms previous state-of-the-art

Researchers studied metric learning problems and introduced distillation
Later works studied second stage pre-training of language model specifically for retrieval
Popularity of dense retrieval partially attributed to successful research in very efficient minimum inner product search
LLMs trained on data consisting of instructions and their execution can zero-shot generalize to perform new tasks
Zero-shot dense retrieval tasks empirically defined by Thakur et al. (2021)
Generative search uses neural generative models as search indices
Generative retrieval uses standard MIPS index and requires no training or training data

Methodology

Problem of zero-shot dense retrieval is formally defined
HyDE is designed to solve the problem

Preliminaries

Dense retrieval models similarity between query and document using inner product similarity.
Zero-shot retrieval requires mapping functions for query and document without access to any relevance judgments.
Difficulty of zero-shot dense retrieval lies in learning two embedding functions into same embedding space.

Hyde

HyDE performs search in document-only embedding space to capture document-document similarity.
Unsupervised contrastive learning is used to learn the document encoder.
An instruction following language model is used to build the query vector.
Generating documents replaces explicit modeling of relevance scores.

Experiments

Setup

Implemented HyDE using InstructGPT and Contriever models
Used OpenAI playground default temperature of 0.7 for open-ended generations
Used English-only Contriever model for English retrieval tasks and multilingual mContriever for non-English tasks
Conducted retrieval experiments with Pyserini toolkit
Considered web search query sets TREC DL19 and DL20, BEIR dataset, and Mr.Tydi dataset
Used different instructions for each dataset
Compared HyDE to Contriever models, BM25, and several fine-tuned models

Web search

HyDE brings improvements to Contriever on TREC DL19 and TREC DL20
HyDE outperforms BM25 by large margins
HyDE is competitive with fine-tuned models
HyDE shows comparable map and ndcg@10 to Contriever FT on TREC DL19
HyDE gets around 10% lower map and ndcg@10 than Contriever FT on DL20
ANCE model shows better ndcg@10 than HyDE but lower recall

Low resource retrieval

HyDE brings improvements to Contriever in terms of ndcg and recall
HyDE outperformed by BM25 on one dataset, TREC-Covid
HyDE shows better performance than ANCE and DPR
Contriever FT shows performance advantages on FiQA and DBPedia

Multilingual retrieval

Small-sized contrastive encoder gets saturated with more languages
Generative LLM can get under-trained with lower resource languages
HyDE improves mContriever model and outperforms non-Contriever models
Margin between HyDE and mContriever FT due to under-training of non-English languages

Analysis

HyDE uses a generative LLM and contrastive encoder as its backbone
We studied the effect of changing the realizations of the LLM and encoder, such as using smaller language models and fine-tuning the encoder
We conducted our studies on TREC DL19/20

Effect of different generative models

HyDE uses other instruction-following language models.
Larger models bring larger improvements to the unsupervised Contriever.
Cohere model is still experimental and details are not disclosed.

Hyde with fine-tuned encoder

HyDE is more powerful when few relevance labels are present
Table 4 shows that less powerful instruction LMs can negatively impact the performance of the fine-tuned retriever
InstructGPT model can bring up the performance, especially on DL19
Generative model may capture certain factors not captured by the fine-tuned encoder

Conclusion

HyDE model should be compared to other retrievers and re-rankers
Relevance in HyDE is captured by an NLG model and language generation process
HyDE can be as effective as dense retrievers that learn to model numerical relevance scores
HyDE can be used at the beginning of a search system’s life and gradually replaced by a supervised dense retriever
HyDE can be used for web search, low resource tasks, and other sophisticated tasks

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Methodology#

Preliminaries#

Hyde#

Experiments#

Setup#

Web search#

Low resource retrieval#

Multilingual retrieval#

Analysis#

Effect of different generative models#

Hyde with fine-tuned encoder#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Methodology

Preliminaries

Hyde

Experiments

Setup

Web search

Low resource retrieval

Multilingual retrieval

Analysis

Effect of different generative models

Hyde with fine-tuned encoder

Conclusion