Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Doc2Query is a technique used to improve the first-stage retrieval effectiveness of search engines.
Sequence-to-sequence models are known to “hallucinate” content that is not present in the source text.
This work explores techniques for filtering out these harmful queries prior to indexing.
Using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query.
Code, data, and a live demonstration are available at https://github.com/terrierteam/pyterrier_doc2query.

Neural network models can improve search effectiveness
Some approaches focus on re-ranking document sets, others on improving the first stage
Doc2Query is a first-stage approach that uses a sequence-to-sequence model to generate queries
This approach has been shown to be effective, but can generate content that does not reflect the input text
This paper proposes a filtering phase to remove poor queries prior to indexing

The classical lexical mismatch problem is a key issue in information retrieval.
Various approaches have been used to address this problem, including query reformulation, query expansion models, and document expansion.
Recently, transformer-based language models have been used to represent text in embedding spaces.
Doc2Query is a sequence-to-sequence model that maps a document to queries that it might be able to answer.
Doc2Query has a generation phase and a filtering phase.
The relevance threshold is determined by the distribution of relevance scores across all expansion queries.

Doc2Query is a new approach for improving the effectiveness and efficiency of document expansion
16% improvement in retrieval effectiveness can be achieved with Doc2Query
Doc2Query reduces index size by 48% and mean query execution time by 30%
Relevance filtering could potentially apply to other approaches such as generating alternative forms of queries, training data, or natural language responses