Papers
arxiv:2501.11628

Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

Published on Jan 20
Authors:
,
,
,
,

Abstract

Learned sparse text embeddings have gained popularity due to their effectiveness in top-k <PRE_TAG>retrieval</POST_TAG> and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world <PRE_TAG>retrieval systems</POST_TAG>. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art <PRE_TAG>retrieval algorithms</POST_TAG> on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense <PRE_TAG>retrieval</POST_TAG>. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.11628 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.11628 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.11628 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.