BM25S Index

This is a BM25S index created with the bm25s library (version 0.2.0), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

BM25S Related Links:

Installation

You can install the bm25s library with pip:

pip install "bm25s==0.2.0"

# For huggingface hub usage
pip install huggingface_hub

Loading a `bm25s` index

You can use this index for information retrieval tasks. Here is an example:

import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency")

# You can retrieve now
query = "a cat is a feline"
results = retriever.retrieve(bm25s.tokenize(query), k=3)

Saving a `bm25s` index

You can save a bm25s index to the Hugging Face Hub. Here is an example:

import bm25s
from bm25s.hf import BM25HF

corpus = [
    "northwest bank",
    "misfits market",
    "merrick bank login",
    "marketing",
    "market place",
    "jetblue customer service",
    "internal revenue service",
    "how to make money online",
    "gordon food service",
    "futures market",
    "frontier airlines customer service",
    "food banks near me",
    "first convenience bank",
    "eastern bank",
    "dollar bank",
]

retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

token = None  # You can get a token from the Hugging Face website
retriever.save_to_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)

Advanced usage

You can leverage more advanced features of the BM25S library during load_from_hub:

# Load corpus and index in memory-map (mmap=True) to reduce memory
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", load_corpus=True, mmap=True)

# Load a different branch/revision
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", revision="main")

# Change directory where the local files should be downloaded
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", local_dir="/path/to/dir")

# Load private repositories with a token:
retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)

Stats

This dataset was created using the following data: 497 keywords Cryptocurrency (semrush)

Statistic	Value
Number of documents	602959
Number of tokens	2414020
Average tokens per document	4.0

Parameters

The index was created with the following parameters:

Parameter	Value
k1	`1.5`
b	`0.75`
delta	`0.5`
method	`lucene`
idf method	`lucene`

Citation

To cite bm25s, please use the following bibtex:

@misc{lu_2024_bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}