pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction
- class pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, token_max_length: int = 40, stopwords: Iterable[str] | None = None)[source]
Bases:
SparseEmbeddingFunctionBM25 sparse embedding function.
Tokenizes text (lowercase, remove punctuation, filter stopwords, stem), hashes each stemmed token to a dimension index via MurmurHash3, and computes a BM25-style term frequency weight:
score = tf * (k + 1) / (tf + k * (1 - b + b * doc_len / avg_doc_length))
This is the query-independent part of BM25 (no IDF), suitable for building a sparse vector index. The inverse document frequency component can be handled at search time by the database engine.
- Parameters:
k – BM25 k1 parameter controlling term-frequency saturation. Default 1.2.
b – BM25 b parameter controlling document-length normalization. Default 0.75.
avg_doc_length – Assumed average document length in tokens. Default 256.0.
token_max_length – Maximum token length; longer tokens are dropped. Default 40.
stopwords – Custom stopword list.
Noneuses built-in English stopwords.
Example
>>> ef = BM25SparseEmbeddingFunction(k=1.5, b=0.8) >>> vectors = ef(["machine learning algorithms"]) >>> print(vectors[0]) SparseVector(3 non-zero entries)
- __init__(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, token_max_length: int = 40, stopwords: Iterable[str] | None = None) None[source]
Methods
__init__([k, b, avg_doc_length, ...])build_from_config(config)Restore instance from configuration dictionary.
embed_query(documents)Alias — BM25 uses the same encoding for documents and queries.
Get configuration dictionary (for persistence).
name()Return unique name identifier (for registration and routing).
support_persistence(sparse_embedding_function)Check if the sparse embedding function supports persistence.
- static build_from_config(config: dict[str, Any]) BM25SparseEmbeddingFunction[source]
Restore instance from configuration dictionary.
- embed_query(documents: str | list[str]) list[SparseVector][source]
Alias — BM25 uses the same encoding for documents and queries.