pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction
- class pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, dim: int = 250000, language: str = 'english', stopwords: Iterable[str] | None = None)[source]
Bases:
SparseEmbeddingFunctionBM25 sparse embedding function powered by
bm25s.Tokenizes text via
bm25s.tokenize()(lowercase, regex split, stopword removal, optional stemming with PyStemmer), hashes each token to a dimension index via MurmurHash3, and computes a BM25-style term frequency weight:score = tf * (k + 1) / (tf + k * (1 - b + b * doc_len / avg_doc_length))
This is the query-independent part of BM25 (no IDF), suitable for building a sparse vector index. The IDF component can be handled at search time by the database engine.
- Parameters:
k – BM25 k1 parameter controlling term-frequency saturation. Default 1.2.
b – BM25 b parameter controlling document-length normalization. Default 0.75.
avg_doc_length – Assumed average document length in tokens. Default 256.0.
dim – Maximum number of sparse-vector dimensions. Hash values are reduced via
hash % dimso every index falls in[0, dim). Must not exceed the database engine’s limit (seekdb supports up to 500 000). Default 250 000.language – Language for stopwords and stemming. Default
"english". Supported values depend on bm25s (e.g."english","german","french", etc.) and PyStemmer for stemming.stopwords – Custom stopword list.
Noneuses the built-in stopword list selected by language.
Example
>>> ef = BM25SparseEmbeddingFunction(k=1.5, b=0.8) >>> vectors = ef(["machine learning algorithms"]) >>> print(vectors[0]) SparseVector(3 non-zero entries)
- __init__(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, dim: int = 250000, language: str = 'english', stopwords: Iterable[str] | None = None) None[source]
Methods
__init__([k, b, avg_doc_length, dim, ...])build_from_config(config)Restore instance from configuration dictionary.
Get configuration dictionary (for persistence).
name()Return unique name identifier (for registration and routing).
support_persistence(sparse_embedding_function)Check if the sparse embedding function supports persistence.
- static build_from_config(config: dict[str, Any]) BM25SparseEmbeddingFunction[source]
Restore instance from configuration dictionary.