pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction

class pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, token_max_length: int = 40, stopwords: Iterable[str] | None = None)[source]

Bases: SparseEmbeddingFunction

BM25 sparse embedding function.

Tokenizes text (lowercase, remove punctuation, filter stopwords, stem), hashes each stemmed token to a dimension index via MurmurHash3, and computes a BM25-style term frequency weight:

score = tf * (k + 1) / (tf + k * (1 - b + b * doc_len / avg_doc_length))

This is the query-independent part of BM25 (no IDF), suitable for building a sparse vector index. The inverse document frequency component can be handled at search time by the database engine.

Parameters:
  • k – BM25 k1 parameter controlling term-frequency saturation. Default 1.2.

  • b – BM25 b parameter controlling document-length normalization. Default 0.75.

  • avg_doc_length – Assumed average document length in tokens. Default 256.0.

  • token_max_length – Maximum token length; longer tokens are dropped. Default 40.

  • stopwords – Custom stopword list. None uses built-in English stopwords.

Example

>>> ef = BM25SparseEmbeddingFunction(k=1.5, b=0.8)
>>> vectors = ef(["machine learning algorithms"])
>>> print(vectors[0])
SparseVector(3 non-zero entries)
__init__(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, token_max_length: int = 40, stopwords: Iterable[str] | None = None) None[source]

Methods

__init__([k, b, avg_doc_length, ...])

build_from_config(config)

Restore instance from configuration dictionary.

embed_query(documents)

Alias — BM25 uses the same encoding for documents and queries.

get_config()

Get configuration dictionary (for persistence).

name()

Return unique name identifier (for registration and routing).

support_persistence(sparse_embedding_function)

Check if the sparse embedding function supports persistence.

static build_from_config(config: dict[str, Any]) BM25SparseEmbeddingFunction[source]

Restore instance from configuration dictionary.

embed_query(documents: str | list[str]) list[SparseVector][source]

Alias — BM25 uses the same encoding for documents and queries.

get_config() dict[str, Any][source]

Get configuration dictionary (for persistence).

Returns:

Configuration dictionary. Should NOT include ‘name’ field (handled by upper layer).

static name() str[source]

Return unique name identifier (for registration and routing).