pyseekdb.utils.embedding_functions

Embedding function implementations for pyseekdb.

This module provides various embedding function implementations that can be used with pyseekdb collections.

Classes

AmazonBedrockEmbeddingFunction(session[, ...])

A convenient embedding function for Amazon Bedrock embedding models using boto3.

BM25SparseEmbeddingFunction([k, b, ...])

BM25 sparse embedding function.

CohereEmbeddingFunction([model_name, ...])

A convenient embedding function for Cohere embedding models using LiteLLM.

GoogleVertexEmbeddingFunction([model_name, ...])

A convenient embedding function for Google Vertex AI embedding models.

HuggingFaceSparseEmbeddingFunction([...])

Sparse embedding function powered by HuggingFace SparseEncoder models.

JinaEmbeddingFunction([model_name, api_key_env])

A convenient embedding function for Jina AI embedding models.

LiteLLMBaseEmbeddingFunction(model_name[, ...])

A custom embedding function using LiteLLM to access various embedding models.

MistralEmbeddingFunction([model_name, ...])

A convenient embedding function for Mistral text embedding models.

MorphEmbeddingFunction(model_name[, ...])

A convenient embedding function for Morph embedding models.

OllamaEmbeddingFunction([model_name, ...])

A convenient embedding function for Ollama embedding models.

OnnxEmbeddingFunction(model_name, ...[, ...])

Generic ONNX runtime embedding function.

OpenAIBaseEmbeddingFunction(model_name[, ...])

Base embedding function for OpenAI-compatible embedding APIs.

OpenAIEmbeddingFunction([model_name, ...])

A convenient embedding function for OpenAI embedding models.

QwenEmbeddingFunction(model_name[, ...])

A convenient embedding function for Qwen (Alibaba Cloud) embedding models.

SentenceTransformerEmbeddingFunction([...])

An embedding function using sentence-transformers with a specific model.

SiliconflowEmbeddingFunction([model_name, ...])

A convenient embedding function for SiliconFlow embedding models.

TencentHunyuanEmbeddingFunction([...])

A convenient embedding function for Tencent Hunyuan embedding models.

Text2VecEmbeddingFunction([model_name, ...])

An embedding function using text2vec with a specific model.

VoyageaiEmbeddingFunction([model_name, ...])

A convenient embedding function for Voyage AI embedding models.

class pyseekdb.utils.embedding_functions.AmazonBedrockEmbeddingFunction(session: Any, model_name: str = 'amazon.titan-embed-text-v2', **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

A convenient embedding function for Amazon Bedrock embedding models using boto3.

For more information about Amazon Bedrock models, see https://docs.aws.amazon.com/bedrock/

This embedding function runs remotely on Amazon Bedrock’s servers, and requires AWS credentials configured via boto3.

Example

pip install pyseekdb boto3

static build_from_config(config: dict[str, Any]) AmazonBedrockEmbeddingFunction[source]

Build an AmazonBedrockEmbeddingFunction from its configuration dictionary.

Parameters:

config – Dictionary containing the embedding function’s configuration. Note: AWS credentials are NOT stored in config for security reasons. Credentials should be provided via environment variables, IAM roles, or passed as additional parameters.

Returns:

Restored AmazonBedrockEmbeddingFunction instance

Raises:

ValueError – If the configuration is invalid or missing required fields

property dimension: int

Get the dimension of embeddings produced by this function.

Returns the known dimension for models without making an API call. If the model is in the known dimensions list, that value is returned.

If the model is not in the known dimensions list, falls back to making an API call to get the embedding and infer the dimension.

Returns:

The dimension of embeddings for this model.

Return type:

int

get_config() dict[str, Any][source]

Get the configuration dictionary for the AmazonBedrockEmbeddingFunction.

Returns:

Dictionary containing configuration needed to restore this embedding function. Note: AWS credentials are NOT stored in the config for security reasons. Credentials should be provided via environment variables, IAM roles, or passed as parameters when restoring.

class pyseekdb.utils.embedding_functions.BM25SparseEmbeddingFunction(k: float = 1.2, b: float = 0.75, avg_doc_length: float = 256.0, token_max_length: int = 40, stopwords: Iterable[str] | None = None)[source]

Bases: SparseEmbeddingFunction

BM25 sparse embedding function.

Tokenizes text (lowercase, remove punctuation, filter stopwords, stem), hashes each stemmed token to a dimension index via MurmurHash3, and computes a BM25-style term frequency weight:

score = tf * (k + 1) / (tf + k * (1 - b + b * doc_len / avg_doc_length))

This is the query-independent part of BM25 (no IDF), suitable for building a sparse vector index. The inverse document frequency component can be handled at search time by the database engine.

Parameters:
  • k – BM25 k1 parameter controlling term-frequency saturation. Default 1.2.

  • b – BM25 b parameter controlling document-length normalization. Default 0.75.

  • avg_doc_length – Assumed average document length in tokens. Default 256.0.

  • token_max_length – Maximum token length; longer tokens are dropped. Default 40.

  • stopwords – Custom stopword list. None uses built-in English stopwords.

Example

>>> ef = BM25SparseEmbeddingFunction(k=1.5, b=0.8)
>>> vectors = ef(["machine learning algorithms"])
>>> print(vectors[0])
SparseVector(3 non-zero entries)
static build_from_config(config: dict[str, Any]) BM25SparseEmbeddingFunction[source]

Restore instance from configuration dictionary.

embed_query(documents: str | list[str]) list[SparseVector][source]

Alias — BM25 uses the same encoding for documents and queries.

get_config() dict[str, Any][source]

Get configuration dictionary (for persistence).

Returns:

Configuration dictionary. Should NOT include ‘name’ field (handled by upper layer).

static name() str[source]

Return unique name identifier (for registration and routing).

class pyseekdb.utils.embedding_functions.CohereEmbeddingFunction(model_name: str = 'embed-english-v3.0', api_key_env: str | None = None, input_type: str | None = None, **kwargs: Any)[source]

Bases: LiteLLMBaseEmbeddingFunction

A convenient embedding function for Cohere embedding models using LiteLLM.

For more information about Cohere models, see https://docs.cohere.com/docs/cohere-embed

For LiteLLM documentation, see https://docs.litellm.ai/docs/embedding/supported_embedding

Example

pip install pyseekdb litellm

static build_from_config(config: dict[str, Any]) CohereEmbeddingFunction[source]

Build a CohereEmbeddingFunction from its configuration dictionary.

Parameters:

config – Dictionary containing the embedding function’s configuration

Returns:

Restored CohereEmbeddingFunction instance

Raises:

ValueError – If the configuration is invalid or missing required fields

property dimension: int

Get the dimension of embeddings produced by this function.

Returns the known dimension for models without making an API call. If the model is in the known dimensions list, that value is returned.

If the model is not in the known dimensions list, falls back to making an API call to get the embedding and infer the dimension.

Returns:

The dimension of embeddings for this model.

Return type:

int

get_config() dict[str, Any][source]

Get the configuration dictionary for the CohereEmbeddingFunction.

Returns:

Dictionary containing configuration needed to restore this embedding function

static name() str[source]

Get the unique name identifier for CohereEmbeddingFunction.

Returns:

The name identifier for this embedding function type

class pyseekdb.utils.embedding_functions.GoogleVertexEmbeddingFunction(model_name: str = 'textembedding-gecko', project_id: str = 'cloud-large-language-models', region: str = 'us-central1', api_key_env: str | None = 'GOOGLE_VERTEX_API_KEY')[source]

Bases: EmbeddingFunction[str | list[str]]

A convenient embedding function for Google Vertex AI embedding models.

For more information about Google Vertex AI models, see https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api

Example

pip install pyseekdb google-cloud-aiplatform

get_config() dict[str, Any][source]

Get the configuration dictionary for the embedding function.

This method should return a dictionary that contains all the information needed to restore the embedding function after restart.

Returns:

Dictionary containing the embedding function’s configuration. Note: The ‘name’ field is not included as it’s handled by the upper layer for routing.

class pyseekdb.utils.embedding_functions.HuggingFaceSparseEmbeddingFunction(model_name: str = 'prithivida/Splade_PP_en_v1', device: str = 'cpu', task: Literal['document', 'query'] = 'document', **kwargs: Any)[source]

Bases: SparseEmbeddingFunction

Sparse embedding function powered by HuggingFace SparseEncoder models.

Uses sentence_transformers.SparseEncoder to produce sparse vectors (e.g., SPLADE activations) for keyword-based retrieval.

The model is loaded lazily and cached at the class level, so multiple instances sharing the same model_name reuse one loaded model.

Parameters:
  • model_name – HuggingFace model identifier (e.g. "prithivida/Splade_PP_en_v1").

  • device – Compute device ("cpu", "cuda", "cuda:0", etc.).

  • task – Encoding mode — "document" for indexing, "query" for searching. Defaults to "document".

  • **kwargs – Extra keyword arguments forwarded to SparseEncoder().

static build_from_config(config: dict[str, Any]) HuggingFaceSparseEmbeddingFunction[source]

Restore instance from configuration dictionary.

embed_query(documents: str | list[str]) list[SparseVector][source]

Encode queries into sparse vectors using encode_query.

Regardless of the task setting, this method always uses the query encoding path, which is typically preferred at search time for asymmetric models (e.g., SPLADE).

Parameters:

documents – A single string or list of strings.

Returns:

List of SparseVector instances, one per input query.

get_config() dict[str, Any][source]

Get configuration dictionary (for persistence).

Returns:

Configuration dictionary. Should NOT include ‘name’ field (handled by upper layer).

static name() str[source]

Return unique name identifier (for registration and routing).

class pyseekdb.utils.embedding_functions.JinaEmbeddingFunction(model_name: str = 'jina-embeddings-v3', api_key_env: str | None = None, **kwargs: Any)[source]

Bases: LiteLLMBaseEmbeddingFunction

A convenient embedding function for Jina AI embedding models.

This class provides a simplified interface to Jina AI embedding models using LiteLLM.

For more information about Jina AI models, see https://jina.ai/embeddings

For LiteLLM documentation, see https://docs.litellm.ai/docs/embedding/supported_embedding

Example

pip install pyseekdb litellm

static build_from_config(config: dict[str, Any]) JinaEmbeddingFunction[source]

Build a JinaEmbeddingFunction from its configuration dictionary.

Parameters:

config – Dictionary containing the embedding function’s configuration

Returns:

Restored JinaEmbeddingFunction instance

Raises:

ValueError – If the configuration is invalid or missing required fields

property dimension: int

Get the dimension of embeddings produced by this function.

Returns the known dimension for models without making an API call. If the model is in the known dimensions list, that value is returned.

If the model is not in the known dimensions list, falls back to making an API call to get the embedding and infer the dimension.

Returns:

The dimension of embeddings for this model.

Return type:

int

get_config() dict[str, Any][source]

Get the configuration dictionary for the JinaEmbeddingFunction.

Returns:

Dictionary containing configuration needed to restore this embedding function

static name() str[source]

Get the unique name identifier for JinaEmbeddingFunction.

Returns:

The name identifier for this embedding function type

class pyseekdb.utils.embedding_functions.LiteLLMBaseEmbeddingFunction(model_name: str, api_key_env: str | None = None, **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

A custom embedding function using LiteLLM to access various embedding models.

LiteLLM provides a unified interface to access embedding models from multiple providers including OpenAI, Hugging Face, Cohere, and many others.

You can extend this class to create your own embedding function by overriding the __call__ method. See https://docs.litellm.ai/docs/embedding/supported_embedding for more information.

Example

pip install pyseekdb litellm

class pyseekdb.utils.embedding_functions.MistralEmbeddingFunction(model_name: str = 'mistral-embed', api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for Mistral text embedding models.

This class provides a simplified interface to Mistral text embeddings using the OpenAI-compatible API.

Note: The embeddings API only accepts the model name and input texts.

For more information about Mistral embeddings, see: https://docs.mistral.ai/capabilities/embeddings/text_embeddings

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

class pyseekdb.utils.embedding_functions.MorphEmbeddingFunction(model_name: str, api_key_env: str | None = None, api_base: str | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for Morph embedding models.

This class provides a simplified interface to Morph embedding models using the OpenAI-compatible API.

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

static name() str[source]

Get the unique name identifier for MorphEmbeddingFunction.

Returns:

The name identifier for this embedding function type

class pyseekdb.utils.embedding_functions.OllamaEmbeddingFunction(model_name: str = 'nomic-embed-text', api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for Ollama embedding models.

This class provides a simplified interface to Ollama embedding models using the OpenAI-compatible API. Ollama provides OpenAI-compatible API endpoints for embedding generation.

For more information about Ollama, see https://docs.ollama.com/

Note: Before using a model, you need to pull it locally using ollama pull <model_name>.

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

static name() str[source]

Get the unique name identifier for OllamaEmbeddingFunction.

Returns:

The name identifier for this embedding function type

class pyseekdb.utils.embedding_functions.OnnxEmbeddingFunction(model_name: str, hf_model_id: str, dimension: int, download_path: Path | None = None, preferred_providers: list[str] | None = None)[source]

Bases: object

Generic ONNX runtime embedding function.

This class handles model download, tokenizer/model loading, and embedding generation using onnxruntime.

property dimension: int

Get the dimension of embeddings produced by this function.

max_tokens() int[source]

Get the maximum number of tokens supported by the model.

property model: Any

Get the model.

Returns:

The model.

property tokenizer: Any

Get the tokenizer for the model.

Returns:

The tokenizer for the model.

class pyseekdb.utils.embedding_functions.OpenAIBaseEmbeddingFunction(model_name: str, api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

Base embedding function for OpenAI-compatible embedding APIs.

This class provides a common implementation for embedding functions that use OpenAI-compatible APIs. It uses the openai package to make API calls.

Subclasses should override: - _get_default_api_base(): Return the default API base URL - _get_default_api_key_env(): Return the default API key environment variable name - _get_model_dimensions(): Return a dict mapping model names to their default dimensions - Optionally override __init__ to set model-specific defaults

Example: .. code-block:: python

import pyseekdb from pyseekdb.utils.embedding_functions import OpenAIBaseEmbeddingFunction

class MyEmbeddingFunction(OpenAIBaseEmbeddingFunction):
def _get_default_api_base(self):

return “https://api.example.com/v1

def _get_default_api_key_env(self):

return “MY_API_KEY”

def _get_model_dimensions(self):

return {“model-v1”: 1536, “model-v2”: 1024}

property dimension: int

Get the dimension of embeddings produced by this function.

Returns the known dimension for models without making an API call. If the dimensions parameter is specified, that value is returned. Otherwise, the default dimension for the model is returned.

If the model is not in the known dimensions list, falls back to calling the parent’s dimension detection (which may make an API call).

Returns:

The dimension of embeddings for this model.

Return type:

int

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

class pyseekdb.utils.embedding_functions.OpenAIEmbeddingFunction(model_name: str = 'text-embedding-3-small', api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for OpenAI embedding models.

This class provides a simplified interface to OpenAI embedding models using the OpenAI API.

For more information about OpenAI models, see https://platform.openai.com/docs/guides/embeddings

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

class pyseekdb.utils.embedding_functions.QwenEmbeddingFunction(model_name: str, api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for Qwen (Alibaba Cloud) embedding models.

This class provides a simplified interface to Qwen embedding models using the OpenAI-compatible API. Qwen provides OpenAI-compatible API endpoints for embedding generation.

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the QwenEmbeddingFunction.

Returns:

Dictionary containing configuration needed to restore this embedding function

static name() str[source]

Get the unique name identifier for QwenEmbeddingFunction.

Returns:

The name identifier for this embedding function type

class pyseekdb.utils.embedding_functions.SentenceTransformerEmbeddingFunction(model_name: str = 'all-MiniLM-L6-v2', device: str = 'cpu', normalize_embeddings: bool = False, **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

An embedding function using sentence-transformers with a specific model.

Example

pip install pyseekdb sentence-transformers

get_config() dict[str, Any][source]

Get the configuration dictionary for the embedding function.

This method should return a dictionary that contains all the information needed to restore the embedding function after restart.

Returns:

Dictionary containing the embedding function’s configuration. Note: The ‘name’ field is not included as it’s handled by the upper layer for routing.

class pyseekdb.utils.embedding_functions.SiliconflowEmbeddingFunction(model_name: str = 'BAAI/bge-large-zh-v1.5', api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for SiliconFlow embedding models.

This class provides a simplified interface to SiliconFlow embedding models using the OpenAI-compatible API. SiliconFlow provides OpenAI-compatible API endpoints for embedding generation.

For more information about SiliconFlow models, see https://docs.siliconflow.cn/en/api-reference/embeddings/create-embeddings

Example

pip install pyseekdb openai

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

class pyseekdb.utils.embedding_functions.TencentHunyuanEmbeddingFunction(model_name: str = 'hunyuan-embedding', api_key_env: str | None = None, api_base: str | None = None, dimensions: int | None = None, **kwargs: Any)[source]

Bases: OpenAIBaseEmbeddingFunction

A convenient embedding function for Tencent Hunyuan embedding models.

This class provides a simplified interface to Tencent Hunyuan embedding models using the OpenAI-compatible API. Tencent Hunyuan provides OpenAI-compatible API endpoints for embedding generation.

For more information about Tencent Hunyuan models, see https://cloud.tencent.com/document/product/1729/111007

Note: The embedding interface currently only supports input and model parameters. The model is fixed as hunyuan-embedding and dimensions are fixed at 1024.

Example

pip install pyseekdb openai

property dimension: int

Get the dimension of embeddings produced by this function.

Returns the known dimension for models without making an API call. If the dimensions parameter is specified, that value is returned. Otherwise, the default dimension for the model is returned.

If the model is not in the known dimensions list, falls back to calling the parent’s dimension detection (which may make an API call).

Returns:

The dimension of embeddings for this model.

Return type:

int

get_config() dict[str, Any][source]

Get the configuration dictionary for the OpenAIBaseEmbeddingFunction.

Subclasses should override the name() method to provide the correct name for routing.

Returns:

Dictionary containing configuration needed to restore this embedding function

class pyseekdb.utils.embedding_functions.Text2VecEmbeddingFunction(model_name: str = 'shibing624/text2vec-base-chinese', device: str = 'cpu', normalize_embeddings: bool = False, **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

An embedding function using text2vec with a specific model.

Text2Vec provides multilingual embeddings (supports 100+ languages) with various pretrained models.

static build_from_config(config: dict[str, Any]) Text2VecEmbeddingFunction[source]

Build Text2VecEmbeddingFunction from configuration dictionary.

property dimension: int

Get the dimension of embeddings produced by this function.

get_config() dict[str, Any][source]

Get configuration dictionary for serialization.

static name() str[source]

Return the embedding function name identifier.

class pyseekdb.utils.embedding_functions.VoyageaiEmbeddingFunction(model_name: str = 'voyage-4-large', api_key_env: str | None = None, input_type: str | None = None, truncation: bool | None = None, output_dimension: int | None = None, **kwargs: Any)[source]

Bases: EmbeddingFunction[str | list[str]]

A convenient embedding function for Voyage AI embedding models.

This class provides a simplified interface to Voyage AI embedding models using the voyageai package.

For more information about Voyage AI models, see https://docs.voyageai.com/docs/embeddings

Example

pip install pyseekdb voyageai

get_config() dict[str, Any][source]

Get the configuration dictionary for the embedding function.

This method should return a dictionary that contains all the information needed to restore the embedding function after restart.

Returns:

Dictionary containing the embedding function’s configuration. Note: The ‘name’ field is not included as it’s handled by the upper layer for routing.