API Reference

This page contains the auto-generated API reference for pyseekdb.

Main Package

pyseekdb - Unified vector database client wrapper

Based on seekdb and pymysql, providing a simple and unified API.

Supports two modes:

Embedded mode - using local seekdb
Remote server mode - connecting to remote server via pymysql (supports both seekdb Server and OceanBase Server)

Examples:

Embedded mode - Collection management:

import pyseekdb
client = pyseekdb.Client(path="./seekdb.db", database="test")
collection = client.get_or_create_collection("my_collection")

Remote server mode (seekdb Server) - Collection management:

import pyseekdb
client = pyseekdb.Client(
    host='localhost',
    port=2881,
    tenant="sys",
    database="test",
    user="root",
    password="pass"
)
collection = client.get_or_create_collection("my_collection")

Remote server mode (OceanBase Server) - Collection management:

import pyseekdb
client = pyseekdb.Client(
    host='localhost',
    port=2881,
    tenant="test",
    database="test",
    user="root",
    password="pass"
)
collection = client.get_or_create_collection("my_collection")

Admin client - Database management:

import pyseekdb
admin = pyseekdb.AdminClient(path="./seekdb.db")
admin.create_database("new_db")
databases = admin.list_databases()

class pyseekdb.AdminAPI[source]

Bases: ABC

Abstract admin API interface for database management. Defines the contract for database operations.

abstractmethod create_database(name: str, tenant: str = 'test') → None[source]

Create database

Parameters:

name – database name
tenant – tenant name (for OceanBase)

abstractmethod delete_database(name: str, tenant: str = 'test') → None[source]

Delete database

Parameters:

name – database name
tenant – tenant name (for OceanBase)

abstractmethod get_database(name: str, tenant: str = 'test') → Database[source]

Get database object

Parameters:

name – database name
tenant – tenant name (for OceanBase)

Returns:

Database object

abstractmethod list_databases(limit: int | None = None, offset: int | None = None, tenant: str = 'test') → Sequence[Database][source]

List all databases

Parameters:

limit – maximum number of results to return
offset – number of results to skip
tenant – tenant name (for OceanBase)

Returns:

Sequence of Database objects

pyseekdb.AdminClient(path: str | None = None, host: str | None = None, port: int | None = None, tenant: str = 'sys', user: str | None = None, password: str = '', **kwargs) → _AdminClientProxy[source]

Smart admin client factory function (proxy pattern)

Automatically selects embedded or remote server mode based on parameters: - If path is provided, uses embedded mode - If host/port is provided, uses remote server mode (supports both seekdb Server and OceanBase Server)

Returns a lightweight AdminClient proxy that only exposes database operations. For collection management, use Client().

Parameters:

path – seekdb data directory path (embedded mode)
host – server address (remote server mode)
port – server port (remote server mode, default 2881)
tenant – tenant name (remote server mode, default “sys” for seekdb Server, “test” for OceanBase)
user – username (remote server mode, without tenant suffix)
password – password (remote server mode). If not provided, will be retrieved from SEEKDB_PASSWORD environment variable
**kwargs – other parameters

Returns:

A proxy that only exposes database operations

Return type:

_AdminClientProxy

Examples

>>> # Embedded mode
>>> admin = AdminClient(path="/path/to/seekdb")
>>> admin.create_database("new_db")  # ✅ Available
>>> # admin.create_collection("coll")  # ❌ Not available

>>> # Remote server mode (seekdb Server)
>>> admin = AdminClient(
...     host='localhost',
...     port=2881,
...     tenant="sys",
...     user="root",
...     password="pass"
... )

>>> # Remote server mode (OceanBase Server)
>>> admin = AdminClient(
...     host='localhost',
...     port=2881,
...     tenant="test",
...     user="root",
...     password="pass"
... )

class pyseekdb.BaseClient[source]

Bases: BaseConnection, AdminAPI

Abstract base class for all clients.

Design Pattern: 1. Provides public collection management methods (create_collection, get_collection, etc.) 2. Defines internal operation interfaces (_collection_* methods) called by Collection objects 3. Subclasses implement all abstract methods to provide specific business logic

Benefits of this design: - Collection object interface is unified regardless of which client created it - Different clients can have completely different underlying implementations (SQL/gRPC/REST) - Easy to extend with new client types

Inherits connection management from BaseConnection and database operations from AdminAPI.

count_collection() → int[source]

Count the total number of collections.

Returns:: The number of collections.

Examples

>>> count = client.count_collection()
>>> print(f"Database has {count} collections")

create_collection(name: str, schema: ~pyseekdb.client.schema.Schema | None = None, configuration: ~pyseekdb.client.configuration.Configuration | ~pyseekdb.client.configuration.HNSWConfiguration | None = <pyseekdb.client.client_base._NotProvided object>, embedding_function: ~pyseekdb.client.embedding_function.EmbeddingFunction[str | list[str]] | None | ~typing.Any = <pyseekdb.client.client_base._NotProvided object>, **kwargs) → Collection[source]

Create a new collection.

Parameters:

name – The name of the collection to create. Must contain only alphanumeric characters or underscores.
schema – Schema configuration. Defaults to None (uses default schema). Can be a Schema object.
configuration – Index configuration. Defaults to None (uses HNSW with Cosine distance and dimension 384). Can be a Configuration or HNSWConfiguration object. If set to None, the dimension will be inferred from the embedding function.
embedding_function – The embedding function to use for this collection. Defaults to DefaultEmbeddingFunction (all-MiniLM-L6-v2). If set to None, no embedding function will be used (embeddings must be provided manually).
**kwargs – Additional parameters for collection creation.

Returns:

The created Collection object.

Raises:

ValueError – If the collection name is invalid, already exists, or if the configuration/embedding function combination is invalid (e.g., dimension mismatch).
TypeError – If the configuration object is of an invalid type.

Examples

Create a collection with default settings:

>>> client.create_collection("my_collection")

Create a collection with a custom embedding function:

>>> from pyseekdb import DefaultEmbeddingFunction
>>> ef = DefaultEmbeddingFunction(model_name="all-MiniLM-L6-v2")
>>> collection = client.create_collection("my_docs", embedding_function=ef)

Create a collection with specific configuration:

>>> from pyseekdb import HNSWConfiguration
>>> config = HNSWConfiguration(dimension=128, distance="l2")
>>> collection = client.create_collection(
...     "custom_config",
...     configuration=config,
...     embedding_function=None
... )

create_database(name: str, tenant: str = 'test') → None[source]

Create database

Parameters:

name – database name
tenant – tenant name (for OceanBase)

delete_collection(name: str) → None[source]

Delete a collection.

Parameters:: name – The name of the collection to delete.
Raises:: ValueError – If the collection does not exist.

Examples

>>> client.delete_collection("my_collection")

delete_database(name: str, tenant: str = 'test') → None[source]

Delete database

Parameters:

name – database name
tenant – tenant name (for OceanBase)

detect_db_type_and_version() → tuple[str, Version][source]

Detect database type and version.

Works for all three modes: seekdb-embedded, seekdb-server, and oceanbase. Version detection is case-insensitive for seekdb.

Returns:: (“seekdb”, Version(“x.x.x.x”)) or (“oceanbase”, Version(“x.x.x.x”))
Return type:: (db_type, version)
Raises:: ValueError – If unable to detect database type or version

Examples

>>> db_type, version = client.detect_db_type_and_version()
>>> version > Version("1.0.0.0")
True

get_collection(name: str, embedding_function: ~pyseekdb.client.embedding_function.EmbeddingFunction[str | list[str]] | None | ~typing.Any = <pyseekdb.client.client_base._NotProvided object>) → Collection[source]

get_database(name: str, tenant: str = 'test') → Database[source]

Get database object

Parameters:

name – database name
tenant – tenant name (for OceanBase)

get_or_create_collection(name: str, schema: ~pyseekdb.client.schema.Schema | None = None, configuration: ~pyseekdb.client.configuration.Configuration | ~pyseekdb.client.configuration.HNSWConfiguration | None = <pyseekdb.client.client_base._NotProvided object>, embedding_function: ~pyseekdb.client.embedding_function.EmbeddingFunction[str | list[str]] | None | ~typing.Any = <pyseekdb.client.client_base._NotProvided object>, **kwargs) → Collection[source]

Get a collection if it exists, otherwise create it.

Parameters:

name – The name of the collection.
schema – Schema configuration for fine-grained index control, including sparse vector index support. When provided, configuration and embedding_function parameters are ignored.
configuration – Index configuration. Defaults to None (uses HNSW with Cosine distance and dimension 384). Can be a Configuration or HNSWConfiguration object. If set to None, the dimension will be inferred from the embedding function. Ignored if schema is provided.
embedding_function – The embedding function to use for this collection. Defaults to DefaultEmbeddingFunction (all-MiniLM-L6-v2). If set to None, no embedding function will be used (embeddings must be provided manually). Ignored if schema is provided.
**kwargs – Additional parameters passed to create_collection if the collection is created.

Returns:

The existing or newly created Collection object.

Raises:

ValueError – If the configuration/embedding function combination is invalid (e.g., dimension mismatch).

Examples

>>> collection = client.get_or_create_collection("my_collection")

has_collection(name: str) → bool[source]

Check if a collection exists.

Parameters:: name – The name of the collection to check.
Returns:: True if the collection exists, False otherwise.

Examples

>>> if client.has_collection("my_collection"):
...     print("Collection exists!")

list_collections() → list[Collection][source]

List all collections in the database.

Returns:: A list of Collection objects.

Examples

>>> collections = client.list_collections()
>>> for col in collections:
...     print(col.name)

list_databases(limit: int | None = None, offset: int | None = None, tenant: str = 'test') → Sequence[Database][source]

List all databases

Parameters:

limit – maximum number of results to return
offset – number of results to skip
tenant – tenant name (for OceanBase)

class pyseekdb.BaseConnection[source]

Bases: ABC

Abstract base class for connection management. Defines unified connection interface for all clients.

abstractmethod get_raw_connection() → Any[source]: Get raw connection object

abstractmethod is_connected() → bool[source]: Check connection status

abstract property mode: str: Return client mode (e.g., ‘SeekdbEmbeddedClient’, ‘RemoteServerClient’)

class pyseekdb.BengFulltextIndexConfig(min_token_size: int | None = None, max_token_size: int | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: FulltextIndexConfig

class pyseekdb.BqHNSWConfiguration(dimension: int = 384, distance: str | pyseekdb.client.configuration.DistanceMetric = 'l2', *, lib: str | pyseekdb.client.configuration.HNSWIndexLib = 'vsag', m: int | None = None, ef_construction: int | None = None, ef_search: int | None = None, extra_info_max_size: int | None = None, refine_k: float | None = None, refine_type: str | pyseekdb.client.configuration.BQRefineType | None = None, bq_bits_query: int | None = None, bq_use_fht: bool | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: HNSWConfiguration

pyseekdb.Client(path: str | None = None, host: str | None = None, port: int | None = None, tenant: str = 'sys', database: str = 'test', user: str | None = None, password: str = '', **kwargs) → _ClientProxy[source]

Smart client factory function (returns ClientProxy for collection operations only)

Automatically selects embedded or remote server mode based on parameters: - If path is provided, uses embedded mode - If host/port is provided, uses remote server mode (supports both seekdb Server and OceanBase Server) - If neither path nor host is provided, defaults to embedded mode with current working directory as path

Returns a ClientProxy that only exposes collection operations. For database management, use AdminClient().

Parameters:

path – seekdb data directory path (embedded mode). If not provided and host is also not provided, defaults to current working directory
host – server address (remote server mode)
port – server port (remote server mode, default 2881)
tenant – tenant name (remote server mode, default “sys” for seekdb Server, “test” for OceanBase)
database – database name
user – username (remote server mode, without tenant suffix)
password – password (remote server mode). If not provided, will be retrieved from SEEKDB_PASSWORD environment variable
**kwargs – other parameters

Returns:

A proxy that only exposes collection operations

Return type:

_ClientProxy

Examples

>>> # Embedded mode with explicit path
>>> client = Client(path="/path/to/seekdb", database="db1")
>>> client.create_collection("my_collection")  # ✅ Available

>>> # Embedded mode (default, uses current working directory)
>>> client = Client(database="db1")
>>> client.create_collection("my_collection")  # ✅ Available

>>> # Remote server mode (seekdb Server)
>>> client = Client(
...     host='localhost',
...     port=2881,
...     tenant="sys",
...     database="db1",
...     user="root",
...     password="pass"
... )

>>> # Remote server mode (OceanBase Server)
>>> client = Client(
...     host='localhost',
...     port=2881,
...     tenant="test",
...     database="db1",
...     user="root",
...     password="pass"
... )

class pyseekdb.ClientAPI[source]

Bases: ABC

Client API interface for collection operations only. This is what end users interact with through the Client proxy.

abstractmethod create_collection(name: str, schema: ~pyseekdb.client.schema.Schema | None = None, configuration: ~pyseekdb.client.configuration.Configuration | ~pyseekdb.client.configuration.HNSWConfiguration | None = <pyseekdb.client.client_base._NotProvided object>, embedding_function: ~pyseekdb.client.embedding_function.EmbeddingFunction[str | list[str]] | None | ~typing.Any = <pyseekdb.client.client_base._NotProvided object>, **kwargs) → Collection[source]

Create collection

Parameters:

name – Collection name
schema – Schema configuration for fine-grained index control, including sparse vector index support. When provided, configuration and embedding_function parameters are ignored.
configuration – Index configuration (Configuration or HNSWConfiguration). For backward compatibility, HNSWConfiguration is still accepted. Configuration can include fulltext analyzer configuration (FulltextIndexConfig). Ignored if schema is provided.
embedding_function – Embedding function to convert documents to embeddings. Defaults to DefaultEmbeddingFunction. If explicitly set to None, collection will not have an embedding function. Ignored if schema is provided.
**kwargs – Additional parameters

abstractmethod delete_collection(name: str) → None[source]: Delete collection

abstractmethod get_collection(name: str, embedding_function: ~pyseekdb.client.embedding_function.EmbeddingFunction[str | list[str]] | None | ~typing.Any = <pyseekdb.client.client_base._NotProvided object>) → Collection[source]

Get an existing collection.

Parameters:

name – The name of the collection to retrieve.
embedding_function – The embedding function to use. If not provided, it will try to load the function used when creating the collection. If explicitly set to None, no embedding function will be used.

Returns:

The Collection object.

Raises:

ValueError – If the collection does not exist.

Examples

>>> collection = client.get_collection("my_collection")

abstractmethod has_collection(name: str) → bool[source]: Check if collection exists

abstractmethod list_collections() → list[Collection][source]: List all collections

class pyseekdb.Collection(client: Any, name: str, collection_id: str | None = None, dimension: int | None = None, embedding_function: EmbeddingFunction[EmbeddingDocuments] | None = None, distance: str | None = None, sparse_vector_index_config: SparseVectorIndexConfig | None = None, **metadata)[source]

Bases: object

Collection unified interface class

Design Principles: - Collection is a lightweight wrapper that only holds metadata - All operations delegate to the client via self._client._collection_*() methods - Different clients (OceanBase, Seekdb, Milvus, etc.) provide different implementations - Users see identical interface regardless of which client created the collection

Add data to collection

Parameters:

ids – Single ID or list of IDs
embeddings – Single embedding or list of embeddings (optional if documents provided and embedding_function is set)
metadatas – Single metadata dict or list of metadata dicts (optional)
documents – Single document or list of documents (optional) If provided without embeddings, embedding_function will be used to generate embeddings
**kwargs – Additional parameters

Examples

# Add single item with embeddings collection.add(ids=”1”, embeddings=[0.1, 0.2, 0.3], metadatas={“tag”: “A”})

# Add multiple items with embeddings collection.add(

ids=[“1”, “2”, “3”], embeddings=[[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], metadatas=[{“tag”: “A”}, {“tag”: “B”}, {“tag”: “C”}]

)

# Add items with documents (embeddings will be auto-generated if embedding_function is set) collection.add(

ids=[“1”, “2”], documents=[“Hello world”, “How are you?”], metadatas=[{“tag”: “A”}, {“tag”: “B”}]

)

property client: Any: Associated client

count() → int[source]

Get the number of items in collection

Returns:: Item count

Examples

count = collection.count() print(f”Collection has {count} items”)

delete(ids: str | list[str] | None = None, where: dict[str, Any] | None = None, where_document: dict[str, Any] | None = None, **kwargs) → None[source]

Delete data from collection

Parameters:

ids – Single ID or list of IDs to delete (optional)
where – Filter condition on metadata (optional)
where_document – Filter condition on documents (optional)
**kwargs – Additional parameters

Note

At least one of ids, where, or where_document must be provided

Examples

# Delete by IDs collection.delete(ids=[“1”, “2”, “3”])

# Delete by metadata filter collection.delete(where={“tag”: “A”})

# Delete by document filter collection.delete(where_document={“$contains”: “keyword”})

property dimension: int | None: Vector dimension

property distance: str | None: Distance metric used by the index (e.g., ‘l2’, ‘cosine’, ‘inner_product’)

property embedding_function: EmbeddingFunction[EmbeddingDocuments] | None: Embedding function for this collection

fork(forked_name: str) → Collection[source]

Fork (duplicate) this collection to create a new collection with the same data.

The forked collection is independent - modifications to one collection do not affect the other. The original collection remains unchanged.

Parameters:: forked_name – Name for the new forked collection. Must be a valid collection name (letters, digits, and underscores only, not empty).
Returns:: The newly created forked collection.
Return type:: Collection
Raises:: ValueError – If fork is not enabled for this database, if the collection name is invalid, or if a collection with the given name already exists.

Note

Fork is only available for seekdb database version 1.1.0.0 or higher.

Examples: .. code-block:: python

# Fork a collection original = client.get_collection(“my_collection”) forked = original.fork(“my_collection_backup”)

# Verify both collections have the same data assert original.count() == forked.count()

# Add data to forked collection (original is unaffected) forked.add(ids=”new_id”, embeddings=[1.0, 2.0, 3.0], documents=”New document”) assert original.count() == 3 # Original unchanged assert forked.count() == 4 # Forked has new data

Get data from collection by IDs or filters

Parameters:

ids – Single ID or list of IDs to retrieve (optional)
where – Filter condition on metadata (optional)
where_document – Filter condition on documents (optional)
limit – Maximum number of results to return (optional)
offset – Number of results to skip (optional)
include – Fields to include in results, e.g., [“metadatas”, “documents”, “embeddings”] (optional)
**kwargs – Additional parameters

Returns:

ids: List[str] - List of IDs
documents: Optional[List[str]] - List of documents (if included)
metadatas: Optional[List[Dict]] - List of metadata dictionaries (if included)
embeddings: Optional[List[List[float]]] - List of embeddings (if included)

Return type:

Dict with keys (chromadb-compatible format)

Note

If no parameters provided, returns all data (up to limit)

Examples

# Get by single ID results = collection.get(ids=”1”) # results[“ids”] contains [“1”] # results[“documents”] contains document for ID “1”

# Get by multiple IDs results = collection.get(ids=[“1”, “2”, “3”]) # results[“ids”] contains [“1”, “2”, “3”] # results[“documents”] contains documents for all IDs

# Get by filter results = collection.get(

where={“tag”: “A”}, limit=10

) # results[“ids”] contains all matching IDs # results[“documents”] contains all matching documents

# Get all data results = collection.get(limit=100)

property has_sparse_vector_index: bool: Check if this collection has a sparse vector index.

hybrid_search(query: dict[str, Any] | None = None, knn: dict[str, Any] | None = None, rank: dict[str, Any] | None = None, n_results: int = 10, include: list[str] | None = None, **kwargs) → dict[str, Any][source]

Hybrid search combining full-text search and vector similarity search

Parameters:

query – Full-text search configuration dict with: - where_document: Document filter conditions (e.g., {“$contains”: “text”}) - where: Metadata filter conditions (e.g., {“page”: {“$gte”: 5}}) - n_results: Number of results for full-text search (optional)
knn – Vector search configuration dict with: - query_texts: Query text(s) to be embedded (optional if query_embeddings provided) - query_embeddings: Query vector(s) (optional if query_texts provided) - where: Metadata filter conditions (optional) - n_results: Number of results for vector search (optional)
rank – Ranking configuration dict (e.g., {“rrf”: {“rank_window_size”: 60, “rank_constant”: 60}})
n_results – Final number of results to return after ranking (default: 10)
include – Fields to include in results (e.g., [“documents”, “metadatas”, “embeddings”])
**kwargs – Additional parameters

Returns:

ids: List[List[str]] - List of ID lists (one list for hybrid search result)
documents: Optional[List[List[str]]] - List of document lists (if included)
metadatas: Optional[List[List[Dict]]] - List of metadata lists (if included)
embeddings: Optional[List[List[List[float]]]] - List of embedding lists (if included)
distances: Optional[List[List[float]]] - List of distance lists

Return type:

Dict with keys (query-compatible format)

Examples

# Hybrid search with both full-text and vector search results = collection.hybrid_search(

query={
“where_document”: {“$contains”: “machine learning”}, “where”: {“category”: {“$eq”: “science”}}, “n_results”: 10

}, knn={

“query_texts”: [“AI research”], “where”: {“year”: {“$gte”: 2020}}, “n_results”: 10

}, rank={“rrf”: {}}, n_results=5, include=[“documents”, “metadatas”, “embeddings”]

) # results[“ids”][0] contains IDs for the hybrid search # results[“documents”][0] contains documents for the hybrid search # results[“distances”][0] contains distances for the hybrid search

property id: str | None: Collection ID

property metadata: dict[str, Any]: Collection metadata

property name: str: Collection name

peek(limit: int = 10) → dict[str, Any][source]

Quickly preview the first few items in the collection

Parameters:

limit – Number of items to preview (default: 10)

Returns:

ids: List[str] - List of IDs
documents: List[str] - List of documents (always included)
metadatas: List[Dict] - List of metadata dictionaries (always included)
embeddings: List[List[float]] - List of embeddings (always included)

Return type:

Dict with keys (chromadb-compatible format)

Examples

# Preview first 5 items (returns all columns by default) preview = collection.peek(limit=5) for i in range(len(preview[“ids”])):

print(f”ID: {preview[‘ids’][i]}, Document: {preview[‘documents’][i]}”) print(f”Metadata: {preview[‘metadatas’][i]}, Embedding: {preview[‘embeddings’][i]}”)

Query collection by vector similarity

Parameters:

query_embeddings – Query vector(s) (optional if query_texts provided). For dense vector queries: list[float] or list[list[float]]. For sparse vector queries, provide query_texts instead and let the configured sparse embedding function generate sparse vectors.
query_texts – Query text(s) to be embedded (optional if query_embeddings provided)
n_results – Number of results to return (default: 10)
where – Filter condition on metadata supporting: - Comparison operators: $eq, $lt, $gt, $lte, $gte, $ne, $in, $nin - Logical operators: $or, $and, $not
where_document – Filter condition on documents supporting: - $contains: full-text search - $regex: regular expression matching - Logical operators: $or, $and
include – Fields to include in results, e.g., [“documents”, “metadatas”, “embeddings”] (optional) By default, returns “documents” and “metadatas”. Always includes “_id”.
query_key – Specify which index to query. Default is None (dense vector). Use K.SPARSE_EMBEDDING (or "#sparse_embedding") to query using sparse vector index.
**kwargs – Additional parameters

Returns:

ids: List[List[str]] - List of ID lists, one list per query
documents: Optional[List[List[str]]] - List of document lists, one list per query (if included)
metadatas: Optional[List[List[Dict]]] - List of metadata lists, one list per query (if included)
embeddings: Optional[List[List[List[float]]]] - List of embedding lists, one list per query (if included)
distances: Optional[List[List[float]]] - List of distance lists, one list per query

Return type:

Dict with keys (chromadb-compatible format)

Examples

# Query by single embedding (dense vector) results = collection.query(

query_embeddings=[0.1, 0.2, 0.3], n_results=5

)

# Query by texts (will be embedded automatically) results = collection.query(

query_texts=[“my query text”], n_results=10

)

# Sparse vector query using query_key results = collection.query(

query_texts=[“fox animal”], query_key=K.SPARSE_EMBEDDING, n_results=5

)

property sparse_embedding_function: SparseEmbeddingFunction | None: Sparse embedding function for this collection, if configured.

property sparse_vector_index_config: SparseVectorIndexConfig | None: Sparse vector index configuration, if any.

Update existing data in collection

Parameters:

ids – Single ID or list of IDs to update
embeddings – New embeddings (optional)
metadatas – New metadata (optional)
documents – New documents (optional)
**kwargs – Additional parameters

Note

IDs must exist, otherwise an error will be raised

Examples

# Update single item collection.update(ids=”1”, metadatas={“tag”: “B”})

# Update multiple items collection.update(

ids=[“1”, “2”], embeddings=[[0.9, 0.8], [0.7, 0.6]]

)

Insert or update data in collection

Parameters:

ids – Single ID or list of IDs
embeddings – embeddings (optional if documents provided)
metadatas – Metadata (optional)
documents – Documents (optional)
**kwargs – Additional parameters

Note

If ID exists, update it; otherwise, insert new data

Examples

# Upsert single item collection.upsert(ids=”1”, embeddings=[0.1, 0.2], metadatas={“tag”: “A”})

# Upsert multiple items collection.upsert(

ids=[“1”, “2”, “3”], embeddings=[[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]

)

class pyseekdb.Configuration(hnsw: HNSWConfiguration | None = None, fulltext_config: FulltextIndexConfig | None = None)[source]

Bases: object

Configuration for collection creation

Parameters:

hnsw – HNSWConfiguration or None
fulltext_config – FulltextIndexConfig or None. If None, defaults to FulltextIndexConfig(analyzer=’ik’)

class pyseekdb.Database(name: str, tenant: str | None = None, charset: str | None = None, collation: str | None = None, **kwargs)[source]

Bases: object

Database object representing a database instance.

Note

tenant is None for embedded/server mode (no tenant concept)
tenant is set for OceanBase mode (multi-tenant architecture)

class pyseekdb.DefaultEmbeddingFunction(model_name: str = 'all-MiniLM-L6-v2', preferred_providers: list[str] | None = None)[source]

Bases: EmbeddingFunction[str | list[str]]

Default embedding function using ONNX runtime.

Uses the ‘all-MiniLM-L6-v2’ model via ONNX, which produces 384-dimensional embeddings. This is a lightweight, fast model suitable for general-purpose text embeddings.

Example

>>> ef = DefaultEmbeddingFunction()
>>> embeddings = ef(["Hello world", "How are you?"])
>>> print(len(embeddings[0]))  # 384

static build_from_config(_config: dict[str, Any]) → Self[source]

property dimension: int: Get the dimension of embeddings produced by this function.

get_config() → dict[str, Any][source]

Get the configuration dictionary for the embedding function.

This method should return a dictionary that contains all the information needed to restore the embedding function after restart.

Returns:: Dictionary containing the embedding function’s configuration. Note: The ‘name’ field is not included as it’s handled by the upper layer for routing.

static name() → str[source]

class pyseekdb.EmbeddingFunction(*args, **kwargs)[source]

Bases: Protocol[D]

Protocol for embedding functions that convert documents to vectors.

This is similar to Chroma’s EmbeddingFunction interface. Implementations should convert text documents to vector embeddings.

Implementations should also provide: - name(): Static method that returns a unique name identifier for routing (not persisted in config) - get_config(): Instance method that returns a configuration dictionary - build_from_config(config): Static method that restores an instance from config

Example

>>> class MyEmbeddingFunction(EmbeddingFunction[Documents]):
...     @staticmethod
...     def name() -> str:
...         return "my_embedding_function"
...     def __call__(self, documents: Documents) -> Embeddings:
...         # Convert documents to embeddings
...         return [[0.1, 0.2, ...], [0.3, 0.4, ...]]
...     def get_config(self) -> Dict[str, Any]:
...         return {...}  # Note: 'name' is not included
...     @staticmethod
...     def build_from_config(config: Dict[str, Any]) -> "MyEmbeddingFunction":
...         return MyEmbeddingFunction(...)
>>>
>>> ef = MyEmbeddingFunction()
>>> embeddings = ef(["Hello", "World"])
>>> config = ef.get_config()
>>> restored_ef = MyEmbeddingFunction.build_from_config(config)

abstractmethod get_config() → dict[str, Any][source]

Get the configuration dictionary for the embedding function.

This method should return a dictionary that contains all the information needed to restore the embedding function after restart.

Returns:: Dictionary containing the embedding function’s configuration. Note: The ‘name’ field is not included as it’s handled by the upper layer for routing.

static support_persistence(embedding_function: Any) → bool[source]: Check if the embedding function supports persistence.

Bases: object

Fulltext analyzer configuration for fulltext indexing.

Parameters:

analyzer – Analyzer name, can be ‘space’, ‘ngram’, ‘ngram2’, ‘beng’, ‘ik’ and so on (default: ‘ik’)
properties – Optional dictionary of parser-specific parameters (key: string, value: primitive type)

analyzer: str | FulltextAnalyzer = 'ik'

properties: dict[str, str | int | float | bool] | None = None

Bases: object

HNSW (Hierarchical Navigable Small World) index configuration

Parameters:

dimension – Vector dimension (number of elements in each vector)
distance – Distance metric for similarity calculation (e.g., ‘l2’, ‘cosine’, ‘inner_product’)
properties – Optional dictionary of properties for the HNSW index (key: string, value: primitive type)
configuration](https (Please refer to [HNSW) – //en.oceanbase.com/docs/common-oceanbase-database-10000000003351043) for detailed information.

dimension: int = 384

distance: str | DistanceMetric = 'l2'

lib: str | HNSWIndexLib = 'vsag'

properties: dict[str, str | int | float | bool] | None = None

type: str | HNSWIndexType = 'hnsw'

class pyseekdb.IKFulltextIndexConfig(ik_mode: str | pyseekdb.client.configuration.IKMode | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: FulltextIndexConfig

class pyseekdb.IKMode(value)[source]

Bases: str, Enum

MAX_WORD = 'max_word'

SMART = 'smart'

pyseekdb.K: alias of FieldKey

class pyseekdb.Ngram2FulltextIndexConfig(min_ngram_size: int | None = None, max_ngram_size: int | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: FulltextIndexConfig

class pyseekdb.NgramFulltextIndexConfig(ngram_token_size: int | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: FulltextIndexConfig

class pyseekdb.RemoteServerClient(host: str = 'localhost', port: int = 2881, tenant: str = 'sys', database: str = 'test', user: str = 'root', password: str = '', charset: str = 'utf8mb4', **kwargs)[source]

Bases: BaseClient

Remote server mode client (connecting via pymysql, lazy loading)

Supports both seekdb Server and OceanBase Server. Uses user@tenant format for authentication.

create_database(name: str, tenant: str = 'test') → None[source]

Create database (remote server has tenant concept, uses client’s tenant)

Parameters:

name – database name
tenant – tenant name (if different from client tenant, will use client tenant)

Note

Remote server has multi-tenant architecture. Database is scoped to client’s tenant.

delete_database(name: str, tenant: str = 'test') → None[source]

Delete database (remote server has tenant concept, uses client’s tenant)

Parameters:

name – database name
tenant – tenant name (if different from client tenant, will use client tenant)

Note

Remote server has multi-tenant architecture. Database is scoped to client’s tenant.

get_database(name: str, tenant: str = 'test') → Database[source]

Get database object (remote server has tenant concept, uses client’s tenant)

Parameters:

name – database name
tenant – tenant name (if different from client tenant, will use client tenant)

Returns:

Database object with tenant information

Note

Remote server has multi-tenant architecture. Database is scoped to client’s tenant.

get_raw_connection() → pymysql.Connection[source]: Get raw connection object

is_connected() → bool[source]: Check connection status

list_databases(limit: int | None = None, offset: int | None = None, tenant: str = 'test') → Sequence[Database][source]

List all databases (remote server has tenant concept, uses client’s tenant)

Parameters:

limit – maximum number of results to return
offset – number of results to skip
tenant – tenant name (if different from client tenant, will use client tenant)

Returns:

Sequence of Database objects with tenant information

Note

Remote server has multi-tenant architecture. Lists databases in client’s tenant.

property mode: str: Return client mode (e.g., ‘SeekdbEmbeddedClient’, ‘RemoteServerClient’)

class pyseekdb.Schema(vector_index: VectorIndexConfig | HNSWConfiguration | None = None, sparse_vector_index: SparseVectorIndexConfig | None = None, fulltext_index: FulltextIndexConfig | None = None)[source]

Bases: object

Schema configuration for collection creation.

Schema provides fine-grained control over indexes and their parameters. When provided to create_collection, the older configuration and embedding_function parameters are ignored.

Default behavior: - If vector_index is not specified, a default HNSW index with L2 distance is used. - If fulltext_index is not specified, a default fulltext index with IK analyzer is used. - sparse_vector_index is optional and defaults to None (no sparse index).

Parameters:

vector_index – HNSW configuration for dense vector index (optional).
sparse_vector_index – Sparse vector index configuration (optional).
fulltext_index – Fulltext index configuration (optional).
embedding_function – Dense embedding function (optional). If provided with vector_index, this is associated with the dense vector index.

Example

>>> # Simple schema with sparse vector index
>>> schema = Schema(
...     sparse_vector_index=SparseVectorIndexConfig(
...         embedding_function=BM25EmbeddingFunction(),
...         source_key=K.DOCUMENT
...     )
... )
>>>
>>> # Full schema with all indexes
>>> schema = Schema(
...     vector_index=VectorIndexConfig(
...         hnsw=HNSWConfiguration(dimension=768, distance="cosine"),
...         embedding_function=OpenAIEmbeddingFunction(api_key_env="OPENAI_API_KEY")
...     ),
...     sparse_vector_index=SparseVectorIndexConfig(
...         embedding_function=BM25EmbeddingFunction()
...     ),
...     fulltext_index=FulltextIndexConfig(analyzer="ik"),
...
... )
>>>
>>> # Schema using create_index chaining
>>> schema = Schema().create_index(
...     VectorIndexConfig(
...         hnsw=HNSWConfiguration(dimension=768, distance="cosine"),
...         embedding_function=OpenAIEmbeddingFunction(api_key_env="OPENAI_API_KEY")
...     )
... ).create_index(
...     SparseVectorIndexConfig(embedding_function=BM25EmbeddingFunction())
... )

create_index(config: Any) → Schema[source]

Add an index configuration to this schema.

Supports method chaining for fluent API usage.

Parameters:: config – Index configuration object. Can be: - VectorIndexConfig: configures the dense vector index - HNSWConfiguration: configures the dense vector index with DefaultEmbeddingFunction - SparseVectorIndexConfig: configures the sparse vector index - FulltextIndexConfig: configures the fulltext index
Returns:: This Schema instance (for chaining).
Raises:: TypeError – If config is not a recognized index configuration type.

Example

>>> schema = Schema().create_index(
...     HNSWConfiguration(dimension=384, distance="cosine")
... ).create_index(
...     SparseVectorIndexConfig(embedding_function=BM25EmbeddingFunction())
... )

class pyseekdb.SpaceFulltextIndexConfig(min_token_size: int | None = None, max_token_size: int | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: FulltextIndexConfig

class pyseekdb.SparseEmbeddingFunction(*args, **kwargs)[source]

Bases: Protocol

Protocol for sparse embedding functions that convert documents to sparse vectors.

Sparse vectors are suitable for keyword-based retrieval (e.g., BM25, SPLADE). Similar to EmbeddingFunction, but produces sparse vectors (dict[int, float]) instead of dense vectors (list[float]).

Implementations should provide: - __call__(): Convert documents to sparse vectors - name(): Static method returning a unique name identifier (for registration and routing) - get_config(): Return configuration dictionary (for persistence) - build_from_config(): Static method to restore instance from config

Example

>>> class BM25EmbeddingFunction(SparseEmbeddingFunction):
...     def __call__(self, documents: Documents) -> SparseVectors:
...         # Generate BM25 sparse vectors
...         ...
...
...     @staticmethod
...     def name() -> str:
...         return "bm25"
...
...     def get_config(self) -> dict:
...         return {"k1": self.k1, "b": self.b}
...
...     @staticmethod
...     def build_from_config(config) -> "BM25EmbeddingFunction":
...         return BM25EmbeddingFunction(**config)

static build_from_config(config: dict[str, Any]) → SparseEmbeddingFunction[source]: Restore instance from configuration dictionary.

abstractmethod get_config() → dict[str, Any][source]

Get configuration dictionary (for persistence).

Returns:: Configuration dictionary. Should NOT include ‘name’ field (handled by upper layer).

static name() → str[source]: Return unique name identifier (for registration and routing).

static support_persistence(sparse_embedding_function: Any) → bool[source]

Check if the sparse embedding function supports persistence.

Parameters:: sparse_embedding_function – The sparse embedding function to check.
Returns:: True if persistence is supported, False otherwise.

class pyseekdb.SparseEmbeddingFunctionRegistry[source]

Bases: object

Registry for sparse embedding function classes.

Maps sparse embedding function names (returned by their name() method) to their corresponding classes, allowing dynamic instantiation from persisted configurations.

To register a custom sparse embedding function:

Option 1 (Recommended): Use the @register_sparse_embedding_function decorator:

>>> @register_sparse_embedding_function
... class MySparseFn(SparseEmbeddingFunction):
...     # ... implementation ...

Option 2: Manually register:

>>> SparseEmbeddingFunctionRegistry.register(MySparseFn)

classmethod build_from_config(name: str, config: dict[str, Any]) → SparseEmbeddingFunction[source]

Build a sparse embedding function from a name and config.

Parameters:

name – The name identifier of the sparse embedding function.
config – Configuration dictionary.

Returns:

A SparseEmbeddingFunction instance.

Raises:

ValueError – If the name is not registered.

classmethod get_class(name: str) → type | None[source]

Get a sparse embedding function class by name.

Parameters:: name – The name identifier of the sparse embedding function.
Returns:: The sparse embedding function class if found, None otherwise.

classmethod list_registered() → list[str][source]

List all registered sparse embedding function names.

Returns:: List of registered sparse embedding function names.

classmethod register(sparse_embedding_function_class: type) → None[source]

Register a sparse embedding function class.

Parameters:: sparse_embedding_function_class – The sparse embedding function class to register. Must implement name(), get_config(), and build_from_config().
Raises:: ValueError – If the class doesn’t have required methods or name is already registered.

class pyseekdb.SparseVector(embeddings: dict[int, float] | None = None)[source]

Bases: object

Sparse vector representation.

A sparse vector is a dictionary mapping integer indices (feature/token positions) to float values (weights). Only non-zero entries are stored, making this efficient for high-dimensional but sparse data (e.g., BM25 scores, SPLADE activations).

Format compatible with OceanBase/seekdb: {index: weight, ...}

Example

>>> sv = SparseVector.from_dict({100: 0.5, 200: 0.3, 500: 0.8})
>>> print(sv.embeddings)
{100: 0.5, 200: 0.3, 500: 0.8}

>>> sv = SparseVector.from_indices([100, 200, 500], [0.5, 0.3, 0.8])
>>> print(sv.embeddings)
{100: 0.5, 200: 0.3, 500: 0.8}

embeddings: dict[int, float] | None = None

static from_dict(embeddings: dict[int, float]) → SparseVector[source]

Create a SparseVector from a dictionary.

Parameters:: embeddings – Dictionary mapping integer indices to float weights.
Returns:: A new SparseVector instance.

Example

>>> sv = SparseVector.from_dict({100: 0.5, 200: 0.3, 500: 0.8})

static from_indices(indices: list[int], values: list[float]) → SparseVector[source]

Create a SparseVector from parallel lists of indices and values.

Parameters:

indices – List of integer indices (feature/token positions).
values – List of float weights corresponding to each index.

Returns:

A new SparseVector instance.

Raises:

ValueError – If indices and values have different lengths.

Example

>>> sv = SparseVector.from_indices([100, 200, 500], [0.5, 0.3, 0.8])

to_sql_string() → str[source]

Convert the sparse vector to OceanBase SQL format.

Returns:: SQL string representation, e.g., '{100:0.5, 200:0.3, 500:0.8}'
Raises:: ValueError – If the sparse vector is empty or None.

class pyseekdb.SparseVectorIndexConfig(embedding_function: ~pyseekdb.client.sparse_embedding_function.SparseEmbeddingFunction, source_key: str | ~pyseekdb.client.types.FieldKey | None = <pyseekdb.client.types.FieldKey object>, lib: str = 'vsag', distance: str = 'inner_product', type: str = 'sindi', prune: bool = False, refine: bool = False, drop_ratio_build: float = 0.0, drop_ratio_search: float = 0.0, refine_k: float = 4.0, properties: dict[str, str | int | float | bool] | None = None)[source]

Bases: object

Sparse vector index configuration.

Sparse vectors are suitable for keyword-based retrieval (e.g., BM25, SPLADE). They complement dense vectors and can be used for hybrid search.

Parameters:

embedding_function – Sparse embedding function (e.g., BM25EmbeddingFunction, SpladeEmbeddingFunction).
source_key – Source field key specifying which field to generate sparse vectors from. - K.DOCUMENT or "#document": use the document field (default) - A plain string like "title": use metadata["title"]
lib – Vector index library (default: “vsag”)
distance – Distance metric (default: “inner_product”). Only inner_product is supported for sparse vectors.
type – Index type (default: “sindi”)
prune – Whether to enable pruning (default: False)
refine – Whether to enable refining (default: False)
drop_ratio_build – Drop ratio for index building (default: 0.0)
drop_ratio_search – Drop ratio for search (default: 0.0)
refine_k – Refine K factor (default: 4.0)

Note

Each collection can have at most one sparse vector index.
Sparse vectors are stored in the sparse_embedding column.
embedding_function is required and must support persistence.
Sparse vectors are always generated from source_key by embedding_function.

Example

>>> # Auto-generate from document field
>>> config = SparseVectorIndexConfig(
...     embedding_function=BM25EmbeddingFunction(),
...     source_key=K.DOCUMENT
... )
>>>
>>> # Auto-generate from metadata field
>>> config = SparseVectorIndexConfig(
...     embedding_function=BM25EmbeddingFunction(),
...     source_key="title"
... )
>>>

distance: str = 'inner_product'

drop_ratio_build: float = 0.0

drop_ratio_search: float = 0.0

embedding_function: SparseEmbeddingFunction

lib: str = 'vsag'

properties: dict[str, str | int | float | bool] | None = None

prune: bool = False

refine: bool = False

refine_k: float = 4.0

resolve_source_key() → tuple[str, str | None][source]

Resolve the source_key to determine data source.

Returns:

source_type is “document” or “metadata”
metadata_key is the metadata field name (only for “metadata” source_type)

Return type:

Tuple of (source_type, metadata_key) where

source_key: str | FieldKey | None = <pyseekdb.client.types.FieldKey object>

type: str = 'sindi'

class pyseekdb.SqHNSWConfiguration(dimension: int = 384, distance: str | pyseekdb.client.configuration.DistanceMetric = 'l2', *, lib: str | pyseekdb.client.configuration.HNSWIndexLib = 'vsag', m: int | None = None, ef_construction: int | None = None, ef_search: int | None = None, extra_info_max_size: int | None = None, properties: dict[str, str | int | float | bool] | None = None)[source]: Bases: HNSWConfiguration

class pyseekdb.VectorIndexConfig(hnsw: pyseekdb.client.configuration.HNSWConfiguration | None = None, embedding_function: pyseekdb.client.embedding_function.EmbeddingFunction | None = None)[source]

Bases: object

embedding_function: EmbeddingFunction | None = None

hnsw: HNSWConfiguration | None = None

class pyseekdb.Version(version_str: str)[source]

Bases: object

Represents a version number with support for comparison operations.

Supports versions in format: x.x.x or x.x.x.x (3 or 4 numeric parts)

Examples

>>> v1 = Version("1.0.1.0")
>>> v2 = Version("1.0.0.1")
>>> v1 > v2
True

>>> v1 = Version("1.2.3")
>>> v2 = Version("1.2.4")
>>> v1 < v2
True

property build: int: Get build version number (0 if not specified)

property major: int: Get major version number

property minor: int: Get minor version number

property parts: tuple[int, int, int, int]: Get version parts as tuple

property patch: int: Get patch version number

pyseekdb.get_default_embedding_function() → DefaultEmbeddingFunction[source]

Get or create the default embedding function instance.

Returns:: DefaultEmbeddingFunction instance

pyseekdb.register_embedding_function(embedding_function_class: type[T]) → type[T][source]

Decorator to automatically register an embedding function class.

This decorator can be used as a class decorator to automatically register an embedding function when the class is defined, eliminating the need to manually call EmbeddingFunctionRegistry.register().

Parameters:: embedding_function_class – The embedding function class to register. Must implement: - A static name() method that returns a unique identifier - A get_config() instance method that returns configuration dict - A static build_from_config(config) method to restore instances
Returns:: The same class (for use as a decorator).
Raises:: ValueError – If the class doesn’t have the required methods or if the name is already registered to a different class.

Example

>>> from pyseekdb.client.embedding_function import (
...     EmbeddingFunction, Documents, Embeddings, register_embedding_function
... )
>>> from typing import Dict, Any
>>>
>>> @register_embedding_function
... class MyCustomEmbeddingFunction(EmbeddingFunction[Documents]):
...     def __init__(self, model_name: str = "my-model"):
...         self.model_name = model_name
...
...     def __call__(self, input: list[str]|str) -> list[list[float]]:
...         # Your embedding logic
...         return [[0.1, 0.2, 0.3] for _ in (input if isinstance(input, list) else [input])]
...
...     @staticmethod
...     def name() -> str:
...         return "my_custom_embedding"
...
...     def get_config(self) -> Dict[str, Any]:
...         return {"model_name": self.model_name}
...
...     @staticmethod
...     def build_from_config(config: Dict[str, Any]) -> "MyCustomEmbeddingFunction":
...         return MyCustomEmbeddingFunction(model_name=config.get("model_name", "my-model"))
>>>
>>> # The class is now automatically registered!
>>> # You can use it immediately when creating collections
>>> import pyseekdb
>>> client = pyseekdb.Client(path="./seekdb.db")
>>> ef = MyCustomEmbeddingFunction()
>>> collection = client.create_collection("my_collection", embedding_function=ef)

pyseekdb.register_sparse_embedding_function(sparse_embedding_function_class: type[T]) → type[T][source]

Decorator to automatically register a sparse embedding function class.

Example

>>> @register_sparse_embedding_function
... class MyBM25Function:
...     def __call__(self, documents):
...         ...
...     @staticmethod
...     def name() -> str:
...         return "my_bm25"
...     def get_config(self) -> dict:
...         return {}
...     @staticmethod
...     def build_from_config(config) -> "MyBM25Function":
...         return MyBM25Function()

Utility Modules

Embedding Functions

The following embedding function classes are available in pyseekdb.utils.embedding_functions:

pyseekdb.utils.embedding_functions

Embedding function implementations for pyseekdb.