5. DQL Operations

DQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.

5.1 Query (Vector Similarity Search)

The query() method performs vector similarity search to find the most similar documents to the query vector(s).

Behavior with Embedding Function:

If query_embeddings are provided: embeddings are used directly, embedding_function is NOT called
If query_embeddings are NOT provided but query_texts are provided:
- If collection has an embedding_function, it will automatically generate query embeddings from texts
- If collection does NOT have an embedding_function, a ValueError will be raised
If neither query_embeddings nor query_texts are provided: A ValueError will be raised

# Basic vector similarity query (embedding_function not used)
results = collection.query(
    query_embeddings=[1.0, 2.0, 3.0],
    n_results=3
)

# Iterate over results
for i in range(len(results["ids"][0])):
    print(f"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}")
    if results.get("documents"):
        print(f"Document: {results['documents'][0][i]}")
    if results.get("metadatas"):
        print(f"Metadata: {results['metadatas'][0][i]}")

# Query by texts - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
results = collection.query(
    query_texts=["my query text"],
    n_results=10
)
# The collection's embedding_function will automatically convert query_texts to query_embeddings

# Query by multiple texts (batch query)
results = collection.query(
    query_texts=["query text 1", "query text 2"],
    n_results=5
)
# Returns dict with lists of lists, one list per query text
for i in range(len(results["ids"])):
    print(f"Query {i}: {len(results['ids'][i])} results")

# Query with metadata filter (using query_texts)
results = collection.query(
    query_texts=["AI research"],
    where={"category": {"$eq": "AI"}},
    n_results=5
)

# Query with comparison operator (using query_texts)
results = collection.query(
    query_texts=["machine learning"],
    where={"score": {"$gte": 90}},
    n_results=5
)

# Query with document filter (using query_texts)
results = collection.query(
    query_texts=["neural networks"],
    where_document={"$contains": "machine learning"},
    n_results=5
)

# Query with combined filters (using query_texts)
results = collection.query(
    query_texts=["AI research"],
    where={"category": {"$eq": "AI"}, "score": {"$gte": 90}},
    where_document={"$contains": "machine"},
    n_results=5
)

# Query with multiple embeddings (batch query)
results = collection.query(
    query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
    n_results=2
)
# Returns dict with lists of lists, one list per query embedding
for i in range(len(results["ids"])):
    print(f"Query {i}: {len(results['ids'][i])} results")

# Query with specific fields
results = collection.query(
    query_embeddings=[1.0, 2.0, 3.0],
    include=["documents", "metadatas", "embeddings"],
    n_results=3
)

Parameters:

query_embeddings (List[float] or List[List[float]], optional): Single embedding or list of embeddings for batch queries
- If provided, used directly (embedding_function is ignored)
- If not provided, must provide query_texts and collection must have embedding_function
query_texts (str or List[str], optional): Query text(s) to be embedded
- If query_embeddings not provided, query_texts will be converted to embeddings using collection’s embedding_function
n_results (int, required): Number of similar results to return (default: 10)
where (dict, optional): Metadata filter conditions (see Filter Operators section)
where_document (dict, optional): Document content filter
include (List[str], optional): List of fields to include: ["documents", "metadatas", "embeddings"]

Returns: Dict with keys (chromadb-compatible format):

ids: List[List[str]] - List of ID lists, one list per query
documents: Optional[List[List[str]]] - List of document lists, one list per query (if included)
metadatas: Optional[List[List[Dict]]] - List of metadata lists, one list per query (if included)
embeddings: Optional[List[List[List[float]]]] - List of embedding lists, one list per query (if included)
distances: Optional[List[List[float]]] - List of distance lists, one list per query

Usage:

# Single query
results = collection.query(query_embeddings=[0.1, 0.2, 0.3], n_results=5)
# results["ids"][0] contains IDs for the query
# results["documents"][0] contains documents for the query
# results["distances"][0] contains distances for the query

# Multiple queries
results = collection.query(query_embeddings=[[0.1, 0.2], [0.3, 0.4]], n_results=5)
# results["ids"][0] contains IDs for first query
# results["ids"][1] contains IDs for second query

Note: The embedding_function used is the one associated with the collection. You cannot override it per-query.

5.2 Get (Retrieve by IDs or Filters)

The get() method retrieves documents from a collection without vector similarity search. It supports filtering by IDs, metadata, and document content.

# Get by single ID
results = collection.get(ids="123")

# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])

# Get by metadata filter (simplified equality - both forms are supported)
results = collection.get(
    where={"category": "AI"},
    limit=10
)
# Or use explicit $eq operator:
# where={"category": {"$eq": "AI"}}

# Get by comparison operator
results = collection.get(
    where={"score": {"$gte": 90}},
    limit=10
)

# Get by $in operator
results = collection.get(
    where={"tag": {"$in": ["ml", "python"]}},
    limit=10
)

# Get by logical operators ($or) - simplified equality
results = collection.get(
    where={
        "$or": [
            {"category": "AI"},
            {"tag": "python"}
        ]
    },
    limit=10
)

# Get by document content filter
results = collection.get(
    where_document={"$contains": "machine learning"},
    limit=10
)

# Get with combined filters
results = collection.get(
    where={"category": {"$eq": "AI"}},
    where_document={"$contains": "machine"},
    limit=10
)

# Get with pagination
results = collection.get(limit=2, offset=1)

# Get with specific fields
results = collection.get(
    ids=["1", "2"],
    include=["documents", "metadatas", "embeddings"]
)

# Get all data (up to limit)
results = collection.get(limit=100)

Parameters:

ids (str or List[str], optional): Single ID or list of IDs to retrieve
where (dict, optional): Metadata filter conditions (see Filter Operators section)
where_document (dict, optional): Document content filter using $contains for full-text search
limit (int, optional): Maximum number of results to return
offset (int, optional): Number of results to skip for pagination
include (List[str], optional): List of fields to include: ["documents", "metadatas", "embeddings"]

Returns: Dict with keys (chromadb-compatible format):

ids: List[str] - List of IDs
documents: Optional[List[str]] - List of documents (if included)
metadatas: Optional[List[Dict]] - List of metadata dictionaries (if included)
embeddings: Optional[List[List[float]]] - List of embeddings (if included)

Usage:

# Get by single ID
results = collection.get(ids="123")
# results["ids"] contains ["123"]
# results["documents"] contains document for ID "123"

# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])
# results["ids"] contains ["1", "2", "3"]
# results["documents"] contains documents for all IDs

# Get by filter
results = collection.get(where={"category": {"$eq": "AI"}}, limit=10)
# results["ids"] contains all matching IDs
# results["documents"] contains all matching documents

Note: If no parameters provided, returns all data (up to limit).

5.3 Hybrid Search

collection.hybrid_search() runs full-text/scalar queries and vector KNN search in parallel, then fuses the results (RRF is supported).

Parameters（dict mode）

query (dict or List[dict], optional): full-text/scalar routes
- where_document: $contains / $not_contains plus $and / $or combinations of those clauses
- where: metadata filters (see 5.4) including logical operators and #id
knn (dict or List[dict], optional): vector routes
- query_embeddings: List[float] or List[List[float]]; validated against collection.dimension when present
- query_texts: str or List[str]; auto-embedded with the collection’s embedding_function (missing function raises ValueError)
- where: metadata filters for this vector route
- n_results: candidates per vector route (k, default 10)
rank (dict, optional): ranking config; RRF tested via {"rrf": {...}} or {}. Omit to use single-route ordering.
n_results (int): final fused result count (default 10).
include (List[str], optional): fields to return. ids/distances are always returned; documents/metadatas are returned by default when include is None; add "embeddings" to fetch vectors.

Return format

Query-compatible dict: ids, distances, optionally documents / metadatas / embeddings. Hybrid search returns a single outer list (one fused result set).

Examples

# Full-text + vector with rank fusion (dict style)
results = collection.hybrid_search(
    query={
        "where_document": {"$contains": "machine learning"},
        "where": {"category": {"$eq": "science"}}
    },
    knn={
        "query_texts": ["AI research"],  # auto-embedded via collection.embedding_function
        "where": {"year": {"$gte": 2020}},
        "n_results": 10,  # k per vector route
    },
    rank={"rrf": {"rank_window_size": 60, "rank_constant": 60}},
    n_results=5,
    include=["documents", "metadatas", "embeddings"],
)

# Vector-only search using explicit embeddings (dimension is validated)
results = collection.hybrid_search(
    knn={"query_embeddings": [[0.1, 0.2, 0.3]], "n_results": 8},
    n_results=5,
    include=["documents", "metadatas"],
)

5.4 Filter Operators

Metadata Filters (`where` parameter)

$eq (or direct equality) / $ne / $gt / $gte / $lt / $lte
$in / $nin for membership checks
$or / $and for logical composition
$not for negation
#id to filter by primary key (e.g., {"#id": {"$in": ["id1", "id2"]}})

Document Filters (`where_document` parameter)

$contains: full-text match
$not_contains: exclude matches
$or / $and combining multiple $contains clauses

5.5 Collection Information Methods

# Get item count
count = collection.count()
print(f"Collection has {count} items")

# Preview first few items in collection (returns all columns by default)
preview = collection.peek(limit=5)
for i in range(len(preview["ids"])):
    print(f"ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
    print(f"Metadata: {preview['metadatas'][i]}, Embedding: {preview['embeddings'][i]}")

# Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")

Methods:

collection.count() - Get the number of items in the collection
collection.peek(limit=10) - Quickly preview the first few items in the collection
client.count_collection() - Count the number of collections in the current database