5. DQL Operations
DQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.
5.1 Query (Vector Similarity Search)
The query() method performs vector similarity search to find the most similar documents to the query vector(s).
Behavior with Embedding Function:
If
query_embeddingsare provided: embeddings are used directly,embedding_functionis NOT calledIf
query_embeddingsare NOT provided butquery_textsare provided:If collection has an
embedding_function, it will automatically generate query embeddings from textsIf collection does NOT have an
embedding_function, aValueErrorwill be raised
If neither
query_embeddingsnorquery_textsare provided: AValueErrorwill be raised
# Basic vector similarity query (embedding_function not used)
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
n_results=3
)
# Iterate over results
for i in range(len(results["ids"][0])):
print(f"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}")
if results.get("documents"):
print(f"Document: {results['documents'][0][i]}")
if results.get("metadatas"):
print(f"Metadata: {results['metadatas'][0][i]}")
# Query by texts - embeddings auto-generated by embedding_function
# Requires: collection must have embedding_function set
results = collection.query(
query_texts=["my query text"],
n_results=10
)
# The collection's embedding_function will automatically convert query_texts to query_embeddings
# Query by multiple texts (batch query)
results = collection.query(
query_texts=["query text 1", "query text 2"],
n_results=5
)
# Returns dict with lists of lists, one list per query text
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with metadata filter (using query_texts)
results = collection.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}},
n_results=5
)
# Query with comparison operator (using query_texts)
results = collection.query(
query_texts=["machine learning"],
where={"score": {"$gte": 90}},
n_results=5
)
# Query with document filter (using query_texts)
results = collection.query(
query_texts=["neural networks"],
where_document={"$contains": "machine learning"},
n_results=5
)
# Query with combined filters (using query_texts)
results = collection.query(
query_texts=["AI research"],
where={"category": {"$eq": "AI"}, "score": {"$gte": 90}},
where_document={"$contains": "machine"},
n_results=5
)
# Query with multiple embeddings (batch query)
results = collection.query(
query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
n_results=2
)
# Returns dict with lists of lists, one list per query embedding
for i in range(len(results["ids"])):
print(f"Query {i}: {len(results['ids'][i])} results")
# Query with specific fields
results = collection.query(
query_embeddings=[1.0, 2.0, 3.0],
include=["documents", "metadatas", "embeddings"],
n_results=3
)
Parameters:
query_embeddings(List[float] or List[List[float]], optional): Single embedding or list of embeddings for batch queriesIf provided, used directly (embedding_function is ignored)
If not provided, must provide
query_textsand collection must haveembedding_function
query_texts(str or List[str], optional): Query text(s) to be embeddedIf
query_embeddingsnot provided,query_textswill be converted to embeddings using collection’sembedding_function
n_results(int, required): Number of similar results to return (default: 10)where(dict, optional): Metadata filter conditions (see Filter Operators section)where_document(dict, optional): Document content filterinclude(List[str], optional): List of fields to include:["documents", "metadatas", "embeddings"]
Returns: Dict with keys (chromadb-compatible format):
ids:List[List[str]]- List of ID lists, one list per querydocuments:Optional[List[List[str]]]- List of document lists, one list per query (if included)metadatas:Optional[List[List[Dict]]]- List of metadata lists, one list per query (if included)embeddings:Optional[List[List[List[float]]]]- List of embedding lists, one list per query (if included)distances:Optional[List[List[float]]]- List of distance lists, one list per query
Usage:
# Single query
results = collection.query(query_embeddings=[0.1, 0.2, 0.3], n_results=5)
# results["ids"][0] contains IDs for the query
# results["documents"][0] contains documents for the query
# results["distances"][0] contains distances for the query
# Multiple queries
results = collection.query(query_embeddings=[[0.1, 0.2], [0.3, 0.4]], n_results=5)
# results["ids"][0] contains IDs for first query
# results["ids"][1] contains IDs for second query
Note: The embedding_function used is the one associated with the collection. You cannot override it per-query.
5.2 Get (Retrieve by IDs or Filters)
The get() method retrieves documents from a collection without vector similarity search. It supports filtering by IDs, metadata, and document content.
# Get by single ID
results = collection.get(ids="123")
# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])
# Get by metadata filter (simplified equality - both forms are supported)
results = collection.get(
where={"category": "AI"},
limit=10
)
# Or use explicit $eq operator:
# where={"category": {"$eq": "AI"}}
# Get by comparison operator
results = collection.get(
where={"score": {"$gte": 90}},
limit=10
)
# Get by $in operator
results = collection.get(
where={"tag": {"$in": ["ml", "python"]}},
limit=10
)
# Get by logical operators ($or) - simplified equality
results = collection.get(
where={
"$or": [
{"category": "AI"},
{"tag": "python"}
]
},
limit=10
)
# Get by document content filter
results = collection.get(
where_document={"$contains": "machine learning"},
limit=10
)
# Get with combined filters
results = collection.get(
where={"category": {"$eq": "AI"}},
where_document={"$contains": "machine"},
limit=10
)
# Get with pagination
results = collection.get(limit=2, offset=1)
# Get with specific fields
results = collection.get(
ids=["1", "2"],
include=["documents", "metadatas", "embeddings"]
)
# Get all data (up to limit)
results = collection.get(limit=100)
Parameters:
ids(str or List[str], optional): Single ID or list of IDs to retrievewhere(dict, optional): Metadata filter conditions (see Filter Operators section)where_document(dict, optional): Document content filter using$containsfor full-text searchlimit(int, optional): Maximum number of results to returnoffset(int, optional): Number of results to skip for paginationinclude(List[str], optional): List of fields to include:["documents", "metadatas", "embeddings"]
Returns: Dict with keys (chromadb-compatible format):
ids:List[str]- List of IDsdocuments:Optional[List[str]]- List of documents (if included)metadatas:Optional[List[Dict]]- List of metadata dictionaries (if included)embeddings:Optional[List[List[float]]]- List of embeddings (if included)
Usage:
# Get by single ID
results = collection.get(ids="123")
# results["ids"] contains ["123"]
# results["documents"] contains document for ID "123"
# Get by multiple IDs
results = collection.get(ids=["1", "2", "3"])
# results["ids"] contains ["1", "2", "3"]
# results["documents"] contains documents for all IDs
# Get by filter
results = collection.get(where={"category": {"$eq": "AI"}}, limit=10)
# results["ids"] contains all matching IDs
# results["documents"] contains all matching documents
Note: If no parameters provided, returns all data (up to limit).
5.3 Hybrid Search
collection.hybrid_search() runs full-text/scalar queries and vector KNN search in parallel, then fuses the results (RRF is supported).
Parameters(dict mode)
query(dict or List[dict], optional): full-text/scalar routeswhere_document:$contains/$not_containsplus$and/$orcombinations of those clauseswhere: metadata filters (see 5.4) including logical operators and#id
knn(dict or List[dict], optional): vector routesquery_embeddings:List[float]orList[List[float]]; validated againstcollection.dimensionwhen presentquery_texts: str or List[str]; auto-embedded with the collection’sembedding_function(missing function raisesValueError)where: metadata filters for this vector routen_results: candidates per vector route (k, default 10)
rank(dict, optional): ranking config; RRF tested via{"rrf": {...}}or{}. Omit to use single-route ordering.n_results(int): final fused result count (default 10).include(List[str], optional): fields to return.ids/distancesare always returned;documents/metadatasare returned by default whenincludeisNone; add"embeddings"to fetch vectors.
Return format
Query-compatible dict:
ids,distances, optionallydocuments/metadatas/embeddings. Hybrid search returns a single outer list (one fused result set).
Examples
# Full-text + vector with rank fusion (dict style)
results = collection.hybrid_search(
query={
"where_document": {"$contains": "machine learning"},
"where": {"category": {"$eq": "science"}}
},
knn={
"query_texts": ["AI research"], # auto-embedded via collection.embedding_function
"where": {"year": {"$gte": 2020}},
"n_results": 10, # k per vector route
},
rank={"rrf": {"rank_window_size": 60, "rank_constant": 60}},
n_results=5,
include=["documents", "metadatas", "embeddings"],
)
# Vector-only search using explicit embeddings (dimension is validated)
results = collection.hybrid_search(
knn={"query_embeddings": [[0.1, 0.2, 0.3]], "n_results": 8},
n_results=5,
include=["documents", "metadatas"],
)
5.4 Filter Operators
Metadata Filters (where parameter)
$eq(or direct equality) /$ne/$gt/$gte/$lt/$lte$in/$ninfor membership checks$or/$andfor logical composition$notfor negation#idto filter by primary key (e.g.,{"#id": {"$in": ["id1", "id2"]}})
Document Filters (where_document parameter)
$contains: full-text match$not_contains: exclude matches$or/$andcombining multiple$containsclauses
5.5 Collection Information Methods
# Get item count
count = collection.count()
print(f"Collection has {count} items")
# Preview first few items in collection (returns all columns by default)
preview = collection.peek(limit=5)
for i in range(len(preview["ids"])):
print(f"ID: {preview['ids'][i]}, Document: {preview['documents'][i]}")
print(f"Metadata: {preview['metadatas'][i]}, Embedding: {preview['embeddings'][i]}")
# Count collections in database
collection_count = client.count_collection()
print(f"Database has {collection_count} collections")
Methods:
collection.count()- Get the number of items in the collectioncollection.peek(limit=10)- Quickly preview the first few items in the collectionclient.count_collection()- Count the number of collections in the current database