1. paradigm Shift: The history of multimodal Retrieval
In enterprise information retrieval, the shift from messy separate pipelines to a single multimodal setup is a huge deal for devops, before this release if you wanted to search text , images , and audio you had to mix and build three different pipelines with 3 different models that never quite synced up. This “pipeline complexity” has been the first bottleneck for production-grade retrieval and major operational challenge where sometimes data got lost during intermediate transformations…
The gemini-embedding-2-preview model solves these architectural challenges as the first fully multimodal model in the Gemini family. By putting text, images, video, audio, and PDFs into a single, unified vector space, so eliminating the need for multiple embedding services. This allows for native cross-modal comparison where the mathematical representation of a video segment can be compared to a text-based query or a technical document, this makes the process more efficient and reduces the total cost of ownership
Supported Modalities and Input Limits
Architects must design within the following technical constraints per single request to ensure system reliability:
| Modality | Specifications and Limits | Supported Formats/Codecs |
|---|---|---|
| Text | Up to 8,192 tokens | N/A |
| Image | Max 6 images per request | PNG, JPEG |
| Audio | Max 80 seconds duration | MP3, WAV |
| Video | Max 128 seconds duration | MP4, MOV (Codecs: H264, H265, AV1, VP9) |
| Max 6 pages (processes visual & text content) | Standard PDF |
Note: The global maximum input limit across all modalities is 8,192 tokens per request.
By consolidating these inputs into a single embedding call, organizations can replace complex multi-model orchestration with a unified semantic infrastructure.
2. Strategic Advantage: The Unified Vector Space
a unified vector space is basically a universal translator for your data instead of wasting time and processing power turning images into text descriptions (which sometimes misses the details) gemini 2 understands everything in one go, you can finally search for a specific video clip using a text prompt without needing a middleman to describe the scene first (middleman for eg: intermediate translation layers, such as captioning or OCR…)
# Text + image → single embedding, no OCR needed
result = client.models.embed_content(
model=MODEL_ID,
contents=[types.Content(parts=[
types.Part(text=caption),
types.Part.from_bytes(data=img_bytes, mime_type="image/png"),
])],
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
)
data Integrity through Native Multimodality
native multimodality means you are getting the pure version of your data instead of relying on middle steps: like turning a voice record into text and losing the caller’s tone, the model here processes the entire signal directly across modalities. it picks up on things like the layout of a diagram or the emotion in a voice which that classic transcription totally misses
For an architect, this means higher data integrity and reduced latency in the ingestion pipeline.
Interaction Patterns in Production
This architecture enables three primary cross-modal interaction patterns:
- Text-to-Video: direct semantic retrieval of specific timestamps within a video library without requiring manual tagging.
- Image-to-Document: uploading a photo of a technical failure to retrieve relevant sections of a PDF maintenance manual.
- Multimodal Search: combining a text description with an image (e.g., “Find a part like this, but made of carbon fiber or metal”) to execute highly specific queries across a unified index
While a unified space simplifies the pipeline, the high dimensionality (3072) requires a vector database capable of managing storage and compute at scale.
3. Matryoshka Representation Learning (MRL)
trying to stay accurate without spending a fortune on server RAM is a constant battle in large-scale deployments. Gemini 2 utilizes MRL, which works like the russian nested doll training the model to concentrate the most significant semantic information in the initial segments of the vector. so you can actually cut the vector short to save space and yes it will remain functional. you get a 4X reduction in storage costs while only losing a tiny fraction of accuracy
Performance vs. Storage Trade-offs
you always got some room to play with how much space your data takes up and Gemini 2 lets you choose vector sizes from 128 up to 3072. and if you are running a massive operation and want to save on memory dropping to 768 dims is usually a sweet spot (as google mentioned) here you cut your storage costs by 4X but only lose a tiny 0.25% drop accuray in retrieval quality provides a massive ROI for enterprise systems, but just remember that if you go for the smaller size you ll need to manually normalize the data to keep your search results sharp
- 768 or 1536 Dimensions: Ideal for high-volume storage. Source benchmarks show that a 768-dimensional vector achieves an MTEB score of 67.99, remarkably close to the 68.16 score of the full 3072-dimensional vector.
- 3072 Dimensions (Default): Recommended when maximum precision is the priority and storage overhead is your last concern
Normalization Requirements
Accurate semantic similarity depends on vector direction rather than magnitude
- Default (3072): Output is automatically normalized by the Gemini API
- Truncated Dimensions: When manually truncating to 768 or 1536, vectors must be manually normalized (L2) to maintain similarity accuracy
This flexible dimensionality allows for advanced retrieval strategies when paired with a high-performance vector database.
4. Infrastructure: Integration with Qdrant Vector Database
Qdrant is the best partner for this because it lets you store all your different data types in one single bucket (due to its native support for unified collections and multi-vector strategies). with it you dont have to build separate folders for text and video anymore, Qdrant allows all modalities to coexist in a single collection that facilitate direct cross-modal queries
by using a two pass strategy you can do a lightning fast search with short vectors first and then use the full sized vectors to polish final results (exactly as we discussed in previous blog post)
Named Vectors and Hybrid Search
Qdrant has a cool feature called “Named Vectors” which lets you store different versions of the same item, this is perfect for hybrid search when you can have one vector that understands the deepest meanings of a query and another one that looks for exact keywords like part numbers or technical jargon. it is basically like giving your search engine both a brain for context and an eye for detail
Two-Pass Retrieval Strategy
Leveraging MRL, architects can implement a highly optimized Two-Pass Search Pipeline:
- Candidate Retrieval: Perform a rapid search using lower-dimensional vectors (e.g., 768 dimensions) to filter the top 100-200 candidates. This minimizes RAM usage and latency during the initial broad search.
- Full-Precision Rescoring: Use the full 3072-dimensional embeddings to rescore the small candidate set, ensuring maximum precision for the final top-K results.
This approach significantly reduces the compute pressure on the vector database while maintaining the accuracy of a full-dimensional search.
5. Gemini 2 and Qdrant Integration
The following implementation utilizes the google-genai SDK and qdrant-client to build a production-ready multimodal pipeline.
System Setup and Normalization
We include a helper function for L2 normalization, which is critical for those choosing to use MRL to reduce dimensionality.
import numpy as np
from google import genai
from google.genai import types
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
# Initialize Clients
client = genai.Client(api_key="YOUR_API_KEY")
MODEL_ID = "gemini-embedding-2-preview"
qdrant_client = QdrantClient(url="http://localhost:6333")
def normalize_vector(vector):
norm = np.linalg.norm(vector)
return (vector / norm).tolist() if norm > 0 else vector
Multimodal Ingestion and Collection Setup
We define the Qdrant collection using Cosine Similarity and demonstrate an aggregated embedding of text and image.
pages = convert_from_path("paper.pdf", dpi=150)
for i, page in enumerate(pages):
buf = io.BytesIO()
page.save(buf, format="JPEG")
img_bytes = buf.getvalue()
v = client.models.embed_content(
model=MODEL,
contents=[types.Content(parts=[
types.Part(text=f"page {i+1}"),
types.Part.from_bytes(data=img_bytes, mime_type="image/jpeg"),
])],
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
).embeddings[0].values
Multimodal Search Query
This example demonstrates a complex search where the query itself is multimodal (text + image), searching for a specific pattern described by the user.
# index with no caption — pure image bytes only
v = embed_parts([
types.Part.from_bytes(data=img_bytes, mime_type="image/png")
])
# query with text only — no image in the query
hits = qdrant_client.query_points(
collection_name="image_only",
query=embed_parts([types.Part(text=q)], task_type="RETRIEVAL_QUERY"),
limit=1,
).points
6. Optimization: Task-Specific Tuning and Performance
For production-grade accuracy, embeddings must be optimized for their specific intent using the task_type parameter. This informs the model of the relationship between vectors (e.g., query vs. document).
Supported Task Types (see ref below)
| Task Type | Description | Optimized Use Case |
|---|---|---|
| RETRIEVAL_QUERY | Optimized for short search queries. | User-facing search inputs. |
| RETRIEVAL_DOCUMENT | Optimized for items to be retrieved. | Indexing PDFs, images, and video chunks. |
| CODE_RETRIEVAL_QUERY | Optimized for code block retrieval. | Technical RAG and code assistants. |
| SEMANTIC_SIMILARITY | Measures conceptual closeness. | Recommendation engines, duplicate detection. |
| CLASSIFICATION | Optimized for label assignment. | Sentiment analysis or auto-tagging. |
| QUESTION_ANSWERING | Optimized for Q&A system queries. | FAQ bots and automated support. |
| FACT_VERIFICATION | Optimized for evidence retrieval. | Automated fact-checking systems. |
# same API call works for audio chunks
v = client.models.embed_content(
model=MODEL_ID,
contents=[types.Content(parts=[
types.Part(text=f"segment {start:.0f}s–{end:.0f}s"),
types.Part.from_bytes(data=audio_bytes, mime_type="audio/mp3"),
])],
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
).embeddings[0].values
critical Migration Note
just a note: that you can’t mix your old embeddings with these new ones, gemini2 speaks completely different language than his younger brother gemini-embedding-001 cuz it is built for multimodality. if you are making the switch you will need to reindex your entire database from scratch to keep everything working as expected
Summary
The integration of Gemini 2’s multimodal MRL embeddings with Qdrant’s scalable Rust-based infrastructure can be the smartest way to build a modern search engine. by using the martryoshka learning to save on server costs and ditching the broken ocr/transcription pipes of the past, you end up with a system that’s actually easy to manage, faster and ofc cheaper to run and way better at actually finding what your users are looking for across text video and audio
references:
- Google embeddings documentations
- MRL previous blog post
- Qdrant documentation
- Medium article
- Github repo with code samples
(END)