home
Mohamed Arbi Nsibi
One Model to Find Them All: Multimodal Search with Gemini 2 and Qdrant

One Model to Find Them All: Multimodal Search with Gemini 2 and Qdrant

March 15, 2026 · 9 min
· 0 views
Build a multimodal search stack with Gemini 2 embeddings and Qdrant, using Matryoshka Representation Learning to reduce storage costs.

1. paradigm Shift: The history of multimodal Retrieval

In enterprise information retrieval, the shift from messy separate pipelines to a single multimodal setup is a huge deal for devops, before this release if you wanted to search text , images , and audio you had to mix and build three different pipelines with 3 different models that never quite synced up. This “pipeline complexity” has been the first bottleneck for production-grade retrieval and major operational challenge where sometimes data got lost during intermediate transformations…

The gemini-embedding-2-preview model solves these architectural challenges as the first fully multimodal model in the Gemini family. By putting text, images, video, audio, and PDFs into a single, unified vector space, so eliminating the need for multiple embedding services. This allows for native cross-modal comparison where the mathematical representation of a video segment can be compared to a text-based query or a technical document, this makes the process more efficient and reduces the total cost of ownership

Supported Modalities and Input Limits

Architects must design within the following technical constraints per single request to ensure system reliability:

ModalitySpecifications and LimitsSupported Formats/Codecs
TextUp to 8,192 tokensN/A
ImageMax 6 images per requestPNG, JPEG
AudioMax 80 seconds durationMP3, WAV
VideoMax 128 seconds durationMP4, MOV (Codecs: H264, H265, AV1, VP9)
PDFMax 6 pages (processes visual & text content)Standard PDF

Note: The global maximum input limit across all modalities is 8,192 tokens per request.

By consolidating these inputs into a single embedding call, organizations can replace complex multi-model orchestration with a unified semantic infrastructure.

2. Strategic Advantage: The Unified Vector Space

a unified vector space is basically a universal translator for your data instead of wasting time and processing power turning images into text descriptions (which sometimes misses the details) gemini 2 understands everything in one go, you can finally search for a specific video clip using a text prompt without needing a middleman to describe the scene first (middleman for eg: intermediate translation layers, such as captioning or OCR…)

# Text + image → single embedding, no OCR needed
result = client.models.embed_content(
    model=MODEL_ID,
    contents=[types.Content(parts=[
        types.Part(text=caption),
        types.Part.from_bytes(data=img_bytes, mime_type="image/png"),
    ])],
    config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
)

data Integrity through Native Multimodality

native multimodality means you are getting the pure version of your data instead of relying on middle steps: like turning a voice record into text and losing the caller’s tone, the model here processes the entire signal directly across modalities. it picks up on things like the layout of a diagram or the emotion in a voice which that classic transcription totally misses

For an architect, this means higher data integrity and reduced latency in the ingestion pipeline.

Interaction Patterns in Production

This architecture enables three primary cross-modal interaction patterns:

While a unified space simplifies the pipeline, the high dimensionality (3072) requires a vector database capable of managing storage and compute at scale.

3. Matryoshka Representation Learning (MRL)

trying to stay accurate without spending a fortune on server RAM is a constant battle in large-scale deployments. Gemini 2 utilizes MRL, which works like the russian nested doll training the model to concentrate the most significant semantic information in the initial segments of the vector. so you can actually cut the vector short to save space and yes it will remain functional. you get a 4X reduction in storage costs while only losing a tiny fraction of accuracy

Performance vs. Storage Trade-offs

you always got some room to play with how much space your data takes up and Gemini 2 lets you choose vector sizes from 128 up to 3072. and if you are running a massive operation and want to save on memory dropping to 768 dims is usually a sweet spot (as google mentioned) here you cut your storage costs by 4X but only lose a tiny 0.25% drop accuray in retrieval quality provides a massive ROI for enterprise systems, but just remember that if you go for the smaller size you ll need to manually normalize the data to keep your search results sharp

Normalization Requirements

Accurate semantic similarity depends on vector direction rather than magnitude

This flexible dimensionality allows for advanced retrieval strategies when paired with a high-performance vector database.

4. Infrastructure: Integration with Qdrant Vector Database

Qdrant is the best partner for this because it lets you store all your different data types in one single bucket (due to its native support for unified collections and multi-vector strategies). with it you dont have to build separate folders for text and video anymore, Qdrant allows all modalities to coexist in a single collection that facilitate direct cross-modal queries

by using a two pass strategy you can do a lightning fast search with short vectors first and then use the full sized vectors to polish final results (exactly as we discussed in previous blog post)

Qdrant has a cool feature called “Named Vectors” which lets you store different versions of the same item, this is perfect for hybrid search when you can have one vector that understands the deepest meanings of a query and another one that looks for exact keywords like part numbers or technical jargon. it is basically like giving your search engine both a brain for context and an eye for detail

Two-Pass Retrieval Strategy

Leveraging MRL, architects can implement a highly optimized Two-Pass Search Pipeline:

  1. Candidate Retrieval: Perform a rapid search using lower-dimensional vectors (e.g., 768 dimensions) to filter the top 100-200 candidates. This minimizes RAM usage and latency during the initial broad search.
  2. Full-Precision Rescoring: Use the full 3072-dimensional embeddings to rescore the small candidate set, ensuring maximum precision for the final top-K results.

This approach significantly reduces the compute pressure on the vector database while maintaining the accuracy of a full-dimensional search.

5. Gemini 2 and Qdrant Integration

The following implementation utilizes the google-genai SDK and qdrant-client to build a production-ready multimodal pipeline.

System Setup and Normalization

We include a helper function for L2 normalization, which is critical for those choosing to use MRL to reduce dimensionality.

import numpy as np
from google import genai
from google.genai import types
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

# Initialize Clients
client = genai.Client(api_key="YOUR_API_KEY")
MODEL_ID = "gemini-embedding-2-preview"
qdrant_client = QdrantClient(url="http://localhost:6333")

def normalize_vector(vector):
    norm = np.linalg.norm(vector)
    return (vector / norm).tolist() if norm > 0 else vector

Multimodal Ingestion and Collection Setup

We define the Qdrant collection using Cosine Similarity and demonstrate an aggregated embedding of text and image.

pages = convert_from_path("paper.pdf", dpi=150)

for i, page in enumerate(pages):
    buf = io.BytesIO()
    page.save(buf, format="JPEG")
    img_bytes = buf.getvalue()

    v = client.models.embed_content(
        model=MODEL,
        contents=[types.Content(parts=[
            types.Part(text=f"page {i+1}"),
            types.Part.from_bytes(data=img_bytes, mime_type="image/jpeg"),
        ])],
        config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
    ).embeddings[0].values

Multimodal Search Query

This example demonstrates a complex search where the query itself is multimodal (text + image), searching for a specific pattern described by the user.

# index with no caption — pure image bytes only
v = embed_parts([
    types.Part.from_bytes(data=img_bytes, mime_type="image/png")
])

# query with text only — no image in the query
hits = qdrant_client.query_points(
    collection_name="image_only",
    query=embed_parts([types.Part(text=q)], task_type="RETRIEVAL_QUERY"),
    limit=1,
).points

6. Optimization: Task-Specific Tuning and Performance

For production-grade accuracy, embeddings must be optimized for their specific intent using the task_type parameter. This informs the model of the relationship between vectors (e.g., query vs. document).

Supported Task Types (see ref below)

Task TypeDescriptionOptimized Use Case
RETRIEVAL_QUERYOptimized for short search queries.User-facing search inputs.
RETRIEVAL_DOCUMENTOptimized for items to be retrieved.Indexing PDFs, images, and video chunks.
CODE_RETRIEVAL_QUERYOptimized for code block retrieval.Technical RAG and code assistants.
SEMANTIC_SIMILARITYMeasures conceptual closeness.Recommendation engines, duplicate detection.
CLASSIFICATIONOptimized for label assignment.Sentiment analysis or auto-tagging.
QUESTION_ANSWERINGOptimized for Q&A system queries.FAQ bots and automated support.
FACT_VERIFICATIONOptimized for evidence retrieval.Automated fact-checking systems.
# same API call works for audio chunks
v = client.models.embed_content(
    model=MODEL_ID,
    contents=[types.Content(parts=[
        types.Part(text=f"segment {start:.0f}s–{end:.0f}s"),
        types.Part.from_bytes(data=audio_bytes, mime_type="audio/mp3"),
    ])],
    config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT"),
).embeddings[0].values

critical Migration Note

just a note: that you can’t mix your old embeddings with these new ones, gemini2 speaks completely different language than his younger brother gemini-embedding-001 cuz it is built for multimodality. if you are making the switch you will need to reindex your entire database from scratch to keep everything working as expected

Summary

The integration of Gemini 2’s multimodal MRL embeddings with Qdrant’s scalable Rust-based infrastructure can be the smartest way to build a modern search engine. by using the martryoshka learning to save on server costs and ditching the broken ocr/transcription pipes of the past, you end up with a system that’s actually easy to manage, faster and ofc cheaper to run and way better at actually finding what your users are looking for across text video and audio

references:

Share this article: LinkedIn

(END)

Join the discussion