Matryoshka : One Vector to Serve Them All

As we know traditional embedding models generate a single, fixed vector for each piece of data. This might creates a fundamental dilemma :

large vectors (1536 dims) offer high acc but they demand significant memory, compute power and of course money. sometimes they are an overkill for simple tasks and prohibitive for edge devices
small vectors (64d) are efficient but sacrifice nuance and details needed for complex applications

This forces you with one of two choices : train and maintain multiple models for different needs, or accept a costly compromise in performance or efficiency

Classical trade-off :

Vector Size	Pros	Cons
Small (300)	Fast & efficient, good for mobile devices.	Less detailed, might miss nuances, less accuracy
Large (1536)	Very detailed and high accuracy	Slower & requires powerful computers.

What if i told you that you can have the best of both worlds without needing two separate models?

The matryoshka concept : one model, many resolutions

The matryoshka embeddings ‘Feb 2024 paper’ inspired from Russian nested dolls provide several nested vectors within a single output in which each layer adds a new level of complexity

alt text

Layer 1 is simply first group of numbers from that list and each subsequent layer is progressively longer slice from the start of the same single vector, going from layer 1 to layer 2 we add more details or dimensions

It is just logical partitioning: Physically a matryoshka embedding is just one large vector , so that the nesting is just a logical concept achieved by slicing the vector at a pre-defined points. the model is trained to ensure that even the first slice is independently meaningful and useful

matryoshka representation learning MRL

It is as defined in the official paper a training strategy that forces subsections of an embedding to be independently useful and this is how it works

during training we define multiple dims (for eg. 128, 256, 512, 1024)
the output is simply a full vector which then the system truncates it to each defined size and calculates a separate loss for each slice
these losses then are summed. the model is penalized if any slice performs poorly on the task

If you try to do this [:, :256] slice on an older model (like the original BERT or text-embedding-ada-002), the resulting vector will perform very poorly because those models distribute information equally across the entire length of the embedding.

What MRL is NOT:

It is not an inference optimization trick like pruning or quantization
it does not make the model forward pass faster or lighter , it will computes the entire embedding vector

what MRL is :

It is about one powerful model which produce one large flexible embedding
the magic happens after inference when you choose how much of the generated vector to use downstream

Key performance metrics (based on the official paper)

Classification : achieves same ImageNet accuracy as standard baselines but with up to 14X smaller embedding sizes

Retrieval : in adaptive retrieval settings MRL allows for up to 14X real world speedups by using small dims for a first pass (shortlisting) and larger dims for reranking with no loss in acc.

Accuracy : lower dimensional slices of MRL model often match or outperform independently trained of that specific size (MRL slice of 64 dims is as good as a model trained specifically to output 64 dims)

Adaptive retrieval

If you are trying to find a book in a 10 million book library you can choose between

Standard retrieval which is too slow cuz you have to pick up every single book read the summary (use the 2048 dims) and decide if it matches what you need

Adaptive retrieval (fast and smart ) it is a two step funnel :

step 1 : shortlisting (pass 1 - the 128x speedup) : the analogy here when you are searching for that specific book you take a glance at the titles (which is in our case the 16d). you may ask why the title and MRL would answer that it is forcing the “title” to contain the most important info, which is obviously can let you reject more than 90% of the rest of the books. Then you grab for example 200 books that seems relevant.

The result : this is theoretically 128X faster because we processed very little data for the massive pile of the books

Step2 : the reranking (pass 2 )

we take those selected 200 books and we read the full summary (the full 2048 dims) for only these books to find our perfect match

The result : when doing the heavy reading only on those 200 books it takes almost no time

OpenAI ‘text-embedding-3’ models

in early 2024 openAI tried MRL for its new “text-embedding-3-small ” and “text-embedding-3-large” models , which shows that a 256d slice can outperform a full 1536d vector

Matryoshka with Qdrant

Step1 : multi-step matryoshka retrieval

we just start the search with smallest fastest embedding slice for example 64d then progressively refine results by reranking with larger slices

from qdrant_client import QdrantClient, models
client = QdrantClient("http://localhost:6333")

# The first branch of our search pipeline retrieves 25 documents
# using the Matryoshka embeddings with multistep retrieval.
matryoshka_prefetch = models.Prefetch(
    prefetch=[
    # first retrieve 100 candidates using fastest(lowest) 64d slice
        models.Prefetch(
            prefetch=[
                models.Prefetch(
                    query=[...],
                    using="matryoshka-64dim",
                    limit=100,
                ),
            ],
            # secondly rerank those 100 using more accurate 128d slice.
            query=[...],
            using="matryoshka-128dim",
            limit=25, # return final top 25 
        )

client.query_points("my-collection", prefetch=[matryoshka_prefetch], ...)

stage 1 : is fast and broad, fast initial search retrieves 100 potential candidates using most efficient 64-dim slice

stage 2: accurate and focused , the sys then reranks using more detailed 256 dim slice returning top 25

the entire operation happens server-side in Qdrant. It’s efficient cuz the expensive high dimensional comparison is only performed on a small subset of the data

Step 2 : Fusing with dense and sparse vectors

# The second branch of our search pipeline also retrieves 25 documents,
# but uses the dense and sparse vectors, with their results combined
# using the Reciprocal Rank Fusion.
sparse_dense_rrf_prefetch = models.Prefetch(
    prefetch=[
        models.Prefetch(
            prefetch=[
                # The first prefetch operation retrieves 100 documents
                # using dense vectors using integer data type. Retrieval
                # is faster, but quality is lower.
                models.Prefetch(
                    query=[...],
                    using="dense-uint8",
                    limit=100,
                )
            ],
            # Integer-based embeddings are then re-ranked using the
            # float-based embeddings. Here we just want to retrieve
            # 25 documents.
            query=[-1.234, 0.762, ..., 1.532],
            using="dense",
            limit=25,
        ),
        # Here we just add another 25 documents using the sparse
        # vectors only.
        models.Prefetch(
            query=models.SparseVector(
                indices=[...], values=[...],
            ),
            using="sparse",
            limit=25,
        ),
    ],
    # RRF is activated below, so there is no need to specify the
    # query vector here, as fusion is done on the scores of the
    # retrieved documents.
    query=models.FusionQuery(
        fusion=models.Fusion.RRF,
    ),
)

Dense and sparse vector search would get us each 25 docs and then with RRF (reciprocal rank fusion) we finally get a single relevance ranked list

Tip : for multi-vector models used only for reranking (like colbert ), disable HNSW graph creation (just set m=0 ) , this prevents indexing which is unnecessary and resource consumption

The big idea : matryoshka representation learning delivers a single, flexible vector that scales to your needs. one model can now serve both high powered servers and resource constrained devices

workflow : heavy lifting happens on powerful hardware , however at inference time we simply slice off the dimension you need.

where the metaphor of dolls bends (but doesn’t break )

it is slicing not nesting : means that in code we are slicing a single large vector (’myvec[:256])’), not physically opening separate dolls
the model learns all layers simultaneously during training unlike a doll build layer by layer
fixed boundaries : slices are fixed at training time and you can’t ask for a custom 400d slice without retaining
the model may learn features that overlap across slices unlike the clean separation of physical doll

MRL offers a clever unified way to nest different levels of details inside a single embeddings space, which fundamentally changing how we approach efficiency and flexibility in AI. You get a single , flexible let’s call it elastic vector that can serve high powered servers and resource-limited edge devices. It is to be honest a clever way to nest different levels of detail inside one unified embedding space.

References

(END)