You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:
Two endpoints: /v1/embeddings/text and /v1/embeddings/image.
Both accept a model parameter. Using a shared multimodal model (like "siglip-1") ensures embeddings are aligned.
Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
The response format is consistent between text and image embeddings, simplifying integration.
This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.
Endpoints
1. Text Embeddings Endpoint
URL:POST /v1/embeddings/text
Headers:
Content-Type: application/json
Authorization: Bearer <API_KEY>
Request Body:
model(string, required): The name of the model to use. For multimodal capability, this should be set to something like "siglip-1".
input(string, required): The text string to embed.
options(object, optional):
normalize(boolean, default: true): Whether to normalize the embedding vector.
return_dims(boolean, default: false): Whether to return the dimensionality of the embedding.
Example Request:
{
"model": "siglip-1",
"input": "A photo of a white cat sitting on a chair.",
"options": {
"normalize": true,
"return_dims": false
}
}
Store this embedding in your application as image_embedding.
Compare the Embeddings
Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:
Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)
Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like
With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.
🚀 Describe the new functionality needed
We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:
/v1/embeddings/text
and/v1/embeddings/image
.model
parameter. Using a shared multimodal model (like"siglip-1"
) ensures embeddings are aligned.Endpoints
1. Text Embeddings Endpoint
URL:
POST /v1/embeddings/text
Headers:
Content-Type: application/json
Authorization: Bearer <API_KEY>
Request Body:
"siglip-1"
.Example Request:
Example Response:
If
return_dims = true
:2. Image Embeddings Endpoint
URL:
POST /v1/embeddings/image
Headers:
Content-Type: application/json
Authorization: Bearer <API_KEY>
Request Body:
"siglip-1"
.Example Request:
Example Response:
If
return_dims = true
:Example Workflow for Multimodal Retrieval
Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").
Get the Text Embedding
Call the text embedding endpoint with your textual query:
Assume the response:
Store this embedding in your application as
text_embedding
.Get the Image Embedding for a Candidate Image
Convert your candidate image to base64 (done client-side), then:
Assume the response:
Store this embedding in your application as
image_embedding
.Compare the Embeddings
Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:
If
similarity
is high, this image is likely a good match for the textual query.Error Handling
model
orinput
/image
field, invalid base64 encoding.Content-Type
is notapplication/json
.Example Error Response:
💡 Why is this needed? What if we don't build it?
Open to feedback here
Other thoughts
No response
The text was updated successfully, but these errors were encountered: