[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

benjibc · 2024-12-20T03:27:28Z

🚀 Describe the new functionality needed

We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:

Two endpoints: /v1/embeddings/text and /v1/embeddings/image.
Both accept a model parameter. Using a shared multimodal model (like "siglip-1") ensures embeddings are aligned.
Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
The response format is consistent between text and image embeddings, simplifying integration.
This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.

Endpoints

1. Text Embeddings Endpoint

URL: POST /v1/embeddings/text

Headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>

Request Body:

model (string, required): The name of the model to use. For multimodal capability, this should be set to something like "siglip-1".
input (string, required): The text string to embed.
options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair.",
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

2. Image Embeddings Endpoint

URL: POST /v1/embeddings/image

Headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>

Request Body:

model (string, required): The name of the model to use. For image embeddings aligned with the text embeddings above, use "siglip-1".
image (object, required):
- base64 (string, required): A base64-encoded representation of the image (e.g., PNG or JPEG). The client must pre-encode the image before sending.
options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
  },
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Example Workflow for Multimodal Retrieval

Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").

Get the Text Embedding
Call the text embedding endpoint with your textual query:

POST /v1/embeddings/text
{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair."
}

Assume the response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...]
}

Store this embedding in your application as text_embedding.

Get the Image Embedding for a Candidate Image
Convert your candidate image to base64 (done client-side), then:

POST /v1/embeddings/image
{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0K..."
  }
}

Assume the response:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...]
}

Store this embedding in your application as image_embedding.

Compare the Embeddings
Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:
```
similarity = cosine_similarity(text_embedding, image_embedding)
```
If similarity is high, this image is likely a good match for the textual query.

Error Handling

400 Bad Request: Missing model or input/image field, invalid base64 encoding.
401 Unauthorized: Invalid or missing API key.
415 Unsupported Media Type: If the Content-Type is not application/json.
500 Internal Server Error: Unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}

💡 Why is this needed? What if we don't build it?

Open to feedback here

Other thoughts

No response

The text was updated successfully, but these errors were encountered:

ashwinb · 2024-12-21T04:18:45Z

Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)

Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like

[
   { type: "text", text: "hello" },
   { type: "image", url: "http://foo.bar.baz/lol.png" },
]

With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

benjibc commented Dec 20, 2024

ashwinb commented Dec 21, 2024

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

Comments

benjibc commented Dec 20, 2024

🚀 Describe the new functionality needed

Endpoints

1. Text Embeddings Endpoint

2. Image Embeddings Endpoint

Example Workflow for Multimodal Retrieval

Error Handling

💡 Why is this needed? What if we don't build it?

Other thoughts

ashwinb commented Dec 21, 2024