Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support multi modal retrieval on top of llama stack, inference provider side #667

Open
benjibc opened this issue Dec 20, 2024 · 1 comment

Comments

@benjibc
Copy link
Contributor

benjibc commented Dec 20, 2024

🚀 Describe the new functionality needed

We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:

  • Two endpoints: /v1/embeddings/text and /v1/embeddings/image.
  • Both accept a model parameter. Using a shared multimodal model (like "siglip-1") ensures embeddings are aligned.
  • Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
  • The response format is consistent between text and image embeddings, simplifying integration.
  • This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.

Endpoints

1. Text Embeddings Endpoint

URL: POST /v1/embeddings/text

Headers:

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY>

Request Body:

  • model (string, required): The name of the model to use. For multimodal capability, this should be set to something like "siglip-1".
  • input (string, required): The text string to embed.
  • options (object, optional):
    • normalize (boolean, default: true): Whether to normalize the embedding vector.
    • return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair.",
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

2. Image Embeddings Endpoint

URL: POST /v1/embeddings/image

Headers:

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY>

Request Body:

  • model (string, required): The name of the model to use. For image embeddings aligned with the text embeddings above, use "siglip-1".
  • image (object, required):
    • base64 (string, required): A base64-encoded representation of the image (e.g., PNG or JPEG). The client must pre-encode the image before sending.
  • options (object, optional):
    • normalize (boolean, default: true): Whether to normalize the embedding vector.
    • return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
  },
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Example Workflow for Multimodal Retrieval

Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").

  1. Get the Text Embedding
    Call the text embedding endpoint with your textual query:

    POST /v1/embeddings/text
    {
      "model": "siglip-1",
      "input": "A photo of a white cat sitting on a chair."
    }

    Assume the response:

    {
      "model": "siglip-1",
      "embedding": [0.0123, -0.0456, 0.0789, ...]
    }

    Store this embedding in your application as text_embedding.

  2. Get the Image Embedding for a Candidate Image
    Convert your candidate image to base64 (done client-side), then:

    POST /v1/embeddings/image
    {
      "model": "siglip-1",
      "image": {
        "base64": "iVBORw0K..."
      }
    }

    Assume the response:

    {
      "model": "siglip-1",
      "embedding": [0.023, -0.081, 0.572, ...]
    }

    Store this embedding in your application as image_embedding.

  3. Compare the Embeddings
    Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:

    similarity = cosine_similarity(text_embedding, image_embedding)

    If similarity is high, this image is likely a good match for the textual query.


Error Handling

  • 400 Bad Request: Missing model or input/image field, invalid base64 encoding.
  • 401 Unauthorized: Invalid or missing API key.
  • 415 Unsupported Media Type: If the Content-Type is not application/json.
  • 500 Internal Server Error: Unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}

💡 Why is this needed? What if we don't build it?

Open to feedback here

Other thoughts

No response

@ashwinb
Copy link
Contributor

ashwinb commented Dec 21, 2024

Thank you for putting up such a complete spec / proposal with all details! Much to learn from for us :)

Is there a reason why you should separate these endpoints though? Internally CLIP does have separate encoders but from an API perspective, a client is really embedding a mixture of various contents together. Our embeddings API for example works with an InterleavedContent type which is roughly something like

[
   { type: "text", text: "hello" },
   { type: "image", url: "http://foo.bar.baz/lol.png" },
]

With your proposal, trying to embed this content will mean I will have to (a) make two API calls, but (b) more importantly, the client is now expected to figure out what to do with these embedding values -- should they be indexed separately? We can certainly make the call to do an "addition" within llama-stack (since the semantic is that the client wants to embed this stuff as one piece). Curious to know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants