Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Image to Text docs #681

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions ai/api-reference/image-to-text.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
openapi: post /image-to-text
---

<Info>
The default Gateway used in this guide is the public
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
not intended for production-ready applications. For production-ready
applications, consider using the [Livepeer Studio](https://livepeer.studio/)
Gateway, which requires an API token. Alternatively, you can set up your own
Gateway node or partner with one via the `ai-video` channel on
[Discord](https://discord.gg/livepeer).
</Info>

<Note>
Please note that the exact parameters, default values, and responses may vary
between models. For more information on model-specific parameters, please
refer to the respective model documentation available in the [image-to-text
pipeline](/ai/pipelines/image-to-text). Not all parameters might be available
for a given model.
</Note>
5 changes: 5 additions & 0 deletions ai/orchestrators/models-config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ currently **recommended** models and their respective prices.
"price_per_unit": 11,
"pixels_per_unit": 1e2,
"currency": "USD",
},
{
"pipeline": "image-to-text",
"model_id": "Salesforce/blip-image-captioning-large",
"price_per_unit": 4768371
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about the price, I copied other image-to pipelines

}
]
```
Expand Down
95 changes: 95 additions & 0 deletions ai/pipelines/image-to-text.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: Image-to-Text
---

## Overview

The `image-to-text` pipeline converts images into text captions. This pipeline is powered by the latest models in the HuggingFace [text-to-image](https://huggingface.co/models?pipeline_tag=text-to-image) pipeline.

<div align="center">

</div>

## Models

### Warm Models

The current warm model requested for the `image-to-text` pipeline is:

- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)

<Tip>
For faster responses with different
[image-to-text](https://huggingface.co/models?pipeline_tag=text-to-image)
diffusion models, ask Orchestrators to load it on their GPU via the `ai-video`
channel in [Discord Server](https://discord.gg/livepeer).
</Tip>

### On-Demand Models

The following models have been tested and verified for the `image-to-text`
pipeline:

<Note>
If a specific model you wish to use is not listed, please submit a [feature
request](https://github.com/livepeer/ai-worker/issues/new?assignees=&labels=enhancement%2Cmodel&projects=&template=model_request.yml)
on GitHub to get the model verified and added to the list.
</Note>

{/* prettier-ignore */}
<Accordion title="Tested and Verified Diffusion Models">
- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
</Accordion>

## Basic Usage Instructions

<Tip>
For a detailed understanding of the `image-to-text` endpoint and to experiment
with the API, see the [Livepeer AI API
Reference](/ai/api-reference/image-to-text).
</Tip>

To create an image caption using the `image-to-text` pipeline, submit a
`POST` request to the Gateway's `image-to-text` API endpoint:

```bash
curl -X POST "https://<GATEWAY_IP>/image-to-text" \
-F model_id=Salesforce/blip-image-captioning-large \
-F image=@<PATH_TO_FILE>
```

In this command:

- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
- `model_id` is the diffusion model to use.
- `image` is the path to the image file to be captioned.

<Note>
Maximum request size: 50 MB
</Note>

For additional optional parameters, refer to the
[Livepeer AI API Reference](/ai/api-reference/image-to-text).

## Orchestrator Configuration

To configure your Orchestrator to serve the `image-to-text` pipeline, refer to
the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.

### System Requirements

The following system requirements are recommended for optimal performance:

- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickstaa any idea where to find a realistic suggestion on the VRAM? I don't think the Salesforce model needs much at all but not sure how to find out for certain.

VRAM.

## API Reference

<Card
title="API Reference"
icon="rectangle-terminal"
href="/ai/api-reference/image-to-text"
>
Explore the `image-to-text` endpoint and experiment with the API in the
Livepeer AI API Reference.
</Card>
7 changes: 7 additions & 0 deletions ai/pipelines/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,11 @@ pipelines:
>
The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
</Card>
<Card
title="Image-to-Text"
icon="message-dots"
href="/ai/pipelines/image-to-text"
>
The image-to-text pipeline generates captions for input images, with an optional prompt to guide the process.
</Card>
</CardGroup>
4 changes: 4 additions & 0 deletions api-reference/generate/image-to-text.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
title: "Image To Text"
openapi: "POST /api/beta/generate/image-to-text"
---
6 changes: 4 additions & 2 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,8 @@
"ai/pipelines/segment-anything-2",
"ai/pipelines/text-to-image",
"ai/pipelines/text-to-speech",
"ai/pipelines/upscale"
"ai/pipelines/upscale",
"ai/pipelines/image-to-text"
]
},
{
Expand Down Expand Up @@ -604,7 +605,8 @@
"ai/api-reference/image-to-video",
"ai/api-reference/segment-anything-2",
"ai/api-reference/upscale",
"ai/api-reference/text-to-speech"
"ai/api-reference/text-to-speech",
"ai/api-reference/image-to-text"
]
}
]
Expand Down