Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Rerank API (Jina- and Cohere-compatible API) #12376

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b6610fb
feat: serving_rerank implementation
K-Mistele Jan 23, 2025
a82b4bb
fix: imports
K-Mistele Jan 23, 2025
99acff6
doc: add example requests and scripts
K-Mistele Jan 23, 2025
31b5137
test: rerank
K-Mistele Jan 24, 2025
485e328
feat: serving_rerank implementation
K-Mistele Jan 23, 2025
8922f81
fix: imports
K-Mistele Jan 23, 2025
dc0d158
doc: add example requests and scripts
K-Mistele Jan 23, 2025
4ed459b
test: rerank
K-Mistele Jan 24, 2025
676eea0
added /v2/rerank route
K-Mistele Jan 24, 2025
b66bcc2
fix(docs): extra spaces
K-Mistele Jan 24, 2025
c44dee4
fix(docs): cross-reference target for rerank API
K-Mistele Jan 24, 2025
cce2873
fix(tests): needed to break up model quotes
K-Mistele Jan 24, 2025
a38060f
doc(example): update jina example to reflect lack of SDK, add cohere …
K-Mistele Jan 24, 2025
901021f
fix: remove logger warnings and make the linter happy
K-Mistele Jan 24, 2025
4849575
fix: file name
K-Mistele Jan 24, 2025
36e85a5
fix(nit): ordering on assertions
K-Mistele Jan 24, 2025
4adb94b
fix(tests): was using score instead of rerank
K-Mistele Jan 24, 2025
dc92240
fix(api): use rereank as the default API for scoring
K-Mistele Jan 24, 2025
330aa22
fix(merge)
K-Mistele Jan 24, 2025
ce85821
Merge branch 'vllm-project:main' into main
K-Mistele Jan 25, 2025
29a0366
doc: v2 rerank endpoint
K-Mistele Jan 25, 2025
844d39a
fix: remove duplicate file and fix vllm start command in examples
K-Mistele Jan 25, 2025
af83c25
fix: only load serving rerank if model supports score
K-Mistele Jan 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ In addition, we have the following custom APIs:
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`)
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).

(chat-template)=

Expand Down Expand Up @@ -473,3 +478,90 @@ The following extra parameters are supported:
:start-after: begin-score-extra-params
:end-before: end-score-extra-params
```

(rerank-api) =

### Re-rank API

Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.

You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, both `/rerank` and `/v1/rerank` endpoints
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved
popular open-source tools.

Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>

#### Example Request

Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

Request:

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
```

Response:

```bash
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```

#### Extra parameters

The following [pooling parameters](#pooling-params) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
```
33 changes: 33 additions & 0 deletions examples/online_serving/jinjaai_rerank_client.py
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""
Example of using the OpenAI entrypoint's rerank API which is compatible with
Jina and Cohere
run: vllm serve --model BAAI/bge-reranker-base
"""
import json

import requests

url = "http://127.0.0.1:8000/rerank"

headers = {"accept": "application/json", "Content-Type": "application/json"}
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved

data = {
"model":
"BAAI/bge-reranker-base",
"query":
"What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.", "Horses and cows are both animals"
]
}

response = requests.post(url, headers=headers, json=data)

# Check the response
if response.status_code == 200:
print("Request successful!")
print(json.dumps(response.json(), indent=2))
else:
print(f"Request failed with status code: {response.status_code}")
print(response.text)
98 changes: 98 additions & 0 deletions tests/entrypoints/openai/test_rerank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import pytest
import requests

from vllm.entrypoints.openai.protocol import RerankResponse

from ...utils import RemoteOpenAIServer

MODEL_NAME = "BAAI/bge-reranker-base"


@pytest.fixture(scope="module")
def server():
args = ['--enforce-eager', '--max-model-len 100']

with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
yield remote_server


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_rerank_texts(server: RemoteOpenAIServer, model_name: str):
query = "What is the capital of France?"
documents = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

rerank_response = requests.post(server.url_for("rerank"),
json={
"model": model_name,
"query": query,
"documents": documents,
})
rerank_response.raise_for_status()
rerank = RerankResponse.model_validate(rerank_response.json())

assert rerank.id is not None
assert rerank.results is not None
assert len(rerank.results) == 2
assert rerank.results[1].relevance_score <= 0.01
assert rerank.results[0].relevance_score >= 0.9
DarkLight1337 marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_top_n(server: RemoteOpenAIServer, model_name: str):
query = "What is the capital of France?"
documents = [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.", "Cross-encoder models are neat"
]

rerank_response = requests.post(server.url_for("score"),
json={
"model": model_name,
"query": query,
"documents": documents,
"top_n": 2
})
rerank_response.raise_for_status()
rerank = RerankResponse.model_validate(rerank_response.json())

assert rerank.id is not None
assert rerank.results is not None
assert len(rerank.results) == 2
assert rerank.results[1].relevance_score <= 0.01
assert rerank.results[0].relevance_score >= 0.9


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
def test_score_max_model_len(server: RemoteOpenAIServer, model_name: str):

query = "What is the capital of France?" * 100
documents = [
"The capital of Brazil is Brasilia.", "The capital of France is Paris."
]

rerank_response = requests.post(server.url_for("rerank"),
json={
"model": model_name,
"query": query,
"documents": documents
})
assert rerank_response.status_code == 400
# Assert just a small fragments of the response
assert "Please reduce the length of the input." in \
rerank_response.text

# Test truncation
rerank_response = requests.post(server.url_for("rerank"),
json={
"model": model_name,
"query": query,
"documents": documents
})
assert rerank_response.status_code == 400
assert "Please, select a smaller truncation size." in \
rerank_response.text
41 changes: 41 additions & 0 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
PoolingChatRequest,
PoolingCompletionRequest,
PoolingRequest, PoolingResponse,
RerankRequest, RerankResponse,
ScoreRequest, ScoreResponse,
TokenizeRequest,
TokenizeResponse,
Expand All @@ -68,6 +69,7 @@
from vllm.entrypoints.openai.serving_models import (BaseModelPath,
OpenAIServingModels)
from vllm.entrypoints.openai.serving_pooling import OpenAIServingPooling
from vllm.entrypoints.openai.serving_rerank import JinaAIServingRerank
from vllm.entrypoints.openai.serving_score import OpenAIServingScores
from vllm.entrypoints.openai.serving_tokenization import (
OpenAIServingTokenization)
Expand Down Expand Up @@ -306,6 +308,10 @@ def score(request: Request) -> Optional[OpenAIServingScores]:
return request.app.state.openai_serving_scores


def rerank(request: Request) -> Optional[JinaAIServingRerank]:
return request.app.state.jinaai_serving_reranking


def tokenization(request: Request) -> OpenAIServingTokenization:
return request.app.state.openai_serving_tokenization

Expand Down Expand Up @@ -502,6 +508,33 @@ async def create_score_v1(request: ScoreRequest, raw_request: Request):
return await create_score(request, raw_request)


@router.post("/rerank")
@with_cancellation
async def do_rerank(request: RerankRequest, raw_request: Request):
handler = rerank(raw_request)
if handler is None:
return base(raw_request).create_error_response(
message="The model does not support Rerank (Score) API")
generator = await handler.do_rerank(request, raw_request)
if isinstance(generator, ErrorResponse):
return JSONResponse(content=generator.model_dump(),
status_code=generator.code)
elif isinstance(generator, RerankResponse):
return JSONResponse(content=generator.model_dump())

assert_never(generator)


@router.post("/v1/rerank")
@with_cancellation
async def do_rerank_v1(request: RerankRequest, raw_request: Request):
logger.warning(
"To indicate that the rerank API is not part of the standard OpenAI"
" API, we have located it at `/rerank`. Please update your client"
"accordingly. (Note: Conforms to JinaAI rerank API)")
return await do_rerank(request, raw_request)
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved


TASK_HANDLERS: Dict[str, Dict[str, tuple]] = {
"generate": {
"messages": (ChatCompletionRequest, create_chat_completion),
Expand All @@ -514,6 +547,9 @@ async def create_score_v1(request: ScoreRequest, raw_request: Request):
"score": {
"default": (ScoreRequest, create_score),
},
"rerank": {
"default": (RerankRequest, do_rerank)
},
K-Mistele marked this conversation as resolved.
Show resolved Hide resolved
"reward": {
"messages": (PoolingChatRequest, create_pooling),
"default": (PoolingCompletionRequest, create_pooling),
Expand Down Expand Up @@ -759,6 +795,11 @@ async def init_app_state(
state.openai_serving_models,
request_logger=request_logger
) if model_config.task == "score" else None
state.jinaai_serving_reranking = JinaAIServingRerank(
engine_client,
model_config,
state.openai_serving_models,
request_logger=request_logger)
DarkLight1337 marked this conversation as resolved.
Show resolved Hide resolved
state.openai_serving_tokenization = OpenAIServingTokenization(
engine_client,
model_config,
Expand Down
46 changes: 46 additions & 0 deletions vllm/entrypoints/openai/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -1000,6 +1000,52 @@ def to_pooling_params(self):
return PoolingParams(additional_data=self.additional_data)


class RerankRequest(OpenAIBaseModel):
model: str
query: str
documents: List[str]
top_n: int = Field(default_factory=lambda: 0)
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None

# doc: begin-rerank-pooling-params
additional_data: Optional[Any] = None
# doc: end-rerank-pooling-params

# doc: begin-rerank-extra-params
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."))

# doc: end-rerank-extra-params

def to_pooling_params(self):
return PoolingParams(additional_data=self.additional_data)


class RerankDocument(BaseModel):
text: str


class RerankResult(BaseModel):
index: int
document: RerankDocument
relevance_score: float


class RerankUsage(BaseModel):
total_tokens: int


class RerankResponse(OpenAIBaseModel):
id: str
model: str
usage: RerankUsage
results: List[RerankResult]


class CompletionLogProbs(OpenAIBaseModel):
text_offset: List[int] = Field(default_factory=list)
token_logprobs: List[Optional[float]] = Field(default_factory=list)
Expand Down
9 changes: 5 additions & 4 deletions vllm/entrypoints/openai/serving_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@
DetokenizeRequest,
EmbeddingChatRequest,
EmbeddingCompletionRequest,
ErrorResponse, ScoreRequest,
ErrorResponse, RerankRequest,
ScoreRequest,
TokenizeChatRequest,
TokenizeCompletionRequest)
from vllm.entrypoints.openai.serving_models import OpenAIServingModels
Expand Down Expand Up @@ -204,9 +205,9 @@ def _validate_input(
token_num = len(input_ids)

# Note: EmbeddingRequest and ScoreRequest doesn't have max_tokens
if isinstance(
request,
(EmbeddingChatRequest, EmbeddingCompletionRequest, ScoreRequest)):
if isinstance(request,
(EmbeddingChatRequest, EmbeddingCompletionRequest,
ScoreRequest, RerankRequest)):

operation = "score" if isinstance(request, ScoreRequest) \
else "embedding generation"
Expand Down
Loading
Loading