-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: OpenAI Compatible Frontend #7561
Conversation
…ing for utilities and models routes
… speed up testing, next step add tests for /v1/models routes
…, need to refactor for including server startup
…ing server only once per test class, skip openai client tests
…the default value
… some placeholder tests for future feature support
…oad logic to track/list all models for now, marked xfail test for known issue with TRT-LLM temperature, added logic to support testing both TRT-LLM and vLLM based on environment, added openai dep to Dockerfile but skipping openai tests for now
…add OpenAIServer utility for testing without FastAPI TestClient, rename folder from example to openai for clarity that the source code isn't an example, add some usage examples with curl, genai-perf, and openai client, add --tokenizer to main.py
…ting it at model load time. Cleanup completions tests
…a decoupled model with an empty final response, add response validation for this scenario
…fixtures and logic
…ut README accordingly
…to not use dockerfiles
…ude more openai fields explicitly from sampling parameters input to vllm backend
@@ -36,14 +36,34 @@ def _create_vllm_inference_request( | |||
model, prompt, request: CreateChatCompletionRequest | CreateCompletionRequest | |||
): | |||
inputs = {} | |||
excludes = {"model", "stream", "messages", "prompt", "echo"} | |||
# Exclude non-sampling parameters so they aren't passed to vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: May make more sense to explicitly "include" support sampling parameters, but will be of a similar length or longer. Both approaches likely require periodic updates to either include a new sampling field, or exclude a non-sampling field.
if num_responses == 2 and responses[-1].final != True: | ||
raise Exception("Unexpected internal error with incorrect response flags") | ||
if num_responses > 2: | ||
raise Exception(f"Unexpected number of responses: {num_responses}, expected 1.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for later relocation.. Why breaking out these function to a separate files? There are also a bunch of helper functions in triton_engine.py
. And say if there is a need to have multiple files for a particular engine implementation, I would have group them together like:
engine
|----- triton
|------- __init__.py
|------- utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed the separation of the utils/helpers definitely got a little blurry/messy - a restructure like this makes sense.
I would like to further restructure later a few things as well.
I think eventually having imports in a user-facing app looking something like this would look more standardized:
from tritonserver.core import Server
from tritonserver.frontends.kserve import KServeGrpc
from tritonserver.frontends.openai import FastApiFrontend
from tritonserver.frontends.openai.engine import TritonLLMEngine
But will consider restructuring in general as a follow-up PR focused just on structure after some discussion.
strong upvote for this feature to be enabled, would be great and, indeed - complete the whole functionality |
… rmccormick-openai
stages: [pre-commit] | ||
verbose: true | ||
require_serial: true | ||
# FIXME: Only run on changed files when triggered by GitHub Actions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @GuanLuo - I'll look into fixing this on GitHub Actions separately
Hi @chorus-over-flanger @faileon, thanks for expressing interest in the Since the route itself doesn't have too many parameters defined in the spec, it may be relatively straightforward to add. If you have any particular models you'd like to see working as an example for If you have any feedback or interest in contributing to the project, let us know as well. Thanks! |
Description
Adds an OpenAI Compatible Frontend for Triton Inference Server as a FastAPI application using the
tritonserver
in-process python bindings for the following endpoints:/v1/models
/v1/completions
/v1/chat/completions
Additionally there are some other observability routes exposed for convenience:
/metrics
- Prometheus-compatible metrics from Triton Core/health/ready
- General health check for inference readiness, similar to NIM schema.This is a refactor and extension of the original example by @nnshah1 here: triton-inference-server/tutorials#104 to include more thorough testing.
Refactors/Changes
The file structure was refactored to make each logical component smaller, more discoverable, and more digestible. It was loosely based on the following resources:
The global variables were converted to use the FastAPI
app
object for storing state instead, such asapp.server
to access the in-processtritonserver
across different routes handling requests.Encapsulated info related to model objects in a
ModelMetadata
dataclass.Made the
--tokenizer
an explicit setting provided by user at startup, rather than something we try to infer based on the Triton model name used in the model repository. This is more resilient to BYO/custom models, and models not produced by the Triton CLI.Added an optional explicit
--backend
(request conversion format) override to better deal with edge cases like LLMs defined in custom backends, or ambiguous backends like "python" for ensemble/BLS models where the request format isn't immediately clear from the backend choice. If the explicit override is not used, the frontend will currently pickvllm
request format if the targeted model'sbackend=vllm
, otherwise it will pickbackend=tensorrtllm
for any other backend value.Extracted most of the Triton in-process Python API logic out of the FastAPI route definitions. The pre-initialized in-process tritonserver is attached to the FastAPI app during initialization in
main.py
. This will make it simpler to mock and unit test specific parts in the frontend, as well as teardown/startup the frontend during testing without stopping/starting the tritonserver each time.Updated various parts of the generated
schemas/openai.py
that seemed to be generated for pydantic v1 (currently using v2) to account for all of the deprecation warnings that are currently deprecated but will be fully removed in coming versions of pydantic: https://docs.pydantic.dev/latest/migration/Testing Methodology
TestClient
object, despite interacting with the application and logic as normal, does not actually expose the network host/ports for interaction from other clients. It would probably be good to move away fromTestClient
and use theOpenAIServer
utility, described below, for all testing later on.openai
python client library, which required aOpenAIServer
utility class to run the application differently then theTestClient
flow expects, similar to vLLM's utility here.streaming
withTestClient
were sparse.top_p
,top_k
) was not done, and I would expect more to be tested by each backend implementation.Open Questions
/v1/models
list all models in the model repository, even if only some of those models are actually compatible with the completions / chat endpoints? It may be difficult to automatically infer which models are "compatible" or not and account for all scenarios, so it may nee to be explicit by the user if we want to limit which models are returned.SERVED_MODEL_NAME
for a cosmetic mapping from something liketensorrt_llm_bls
->meta-llama/Meta-Llama-3.1-8b-Instruct
to let the server/app starter specify how clients should interact with it. They may want to hide the details of ensemble/BLS/etc. to clients behind a more convenient--served-model-name
.docker/Dockerfile*
and general expected user-facing workflow. I added these as examples for myself to use during testing/development and included how they'd be used in the README, but ultimately we will probably be publishing this code within the respective containers and these DIY containers are likely unnecessary.\Open Items for this PR
stop
that are passed as aList
when using vLLM backend.Open Items for follow-up PRs
genai-perf
to maintain compatibility and catch regressionslogit_bias
,logprobs
,n > 1
,best_of
. These are mostly just unexplored, and haven't yet been scoped out to see the effort involved in supporting them. (DLIS-7183, DLIS-7184)usage
(DLIS-7185)TestClient
tests to use theOpenAIServer
utility for all testing insteadNotes
text/event-stream
headers missing for streaming,temperature
being ignored by TRT-LLM BLS,content=None
for streaming messages causing genai-perf errors compared tocontent=""
, certain sampling parameters being converted to incorrect types internally, etc.