feat: OpenAI Compatible Frontend #7561

rmccorm4 · 2024-08-22T01:06:38Z

Description

Adds an OpenAI Compatible Frontend for Triton Inference Server as a FastAPI application using the tritonserver in-process python bindings for the following endpoints:

/v1/models
/v1/completions
/v1/chat/completions

Additionally there are some other observability routes exposed for convenience:

/metrics - Prometheus-compatible metrics from Triton Core
/health/ready - General health check for inference readiness, similar to NIM schema.

This is a refactor and extension of the original example by @nnshah1 here: triton-inference-server/tutorials#104 to include more thorough testing.

Refactors/Changes

The file structure was refactored to make each logical component smaller, more discoverable, and more digestible. It was loosely based on the following resources:
- https://fastapi.tiangolo.com/tutorial/bigger-applications/#an-example-file-structure
- https://github.com/zhanymkanov/fastapi-best-practices?tab=readme-ov-file#project-structure
The global variables were converted to use the FastAPI app object for storing state instead, such as app.server to access the in-process tritonserver across different routes handling requests.
Encapsulated info related to model objects in a ModelMetadata dataclass.
Made the --tokenizer an explicit setting provided by user at startup, rather than something we try to infer based on the Triton model name used in the model repository. This is more resilient to BYO/custom models, and models not produced by the Triton CLI.
Added an optional explicit --backend (request conversion format) override to better deal with edge cases like LLMs defined in custom backends, or ambiguous backends like "python" for ensemble/BLS models where the request format isn't immediately clear from the backend choice. If the explicit override is not used, the frontend will currently pick vllm request format if the targeted model's backend=vllm, otherwise it will pick backend=tensorrtllm for any other backend value.
Extracted most of the Triton in-process Python API logic out of the FastAPI route definitions. The pre-initialized in-process tritonserver is attached to the FastAPI app during initialization in main.py. This will make it simpler to mock and unit test specific parts in the frontend, as well as teardown/startup the frontend during testing without stopping/starting the tritonserver each time.
Updated various parts of the generated schemas/openai.py that seemed to be generated for pydantic v1 (currently using v2) to account for all of the deprecation warnings that are currently deprecated but will be fully removed in coming versions of pydantic: https://docs.pydantic.dev/latest/migration/

Testing Methodology

Original set of tests were added using FastAPI's TestClient for simplicity in getting started. I later found out that this TestClient object, despite interacting with the application and logic as normal, does not actually expose the network host/ports for interaction from other clients. It would probably be good to move away from TestClient and use the OpenAIServer utility, described below, for all testing later on.
OpenAI client tests were then added to show how this implementation can act as a drop-in replacement for current users of the openai python client library, which required a OpenAIServer utility class to run the application differently then the TestClient flow expects, similar to vLLM's utility here.
Streaming tests were done through the OpenAI client, as the documentation around streaming with TestClient were sparse.
Overall the tests were meant to add broad support for various types of clients (raw http requests, openai client, genai-perf) and broad testing for the supported parameters (test that changes in the openai-facing values produce different behavior in the backends). Deep dives into each and every parameter and exactly how they would behave for each value (ex: top_p, top_k) was not done, and I would expect more to be tested by each backend implementation.

Open Questions

Q: Should /v1/models list all models in the model repository, even if only some of those models are actually compatible with the completions / chat endpoints? It may be difficult to automatically infer which models are "compatible" or not and account for all scenarios, so it may nee to be explicit by the user if we want to limit which models are returned.
- A: Let's expose a SERVED_MODEL_NAME for a cosmetic mapping from something like tensorrt_llm_bls -> meta-llama/Meta-Llama-3.1-8b-Instruct to let the server/app starter specify how clients should interact with it. They may want to hide the details of ensemble/BLS/etc. to clients behind a more convenient --served-model-name.
Q: What to do with the docker/Dockerfile* and general expected user-facing workflow. I added these as examples for myself to use during testing/development and included how they'd be used in the README, but ultimately we will probably be publishing this code within the respective containers and these DIY containers are likely unnecessary.\
- A: These will be likely removed as we get closer to having something published in a container.

Open Items for this PR

Benchmarking and performance improvements, especially under high concurrency and load
KServe frontend bindings integration
Fix issues with parameters like stop that are passed as a List when using vLLM backend.

Open Items for follow-up PRs

README and Documentation improvements (DLIS-7179)
Support SERVED_MODEL_NAME mapping for "convenience name" -> "triton model name" (DLIS-7171)
Add some tests that use genai-perf to maintain compatibility and catch regressions
Unsupported OpenAI schema parameters like logit_bias, logprobs, n > 1, best_of. These are mostly just unexplored, and haven't yet been scoped out to see the effort involved in supporting them. (DLIS-7183, DLIS-7184)
Function Calling / Tools support (DLIS-7168)
Unsupported OpenAI schema response fields like usage (DLIS-7185)
Testing for LoRA/multi-LoRA (DLIS-7186)
Migrating TestClient tests to use the OpenAIServer utility for all testing instead

Notes

I left in all the commits over time if you want to watch how the code evolved from the original, and see some of the edge cases that were caught by testing such as text/event-stream headers missing for streaming, temperature being ignored by TRT-LLM BLS, content=None for streaming messages causing genai-perf errors compared to content="", certain sampling parameters being converted to incorrect types internally, etc.

…ing for utilities and models routes

… speed up testing, next step add tests for /v1/models routes

…l ready

…, need to refactor for including server startup

…ing server only once per test class, skip openai client tests

…the default value

… some placeholder tests for future feature support

…oad logic to track/list all models for now, marked xfail test for known issue with TRT-LLM temperature, added logic to support testing both TRT-LLM and vLLM based on environment, added openai dep to Dockerfile but skipping openai tests for now

…add OpenAIServer utility for testing without FastAPI TestClient, rename folder from example to openai for clarity that the source code isn't an example, add some usage examples with curl, genai-perf, and openai client, add --tokenizer to main.py

…ting it at model load time. Cleanup completions tests

…a decoupled model with an empty final response, add response validation for this scenario

…fixtures and logic

…t-stream

…ut README accordingly

…words) (#7682)

…to not use dockerfiles

…ude more openai fields explicitly from sampling parameters input to vllm backend

rmccorm4 · 2024-10-09T23:47:18Z

python/openai/openai_frontend/engine/utils/triton.py

@@ -36,14 +36,34 @@ def _create_vllm_inference_request(
    model, prompt, request: CreateChatCompletionRequest | CreateCompletionRequest
 ):
    inputs = {}
-    excludes = {"model", "stream", "messages", "prompt", "echo"}
+    # Exclude non-sampling parameters so they aren't passed to vLLM


NOTE: May make more sense to explicitly "include" support sampling parameters, but will be of a similar length or longer. Both approaches likely require periodic updates to either include a new sampling field, or exclude a non-sampling field.

.gitignore

python/openai/README.md

python/openai/openai_frontend/engine/__init__.py

GuanLuo · 2024-10-10T02:01:02Z

python/openai/openai_frontend/engine/utils/triton.py

+    if num_responses == 2 and responses[-1].final != True:
+        raise Exception("Unexpected internal error with incorrect response flags")
+    if num_responses > 2:
+        raise Exception(f"Unexpected number of responses: {num_responses}, expected 1.")


Maybe for later relocation.. Why breaking out these function to a separate files? There are also a bunch of helper functions in triton_engine.py. And say if there is a need to have multiple files for a particular engine implementation, I would have group them together like:

engine |----- triton |------- __init__.py |------- utils.py

Yeah agreed the separation of the utils/helpers definitely got a little blurry/messy - a restructure like this makes sense.

I would like to further restructure later a few things as well.

I think eventually having imports in a user-facing app looking something like this would look more standardized:

from tritonserver.core import Server from tritonserver.frontends.kserve import KServeGrpc from tritonserver.frontends.openai import FastApiFrontend from tritonserver.frontends.openai.engine import TritonLLMEngine

But will consider restructuring in general as a follow-up PR focused just on structure after some discussion.

chorus-over-flanger · 2024-10-10T07:45:10Z

Would it be possible to include /embeddings endpoint too? It would be the last missing piece for our company to try out Triton for our on-prem RAG/Agentic solutions.

strong upvote for this feature to be enabled, would be great and, indeed - complete the whole functionality

… rmccormick-openai

python/openai/requirements.txt

rmccorm4 · 2024-10-10T19:39:22Z

.pre-commit-config.yaml

-    stages: [pre-commit]
-    verbose: true
-    require_serial: true
+# FIXME: Only run on changed files when triggered by GitHub Actions


FYI @GuanLuo - I'll look into fixing this on GitHub Actions separately

rmccorm4 · 2024-10-12T00:57:01Z

Hi @chorus-over-flanger @faileon, thanks for expressing interest in the /embeddings route! It's on our radar as another feature to add to the OpenAI frontend support (chat, completions, and models) added in this PR.

Since the route itself doesn't have too many parameters defined in the spec, it may be relatively straightforward to add.

If you have any particular models you'd like to see working as an example for /embeddings, do let us know.

If you have any feedback or interest in contributing to the project, let us know as well.

Thanks!

rmccorm4 added 30 commits July 30, 2024 23:30

Initial code migration, start the testing structure

637db32

Restructure to recommended FastAPI project structure, add simple test…

e14128b

…ing for utilities and models routes

Start a CONTRIBUTING.md

a37b0b3

Add simple /completions endpoint test

7eb1ffc

Add some plumbing for /v1/models routes, add mock_llm python model to…

530c871

… speed up testing, next step add tests for /v1/models routes

Add simple tests for /v1/models and remove chat_completions test unti…

9eba9c3

…l ready

Add some basic chat completions support and testing

fb7ce72

WIP: Add OpenAI client test that works when server is already running…

0cf8fae

…, need to refactor for including server startup

Flesh out /completions tests more, refactor to class fixture for runn…

3d227dd

…ing server only once per test class, skip openai client tests

Update chat completions schema to enforce max_tokens >= 0, and lower …

4c1ac55

…the default value

Add more tests around max_tokens and temperature behavior, as well as…

5b15877

… some placeholder tests for future feature support

Remove unused parts from tokenizer.py

f9f4b07

Add streaming test placeholders, add test where no tokenizer is defined

567abf3

Add 'echo' parameter test, but skip it for TRT-LLm due to only suppor…

4e3a441

…ting it at model load time. Cleanup completions tests

Fix issue with finish_reason for non-streaming completion when using …

523f369

…a decoupled model with an empty final response, add response validation for this scenario

Move triton response validation into common triton utils

75f71ce

Reduce code copying and global variables, use conftest.py for shared …

118887c

…fixtures and logic

Split Dockefile in 2 to capture llama3.1 requirement for vllm

6cf2e77

Split Dockerfile in 2 to capture llama3.1 requirement for vllm

66afc48

Add configurable model parameter to examples

0bbd248

Fix streaming for genai-perf by setting the content-type to text/even…

6e59f6e

…t-stream

Update examples to default to vllm model for simplicity

763b3a4

Start high level README for other developers

0328ea6

Move openai source code into server/python/openai folder, and flesh o…

43dd329

…ut README accordingly

Move openai code to server/python folder

363b40e

Add disclaimer for TRT-LLM to README

d35d336

Fix README typos

63fc4a7

Fix relative path for OpenAI server helper after moving locations

4a729c0

rmccorm4 requested review from oandreeva-nv and GuanLuo October 7, 2024 16:48

fix: Support sampling parameters of type List for vLLM backend (stop …

78e571d

…words) (#7682)

rmccorm4 requested a review from kthui October 8, 2024 23:52

rmccorm4 added 5 commits October 8, 2024 17:03

Review feedback: remove examples/ and docker/ folders, update README …

579ad63

…to not use dockerfiles

Add a few FIXMEs for follow-up

815eebe

Add requirements.txt back in, fix test and docs accordingly

8f92734

Fix TRT-LLM model repo test path

5c0b2e6

Explicitly return error on unknown fields not defined in schema, excl…

44b2282

…ude more openai fields explicitly from sampling parameters input to vllm backend

rmccorm4 commented Oct 9, 2024

View reviewed changes

GuanLuo previously approved these changes Oct 10, 2024

View reviewed changes

Merge branch 'main' of github.com:triton-inference-server/server into…

dc7bdf4

… rmccormick-openai

kthui previously approved these changes Oct 10, 2024

View reviewed changes

python/openai/requirements.txt Outdated Show resolved Hide resolved

Add missing copyright headers

49162be

rmccorm4 dismissed stale reviews from kthui and GuanLuo via 49162be October 10, 2024 18:50

rmccorm4 added 4 commits October 10, 2024 11:59

Review feedback: split app and test requirements to 2 requirements files

fe45d39

Fix whitespace pre-commit, remove auto 'git add' from copyright tool

2261d13

Disable copyright pre-commit hook until fixed on GitHub Actions side

2e2a190

Fix attribution for tokenizer util

cc8657d

rmccorm4 commented Oct 10, 2024

View reviewed changes

rmccorm4 requested review from kthui and GuanLuo October 10, 2024 19:57

Fix copyright header on copyright tool, remove unused import

fa9501e

kthui approved these changes Oct 10, 2024

View reviewed changes

GuanLuo approved these changes Oct 11, 2024

View reviewed changes

rmccorm4 merged commit d60aa73 into main Oct 11, 2024
3 checks passed

rmccorm4 deleted the rmccormick-openai branch October 11, 2024 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: OpenAI Compatible Frontend #7561

feat: OpenAI Compatible Frontend #7561

rmccorm4 commented Aug 22, 2024 •

edited

Loading

rmccorm4 Oct 9, 2024

GuanLuo Oct 10, 2024

rmccorm4 Oct 10, 2024 •

edited

Loading

chorus-over-flanger commented Oct 10, 2024

rmccorm4 Oct 10, 2024

rmccorm4 commented Oct 12, 2024

feat: OpenAI Compatible Frontend #7561

feat: OpenAI Compatible Frontend #7561

Conversation

rmccorm4 commented Aug 22, 2024 • edited Loading

Description

Refactors/Changes

Testing Methodology

Open Questions

Open Items for this PR

Open Items for follow-up PRs

Notes

rmccorm4 Oct 9, 2024

Choose a reason for hiding this comment

GuanLuo Oct 10, 2024

Choose a reason for hiding this comment

rmccorm4 Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

chorus-over-flanger commented Oct 10, 2024

rmccorm4 Oct 10, 2024

Choose a reason for hiding this comment

rmccorm4 commented Oct 12, 2024

rmccorm4 commented Aug 22, 2024 •

edited

Loading

rmccorm4 Oct 10, 2024 •

edited

Loading