From ce184ed8d0a02cb7e9b4c94d2965c59ad950966b Mon Sep 17 00:00:00 2001 From: Michal Moskal Date: Mon, 23 Dec 2024 12:55:34 +0000 Subject: [PATCH] update llg timing info --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2f144b4..8a26349 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ This server is similar in spirit to the [TensorRT-LLM OpenAI server example](htt The sampling can be constrained by the [Low-Level Guidance library](https://github.com/microsoft/llguidance), part of the [Guidance project](https://github.com/guidance-ai/guidance). While TensorRT computes logits (token probabilities) for the next token, the llguidance library computes a set of tokens allowed by the grammar (whether JSON schema, regular expression, or a full context-free grammar (CFG)) in the form of a bitmask. When both the logits and bitmask are ready, a custom CUDA kernel applies the mask to the logits, and the result is used for sampling inside of TensorRT-LLM. -There is no significant startup cost for all realistic sizes of grammars (no measurable impact on time to first token (TTFT)). The overhead on generation speed (median time between tokens (TBT)) is typically 1-3%. The mask computation takes on the order of 1 ms of single-core CPU time per token per sequence in the batch. Thus, with 16 cores and a TBT of around 10 ms, batch sizes of up to 160 are not CPU-bound. Typically, the unconstrained TBT is higher at such batch sizes, and more cores are available, so batch size is not a problem in production. +There is no significant startup cost for all realistic sizes of grammars (no measurable impact on time to first token (TTFT)). The overhead on generation speed (median time between tokens (TBT)) is typically 1-3% (and comes mostly from apply masking kernels on the GPU). The mask computation takes on the order of 100 us of single-core CPU time per token per sequence in the batch. Thus, with 16 cores and a TBT of around 10 ms, batch sizes of up to 1600 are not CPU-bound. Typically, the unconstrained TBT is higher at such batch sizes, and more cores are available, so batch size is not a problem in production. This approach differs from [Outlines](https://github.com/dottxt-ai/outlines) (which pre-computes masks, resulting in a startup cost and limits on schema complexity) and is more similar in spirit to [llama.cpp grammars](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), though it is much faster due to the use of a custom lexer with [derivative-based regexes](https://github.com/microsoft/derivre), an Earley parser, and a [highly optimized](https://github.com/microsoft/llguidance/blob/main/docs/toktrie.md) token prefix tree.