You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For streaming case, we cannot clamp the generated tokens and recompute them.
Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See #158 and #164.
We need to either
Require that generation never grows beyond max_num_batched_tokens
…/ Update config name (octoml#163)
This PR updates three places for better experience.
* Unify the `--model-path` and `--model` args in build.py. Now we only
take `--model`.
* Hardcode the rotary embedding size for LLaMA to 2048. This enables us
to build a model with different max sequence length without changing the
built weights.
* Update the generated config file name to `mlc-chat-config.json`.
masahi
changed the title
[Bug] Recovering logic of a long evicted request is broken for streaming case
[Bug] Recovering logic of a long evicted request is broken
Feb 1, 2024
@elvin-n After #157 lands, you can follow a similar strategy to use multiple EvalMultiQueryRequest to split restoring of a long request into several batches, each of which fits into max_num_batched_tokens.
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L385-L399
For streaming case, we cannot clamp the generated tokens and recompute them.
Moreover, since the clamping logic is done in the worker but not in the main process, the discrepancy arises between the main and the worker process. See #158 and #164.
We need to either
max_num_batched_tokens
evaluate_multi_query
function from Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156@elvin-n @sunggg
The text was updated successfully, but these errors were encountered: