Suggestion / Question for Refactoring to Enable Multiple Generation #501
lapp0
started this conversation in
Feature requests
Replies: 2 comments 6 replies
-
This is a good idea and would simplify the code substantially. Do you want to open a PR? |
Beta Was this translation helpful? Give feedback.
1 reply
-
We have features on the roadmap that require us to do that, like sampling algorithms not available in any library or better cache management. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Continuation from #416 (comment)
Efficient Caching
In the existing outlines codebase, the
RegexLogitsProcessor
is called with seq_id which corresponds to a distinct FSM state https://github.com/outlines-dev/outlines/blob/0355ab4272a5d7e4d94c4a53a52593f885b81a61/outlines/serve/vllm.py#L68-L70_apply_logits_processors
is patched in outlines to pass seq_id.This works, but if we used
tuple(input_ids)
as a distinct key, we would be able to prevent repeated work for a given(fsm, input_ids)
Additionally, we would no longer need to patch
_apply_logits_processors
which would prevent problems if vLLMs implementation of the function changed.Further, caching FSM state by
input_ids
may allow us to efficiently manage the FSM internally within the logits processor as described in the next section.Multiple Sampling By Eliminating Outlines Generation Logic
I'm not fluent in outlines codebase, so if you could point out what I'm missing I'd appreciate it.
My understanding is that outlines allows vLLMs AsyncEngine to handle all KV management and simply provides a logits processor https://github.com/outlines-dev/outlines/blob/0355ab4272a5d7e4d94c4a53a52593f885b81a61/outlines/serve/serve.py#L80
This is a clean paradigm and it makes sense to me that outlines would leave KV cache management to
transformers
as well, and simply provide an efficient logits processor.Regarding
outlines/outlines/generate/api.py
, why wouldn't we be able to simplifySequenceGenerator
so it allowstransformers
to do the bulk of the work in managing generation? It appears that we would be able to providetransformers
with a logits processor which retrieves the cached FSM state asfn(generation_token_ids)
rather than tracking and manipulatingFSMState
withinSequenceGenerator
.Beta Was this translation helpful? Give feedback.
All reactions