Suggestion / Question for Refactoring to Enable Multiple Generation #501

lapp0 · 2024-01-03T20:50:04Z

lapp0
Jan 3, 2024

Efficient Caching

In the existing outlines codebase, the RegexLogitsProcessor is called with seq_id which corresponds to a distinct FSM state https://github.com/outlines-dev/outlines/blob/0355ab4272a5d7e4d94c4a53a52593f885b81a61/outlines/serve/vllm.py#L68-L70

_apply_logits_processors is patched in outlines to pass seq_id.

This works, but if we used tuple(input_ids) as a distinct key, we would be able to prevent repeated work for a given (fsm, input_ids)

Additionally, we would no longer need to patch _apply_logits_processors which would prevent problems if vLLMs implementation of the function changed.

Further, caching FSM state by input_ids may allow us to efficiently manage the FSM internally within the logits processor as described in the next section.

Multiple Sampling By Eliminating Outlines Generation Logic

I'm not fluent in outlines codebase, so if you could point out what I'm missing I'd appreciate it.

My understanding is that outlines allows vLLMs AsyncEngine to handle all KV management and simply provides a logits processor https://github.com/outlines-dev/outlines/blob/0355ab4272a5d7e4d94c4a53a52593f885b81a61/outlines/serve/serve.py#L80

This is a clean paradigm and it makes sense to me that outlines would leave KV cache management to transformers as well, and simply provide an efficient logits processor.

Regarding outlines/outlines/generate/api.py, why wouldn't we be able to simplify SequenceGenerator so it allows transformers to do the bulk of the work in managing generation? It appears that we would be able to provide transformers with a logits processor which retrieves the cached FSM state as fn(generation_token_ids) rather than tracking and manipulating FSMState within SequenceGenerator.

rlouf · 2024-01-04T12:08:39Z

rlouf
Jan 4, 2024
Maintainer

This works, but if we used tuple(input_ids) as a distinct key, we would be able to prevent repeated work for a given (fsm, input_ids)

This is a good idea and would simplify the code substantially. Do you want to open a PR?

1 reply

lapp0 Jan 4, 2024
Author

I should be able to do this sometime soon.

rlouf · 2024-01-04T12:10:13Z

rlouf
Jan 4, 2024
Maintainer

This is a clean paradigm and it makes sense to me that outlines would leave KV cache management to transformers as well, and simply provide an efficient logits processor.

We have features on the roadmap that require us to do that, like sampling algorithms not available in any library or better cache management.

5 replies

lapp0 Jan 4, 2024
Author

Could you link me the roadmap? I can't find anything related to a roadmap in search. (Or ping me when you write one up if you have yet to).

lapp0 Jan 4, 2024
Author

Also I'm curious about the design philosophy. Considering KV Cache management is an integral function of inference libraries, where does the responsibilities of the inference library (transformers, vLLM) end and outlines begin as outlines is developed further?

rlouf Jan 9, 2024
Maintainer

I haven't had the time to write a roadmap, but the design philosophy is simple:

Guided generation
Better sampling method (coming soon)

Anything that is not implemented at the level of inference libraries we will implement in Outlines for now. For instance, for some sampling methods we will need to improve the way cache is currently managed. Now, if some of that can be implemented upstream later all the better!

The original goal was to only support transformers and openai (for comparison reasons), but the method has become popular enough for people to request other integrations which we happily support.

rlouf Jan 18, 2024
Maintainer

We now have a roadmap.

lapp0 Jan 24, 2024
Author

Extremely exciting stuff! With correct / efficient grammar-constrained generation, token healing, function calling, and advanced sampling methods Outlines will be a standard in the LLM space.

Thanks for sharing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion / Question for Refactoring to Enable Multiple Generation #501

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Suggestion / Question for Refactoring to Enable Multiple Generation #501

lapp0 Jan 3, 2024

Efficient Caching

Multiple Sampling By Eliminating Outlines Generation Logic

Replies: 2 comments · 6 replies

rlouf Jan 4, 2024 Maintainer

lapp0 Jan 4, 2024 Author

rlouf Jan 4, 2024 Maintainer

lapp0 Jan 4, 2024 Author

lapp0 Jan 4, 2024 Author

rlouf Jan 9, 2024 Maintainer

rlouf Jan 18, 2024 Maintainer

lapp0 Jan 24, 2024 Author

lapp0
Jan 3, 2024

Replies: 2 comments 6 replies

rlouf
Jan 4, 2024
Maintainer

lapp0 Jan 4, 2024
Author

rlouf
Jan 4, 2024
Maintainer

lapp0 Jan 4, 2024
Author

lapp0 Jan 4, 2024
Author

rlouf Jan 9, 2024
Maintainer

rlouf Jan 18, 2024
Maintainer

lapp0 Jan 24, 2024
Author