[RFC] drop pre/post callback, only leave mid #68

mmoskal · 2024-03-11T18:51:54Z

mmoskal
Mar 11, 2024
Maintainer

The pre_process and post_process callbacks currently run in the critical path of inference - we have measured the overhead at about 0.3ms per token in rLLM, however it may be worse with Python-based LLM infrastructure. The overhead is primarily the inter-process communication delay (especially the fact that the OS can decide the de-schedule one of the involved processes).

A solution would be only leave the mid_process callback, and add possible return values for it that indicate that the further generation needs to be forked, or that the current token needs to be discarded (it's kind of supported already via backtrack=1).

The downside is that certain operations may incur a one-token overhead in some cases:

forking
joining
stopping generation
suspending generation while waiting for other forks to compute something
fast-forwarding (i.e., the first token of fast-forward would sometimes be needlessly sampled)

Many of these can be mitigated to some extent (eg., when requesting fork we could return splice commands for each branch, when returning a small set of allowed tokens, we could say "if token X is selected, then fast-forward by YZW).

It would be also impossible to directly implement lock-step generation between different forks, making certain beam-search approaches harder.

The advantage is much simpler interface and no overhead, which might be easier sell for LLM infrastructure folks.

mmoskal · 2024-03-13T00:10:33Z

mmoskal
Mar 13, 2024
Maintainer Author

A proposed return type from mid-process is a list of Branch objects. Typically, exactly one branch is returned to continue generation. If no branches are returned, the sequence is terminated. If more than one sequence is returned, the sequence is forked into multiple sequences.

Branch is defined as follows:

pub struct Branch {
    /// How many KV cache entries to remove before any new ones are added.
    pub backtrack: u32,
    /// If None, no sampling is performed.
    /// If Some(vob), only tokens from vob (set of tokens represented as bitvector) are allowed.
    pub sample_mask: Option<SimpleVob>,
    /// If no sampling, there should be exactly one sequence of tokens to be appended.
    /// Otherwise, for every allowed token there can be a sequence that starts with that token -
    /// when that starting token is sampled, all the other tokens in that sequence are appended as well.
    pub ff_tokens: Vec<Vec<TokenId>>,
}

Sampling is performed as follows:

seq.pop_kv_cache(b.backtrack);
let to_append = match b.sample_mask {
    Some(mask) => {
        let tok = seq.sample_with_mask(mask);
        match b.ff_tokens.iter().find(|t| t[0] == tok) {
            Some(toks) => toks,
            None => vec![tok],
        }
    }
    None => {
        assert!(b.ff_tokens.len() == 1);
        b.ff_tokens[0]
    }
};
seq.append_tokens(to_append);

Or the same thing in Python:

class Branch:
    backtrack: int
    sample_mask: TokenSet | None
    ff_tokens: list[list[Token]]

def sample(branches: list[Branch], seq):
    if len(branches) == 0:
        seq.stop()
    else:
        seqs = [seq] + [seq.fork() for _ in range(len(branches) - 1)]
        for (idx, (seq, b)) in enumerate(zip(seqs, branches)):
            seq.branch_idx = idx
            seq.last_backtrack = b.backtrack
            seq.remove_tokens(b.backtrack)
            if b.sample_mask is None:
                assert len(b.ff_tokens) == 1
                seq.last_append = b.ff_tokens[0]
            else:
                tok = seq.sample_with_bias(b.sample_mask)
                ff_tokens = (t for t in b.ff_tokens if t[0] == tok)
                seq.last_append = next(ff_tokens, b.ff_tokens[0])
            seq.append_tokens(seq.last_append)
            # we remember branch_idx, last_backtrack and last_append for 
            # the next mid_process() invocation

# callback provided by the controller:
def mid_process(branch_idx: int, last_backtrack: int, last_append: list[Token]) -> list[Branch]:
    ...

Note that last_backtrack parameter to mid_process() is for convince only (the controller could keep track of it), however it makes keeping track of current tokens easier. The last_append depends on the token sampled, so definitely needs to be provided. The branch_idx is for the controller code to figure out which branch it finds itself in.

0 replies

mmoskal · 2024-04-10T17:19:33Z

mmoskal
Apr 10, 2024
Maintainer Author

Another comment regarding performance: Speculative decoding uses a draft model (10-100x smaller) with the same tokenizer to generate a number of tokens (say 5 or 10), and then using the main model to validate the guess of the draft model in parallel. When placing constraints on the output, we want to do it on the small model as well as the main one. However, the time bounds on the draft model are going to be much tighter.

1 reply

AaronFriel Apr 13, 2024

I wonder if speculative decoding might be tricky to support with AICI and guided generation and produce worse results. Biasing is (by definition) pushing the model outside of its training set, this is especially true with severely constrained decoding that matches a narrow grammar.

As-is, I expect the large model would disproportionately reject tokens produced by AICI applied to the draft model. That is, after biasing logits, the remaining logits may all remain below the threshold to accept.
Changing the large model's threshold for accepting a draft token will negatively impact the quality. We especially don't want to change the larger model's threshold on unconstrained spans of tokens to generate (e.g.: generating unconstrained text until a newline).

It's not clear how to balance these two. It may be that additional pretraining or LoRAs are necessary to more closely align the draft and large model with the AICI program to enable speculative decoding.

AaronFriel · 2024-04-10T18:35:30Z

AaronFriel
Apr 10, 2024

A solution would be only leave the mid_process callback, and add possible return values for it that indicate that the further generation needs to be forked, or that the current token needs to be discarded (it's kind of supported already via backtrack=1).

To further avoid pipeline stalls, should mid_process be raced no matter what? In the event that computation exceeds the deadline:

Start ignoring logits for that sequence
Allow mid_process to complete (or throw an error if it exceeds some multiple of the deadline)
Submit a new sequence backtracking to where mid_process stalled
Cancel the original sequence in the LLM - important that we do this after 3. because we don't want the LLM to clear our prefix cache and pay the penalty of prefill on the new sequence

2 replies

mmoskal Apr 10, 2024
Maintainer Author

Right, my plan was to pretend mid_process() returned backtrack=0,tokens=[],no_sampling (so no-op) and wait for it to actually return (some number of steps - this gets into economics of how a cloud provider charges for running the controller). Of course, we would want to avoid too many sequences in a batch stalling that way.

mmoskal Apr 13, 2024
Maintainer Author

Added #94 to track this

mmoskal · 2024-04-13T21:18:49Z

mmoskal
Apr 13, 2024
Maintainer Author

This has been implemented in #92.

Note that the pyctrl/jsctrl still use post_process() as a client-side abstraction. However, that post_process() method is only called from the Wasm-level mid_process() callback (either at the beginning, handling previous round, or at the end when tokens are deterministic).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] drop pre/post callback, only leave mid #68

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RFC] drop pre/post callback, only leave mid #68

mmoskal Mar 11, 2024 Maintainer

Replies: 4 comments · 3 replies

mmoskal Mar 13, 2024 Maintainer Author

mmoskal Apr 10, 2024 Maintainer Author

AaronFriel Apr 13, 2024

AaronFriel Apr 10, 2024

mmoskal Apr 10, 2024 Maintainer Author

mmoskal Apr 13, 2024 Maintainer Author

mmoskal Apr 13, 2024 Maintainer Author

mmoskal
Mar 11, 2024
Maintainer

Replies: 4 comments 3 replies

mmoskal
Mar 13, 2024
Maintainer Author

mmoskal
Apr 10, 2024
Maintainer Author

AaronFriel
Apr 10, 2024

mmoskal Apr 10, 2024
Maintainer Author

mmoskal Apr 13, 2024
Maintainer Author

mmoskal
Apr 13, 2024
Maintainer Author