Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful exit if model_max_length is exceeded #359

Open
shlomenu opened this issue Aug 20, 2024 · 0 comments
Open

Graceful exit if model_max_length is exceeded #359

shlomenu opened this issue Aug 20, 2024 · 0 comments

Comments

@shlomenu
Copy link

I'm afraid that it's somewhat challenging to present a minimal working example of the problem, but I hope the following snippet offers some useful context:

@lmql.query
    def annotate(prefix, words, annotator, initial, terminal):
        """lmql
        annotator.digest(prefix)
        "{prefix}"
        async for i, (word, sep) in asyncstdlib.enumerate(words, start=1):
            annotator.digest(word, sep)
            tags, max_tokens = annotator.get_tags()
            if annotator.digested + max_tokens >= annotator.max_digestable:
                break
            elif i == 2 and initial:
                annotator.progress_region()
            elif i == len(words) and terminal:
                annotator.progress_region()
            "{word}[@annotator.postprocess TAG]{sep}" where TAG in tags
        """

I am using LMQL for a span annotation task in which the generative model does not need to add text except for select meta-tokens indicating the opening and closing of spans. I'm using LMQL to get big savings, in this case, as the vast majority of the text remains unchanged in this use case, and the tokens that are added are always from a very narrow set. However, I encounter the problem that even if the passage I'm submitting to the LMQL server is less than the context length, unless it occupies literally <=50% of the model's context, there are at least some edge cases in which I get a CUDA indexing error as a result of LMQL supplying the model with too large a prompt. This restriction seems wasteful and unnecessarily restrictive, since in training I always know myself to be leaving enough spare context for the correct span annotations to fit. Therefore, I want the prompt learning technique I'm using to be able to adapt and adjust to using only the amount of the context that it actually needs to cause the LMQL model to replicate the correct spans (the prompt learning technique adjusts prefix in the above code). Therefore, I want LMQL to assess, dynamically, whether the whole context has been occupied, and gracefully exit when this is the case, rather than simply passing on the raw CUDA error. I think this is a challenge that LMQL users working with long documents in this setting would frequently encounter, as would anyone in any setting where one actually comes up against context size limits. In such settings, they can do something similar to what I did, but overall it causes some nasty and pointless-seeming code duplication where the LMQL code on the user side must use a duplicate tokenizer to be constantly checking when to abort. It seems like this behavior would somehow be trivial to implement on the lowest level (simply don't call generate with too big a context). It would then simply be true, by default, that an LMQL query cannot overload the context of the model doing the generation. This would be a nice guarantee for LMQL itself, to offer, rather than being something that must be reengineered each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant