How well does the grammar-based sampling work for non-instruct models? #6651

Stevenic · 2024-04-13T07:10:28Z

Stevenic
Apr 13, 2024

I'm digging through everything llama.cpp (day 4 for me) and I've come across the grammar-based sampling work. I'm just curious how well it works for non instruction tuned models?

I can dig in a little deeper to where I'm going with the line of question but for a bit of background... I've done a fair amount of work in this space on the validation side of things in my AlphaWave client library and the various writings I've done around improving function calling reliability. Prior to leaving Microsoft, I also helped the TypeChat Team refine their repair loop strategy so this is a topic I have particular interest in.

I spent a lot of time working on techniques for repairing malformed responses from OpenAI models but the first thing I noticed when I started using OSS models (5 months ago) is that I couldn't get my standard repair techniques to work with any of them. Not a single one could repair a malformed structured response. I actually gave up on asking OSS models for structured responses because I couldn't trust them.

This work makes me think that maybe it's worth revesting reliable function calling for OSS models but I'm curious first how it works for non instruction tuned models? And then how well does it work for the instruction tunned ones? I won't ramble too much here but getting the model to reliably return a JSON object is one thing, getting it to reliably conform to a schema is a totally other thing, and then getting to understand why it's returning a given structure (agents) is a completely different ball game.

My guess is that you're seeing it's good at enforcing JSON for most models and if the schema enforcement is working broadly its because your forcing the model down that path and not that its getting there willingly (think hostage with a gun to their head being forced to read a note.)

martinezhermes · 2024-04-13T07:49:34Z

martinezhermes
Apr 13, 2024

I have experienced this as well. In the end what has worked best for me is fine-tuning a few thousand examples of the expected behavior with clear examples of the output grammar, using an instruction model as base. The grammar forcing option will then make sure the model writes appropriate symbols every time. Having a parser force an invalid generation to go back to the model for fixing will mostly fail if the model hasn't been properly trained for the task

1 reply

Stevenic Apr 13, 2024
Author

Having a parser force an invalid generation to go back to the model for fixing will mostly fail if the model hasn't been properly trained for the task

That's what I would expect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How well does the grammar-based sampling work for non-instruct models? #6651

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How well does the grammar-based sampling work for non-instruct models? #6651

Stevenic Apr 13, 2024

Replies: 1 comment · 1 reply

martinezhermes Apr 13, 2024

Stevenic Apr 13, 2024 Author

Stevenic
Apr 13, 2024

Replies: 1 comment 1 reply

martinezhermes
Apr 13, 2024

Stevenic Apr 13, 2024
Author