Greedy matching of Terminals inside a Rule #242

trulede · 2018-09-29T20:19:26Z

Hello,

I'm getting started with Lark and trying to implement a very basic DSL where for a given line of DSL code the first world is the "instruction" and everything following that are parameters, up until an optional Newline character.

This is the grammar:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" (EXP | WORD)* _NL?               -> foo_msg
                | "bar" (EXP | WORD)* _NL?               -> bar_msg
                | "repeat" NUMBER code_block -> repeat

    code_block: "{" instruction+ "}"
    EXP: "$" WORD

    %import common.WORD
    %import common.INT -> NUMBER
    %import common.WS
    %import common.NEWLINE -> _NL
    %ignore WS
"""

and the input:

foo one bar two
bar one

the problem is that the grammer is matching two instructions from the first line, i.e. "foo one" and then "bar two", but what I want is for one instruction "foo" with parameters "one bar two".

Any hints on how to achieve that?

Thanks.

The text was updated successfully, but these errors were encountered:

erezsh · 2018-09-29T20:40:55Z

You speak of lines, but you made the newline optional: _NL?

So, the interpretation of "foo one -- bar two" is correct.

In other words, you have ambiguity in your grammar.

I assume you're using Earley, so the solution in simply on of the following two:

Remove the ambiguity
Use priority to choose the correct derivation. Or, specify ambiguity="explicit" and choose the right one in your code.

Let me know if you're having trouble or need any more assistance.

trulede · 2018-09-29T21:49:04Z

Its a bit strange, I can run the same code several times, without any changes, and get different parsing results:

trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two'), Token(WORD, 'bar'), Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two bar two $foo one
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$

That is with the following grammar:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" (VAR | WORD)* _NL      -> foo_msg
                | "bar" (VAR | WORD)* _NL      -> bar_msg
                | "end"

    VAR: "$" WORD
    WORD.2: WORDX
    _NL: NEWLINE

    %import common.WORD -> WORDX
    %import common.WS
    %import common.NEWLINE
    %ignore WS
"""

and this input:

foo one bar two
bar two $foo one
end

SO ... I think the priority might have helped, but really, I'm not so sure, because when I take it away again ... I still get both the desired and not-so-desired results. Making _NL required as a part of the match was more helpful, so I have to append a newline to the code to be sure it will parse the last line correctly - at least I think I do ;-)

trulede · 2018-09-30T07:06:55Z

I see that there is issue #81 which may be the same problem regarding changing results. When I use the LALR I get an error which suggests the grammar is still ambiguous. This makes me wonder, is it possible to make a grammar which can handle this?

It must be possible, perhaps the approach I have is wrong. What I want to do is first split the text into lines, then match the first word in each line to an instruction, and have any remaining words in that line as a collection of parameters to the instruction.

If I had a grammar with some hierarchy then perhaps I could achieve that.

erezsh · 2018-09-30T08:00:14Z

so I have to append a newline to the code to be sure it will parse the last line correctly

Yes, that's the recommended method for now, until I implement #237

I can run the same code several times, without any changes, and get different parsing results:

This is a bug in the Earley parser, which is fixed in the earley_sppf branch. I plan on merging it in to master in the near future.

When I use the LALR I get an error which suggests the grammar is still ambiguous.

LALR demands that your grammar is unambiguous and deterministic. It's probably possible to make your grammar fit (as long as _NL isn't optional). But that really depends on the DSL you're planning.

trulede · 2018-09-30T08:28:46Z

I've implemented a few very simple languages by just splitting strings and looking up dictionaries ... simple as in:

$volt = 12
set v=$volt

... and that was getting difficult to implement any further.

With the LARK want I want to try a do is create a DSL where I can get some loop constructs, better variable handling, and then add some "auto-complete" capability so that the DSL can work in an interactive command prompt. Thats why I'm experimenting with the somewhat strange/simple construct of a command followed by words, the idea being that the autocomplete will build a valid DSL instruction interactively with repeated calls to a Lark Parser.

However, it might end up making more sense to just put the parameters in brackets, the autocomplete would be able to handle that, and provide hints for all possible parameters - either derived from the Lark parser, or some other data source. If change my grammar to this:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" ("\"" (VAR | WORD)* "\"")? _NL      -> foo_msg
                | "bar" ("\"" (VAR | WORD)* "\"")? _NL      -> bar_msg

    VAR: "$" WORD
    _NL: NEWLINE

    %import common.WORD
    %import common.WS
    %import common.NEWLINE
    %ignore WS
"""

it works as expected, the ambiguity is resolved. I should try the original grammar against the mentioned fixes/changes before deciding one way of the other.

Thanks.

trulede closed this as completed Sep 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greedy matching of Terminals inside a Rule #242

Greedy matching of Terminals inside a Rule #242

trulede commented Sep 29, 2018

erezsh commented Sep 29, 2018

trulede commented Sep 29, 2018

trulede commented Sep 30, 2018

erezsh commented Sep 30, 2018

trulede commented Sep 30, 2018

Greedy matching of Terminals inside a Rule #242

Greedy matching of Terminals inside a Rule #242

Comments

trulede commented Sep 29, 2018

erezsh commented Sep 29, 2018

trulede commented Sep 29, 2018

trulede commented Sep 30, 2018

erezsh commented Sep 30, 2018

trulede commented Sep 30, 2018