Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy matching of Terminals inside a Rule #242

Closed
trulede opened this issue Sep 29, 2018 · 5 comments
Closed

Greedy matching of Terminals inside a Rule #242

trulede opened this issue Sep 29, 2018 · 5 comments

Comments

@trulede
Copy link

trulede commented Sep 29, 2018

Hello,

I'm getting started with Lark and trying to implement a very basic DSL where for a given line of DSL code the first world is the "instruction" and everything following that are parameters, up until an optional Newline character.

This is the grammar:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" (EXP | WORD)* _NL?               -> foo_msg
                | "bar" (EXP | WORD)* _NL?               -> bar_msg
                | "repeat" NUMBER code_block -> repeat

    code_block: "{" instruction+ "}"
    EXP: "$" WORD

    %import common.WORD
    %import common.INT -> NUMBER
    %import common.WS
    %import common.NEWLINE -> _NL
    %ignore WS
"""

and the input:

foo one bar two
bar one

the problem is that the grammer is matching two instructions from the first line, i.e. "foo one" and then "bar two", but what I want is for one instruction "foo" with parameters "one bar two".

Any hints on how to achieve that?

Thanks.

@erezsh
Copy link
Member

erezsh commented Sep 29, 2018

You speak of lines, but you made the newline optional: _NL?

So, the interpretation of "foo one -- bar two" is correct.

In other words, you have ambiguity in your grammar.

I assume you're using Earley, so the solution in simply on of the following two:

  1. Remove the ambiguity

  2. Use priority to choose the correct derivation. Or, specify ambiguity="explicit" and choose the right one in your code.

Let me know if you're having trouble or need any more assistance.

@trulede
Copy link
Author

trulede commented Sep 29, 2018

Its a bit strange, I can run the same code several times, without any changes, and get different parsing results:

trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two'), Token(WORD, 'bar'), Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two bar two $foo one
trule@timimac [lark]$ python3 lark_dsl.py 
Tree(start, [Tree(foo_msg, [Token(WORD, 'one'), Token(WORD, 'bar'), Token(WORD, 'two')]), Tree(bar_msg, [Token(WORD, 'two'), Token(VAR, '$foo'), Token(WORD, 'one')]), Tree(instruction, [])])
FOO: one bar two
BAR: one $foo two
trule@timimac [lark]$ 

That is with the following grammar:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" (VAR | WORD)* _NL      -> foo_msg
                | "bar" (VAR | WORD)* _NL      -> bar_msg
                | "end"

    VAR: "$" WORD
    WORD.2: WORDX
    _NL: NEWLINE

    %import common.WORD -> WORDX
    %import common.WS
    %import common.NEWLINE
    %ignore WS
"""

and this input:

foo one bar two
bar two $foo one
end

SO ... I think the priority might have helped, but really, I'm not so sure, because when I take it away again ... I still get both the desired and not-so-desired results. Making _NL required as a part of the match was more helpful, so I have to append a newline to the code to be sure it will parse the last line correctly - at least I think I do ;-)

@trulede
Copy link
Author

trulede commented Sep 30, 2018

I see that there is issue #81 which may be the same problem regarding changing results. When I use the LALR I get an error which suggests the grammar is still ambiguous. This makes me wonder, is it possible to make a grammar which can handle this?

It must be possible, perhaps the approach I have is wrong. What I want to do is first split the text into lines, then match the first word in each line to an instruction, and have any remaining words in that line as a collection of parameters to the instruction.

If I had a grammar with some hierarchy then perhaps I could achieve that.

@erezsh
Copy link
Member

erezsh commented Sep 30, 2018

so I have to append a newline to the code to be sure it will parse the last line correctly

Yes, that's the recommended method for now, until I implement #237

I can run the same code several times, without any changes, and get different parsing results:

This is a bug in the Earley parser, which is fixed in the earley_sppf branch. I plan on merging it in to master in the near future.

When I use the LALR I get an error which suggests the grammar is still ambiguous.

LALR demands that your grammar is unambiguous and deterministic. It's probably possible to make your grammar fit (as long as _NL isn't optional). But that really depends on the DSL you're planning.

@trulede
Copy link
Author

trulede commented Sep 30, 2018

I've implemented a few very simple languages by just splitting strings and looking up dictionaries ... simple as in:

$volt = 12
set v=$volt

... and that was getting difficult to implement any further.

With the LARK want I want to try a do is create a DSL where I can get some loop constructs, better variable handling, and then add some "auto-complete" capability so that the DSL can work in an interactive command prompt. Thats why I'm experimenting with the somewhat strange/simple construct of a command followed by words, the idea being that the autocomplete will build a valid DSL instruction interactively with repeated calls to a Lark Parser.

However, it might end up making more sense to just put the parameters in brackets, the autocomplete would be able to handle that, and provide hints for all possible parameters - either derived from the Lark parser, or some other data source. If change my grammar to this:

dsl_grammer = r"""
    start: instruction+

    instruction:  "foo" ("\"" (VAR | WORD)* "\"")? _NL      -> foo_msg
                | "bar" ("\"" (VAR | WORD)* "\"")? _NL      -> bar_msg

    VAR: "$" WORD
    _NL: NEWLINE

    %import common.WORD
    %import common.WS
    %import common.NEWLINE
    %ignore WS
"""

it works as expected, the ambiguity is resolved. I should try the original grammar against the mentioned fixes/changes before deciding one way of the other.

Thanks.

@trulede trulede closed this as completed Sep 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants