Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignored terminal used in the grammar is expected with Earley but not with LALR #1473

Open
Molrn opened this issue Sep 26, 2024 · 3 comments
Open
Labels
add to docs TODO: Include this information in the docs bug

Comments

@Molrn
Copy link

Molrn commented Sep 26, 2024

Describe the bug

When a terminal is ignored but still used in the grammar, the behavior is different depending on the parser used. With LALR, the terminal is ignored, with Earley it is expected.

According to the doc, the behavior of LALR seems to be the expected one, but the Earley behavior still makes sense.

To Reproduce

grammar = """
start : "CREATE" WS "START"
%import common.WS
%ignore WS
"""

lalr_parser = Lark(grammar, parser="lalr")
lalr_parser.parse("CREATE START")            # Fails
lalr_parser.parse("CREATESTART")             # Fails

earley_parser = Lark(grammar, parser="earley")
earley_parser.parse("CREATE START")       # Succeeds
earley_parser.parse("CREATESTART")        # Fails

Environment

Version 1.2.2

@erezsh
Copy link
Member

erezsh commented Jan 1, 2025

Are you using the latest Lark version?

lalr_parser.parse("CREATESTART") fails when I run it.

But you are right that lalr_parser.parse("CREATE START") fails, when it should succeed.

However, I'm not sure if there's a simple solution. Currently the LALR parser throws away ignored tokens at the lexer level. Sending them to the parser will have an impact on the LALR performance.

@erezsh erezsh added the bug label Jan 1, 2025
@Molrn
Copy link
Author

Molrn commented Jan 2, 2025

Are you using the latest Lark version?

I'm using 1.2.2, the latest version.

lalr_parser.parse("CREATESTART") fails when I run it.

You're right, this is an error in my bug report, sorry about that.

I understood it as if LALR's behavior was the expected one, as putting in the grammar an ignored terminal is counter-intuitive. Applying this behavior (removing ignored tokens at lexer level) to earley would probably be an easier fix. However, using this trick in earley is very powerful as it allows to have a much more flexible language with a clean grammar. It is personally what decided me to use earley over LALR for my project.

If it is not possible to implement this behavior in LALR without sinking its performances, I would recommand not to modify anything. It is though worth mentionning in the documentation in my opinion.

@erezsh erezsh added the add to docs TODO: Include this information in the docs label Jan 2, 2025
@erezsh
Copy link
Member

erezsh commented Jan 2, 2025

You're right, it's worth mentioning.

I also think it's better to keep this behavior in Earley, because sometimes it's useful to say "this ignored token must be here in the text", whereas otherwise it would be optional.

I think we can consider it a "missing feature" in LALR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to docs TODO: Include this information in the docs bug
Projects
None yet
Development

No branches or pull requests

2 participants