Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
db1a5a5
Make lark.lark parse the same grammar as load_grammar.py, and make gr…
RossPatterson Feb 1, 2024
9493f81
1. Fix "Python type check / Format (pull request)" failure in test_la…
RossPatterson Feb 1, 2024
7a2880f
DOH!
RossPatterson Feb 1, 2024
83a374f
Remove unnessary anchor; coalesce ENBF item sets; fix %override grammar
RossPatterson Feb 2, 2024
fdffb5f
Revert lark.lark to its original form.
RossPatterson Feb 9, 2024
95c5742
Make lark.lark accept the same input as load_grammar.py, and provide …
RossPatterson Feb 9, 2024
200d6b5
Address some review comments.
RossPatterson Feb 9, 2024
0fb28f9
Fix review comment re: templates in terminals.
RossPatterson Feb 10, 2024
2ec5ef3
Fix review comment: Remove inlining from expansions, expansion, and v…
RossPatterson Feb 10, 2024
e9c026e
Address review comment: Make alias and expr optionals, not maybes, so…
RossPatterson Feb 10, 2024
9bf7ddf
Address review comment: Make '%declare rule' fail in post-processing …
RossPatterson Feb 10, 2024
7f02bd1
lark.lark doesn't allow backslash-nl as a line-continuation, but load…
RossPatterson Feb 13, 2024
4f7a5eb
Push optionality of rule_modifiers and priority down into rule_modifi…
RossPatterson Mar 15, 2024
40576d2
Fix bug introduced in #1018
RossPatterson Mar 15, 2024
daac65d
Issue #1388 is ready for review.
RossPatterson Mar 15, 2024
5f37365
Resolve @megalng comment re:@skipIf
RossPatterson Jun 21, 2024
697841b
Resolve @megalng comment re:tests/test_lark_validator.py
RossPatterson Jun 21, 2024
654e102
Resolve @megalng comment re:docstrings
RossPatterson Jun 21, 2024
33d7088
Resolve @erezsh comment re:typo
RossPatterson Jun 21, 2024
0d01fe2
Resolve part of @erezsh comment re: options.
RossPatterson Jun 21, 2024
20302ca
Remove obsolete 'options' parameter
RossPatterson Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 74 additions & 49 deletions docs/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,29 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o

Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).

## EBNF Expressions

The EBNF expression in a Lark termminal definition is a sequence of items to be matched.
Each item is one of:
erezsh marked this conversation as resolved.
Show resolved Hide resolved

* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal.
* `"string literal"` - Literal, to be matched as-is.
* `"string literal"i` - Literal, to be matched case-insensitively.
* `/regexp literal/` - Regular expression literal. Can inclde flags.
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved
* `"character".."character"` - Literal range. The range represends all values between the two literals, inclusively.
* `(item item ..)` - Group items
* `(item | item | ..)` - Alternate items.
* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `item?` - Zero or one instances of item (a "maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved

The EBNF expression in a Lark rule definition is also a sequence of the same set of items to be matched, with one addition:

* `rule` - A rule, which can include recursive use of this rule.

## Terminals

Expand All @@ -59,45 +82,16 @@ Terminals are used to match text into symbols. They can be defined as a combinat
**Syntax:**

```html
<NAME> [. <priority>] : <literals-and-or-terminals>
<NAME> [. <priority>] : <items-to-match>
```

Terminal names must be uppercase.

Literals can be one of:

* `"string"`
* `/regular expression+/`
* `"case-insensitive string"i`
* `/re with flags/imulx`
* Literal range: `"a".."z"`, `"1".."9"`, etc.
Terminal names must be uppercase. They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`). Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified, or unless they are part of a containing terminal. Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).

Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.

Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
See [EBNF Expressions](#ebnf-expressions) above for the list of items that a terminal can match.

### Templates

Templates are expanded when preprocessing the grammar.

Definition syntax:

```ebnf
my_template{param1, param2, ...}: <EBNF EXPRESSION>
```

Use syntax:

```ebnf
some_rule: my_template{arg1, arg2, ...}
```

Example:
```ebnf
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'

num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
```
Templates are not allowed with terminals.

### Priority

Expand All @@ -122,7 +116,7 @@ SIGNED_INTEGER: /
/x
```

Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one.

Regexps/strings of different flags can only be concatenated in Python 3.6+

Expand Down Expand Up @@ -196,29 +190,19 @@ _ambig

**Syntax:**
```html
<name> : <items-to-match> [-> <alias> ]
<modifiers><name> : <items-to-match> [-> <alias> ]
| ...
```

Names of rules and aliases are always in lowercase.
Names of rules and aliases are always in lowercase. They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`). Rule names that start with "_" will be inlined into their containing rule.

Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).

An alias is a name for the specific rule alternative. It affects tree construction.
An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree).

The affect of a rule on the parse tree can be specified by modifiers. The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not. The `?` modifier causes the rule to be inlined if it only has a single child. The `?` modifier cannot be used on rules that are named starting with an underscore.

Each item is one of:

* `rule`
* `TERMINAL`
* `"string literal"` or `/regexp literal/`
* `(item item ..)` - Group items
* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `item?` - Zero or one instances of item ("maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
See [EBNF Expressions](#ebnf_expressions) above for the list of items that a rule can match.

**Examples:**
```perl
Expand All @@ -230,6 +214,29 @@ expr: expr operator expr
four_words: word ~ 4
```

### Templates

Templates are expanded when preprocessing rules in the grammar.

Definition syntax:

```ebnf
my_template{param1, param2, ...}: <EBNF EXPRESSION>
```

Use syntax:

```ebnf
some_rule: my_template{arg1, arg2, ...}
```

Example:
```ebnf
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'

num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
```

### Priority

Like terminals, rules can be assigned a priority. Rule priorities are signed
Expand Down Expand Up @@ -297,12 +304,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by

Declare a terminal without defining it. Useful for plugins.

**Syntax:**
```html
%declare <TERMINAL>
%declare <rule>
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved
```

### %override

Override a rule or terminals, affecting all references to it, even in imported grammars.

Useful for implementing an inheritance pattern when importing grammars.

**Syntax:**
```html
%override <terminal definition>
%override <rule definition>
```

**Example:**
```perl
%import my_grammar (start, number, NUMBER)
Expand All @@ -319,6 +338,12 @@ Useful for splitting up a definition of a complex rule with many different optio

Can also be used to implement a plugin system where a core grammar is extended by others.

**Syntax:**
```html
%extend <TERMINAL> ... additional terminal alternate ...
%extend <rule> ... additional rule alternate ...
```


**Example:**
```perl
Expand Down
27 changes: 21 additions & 6 deletions lark/grammars/lark.lark
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@
# Lark grammar of Lark's syntax
# Note: Lark is not bootstrapped, its parser is implemented in load_grammar.py
# This grammar matches that one, but does not enfore some rules that it does.
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved
# If you want to enforce those, you can pass the "LarkValidatorVisitor" over
# the parse tree, like this:

# import os
# import lark
# from lark.lark_validator_visitor import LarkValidatorVisitor
#
# lark_path = os.path.join(os.path.dirname(lark.__file__), 'grammars/lark.lark')
# lark_parser = Lark.open(lark_path, parser="lalr")
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved
# parse_tree = lark_parser.parse(my_grammar)
# LarkValidatorVisitor.validate(parse_tree)

start: (_item? _NL)* _item?

_item: rule
| token
| statement

rule: RULE rule_params priority? ":" expansions
token: TOKEN token_params priority? ":" expansions
rule: rule_modifiers? RULE rule_params priority? ":" expansions
token: TOKEN priority? ":" expansions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priority is already optional

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but load_grammar.py says priority is a required element of rule, and that priority is _DOT NUMER or null. I wanted lark.lark to produce the same parse tree as load_grammar.py.

It's different for token (term in load_grammar.py) - there, load_grammar.py [says priority is optional[(https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L162-L163).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erezsh If my comment of 2024-06-20 is acceptable, let's resolve this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I meant was that priority can already be an empty rule, so no point in making it optional.


rule_modifiers: RULE_MODIFIERS

rule_params: ["{" RULE ("," RULE)* "}"]
token_params: ["{" TOKEN ("," TOKEN)* "}"]

priority: "." NUMBER

statement: "%ignore" expansions -> ignore
| "%import" import_path ["->" name] -> import
| "%import" import_path name_list -> multi_import
| "%override" rule -> override_rule
| "%override" (rule | token) -> override
| "%declare" name+ -> declare
| "%extend" (rule | token) -> extend

!import_path: "."? name ("." name)*
name_list: "(" name ("," name)* ")"
Expand All @@ -39,14 +53,15 @@ name_list: "(" name ("," name)* ")"
?value: STRING ".." STRING -> literal_range
| name
| (REGEXP | STRING) -> literal
| name "{" value ("," value)* "}" -> template_usage
| RULE "{" value ("," value)* "}" -> template_usage

name: RULE
| TOKEN

_VBAR: _NL? "|"
OP: /[+*]|[?](?![a-z])/
RULE: /!?[_?]?[a-z][_a-z0-9]*/
RULE_MODIFIERS: /(!|![?]?|[?]!?)(?=[_a-z])/
RULE: /_?[a-z][_a-z0-9]*/
TOKEN: /_?[A-Z][_A-Z0-9]*/
STRING: _STRING "i"?
REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/
Expand Down
93 changes: 93 additions & 0 deletions lark/lark_validator_visitor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
from .lexer import Token
from .load_grammar import GrammarError
from .visitors import Visitor
from .tree import Tree

class LarkValidatorVisitor(Visitor):

@classmethod
def validate(cls, tree: Tree):
visitor = cls()
visitor.visit(tree)
return tree

def alias(self, tree: Tree):
# Reject alias names in inner 'expansions'.
self._reject_aliases(tree.children[0], "Deep aliasing not allowed")

def ignore(self, tree: Tree):
# Reject everything except 'literal' and 'name' > 'TOKEN'.
assert len(tree.children) > 0 # The grammar should pass us some things to ignore.
if len(tree.children) > 1:
self._reject_bad_ignore()
node = tree.children[0]
if node.data == "expansions":
if len(node.children) > 1:
self._reject_bad_ignore()
node = node.children[0]
if node.data == "alias":
if len(node.children) > 1:
self._reject_bad_ignore()
node = node.children[0]
if node.data == "expansion":
if len(node.children) > 1:
self._reject_bad_ignore()
node = node.children[0]
if node.data == "expr":
if len(node.children) > 1:
self._reject_bad_ignore()
node = node.children[0]
if node.data == "atom":
if len(node.children) > 1:
self._reject_bad_ignore()
node = node.children[0]
if node.data == "literal":
return
elif node.data == "name":
if node.children[0].data == "TOKEN":
return
elif node.data == "value":
if node.children[0].data == "literal":
return
elif node.children[0].data == "name":
if node.children[0][0].data == "TOKEN":
return
self._reject_bad_ignore()

def token(self, tree: Tree):
assert len(tree.children) > 1 # The grammar should pass us at least a token name and an item.
first_item = 2 if tree.children[1].data == "priority" else 1
# Reject alias names in token definitions.
for child in tree.children[first_item:]:
self._reject_aliases(child, "Aliasing not allowed in terminals (You used -> in the wrong place)")
# Reject template usage in token definitions. We do this before checking rules
# because rule usage looks like template usage, just without parameters.
for child in tree.children[first_item:]:
self._reject_templates(child, "Templates not allowed in terminals")
# Reject rule references in token definitions.
for child in tree.children[first_item:]:
self._reject_rules(child, "Rules aren't allowed inside terminals")

def _reject_aliases(self, item: Tree|Token, message: str):
if isinstance(item, Tree):
if item.data == "alias" and len(item.children) > 1 and item.children[1] is not None:
raise GrammarError(message)
for child in item.children:
self._reject_aliases(child, message)

def _reject_bad_ignore(self):
raise GrammarError("Bad %ignore - must have a Terminal or other value.")

def _reject_rules(self, item: Tree|Token, message: str):
if isinstance(item, Token) and item.type == "RULE":
raise GrammarError(message)
elif isinstance(item, Tree):
for child in item.children:
self._reject_rules(child, message)

def _reject_templates(self, item: Tree|Token, message: str):
if isinstance(item, Tree):
if item.data == "template_usage":
raise GrammarError(message)
for child in item.children:
self._reject_templates(child, message)
2 changes: 2 additions & 0 deletions tests/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
from .test_tools import TestStandalone
from .test_cache import TestCache
from .test_grammar import TestGrammar
from .test_lark_lark import TestLarkLark
from .test_ignore import TestIgnore
from .test_reconstructor import TestReconstructor
from .test_tree_forest_transformer import TestTreeForestTransformer
from .test_lexer import TestLexer
Expand Down
Loading