Lelwel (Language for Extended LL(1) parsing With Error resilience and Lossless syntax trees) generates recursive descent parsers for Rust using LL(1) grammars with extensions for direct left recursion, operator precedence, semantic predicates (which also enable arbitrary lookahead), and semantic actions (which allow to deal with semantic context sensitivity, e.g. type / variable name ambiguity in C).
The parser creates a homogeneous, lossless, concrete syntax tree (CST) that can be used to construct an abstract syntax tree (AST). Special node rename, elision, marker, and creation operators allow fine-grained control over how the CST is built for certain parses.
The error recovery and tree construction is inspired by Alex Kladov's (matklad) Resilient LL Parsing Tutorial. Lelwel uses a (to my knowledge) novel heuristic to automatically calculate the recovery sets, by using the follow sets of the dominators in the directed graph induced by the grammar.
Lelwel is written as a library.
It is used by the CLI tool llw
, the language server lelwel-ls
, and can be included as a build dependency in order to be called from a build.rs
file.
There is a plugin for Neovim that uses the language server.
By default the generated parser uses Logos for lexing and Codespan for diagnostics, however this is not mandatory.
- Error Resilience: The generated parser may provide similar error resilience as handwritten parsers.
- Lossless Syntax Tree: Language tooling such as language servers or formatters require all the information about the source code including whitespaces and comments.
- Language Server: Get instant feedback when your grammar contains conflicts or errors.
- Easy to Debug: The generated parser is easy to understand and can be debugged with standard tools, as the code is not generated by a procedural macro.
- Error Resilience: It seems to be the case that LL parsers are better suited than LR parsers for generating meaningful syntax trees from incomplete source code.
-
Runtime Complexity: More general parsers such as GLR/GLL or ALL(*) can have a runtime complexity of
$O(n^3)$ or$O(n^4)$ respectively for certain grammars. With LL(1) parsers you are guaranteed to have linear runtime complexity as long as your semantic actions and predicates have a constant runtime complexity. - Ambiguity: The decision problem of whether an arbitrary context free grammar is ambiguous is undecidable. Warnings of a general parser generator therefore may contain false positives. In the worst case ambiguities may be found at runtime. The PEG formalism just defines ambiguity away, which may cause the parser to parse a different language than you think.
The following example shows the difference between Lelwel and Tree-sitter, a GLR parser generator with sophisticated error recovery, when parsing certain incomplete C source code.
void f() {
g(1,
int x = 2 +
}
Lelwel syntax tree
TranslationUnit [0..33]
FunctionDefinition [0..33]
DeclarationSpecifiers [0..4]
TypeSpecifier [0..4]
Void "void" [0..4]
Whitespace " " [4..5]
Declarator [5..8]
FunctionDeclarator [5..8]
IdentDeclarator [5..6]
Identifier "f" [5..6]
LPar "(" [6..7]
RPar ")" [7..8]
Whitespace " " [8..9]
CompoundStatement [9..33]
LBrace "{" [9..10]
Whitespace "\n " [10..13]
ExpressionStatement [13..17]
CallExpr [13..17]
IdentExpr [13..14]
Identifier "g" [13..14]
LPar "(" [14..15]
ArgumentExpressionList [15..17]
IntExpr [15..16]
IntConst "1" [15..16]
Comma "," [16..17]
Whitespace "\n " [17..20]
Declaration [20..31]
DeclarationSpecifiers [20..23]
TypeSpecifier [20..23]
Int "int" [20..23]
Whitespace " " [23..24]
InitDeclaratorList [24..31]
InitDeclarator [24..31]
Declarator [24..25]
IdentDeclarator [24..25]
Identifier "x" [24..25]
Whitespace " " [25..26]
Assign "=" [26..27]
Whitespace " " [27..28]
Initializer [28..31]
BinExpr [28..31]
IntExpr [28..29]
IntConst "2" [28..29]
Whitespace " " [29..30]
Plus "+" [30..31]
Whitespace "\n" [31..32]
RBrace "}" [32..33]
Tree-sitter syntax tree
translation_unit [0, 0] - [4, 0]
function_definition [0, 0] - [3, 1]
type: primitive_type [0, 0] - [0, 4]
declarator: function_declarator [0, 5] - [0, 8]
declarator: identifier [0, 5] - [0, 6]
parameters: parameter_list [0, 6] - [0, 8]
( [0, 6] - [0, 7]
) [0, 7] - [0, 8]
body: compound_statement [0, 9] - [3, 1]
{ [0, 9] - [0, 10]
ERROR [1, 2] - [2, 11]
identifier [1, 2] - [1, 3]
( [1, 3] - [1, 4]
number_literal [1, 4] - [1, 5]
, [1, 5] - [1, 6]
assignment_expression [2, 2] - [2, 11]
left: identifier [2, 2] - [2, 5]
ERROR [2, 6] - [2, 7]
identifier [2, 6] - [2, 7]
operator: = [2, 8] - [2, 9]
right: number_literal [2, 10] - [2, 11]
ERROR [2, 12] - [2, 13]
+ [2, 12] - [2, 13]
} [3, 0] - [3, 1]
Note
Tree-sitter is an excellent tool for its intended purpose, which is mainly syntax highlighting. This comparison only demonstrates the possible syntax tree quality of an error resilient parser.
The parser for lelwel grammar files (*.llw) is itself generated by Lelwel. There are also examples for C without a preprocessor (actually resolves ambiguity with semantic context information, unlike examples for ANTLR4 and Tree-sitter), Lua, arithmetic expressions, JSON, and Oberon-0.
You can try out examples in the Lelwel Playground.
The following example shows a grammar for the toy language "L" introduced by the Resilient LL Parsing Tutorial.
token Fn='fn' Let='let' Return='return' True='true' False='false';
token Arrow='->' LPar='(' RPar=')' Comma=',' Colon=':' LBrace='{' RBrace='}'
Semi=';' Asn='=' Plus='+' Minus='-' Star='*' Slash='/';
token Name='<name>' Int='<int>';
token Whitespace;
skip Whitespace;
start file;
file: fn*;
fn: 'fn' Name param_list ['->' type_expr] block;
param_list: '(' [param (?1 ',' param)* [',']] ')';
param: Name ':' type_expr;
type_expr: Name;
block: '{' stmt* '}';
stmt^:
stmt_expr
| stmt_let
| stmt_return
;
stmt_expr: expr ';';
stmt_let: 'let' Name '=' expr ';';
stmt_return: 'return' [expr] ';';
expr:
expr ('*' | '/') expr @expr_binary
| expr ('+' | '-') expr @expr_binary
| expr arg_list @expr_call
| Name @expr_name
| '(' expr ')' @expr_paren
| expr_literal ^
;
expr_literal: Int | 'true' | 'false';
arg_list: '(' [expr (?1 ',' expr)* [',']] ')';
- Write a grammar file and place it in the
src
directory of your crate. Optionally you can install the CLI or language server to validate your grammar file:cargo install --features=cli,lsp lelwel
. - Add the following to your
Cargo.toml
andbuild.rs
files.[dependencies] logos = "0.15" codespan-reporting = "0.11" [build-dependencies] lelwel = "0.7"
fn main() { lelwel::build("src/your_grammar.llw"); }
- Start a build. This will create a
parser.rs
file next to your grammar file. Theparser.rs
file is supposed to be manually edited to implement the lexer and it includes the actual parsergenerated.rs
, which is written to the CargoOUT_DIR
. If you change the grammar after theparser.rs
file has been generated, it may be required to manually update theToken
enum or theParser
impl for semantic predicates and actions. - Use the parser module with the following minimal
main.rs
file for printing the CST and diagnostics.mod parser; use codespan_reporting::files::SimpleFile; use codespan_reporting::term::termcolor::{ColorChoice, StandardStream}; use codespan_reporting::term::{self, Config}; use logos::Logos; use parser::*; fn main() -> std::io::Result<()> { let args: Vec<String> = std::env::args().collect(); if args.len() != 2 { std::process::exit(1); } let source = std::fs::read_to_string(&args[1])?; let mut diags = vec![]; let (tokens, ranges) = tokenize(Token::lexer(&source), &mut diags); let cst = Parser::parse(&source, tokens, ranges, &mut diags); println!("{cst}"); let writer = StandardStream::stderr(ColorChoice::Auto); let config = Config::default(); let file = SimpleFile::new(&args[1], &source); for diag in diags.iter() { term::emit(&mut writer.lock(), &config, &file, diag).unwrap(); } Ok(()) }
Lelwel grammars are based on the formalism of context free grammars (CFG) and more specifically LL(1) grammars. There are certain extensions to the classical grammar syntax such as constructs similar to those from EBNF.
A grammar file consists of top level definitions which are independent of their order.
C-style and C++-style comments can be used.
Example
// this is a single-line comment /* this is a multi-line comment */
Documentation comments can be used before top level definitions.
Example
/// this is a doc-comment token A B C;
Tip
Documentation comments are shown by the language server on hover.
A token list definition introduces a list of tokens (terminals) to the grammar.
It starts with the token
keyword, ends with a ;
and contains a list of token names and corresponding token symbols.
A token name must start with a capital letter. The token symbol is optional and delimited by single quotation marks. In a regex a token can be referenced by its name or symbol.
Tip
The token symbol is used in error messages and the generator of the parser.rs
file.
If the token symbol string starts with <
and ends with >
, the token is interpreted as a class of tokens for which the symbol is only a description.
This influences how error messages and lexer rules are generated by default in parser.rs
.
Example
token MyKeyword='my_keyword' Int='<integer literal>' True='true' False='false';
A skip
definition allows to specify a list of tokens, which are ignored by the parser.
These tokens will however still be part of the syntax tree.
Example
skip Whitespace Comment;
A right
definition allows to specify a list of tokens, which are handled as right associative operators in left recursive rules.
Example
right '^' '=';
A start
definition specifies the start rule of the grammar.
There must be exactly one start definition in a grammar.
The start rule must not be referenced in a regex.
Example
start translation_unit;
A grammar rule must start with a lower case letter. A regular expression is used to specify the right hand side of the rule.
Example
translation_unit: declaration*;
Regular expressions are built from the following syntactic constructs.
- Grouping:
(...)
- Identifier:
rule_name
orTokenName
- Symbol:
'token symbol'
- Concatenation:
A B
which isA
followed byB
- Alternation:
A | B
which is eitherA
orB
- Optional:
[A]
which is eitherA
or nothing - Star Repetition:
A*
which is a repetition of 0 or moreA
- Plus Repetition:
A+
which is a repetition of 1 or moreA
- Semantic Predicate:
?1
which is the semantic predicate number 1 - Semantic Action:
#1
which is the semantic action number 1 - Node Rename:
@new_node_name
renames the rule syntax tree node - Node Elision:
^
prevents creation of the rule syntax tree node - Node Marker:
<1
marker with index 1 for a new node - Node Creation:
1>new_node_name
insert node at position of marker with index 1
Rules with direct left recursion are parsed using a Pratt parser. The order of recursive rule branches in the top level alternation defines the binding power of the operator tokens. The binding power decreases from the first to the last branch. In a left recursive branch the follow set of the first concatenation element defines the operator tokens. Otherwise in a right recursive branch the first set of the first concatenation element defines the operator tokens.
Tip
Use a Pratt parser for expressions, as it improves readability and efficiency compared to an encoding of operator precedence in grammar rules.
Example
expr: expr '^' expr | ('-' | '+') expr | expr ('*' | '/') expr | expr ('+' | '-') expr | Num | '(' expr ')' ;
Mixing left and right associative operators in the same branch is not allowed.
Example
right '='; expr: expr ('+' | '=') expr // ❌ error | Num ;
Semantic actions can be defined at any point in a regex. Semantic predicates can be defined at the beginning of an alternation branch, an optional or a repetition. The index of actions and predicates can be used multiple times in a rule.
When the semantic action or predicate is visited in a parse it will execute the rust code of the corresponding associated function in the PredicatesAndActions
trait implementation of the Parser
type.
Tip
If the PredicatesAndActions
trait is implemented in the parser.rs
file next to the grammar file (where it is generated by default), you can use the language server to jump from the grammar to the rust code.
Example
token A B C; foo: ?1 A #1 (?2 B C)* B #2 | A C #2 ;impl PredicatesAndActions for Parser<'_> { fn predicate_foo_1(&self) -> bool { self.peek(1) == Token::B } fn predicate_foo_2(&self) -> bool { self.context.some_condition } fn action_foo_1(&mut self, diags: &mut Vec<Diagnostic>) { println!("executed action 1"); } fn action_foo_2(&mut self, diags: &mut Vec<Diagnostic>) { println!("executed action 2"); } }
The syntax tree node for a rule can be renamed when a node rename operator is visited during a parse.
Example
expr: Int @int_expr | '(' expr ')' @paren_expr ;May result in the following syntax tree.
paren_expr ├─ '(' ├─ int_expr │ └─ Int └─ ')'
Creation of a syntax tree node can be prevented if a node elision operator is visited during a parse.
Example
foo: A bar bar; bar: B ^ | CMay result in the following syntax tree.
foo ├─ A ├─ B └─ bar └─ C
Unconditional node elision can also be achieved with a special syntax by writing the operator directly after the rule name.
Tip
This is useful for rules with a top level alternation, as it avoids an extra nesting with parentheses.
Example
foo: A bar bar; bar^: B | CMay result in the following syntax tree.
foo ├─ A ├─ B └─ C
Node elision is not allowed in left recursive branches of a rule.
Example
foo: foo A ^ // ❌ error | B ^ // ✅ ok ;
A node marker can be defined in a regex, which marks the position where a syntax tree node can be inserted during a parse. The index of such a node marker must be unique for each rule.
A node creation can be used with a corresponding node marker, if it is placed in a position where the node marker will be visited before the node creation during a parse.
Warning
The current implementation uses insertion into a vector for node creation, which may impact performance, if many rules/tokens are parsed between node marker and creation. In the worst case this may result in quadratic parse time complexity. However in practice it does not seem to matter too much, as the insertions mostly happen close to the end of the vector. Nevertheless, in the future this will probably be changed to a more efficient implementation.
Example
foo: A <1 B C 1>bar;May result in the following syntax tree.
foo ├─ A └─ bar ├─ B └─ C
The index and node name is optional for the node creation. If the index is missing the behavior is as if a node marker at the start of the rule was used. If the node name is missing the node name of the current rule is used.
Tip
This is useful in combination with unconditional node elision, so a node can still be created for certain parses.
Example
assign_expr^: bin_expr ['=' assign_expr >];This is usually preferable to
assign_expr: bin_expr ('=' assign_expr | ^);as the latter will not elide a node if
bin_expr
is followed by an invalid token that is not in the follow set.
A node creation with missing index is not allowed in left recursive rules.
Example
foo: foo A > // ❌ error | B > // ❌ error ;
Lelwel, its examples, and its generated code are licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.