You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Root cause: LL1Analyzer returns incorrect follow tokens when exploring a rule with epsilon transitions from start to rule stop state, by exploring the rule stop target transition.
Background: During parsing, when an error is encountered, before exiting the rule the parser attempts error recovery. DefaultErrorStrategy's recover implementation works by resynchronizing the parser by consuming tokens until it finds one in the resynchronization set. If it successful, the input stream is then in a state where the next token is known to be able to be consumed by some ancestor rule being recursively parsed, and parsing can continue. The key is to correctly compute the resynchronization set--the set of tokens that may follow the immediate rule or any ancestor rule we are parsing. That is done using the ATN: each rule transition object stores a follow state (the state the rule resumes to after the sub-rule invocation returns); in broad strokes, starting from the follow state of the invoking state of each ancestor rule context, LL1 Analyzer walks the epsilon, predicate, and rule transitions of the ATN, adding tokens (defined by atom and wildcard transitions) it discovers. Importantly, it doesn't transition past these token transitions to ensure it operates with LL(1), and it doesn't continue past the stop state of each invoking rule.
Why does it matter that it not look past the stop state of some rule X? Because the stop state has a transition to the follow state for all possible transitions to rule X. These may not be possible states that can produce the next token as defined by the current rule context (including its ancestor rule contexts). For example, in the following grammar, rule2's stop state has an epsilon transition to the follow state of the rule transition to rule2 of both the start rule and rule1, but only ABC may follow HELLO WORLD when parsed from the start rule. If LL1 Analyzer continued past the stop state of rule2, it would transition to a state in rule1 and ZZZ would be erroneously added to the resynchronization set:
The bug in this issue comes into play when writing rules that may consume no tokens (i.e. a rule in which the stop state may be reached from the start state purely through epsilon/predicate transitions). For example, rule4 in the following grammar:
Here, ZZZ should not be in the resynchronization set of the parser state after HELLO in rule2, but it is because given the rule contexts (start) -> (rule1) -> (rule2 HELLO) it can be reached by a rule transition from rule1 -> rule4 and epsilon transitions from rule4 to rule3 and an atom transition to ZZZ, before a depth-first arrival at rule1's stop state. This means parsing "HELLO ZZZ ZZZ ABC" would recover from a MismatchedTokenException in rule2 by consuming no tokens, rule1 completes normally, then the start rule also hits a MismatchedTokenException since the input stream is currently at the first ZZZ token and it is expecting ABC.
The fix is to ensure that LL1 Analyzer correctly computes the result set by never transitioning from a rule stop state, outside of special scenarios where the stop state is not set such as SLL(1) lookaheads. Then, in our example, given the correct resynchronization set {DEF, ABC} recover will correctly resynchronize by consuming the two ZZZ tokens and leaving the input stream at token ABC for the start rule to successfully parse. Note: The original ANTLR4 repository does not have this defect, as it already implements correct stopping.
As a side note, what if rule1 had a 'JKL'? optional token after the call to rule4? LL1 Analyzer already correctly handles this by storing the follow state of each rule transition that it walks, and when reaching a rule stop state it continues from that follow state if there is one. This doesn't conflict with what was described above, as we are not transitioning to the rule stop state target states but rather to the follow state of the invoking rule transition (in this example, that means when we encounter the rule4 stop state we transition to the state in rule1 after the rule4 call, in order to add token 'JKL' to the result set).
The text was updated successfully, but these errors were encountered:
br0nstein
added a commit
to br0nstein/antlr4-optimized
that referenced
this issue
Oct 26, 2023
Root cause: LL1Analyzer returns incorrect follow tokens when exploring a rule with epsilon transitions from start to rule stop state, by exploring the rule stop target transition.
Background: During parsing, when an error is encountered, before exiting the rule the parser attempts error recovery. DefaultErrorStrategy's recover implementation works by resynchronizing the parser by consuming tokens until it finds one in the resynchronization set. If it successful, the input stream is then in a state where the next token is known to be able to be consumed by some ancestor rule being recursively parsed, and parsing can continue. The key is to correctly compute the resynchronization set--the set of tokens that may follow the immediate rule or any ancestor rule we are parsing. That is done using the ATN: each rule transition object stores a follow state (the state the rule resumes to after the sub-rule invocation returns); in broad strokes, starting from the follow state of the invoking state of each ancestor rule context, LL1 Analyzer walks the epsilon, predicate, and rule transitions of the ATN, adding tokens (defined by atom and wildcard transitions) it discovers. Importantly, it doesn't transition past these token transitions to ensure it operates with LL(1), and it doesn't continue past the stop state of each invoking rule.
Why does it matter that it not look past the stop state of some rule X? Because the stop state has a transition to the follow state for all possible transitions to rule X. These may not be possible states that can produce the next token as defined by the current rule context (including its ancestor rule contexts). For example, in the following grammar, rule2's stop state has an epsilon transition to the follow state of the rule transition to rule2 of both the start rule and rule1, but only ABC may follow HELLO WORLD when parsed from the start rule. If LL1 Analyzer continued past the stop state of rule2, it would transition to a state in rule1 and ZZZ would be erroneously added to the resynchronization set:
The bug in this issue comes into play when writing rules that may consume no tokens (i.e. a rule in which the stop state may be reached from the start state purely through epsilon/predicate transitions). For example, rule4 in the following grammar:
Here, ZZZ should not be in the resynchronization set of the parser state after HELLO in rule2, but it is because given the rule contexts (start) -> (rule1) -> (rule2 HELLO) it can be reached by a rule transition from rule1 -> rule4 and epsilon transitions from rule4 to rule3 and an atom transition to ZZZ, before a depth-first arrival at rule1's stop state. This means parsing "HELLO ZZZ ZZZ ABC" would recover from a MismatchedTokenException in rule2 by consuming no tokens, rule1 completes normally, then the start rule also hits a MismatchedTokenException since the input stream is currently at the first ZZZ token and it is expecting ABC.
The fix is to ensure that LL1 Analyzer correctly computes the result set by never transitioning from a rule stop state, outside of special scenarios where the stop state is not set such as SLL(1) lookaheads. Then, in our example, given the correct resynchronization set {DEF, ABC} recover will correctly resynchronize by consuming the two ZZZ tokens and leaving the input stream at token ABC for the start rule to successfully parse. Note: The original ANTLR4 repository does not have this defect, as it already implements correct stopping.
As a side note, what if rule1 had a 'JKL'? optional token after the call to rule4? LL1 Analyzer already correctly handles this by storing the follow state of each rule transition that it walks, and when reaching a rule stop state it continues from that follow state if there is one. This doesn't conflict with what was described above, as we are not transitioning to the rule stop state target states but rather to the follow state of the invoking rule transition (in this example, that means when we encounter the rule4 stop state we transition to the state in rule1 after the rule4 call, in order to add token 'JKL' to the result set).
The text was updated successfully, but these errors were encountered: