Understating ATN state changes #4370

renatahodovan · 2023-07-31T15:27:55Z

renatahodovan
Jul 31, 2023

Hi all!

I use ANTLR4 to parse many input files in a large variety of languages and build custom tree representations from them. These trees require to have specific quantifier nodes being the parent of subtrees quantified with *, + or ?. For this I analyze the input grammar to automatically inject custom actions into it, marking the beginning and the end of the quantified expressions. After this I rebuild the modified grammar and use custom parser subclasses and listeners to build the tree. This works well, however it's a bit nasty.

After investigating the generated parser, I've noticed that the ATN state changes encode very similar information that I need. Fortunately this information is available in the Python runtime by overriding the setter of the state member. However, it seems that some state updates are missing or I have limited knowledge about the expected behaviour (which is a fact). I thought, that every state has a stateNumber, a stateType and a transitions list encoding the stateNumber of the possible follow-up states (with the target property of the transitions). However, when I run a dummy parser and print all the state changes with their possible transitions, it seems that the follow-up states are not necessarily contained by the current transition lists.

I have a minimal repro, printing the state changes and the transitions:

grammar Loop;

start : x* ('a' | 'b')+;
x : 'X' ;

from antlr4 import InputStream, CommonTokenStream
from antlr4.atn.ATNState import ATNState

from LoopLexer import LoopLexer
from LoopParser import LoopParser


class MyParser(LoopParser):

    @property
    def state(self):
        return self._stateNumber

    @state.setter
    def state(self, atnState: int):
        self._stateNumber = atnState
        state = self.atn.states[atnState]
        t = ", ".join([f'{ATNState.serializationNames[transition.target.stateType]}({transition.target.stateNumber})' for transition in state.transitions])
        print(f'{atnState:5} {self.ruleNames[state.ruleIndex]:5}: {ATNState.serializationNames[state.stateType]:20} [{t}]')
     

def main():
    lexer = LoopLexer(InputStream("XXa"))
    parser = MyParser(CommonTokenStream(lexer))
    parser.start()

main()

Output

 SNum  Rule  SType                Transitions
 -------------------------------------------------------------------
    0 start: RULE_START           [STAR_LOOP_ENTRY(7)]
    7 start: STAR_LOOP_ENTRY      [STAR_BLOCK_START(5), LOOP_END(8)]
    4 start: BASIC                [RULE_START(2)]
    2 x    : RULE_START           [BASIC(15)]
   15 x    : BASIC                [BASIC(16)]
    4 start: BASIC                [RULE_START(2)]
    9 start: STAR_LOOP_BACK       [STAR_LOOP_ENTRY(7)]
    4 start: BASIC                [RULE_START(2)]
    2 x    : RULE_START           [BASIC(15)]
   15 x    : BASIC                [BASIC(16)]
    4 start: BASIC                [RULE_START(2)]
    9 start: STAR_LOOP_BACK       [STAR_LOOP_ENTRY(7)]
   11 start: PLUS_BLOCK_START     [BASIC(10)]
   10 start: BASIC                [BLOCK_END(12)]
   13 start: PLUS_LOOP_BACK       [PLUS_BLOCK_START(11), LOOP_END(14)]
   -1 x    : BASIC                []

In state 0, the possible next state is 7, which is indeed the second state. However, state 2 predicts that the follow-up states are either 5 (STAR_BLOCK_START) or 8 (LOOP_END), but instead we continue with state 4. Then in state 4 the only transition points to state 2, which is "correct" and so is the next state. However, after this state 15 predicts state 16 but we end up in state 4. And so on...

I know that it is an internal API and it seems to work well. But I don't understand why are some state changes "missing" (especially that setting LOOP_END would be quite useful for me). Furthermore, I'd like to understand the generated code a bit better :)

Thanks in advance!

Answered by kaby76

Aug 1, 2023

It appears from your output that there are steps missing between NFA 15 to NFA 4. From NFA state 15, input "X", the transitions are NFA 15->16->3->(pop stack)->6->9->7->5->4. The interpreter likely works through this sequence before calling the code to set state number using closure() on the next NFA state. The transitions between 16 and 4 are all empty transitions, less the stack pop, and likely just ends up in NFA state 7. But because the next input is again another "X", and we are in the middle of "start", it's already cached a computed DFA state for rule "start" on "X" from NFA 7, the interpreter goes to NFA 4. The only way to really understand this code is to debug the calls to Adapt…

View full answer

kaby76 · 2023-08-01T11:34:43Z

kaby76
Aug 1, 2023

It appears from your output that there are steps missing between NFA 15 to NFA 4. From NFA state 15, input "X", the transitions are NFA 15->16->3->(pop stack)->6->9->7->5->4. The interpreter likely works through this sequence before calling the code to set state number using closure() on the next NFA state. The transitions between 16 and 4 are all empty transitions, less the stack pop, and likely just ends up in NFA state 7. But because the next input is again another "X", and we are in the middle of "start", it's already cached a computed DFA state for rule "start" on "X" from NFA 7, the interpreter goes to NFA 4. The only way to really understand this code is to debug the calls to AdaptivePredict(). Note, another method that sometimes helps in understanding the interpreter is to turn on the "newish" parser trace option and try to follow along with the output. However, in this example, the trace is unrevealing.

2 replies

renatahodovan Aug 4, 2023
Author

@kaby76 Thanks a lot for your prompt and detailed answer. It clarified a lot, but I'll need some further debugging ofc. Btw, can I generate similar figures like the above by myself? Is it some hidden ANTLRv4 feature that I'm unaware of?

kaby76 Aug 4, 2023

For the .dot files, use the -atn option on the command line tool. https://github.com/antlr/antlr4/blob/dev/doc/tool-options.md. I then go to https://dreampuf.github.io/GraphvizOnline/ and paste the Dot code in the LHS pane, then select the "Download" button in the RHS pane of the webpage to get the .svg file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understating ATN state changes #4370

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Understating ATN state changes #4370

renatahodovan Jul 31, 2023

Output

Replies: 1 comment · 2 replies

kaby76 Aug 1, 2023

renatahodovan Aug 4, 2023 Author

kaby76 Aug 4, 2023

renatahodovan
Jul 31, 2023

Replies: 1 comment 2 replies

kaby76
Aug 1, 2023

renatahodovan Aug 4, 2023
Author