-
Notifications
You must be signed in to change notification settings - Fork 2
Token cache
The token cache is a data structure that keeps track of the tokens encountered when parsing a document. Such tokens typically come from the lexer, but through error recovery further tokens may be added.
Initially, the cache contains two nodes START and END with START pointing to END and END pointing to itself:
[ START ] -> [ END ] -o
^ ^ |
FIRST | |
INSERT o--------o
SHIFT
There are three pointers into the linked list:
- FIRST: Always points to the START node.
- INSERT: Points to the node after which new tokens are inserted.
- SHIFT: Points to the node whose successor's token is to be shifted.
When a token is read from the lexer, it is inserted one node after INSERT and INSERT is advanced:
[ START ] -> [ whitespace ] -> [ END ]
^ ^
FIRST INSERT
SHIFT
[ START ] -> [ whitespace ] -> [ TOKEN ] -> [ END ]
^ ^
FIRST INSERT
SHIFT
When the parser performs a shift operation on a given token TOKEN, SHIFT is advanced until it points to the node that is the predecessor of the node holding TOKEN. This skips whitespaces and other out-of-grammar tokens:
[ START ] -> [ whitespace ] -> [ TOKEN ] -> [ END ]
^ ^ ^
FIRST SHIFT INSERT
Keeping SHIFT one node behind ensures that it never points to the END node, which would require updating it manually when a new insertion is made.
There are two cases in which tokens are skipped:
- Whitespaces, comments etc. are skipped while originally retrieving a physical token. In this case, we associate the number of skipped tokens with the physical token. This extends to preprocessor directives and arguments, such as
#if
and#line
. - Physical tokens can be skipped as part of the error recovery. In this case they are rejected by the parser and thus never shifted. We record the number of skipped tokens associated with the skipped token (from 1.) in the token provider context, and increase it by additional 1 to account for the skipped token itself.
Whenever a shift operation is performed, we record the current number of skipped tokens, which is the sum of the skipped tokens associated with the shifted token (from 1.) and the number of skipped physical tokens from the token provider context (from 2.). We reset the latter value to 0 at this point.
Due to the way we implement the parser, it is necessary to patch the token cache in some situations:
- If a token is inserted (either legally in case of the optional
;
after}
, or in general during recovery), it is inserted after SHIFT and SHIFT is immediately advanced. - If a token is split up (various cases related to
>
, such as>>
), three things happen:- The node's token after SHIFT is replaced by the first part of the split (currently always
>
). - A pseudo-shift operation is performed in order to advance the SHIFT pointer.
- The second part of the split is inserted after SHIFT.
- The node's token after SHIFT is replaced by the first part of the split (currently always
There is a special case if SHIFT equals INSERT when an insertion is made: INSERT has to be set to the inserted node in this case to avoid the following situation:
[ START ] -> [ TOKEN ] -> [ INSERTED TOKEN ] -> [ END ]
^ ^
FIRST INSERT
SHIFT
This would cause subsequent insertions to be made in front of the INSERTED TOKEN node.
The list can be recursively traversed from the START node until the END node is encountered.