Skip to content
This repository has been archived by the owner on Aug 21, 2018. It is now read-only.

Token cache

Simon Krajewski edited this page Mar 8, 2017 · 1 revision

Token cache

The token cache is a data structure that keeps track of the tokens encountered when parsing a document. Such tokens typically come from the lexer, but through error recovery further tokens may be added.

Initially, the cache contains two nodes START and END with START pointing to END and END pointing to itself:

[ START ] -> [ END ] -o
^            ^        |
FIRST        |        |
INSERT       o--------o
SHIFT

There are three pointers into the linked list:

  • FIRST: Always points to the START node.
  • INSERT: Points to the node after which new tokens are inserted.
  • SHIFT: Points to the node whose successor's token is to be shifted.

Inserting

When a token is read from the lexer, it is inserted one node after INSERT and INSERT is advanced:

[ START ] -> [ whitespace ] -> [ END ]
^            ^
FIRST        INSERT
SHIFT
[ START ] -> [ whitespace ] -> [ TOKEN ] -> [ END ]
^                              ^
FIRST                          INSERT
SHIFT

Shifting

When the parser performs a shift operation on a given token TOKEN, SHIFT is advanced until it points to the node that is the predecessor of the node holding TOKEN. This skips whitespaces and other out-of-grammar tokens:

[ START ] -> [ whitespace ] -> [ TOKEN ] -> [ END ]
^            ^                 ^
FIRST        SHIFT             INSERT

Keeping SHIFT one node behind ensures that it never points to the END node, which would require updating it manually when a new insertion is made.

Skipping tokens

There are two cases in which tokens are skipped:

  1. Whitespaces, comments etc. are skipped while originally retrieving a physical token. In this case, we associate the number of skipped tokens with the physical token. This extends to preprocessor directives and arguments, such as #if and #line.
  2. Physical tokens can be skipped as part of the error recovery. In this case they are rejected by the parser and thus never shifted. We record the number of skipped tokens associated with the skipped token (from 1.) in the token provider context, and increase it by additional 1 to account for the skipped token itself.

Whenever a shift operation is performed, we record the current number of skipped tokens, which is the sum of the skipped tokens associated with the shifted token (from 1.) and the number of skipped physical tokens from the token provider context (from 2.). We reset the latter value to 0 at this point.

Patching

Due to the way we implement the parser, it is necessary to patch the token cache in some situations:

  • If a token is inserted (either legally in case of the optional ; after }, or in general during recovery), it is inserted after SHIFT and SHIFT is immediately advanced.
  • If a token is split up (various cases related to >, such as >>), three things happen:
    1. The node's token after SHIFT is replaced by the first part of the split (currently always >).
    2. A pseudo-shift operation is performed in order to advance the SHIFT pointer.
    3. The second part of the split is inserted after SHIFT.

There is a special case if SHIFT equals INSERT when an insertion is made: INSERT has to be set to the inserted node in this case to avoid the following situation:

[ START ] -> [ TOKEN ] -> [ INSERTED TOKEN ] -> [ END ]
^            ^
FIRST        INSERT
             SHIFT

This would cause subsequent insertions to be made in front of the INSERTED TOKEN node.

Traversal

The list can be recursively traversed from the START node until the END node is encountered.

Clone this wiki locally