Design

Parsing

Requirements

The parser has to have the following properties:

It never fails, no matter the input. At worst, it produces a token stream without any structural information.
It keeps all the information required to reconstruct the original document. This involves all out-of-grammar tokens like whitespaces, comments and conditional compilation.
It has to have several entry points in order to support incremental compilation.

Output

If we want to use the parser for Haxe compilation, it has to support outputting the Haxe AST. For display support, more information is required. In particular, all tokens, including whitespaces and other out-of-grammar tokens, must be represented. To this end, we suggest the following output:

A list of tokens that represent the source document.
A parse tree that represents the syntactic structure and references the token stream.

Haxe program:

class C {
	static var x:Int;
}

Token list:

 0 CLASS
 1 WHITESPACE
 2 IDENT
 3 WHITESPACE
 4 BROPEN
 5 NEWLINE
 6 WHITESPACE
 7 STATIC
 8 WHITESPACE
 9 VAR
10 WHITESPACE
11 IDENT
12 COLON
13 IDENT
14 SEMICOLON
15 NEWLINE
16 BRCLOSE
17 EOF

Because tokens are most of the time processed sequentially, representing them as a double-linked list is natural:

class Token {
	var text:String;
	var nextToken:Token;
	var previousToken:Token;
}

Parse tree

file:
- package:
- decls: [
  - class_decl:
    - T0
    - T2
    - type_decl_parameters: None
    - class_relations: []
    - T4
    - fields: [
      - var_field:
        
        modifiers: [
        
        T7
        
        ]
        
        T9
        
        T11
        
        type_hint:
        
        Some
        
        T12
        
        T13
        
        assignment: None
        
        T14
    - ]
    - T16
- T17

Token relations

There is no explicit reference of out-of-grammar tokens in the parse tree. Instead, there's an implicit relation between out-of-grammar tokens and grammar tokens through the token list. We describe this relation through the following algorithm:

Add out-of-grammar tokens to list [Lead] until a grammar token is found.
Consume the grammar token, reference to it as [Token].
Add out-of-grammar tokens to list [Trail] until a grammar token or a newline token is found.
- If it is a grammar token, emit ([Lead], [Token], [Trail]) and goto 2.
- If it is a newline token [Newline], emit ([Lead], [Token], [Trail] + [Newline]) and goto 1.

This algorithm can be executed by the converter which translates the parse tree to a higher-level representation. In that representation, grammar tokens might have an explicit relation to out-of-grammar tokens in the form of "trivia" or otherwise.

Higher-level representation

The higher-level representation is built on top of the token list/parse tree and provides an interface to obtain information and modify the underlying structure. As such, its structure should be immutable and all changes should go through an API that interacts with the underlying representation.

Tokens

We distinguish the tokens provided by the parser from their higher-level representation by calling the latter TokenInfo. Its structure could look like this:

class TokenInfo {
	var token:Token; // from the parser
	var leadingTrivia:Array<Token>;
	var trailingTrivia:Array<Token>;
}

Nodes

The representation of nodes is largely similar to the current ParseTree structure. However, in order to make it immutable, all occurrences of Array and others has to be replaced with a handle-type that is supported by the API. For example:

EBlock(braceOpen:Token, elems:ArrayHandle<BlockElement>, braceClose:Token);

With an API like this:

class NodeApi {
	public function insert<T>(handle:ArrayHandle<T>, offset:Int, value:T);
	public function replace<T>(handle:ArrayHandle<T>, offset:Int, value:T);
	public function delete<T>(handle:ArrayHandle<T>, offset:Int);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly