-
Notifications
You must be signed in to change notification settings - Fork 2
Design
The parser has to have the following properties:
- It never fails, no matter the input. At worst, it produces a token stream without any structural information.
- It keeps all the information required to reconstruct the original document. This involves all out-of-grammar tokens like whitespaces, comments and conditional compilation.
- It has to have several entry points in order to support incremental compilation.
If we want to use the parser for Haxe compilation, it has to support outputting the Haxe AST. For display support, more information is required. In particular, all tokens, including whitespaces and other out-of-grammar tokens, must be represented. To this end, we suggest the following output:
- A list of tokens that represent the source document.
- A parse tree that represents the syntactic structure and references the token stream.
class C {
static var x:Int;
}
0 CLASS
1 WHITESPACE
2 IDENT
3 WHITESPACE
4 BROPEN
5 NEWLINE
6 WHITESPACE
7 STATIC
8 WHITESPACE
9 VAR
10 WHITESPACE
11 IDENT
12 COLON
13 IDENT
14 SEMICOLON
15 NEWLINE
16 BRCLOSE
17 EOF
Because tokens are most of the time processed sequentially, representing them as a double-linked list is natural:
class Token {
var text:String;
var nextToken:Token;
var previousToken:Token;
}
- file:
- package:
- decls: [
- class_decl:
- T0
- T2
- type_decl_parameters: None
- class_relations: []
- T4
- fields: [
- var_field:
- modifiers: [
- T7
- ]
- T9
- T11
- type_hint:
- Some
- T12
- T13
- Some
- assignment: None
- T14
- modifiers: [
- var_field:
- ]
- T16
- class_decl:
- T17
There is no explicit reference of out-of-grammar tokens in the parse tree. Instead, there's an implicit relation between out-of-grammar tokens and grammar tokens through the token list. We describe this relation through the following algorithm:
- Add out-of-grammar tokens to list [Lead] until a grammar token is found.
- Consume the grammar token, reference to it as [Token].
- Add out-of-grammar tokens to list [Trail] until a grammar token or a newline token is found.
- If it is a grammar token, emit ([Lead], [Token], [Trail]) and goto 2.
- If it is a newline token [Newline], emit ([Lead], [Token], [Trail] + [Newline]) and goto 1.
This algorithm can be executed by the converter which translates the parse tree to a higher-level representation. In that representation, grammar tokens might have an explicit relation to out-of-grammar tokens in the form of "trivia" or otherwise.
The higher-level representation is built on top of the token list/parse tree and provides an interface to obtain information and modify the underlying structure. As such, its structure should be immutable and all changes should go through an API that interacts with the underlying representation.
We distinguish the tokens provided by the parser from their higher-level representation by calling the latter TokenInfo. Its structure could look like this:
class TokenInfo {
var token:Token; // from the parser
var leadingTrivia:Array<Token>;
var trailingTrivia:Array<Token>;
}
The representation of nodes is largely similar to the current ParseTree structure. However, in order to make it immutable, all occurrences of Array and others has to be replaced with a handle-type that is supported by the API. For example:
EBlock(braceOpen:Token, elems:ArrayHandle<BlockElement>, braceClose:Token);
With an API like this:
class NodeApi {
public function insert<T>(handle:ArrayHandle<T>, offset:Int, value:T);
public function replace<T>(handle:ArrayHandle<T>, offset:Int, value:T);
public function delete<T>(handle:ArrayHandle<T>, offset:Int);
}