Skip to content

Latest commit

 

History

History
112 lines (86 loc) · 6.71 KB

writing-a-parser.md

File metadata and controls

112 lines (86 loc) · 6.71 KB

Writing a parser

This guide shows how to implement a simple parser using the alfa-parser package. The guide is based on the work done in #1339.

Background

The alfa-parser package contains functionality to make it easy to implement a parser (parsers combinators). This package is used internally by Alfa to parse HTML, CSS etc., but it can be used for any parsing scenario.

For this guide we want to create a parser that parses the content of the HTML attribute autocomplete. The content of the attribute must follow a set of rules as defined in the HTML specification.

Defining the grammar

We refer to the rules defined in the specification as the grammar or the syntax of the autocomplete attribute. It must contain a space separated list of tokens in a specific order, some of which may be optional and some may be required depending on the presence of some preceding token. For example the following is an invalid use of the attribute

<input autocomplete="section-blue billing home" />

This is invalid because the home token must be followed by a required token.

We can express the grammar as follows

[section] [addressType] (unmodifiable | [modifier] modifiable) [webauthn] EOF

where a token name surrounded by square brackets [] means it is an optional token, tokens separated by a vertical line | means exactly one of the tokens must be present and EOF is a special token signalling that no more tokens are allowed.

Here the possible values for the tokens are not listed, but as an example addressType is defined as follows

addressType ::= "shipping" | "billing"

meaning that the token can either have the literal value of shipping or billing.

Implementing the parser

Our aim is to write a parser that consumes an array of tokens (in our case strings) and returns a result of either Ok, with the result and the rest of the tokens, or an Err result with a message containing relevant information on why the parsing failed.

The strategy is:

  • write individual atomic parsers to parse each indivial token using string matching
  • combine the atomic parsers using the combinators from alfa-parser into the final parser in such a way that it encodes the grammar

Start by adding the following imports

import { Parser } from "@siteimprove/alfa-parser";
import { Err, Ok } from "@siteimprove/alfa-result";
import { Slice } from "@siteimprove/alfa-slice";

const { either, end, option, right } = Parser;

The types from alfa-result will be used by the parser to signal success or failure and the Slice type is used by the parser to consume an array of tokens in a memory efficient way. The functions, or combinators, from the Parser namespace that will be used have been stored as consts for convenience.

We will implement two of the "atomic" parsers and leave out the rest as they are very similar. In the actual implementation in #1339 this was done in a more general way where a function generates the parser from an array of literal values.

const section: Parser<Slice<string>, string, string> = (input: Slice<string>) => {
  const token = input.first(); // Get the next token
  
  if (token.isSome() && token.get().startsWith("section-")) { // Check if we got a token and it has a valid literal value
    return Result.of([input.rest(), token.get()]); // Return result containing the rest of the tokens and the parsed token
  }
  
  return Err.of(
    `Expected token beginning with \`section-\`, but got ${token}` // Return an error if we got an invalid token or `None`
  );
}

This code declares and initializes a const of type Parser<Slice<string>, string, string>. The first type parameter is the type of the input, the second is the type of the parsed token and the third type parameter is the error type contained in the Err, typically a string message. In our case the input is a collection of strings represented by Slice<string>, the token type is also string for this example, but if the grammar contained symbols in addition to literals, then it would typically be some union type to represent the possible symbols.

The implemtation of the addressType parser is very similar

const addressType: Parser<Slice<string>, string, string> = (input: Slice<string>) => {
  const token: Option<string> = input.first(); 
  
  if (token.isSome() && (token.get() === "shipping" || token.get() === "billing")) { 
    return Result.of([input.rest(), token.get()]); 
  }
  
  return Err.of(`Expected \`shipping\` or \`billing\`, but got ${token}`); 
}

Now let's imagine we also implemented the other parsers in a similar fashion. We are now ready to combine all the parsers using the parser combinators option, either, right and end to get the final parser.

const parse = right(
  option(section), // The attribute should optionally begin with a section token
  right(
    option(addressType), // Next, optionally an addressType token
    right(
      either( // Then, required, either an unmodifiable token or an optional modifier token followed by a modifiable token
        unmodifiable,
        right(option(modifier), modifiable)
      ),
      right(
        option(webauthn), // Finally, an optional webauthn token
        end((token) => `Expected EOF, but got ${token}`) // At this point no more tokens are allowed
      )
    )
  )
);

This code uses the right combinator to combine two parsers in such a way that only the output of the right one is passed on to next parser meaning that in the end, if the parsing was successful, the parser will only output the last parsed token which in our case will be EOF. This is because we are only interested in knowing if the parsing was successful and we don't really care about the output. Typically however the parser would output some kind of abstract representation of the input that could then be processed further.

We have now implemented a parser, so let's see how to use it

const result1 = parse(Slice.of(["section-blue", "shipping", "home", "tel", "webauthn"]));
console.dir(result1.isOk()); // prints true

const result2 = parse(Slice.of(["section-blue", "shipping", "home", "webauthn"]));
console.dir(result2.isOk()); // prints false since "webauthn" is not allowed to follow "home", a required token is missing

Note: We have left out the pre-processing of the raw input into a format that the parser can work with because it is so simple in this example, but typically this can be more complicated, in which case implementing a lexer might be necessary.