V2 Parser: Add scanner #340

kubukoz · 2023-10-02T00:22:21Z

Adds a scanner for all language tokens.

The design is inspired by rowan, although most of that will be visible in the parser stage, which will come in the future.

This scanner is:

total: there is no input that should fail to produce tokens
lossless: you can render the tokens back to the original code they were produced from. Note that full utf-8 isn't explicitly supported, so codepoints that don't fit in a char may break here. However, preliminary testing with property-based testing hasn't shown any cases that would fail to parse and re-render.
full fidelity: the tokens include comments and whitespace.
- worth noting, newlines are treated specifically: normally whitespace tokens consist of any number of white characters, but newlines form their own tokens. See tests for examples, but the main reason this is being done is to make sync points easier in the parser (by relying on newline tokens).
Scanner
- Single-char tokens (punctuation)
  - missed: colons
- Identifiers
- Single-line comments
- Multi-character tokens (keywords)
  - Keywords after comments, after whitespace etc.
- String literals (unescaped, multi-line: consistent with current parser)
- Numeric literals (full JSON number syntax)
  - Using Cats Parse for this, we can switch to a custom implementation later on. I'm not gonna waste hours of my life just to avoid using a third-party library ;P
- Boolean/null literals
- Parity testing
  - include scanner test in all generative tests
  - test for a non-empty list of non-error tokens
  - any error tokens in valid inputs should be reported as test failures
- support arbitrary utf-8 codepoints?
  - Maybe later. For now, not a priority.

kubukoz added 14 commits October 2, 2023 02:22

Begin work on a new scanner

42f4926

Support colons

f491bba

Add a more complex test case

84e42a3

Add scanTestReverse

c00a59a

Comment out currently unsupported token

a6782e0

Add keyword tokens

c69a352

Rework error matching

d117fc2

Add support for multi-char keywords

1ff0060

Add more complex cases

6dd09d7

cleanup

3afcf44

Support string literals

fb0ee2e

Add test for multiline string

b7ac39e

Add test against real input

de8e3cc

Import all syntax for less verbosity

ac89ad3

kubukoz mentioned this pull request Oct 2, 2023

Scala 3 prep: drop cats-tagless-macros #341

Merged

kubukoz added 7 commits October 3, 2023 03:28

No need for kinds

fab0983

Support number literals

9526d1a

Add parity test for scanner

b35bbf7

Also check negative cases

1b205d3

Merge branch 'main' into parser-v2

2afbdc9

Tiny simplification + comment

54f7a00

Split scanner suites

dcf9e76

kubukoz mentioned this pull request Oct 3, 2023

V2 parser #343

Open

8 tasks

kubukoz changed the title ~~V2 Parser~~ V2 Parser: Add scanner Oct 3, 2023

kubukoz marked this pull request as ready for review October 3, 2023 02:15

Merge branch 'main' into parser-v2

13f23c1

kubukoz merged commit 7d9e6b1 into main Oct 4, 2023
4 checks passed

kubukoz deleted the parser-v2 branch October 4, 2023 00:56

Provide feedback