Experiment: the new parser based on `winnow` #283

39555 · 2024-11-30T22:50:11Z

winnow is the rethought, user friendly fork of nom. I played with it for some time implementing the bash grammar and I think you might find it interesting to see how it looks!

I managed to build a solid zero-copy (except for escaping where I'm using Cow) parser. It already has a 2x speedup over the current uncached PEG parser and performs on par with the cached version. Basically it is a hand-rolled parser built from winnow building blocks compared to a framework or a parser generator.

compare_parsers/old_parser_uncached
                        time:   [9.0484 µs 9.0652 µs 9.0831 µs]

compare_parsers/new_parser_uncached
                        time:   [3.4901 µs 3.5102 µs 3.5264 µs]

Its advantages:

It has a really nice debug trace output!

An example of the trace

> repeat_till                                                                                 | for f in A B C; do\n                echo 163 \"
   > eof                                                                                        | for f in A B C; do\n                echo 163 \"
   < eof                                                                                        | backtrack
   > terminated                                                                                 | for f in A B C; do\n                echo 163 \"
    > cut_err                                                                                   | for f in A B C; do\n                echo 163 \"
     > complete_command                                                                         | for f in A B C; do\n                echo 163 \"
      > repeat                                                                                  | for f in A B C; do\n                echo 163 \"
       > preceded                                                                               | for f in A B C; do\n                echo 163 \"
        > not                                                                                   | for f in A B C; do\n                echo 163 \"
         > line_ending                                                                          | for f in A B C; do\n                echo 163 \"
          > alt                                                                                 | for f in A B C; do\n                echo 163 \"
           > "\n"                                                                               | for f in A B C; do\n                echo 163 \"
           < "\n"                                                                               | backtrack
           > "\r\n"                                                                             | for f in A B C; do\n                echo 163 \"
           < "\r\n"                                                                             | backtrack
          < alt                                                                                 | backtrack
         < line_ending                                                                          | backtrack
        < not                                                                                   | +0
        > and_or                                                                                | for f in A B C; do\n                echo 163 \"
         > pipeline                                                                             | for f in A B C; do\n                echo 163 \"
          > separated_pair                                                                      | for f in A B C; do\n                echo 163 \"
           > opt                                                                                | for f in A B C; do\n                echo 163 \"
            > "!"                                                                               | for f in A B C; do\n                echo 163 \"
            < "!"                                                                               | backtrack
           < opt                                                                                | +0
           > space                                                                              | for f in A B C; do\n                echo 163 \"
            > take_while                                                                        | for f in A B C; do\n                echo 163 \"
            < take_while                                                                        | +0
           < space                                                                              | +0
           > pipe_sequence                                                                      | for f in A B C; do\n                echo 163 \"
            > first_command                                                                     | for f in A B C; do\n                echo 163 \"
             > command                                                                          | for f in A B C; do\n                echo 163 \"
              > alt                                                                             | for f in A B C; do\n                echo 163 \"
               > simple_command                                                                 | for f in A B C; do\n                echo 163 \"
                > opt                                                                           | for f in A B C; do\n                echo 163 \"
                 > cmd_prefix                                                                   | for f in A B C; do\n                echo 163 \"
                  > separated                                                                   | for f in A B C; do\n                echo 163 \"
                   > alt                                                                        | for f in A B C; do\n                echo 163 \"
                    > io_redirect

It allows for more imperative code to control special cases such as here docs.
You can use whatever error type you want. By default it uses zero copy ContextError with the span and message information. You can attach custom error messages with parser.context("foo"). The author of the kdl and miette uses miette in the kdl parser https://github.com/kdl-org/kdl-rs/blob/05a4c4fce1a25727e15f4e2f873d0e0ca076c328/src/v2_parser.rs#L67
- So we can support ariadne or miette
small amount of allocations due to the usage of &mut everywhere.
I decided to merge tokenizing and parsing into one step. It has significantly reduced the amount of code compared to the old tokenizer things.

Things that are not fully implemented:

extended_tests and basically any kind of expression. Im currently working on PR for supporting expression parsing in winnow Pratt parsing support winnow-rs/winnow#131
The Interpreter integration although it is a drop-in replacement.
Documentation.
I would like to use the snapshot based testing.
Context errors.

github-actions · 2024-11-30T22:55:13Z

Performance Benchmark Report

Benchmark name	Baseline (μs)	Test/PR (μs)	Delta (μs)	Delta %
`expand_one_string`	`3.43 μs`	`3.44 μs`	`0.01 μs`	`⚪ Unchanged`
`instantiate_shell`	`61.46 μs`	`61.61 μs`	`0.15 μs`	`⚪ Unchanged`
`instantiate_shell_with_init_scripts`	`30100.58 μs`	`30474.19 μs`	`373.61 μs`	`🟠 +1.24%`
`run_echo_builtin_command`	`90.68 μs`	`90.55 μs`	`-0.13 μs`	`⚪ Unchanged`
`run_one_builtin_command`	`108.72 μs`	`109.05 μs`	`0.33 μs`	`⚪ Unchanged`
`run_one_external_command`	`1942.76 μs`	`1944.58 μs`	`1.82 μs`	`⚪ Unchanged`
`run_one_external_command_directly`	`998.83 μs`	`1014.51 μs`	`15.69 μs`	`⚪ Unchanged`

Benchmarks removed:

parse_sample_script
parse_bash_completion

Benchmarks added:

compare_parsers/old_parser_uncached
compare_parsers/new_parser_uncached

reubeno · 2024-12-01T01:59:21Z

This sounds really promising! I haven't looked at your changes yet, but the performance speed-up would be very welcome if we can (over time) get to parity with what we've got working so far and if we feel good about its maintainability moving forward. I'm keen to read up on winnow and learn more.

Have you found you've need to make significant changes to the AST structs so far? I've seen other projects successfully experiment with alternate parsers (behind a feature flag) and then be able to A/B compare them, and if they find it's the right call, flip over the default when ready.

39555 · 2024-12-01T07:44:54Z

There are no AST changes. The current ast maps perfectly to the yacc posix grammar description. So the second parser is a drop-in replacement except the tokenizer call

39555 · 2024-12-01T13:37:11Z

I found it easy to resolve issues with this new parser such as that one we had with heredocs before. Each part of the grammar is completely separated from the rest in its own function similar to peg's rule macro except it's operates directly on the stream, so no more global state machine that the tokenizer has

reubeno · 2024-12-01T18:52:21Z

Moving to a true streaming parser will help other issues we've had (e.g., shopt actually changing tokenizing/parsing semantics for subsequent commands by enabling/disabling extglob or similar).

reubeno · 2024-12-10T18:25:06Z

I haven't looked through the full details of this draft but am supportive of us moving forward with adding this as an experimental secondary parser. That will allow us to incrementally mature it and bring it to full parity, get testing set up in a way that we're happy with, etc. -- and then we could figure out the criteria for switching the default parser.

@39555 -- what do you think about preparing a separate small (first step) PR to enable a command-line option to select an alternate parser, and then just add a stub version of the new parser, and a basic test for it -- without adding all the real implementation? That will allow us to make sure we find all the places that code is directly calling into the parser and tokenizer today, and clean that up so it will be easier to switch it.

feat: the new parser based on winnow

06311e6

39555 force-pushed the winnow-parser branch from cf02d3b to 06311e6 Compare November 30, 2024 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: the new parser based on `winnow` #283

Experiment: the new parser based on `winnow` #283

39555 commented Nov 30, 2024

github-actions bot commented Nov 30, 2024 •

edited

Loading

reubeno commented Dec 1, 2024

39555 commented Dec 1, 2024 •

edited

Loading

39555 commented Dec 1, 2024

reubeno commented Dec 1, 2024

reubeno commented Dec 10, 2024 •

edited

Loading

Experiment: the new parser based on winnow #283

Are you sure you want to change the base?

Experiment: the new parser based on winnow #283

Conversation

39555 commented Nov 30, 2024

github-actions bot commented Nov 30, 2024 • edited Loading

Performance Benchmark Report

reubeno commented Dec 1, 2024

39555 commented Dec 1, 2024 • edited Loading

39555 commented Dec 1, 2024

reubeno commented Dec 1, 2024

reubeno commented Dec 10, 2024 • edited Loading

Experiment: the new parser based on `winnow` #283

Experiment: the new parser based on `winnow` #283

github-actions bot commented Nov 30, 2024 •

edited

Loading

39555 commented Dec 1, 2024 •

edited

Loading

reubeno commented Dec 10, 2024 •

edited

Loading