Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: the new parser based on winnow #283

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

39555
Copy link
Contributor

@39555 39555 commented Nov 30, 2024

winnow is the rethought, user friendly fork of nom. I played with it for some time implementing the bash grammar and I think you might find it interesting to see how it looks!

I managed to build a solid zero-copy (except for escaping where I'm using Cow) parser. It already has a 2x speedup over the current uncached PEG parser and performs on par with the cached version. Basically it is a hand-rolled parser built from winnow building blocks compared to a framework or a parser generator.

compare_parsers/old_parser_uncached
                        time:   [9.0484 µs 9.0652 µs 9.0831 µs]

compare_parsers/new_parser_uncached
                        time:   [3.4901 µs 3.5102 µs 3.5264 µs]

Its advantages:

  • It has a really nice debug trace output!
An example of the trace

> repeat_till                                                                                 | for f in A B C; do\n                echo 163 \"
   > eof                                                                                        | for f in A B C; do\n                echo 163 \"
   < eof                                                                                        | backtrack
   > terminated                                                                                 | for f in A B C; do\n                echo 163 \"
    > cut_err                                                                                   | for f in A B C; do\n                echo 163 \"
     > complete_command                                                                         | for f in A B C; do\n                echo 163 \"
      > repeat                                                                                  | for f in A B C; do\n                echo 163 \"
       > preceded                                                                               | for f in A B C; do\n                echo 163 \"
        > not                                                                                   | for f in A B C; do\n                echo 163 \"
         > line_ending                                                                          | for f in A B C; do\n                echo 163 \"
          > alt                                                                                 | for f in A B C; do\n                echo 163 \"
           > "\n"                                                                               | for f in A B C; do\n                echo 163 \"
           < "\n"                                                                               | backtrack
           > "\r\n"                                                                             | for f in A B C; do\n                echo 163 \"
           < "\r\n"                                                                             | backtrack
          < alt                                                                                 | backtrack
         < line_ending                                                                          | backtrack
        < not                                                                                   | +0
        > and_or                                                                                | for f in A B C; do\n                echo 163 \"
         > pipeline                                                                             | for f in A B C; do\n                echo 163 \"
          > separated_pair                                                                      | for f in A B C; do\n                echo 163 \"
           > opt                                                                                | for f in A B C; do\n                echo 163 \"
            > "!"                                                                               | for f in A B C; do\n                echo 163 \"
            < "!"                                                                               | backtrack
           < opt                                                                                | +0
           > space                                                                              | for f in A B C; do\n                echo 163 \"
            > take_while                                                                        | for f in A B C; do\n                echo 163 \"
            < take_while                                                                        | +0
           < space                                                                              | +0
           > pipe_sequence                                                                      | for f in A B C; do\n                echo 163 \"
            > first_command                                                                     | for f in A B C; do\n                echo 163 \"
             > command                                                                          | for f in A B C; do\n                echo 163 \"
              > alt                                                                             | for f in A B C; do\n                echo 163 \"
               > simple_command                                                                 | for f in A B C; do\n                echo 163 \"
                > opt                                                                           | for f in A B C; do\n                echo 163 \"
                 > cmd_prefix                                                                   | for f in A B C; do\n                echo 163 \"
                  > separated                                                                   | for f in A B C; do\n                echo 163 \"
                   > alt                                                                        | for f in A B C; do\n                echo 163 \"
                    > io_redirect

  • It allows for more imperative code to control special cases such as here docs.

  • You can use whatever error type you want. By default it uses zero copy ContextError with the span and message information. You can attach custom error messages with parser.context("foo"). The author of the kdl and miette uses miette in the kdl parser https://github.com/kdl-org/kdl-rs/blob/05a4c4fce1a25727e15f4e2f873d0e0ca076c328/src/v2_parser.rs#L67

    • So we can support ariadne or miette
  • small amount of allocations due to the usage of &mut everywhere.

  • I decided to merge tokenizing and parsing into one step. It has significantly reduced the amount of code compared to the old tokenizer things.

Things that are not fully implemented:

  • extended_tests and basically any kind of expression. Im currently working on PR for supporting expression parsing in winnow Pratt parsing support winnow-rs/winnow#131
  • The Interpreter integration although it is a drop-in replacement.
  • Documentation.
  • I would like to use the snapshot based testing.
  • Context errors.

Copy link

github-actions bot commented Nov 30, 2024

Performance Benchmark Report

Benchmark name Baseline (μs) Test/PR (μs) Delta (μs) Delta %
expand_one_string 3.43 μs 3.44 μs 0.01 μs ⚪ Unchanged
instantiate_shell 61.46 μs 61.61 μs 0.15 μs ⚪ Unchanged
instantiate_shell_with_init_scripts 30100.58 μs 30474.19 μs 373.61 μs 🟠 +1.24%
run_echo_builtin_command 90.68 μs 90.55 μs -0.13 μs ⚪ Unchanged
run_one_builtin_command 108.72 μs 109.05 μs 0.33 μs ⚪ Unchanged
run_one_external_command 1942.76 μs 1944.58 μs 1.82 μs ⚪ Unchanged
run_one_external_command_directly 998.83 μs 1014.51 μs 15.69 μs ⚪ Unchanged

Benchmarks removed:

  • parse_sample_script
  • parse_bash_completion

Benchmarks added:

  • compare_parsers/old_parser_uncached
  • compare_parsers/new_parser_uncached

@reubeno
Copy link
Owner

reubeno commented Dec 1, 2024

This sounds really promising! I haven't looked at your changes yet, but the performance speed-up would be very welcome if we can (over time) get to parity with what we've got working so far and if we feel good about its maintainability moving forward. I'm keen to read up on winnow and learn more.

Have you found you've need to make significant changes to the AST structs so far? I've seen other projects successfully experiment with alternate parsers (behind a feature flag) and then be able to A/B compare them, and if they find it's the right call, flip over the default when ready.

@39555
Copy link
Contributor Author

39555 commented Dec 1, 2024

There are no AST changes. The current ast maps perfectly to the yacc posix grammar description. So the second parser is a drop-in replacement except the tokenizer call

@39555
Copy link
Contributor Author

39555 commented Dec 1, 2024

I found it easy to resolve issues with this new parser such as that one we had with heredocs before. Each part of the grammar is completely separated from the rest in its own function similar to peg's rule macro except it's operates directly on the stream, so no more global state machine that the tokenizer has

@reubeno
Copy link
Owner

reubeno commented Dec 1, 2024

Moving to a true streaming parser will help other issues we've had (e.g., shopt actually changing tokenizing/parsing semantics for subsequent commands by enabling/disabling extglob or similar).

@reubeno
Copy link
Owner

reubeno commented Dec 10, 2024

I haven't looked through the full details of this draft but am supportive of us moving forward with adding this as an experimental secondary parser. That will allow us to incrementally mature it and bring it to full parity, get testing set up in a way that we're happy with, etc. -- and then we could figure out the criteria for switching the default parser.

@39555 -- what do you think about preparing a separate small (first step) PR to enable a command-line option to select an alternate parser, and then just add a stub version of the new parser, and a basic test for it -- without adding all the real implementation? That will allow us to make sure we find all the places that code is directly calling into the parser and tokenizer today, and clean that up so it will be easier to switch it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants