Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The plan for Farkle 7 #24

Open
teo-tsirpanis opened this issue Mar 30, 2022 · 0 comments
Open

The plan for Farkle 7 #24

teo-tsirpanis opened this issue Mar 30, 2022 · 0 comments

Comments

@teo-tsirpanis
Copy link
Owner

teo-tsirpanis commented Mar 30, 2022

The plan for Farkle 7

Farkle's goal is to provide the best LALR parsing experience for .NET. To achieve this, version 7, the next major version of Farkle is going to be a significant reimagining of the library's internals, aiming for improved performance, extensibility and compatibility going forward. I've been thinking of it for more than a year and it's finally time to put my thoughts into paper. Here's an outline of what will come (links to dedicated issues will be provided in the future):

  • Farkle will be rewritten in C#: F# is a good language, however for a project like Farkle I have often found myself working not with it but against it. There are some important features C# has and F# doesn't, forcing me to devise suboptimal workarounds. For some examples, I had to write a weaver because F# does not support covariant and contravariant generics, and I had to use reflection in one place because in F# cyclical dependencies among files are not supported. With the drawback of Farkle's codebase size getting increased, rewriting it to C# will bring benefits such as more control over the produced code, decoupling from the FSharp.Core library, better tooling support and improved compilation times. The main Farkle library will be the first to be rewritten and other components such as the precompiler and the CLI tool will eventually follow. The existing tests will stay in F#.

    But, providing an idiomatic F# API is still a goal. I'm not sure how it will be done; the options are either by pubishing a separate package with a name like Farkle.FSharp, by providing an F# source file inside the main package, or by providing the F# API in the library binary itself. In the latter case Farkle would still depend on FSharp.Core but a large portion of it would be able to be trimmed away.

  • Improve the grammar format: Owing to Farkle's origins as an F# engine for the GOLD Parser project, it can since its first versions read and parse grammars from GOLD's Enhanced Grammar Table (EGT) files. After gaining the ability to build grammar tables on its own and to support the precompiler feature, Farkle needed to also write grammars to files. For this reason the EGTneo format was born which shares some semantics with EGT files but is more compact and easy to read from code than EGT because it better fits with Farkle's grammar domain model.

    This however proved to be EGTneo's weak point. Because it so tightly matches Farkle's grammar domain model and because it was designed very hastily without any form of extensibility, it makes adding new grammar features almost impossible. Add to that some unfortunate design choices that limit performance (such as the use of ImmutableDictionary<TKey,TValue> which at that time I did not know it was implemented with a tree). That's why Farkle 7 will come with a new grammar format that is carefully designed with extensibility in mind and will enable significant innovations in the grammars' features. These are some of the new features in my mind that will come with Farkle 7:

    • Hidden Terminals: A terminal can be marked as hidden, which means that it won't be suggested in syntax errors. A use case for it is in Farkle's string regex grammar, which defines a syntax for Unicode categories which are not actually supported and displays an error if they are encountered.

    • Noisy Terminals: A terminal can be marked as noisy, which means that it will be ignored if encountered in an unexpected place, instead of raising a syntax error. A use case for it is in Farkle's newLine, which if used in a grammar it will prohibit new lines from appearing in any place not explicitly marked. By allowing newLine to be noisy, new lines in unexpected places will be allowed and ignored.

    • Support for partial grammars: The new grammar format will support encoding partial grammars, which are grammars that had a build error and cannot be used for parsing. This would allow increased integration with Farkle's templating subsystem and improved and more complete HTML error reports.

    The new grammar format will be properly documented and versioned, making it easy to write grammar readers in other languages. Farkle 7 will continue supporting reading GOLD Parser's EGT files by converting them to the new grammar format and reading that under the hood. The Farkle 6 EGTneo format will not be supported.

  • Improve the parser APIs: There are many changes planned in this area, but let's talk about the most fundamental. Currently the architecture of Farkle's parser is comprised of these three components from bottom to top: the CharStream API, the tokenizer and the LALR parser. When you tell Farkle to parse a piece of text, it invokes the parser, which invokes the tokenizer many times to get tokens, which uses the CharStream to get the input characters, potentially performing I/O to populate its internal buffer. This is called a pull-based model because characters are pulled by the tokenizer when needed.

    Farkle 7 will switch to a push-based model, where I/O happens outside the parser call chain. Instead of the parser reading characters by itself, we will give the characters to it, and if it runs out of characters, the parser will stop and wait for us to give it more. If input ends, we will tell the parser so that next time it will not ask for more, but appropriately pass EOF to the LALR state machine, and return a result. Parsing contiguous characters in memory will be like a single big push. The possibilities this new design will bring are enormous. In Farkle 6 if you wanted to control the input to the parser you had to inherit TextReader which is cumbersome, and you couldn't pause a parser once it started. One of these possibilities is asynchronous parsing, which becomes trivial with this new architecture. These changes will be mostly transparent to the average user of Farkle, but those who use advanced features like custom tokenizers will have to adjust to the push-based API.

    Besides that, the new parser architecture will be more flexible both in its implementation (support for custom tokenizers will be more streamlined) and its support for inputs (Farkle 6 supports parsing from ReadOnlyMemory<char> while Farkle 7 will support ReadOnlySpan<char> as well) and as for performance, parsing contiguous characters in memory will be amortized allocation-free.1

  • Improve the builder APIs: The builder will receive less radical changes than the parser. Its object model of small and (almost) immutable grammars that get composed into bigger ones has worked well, and the main essence of it will carry on. These are the areas that will be improved:

    • Distinguishing between global and local options: There are some extension methods to configure grammars, and most of them have effect only if applied to the grammar's starting symbol that will get built. We will call these global options and examples are adding comments and setting case sensitivity or whether to ignore whitespace. Other options can be set to any symbol of the grammar, like the symbol's name and operator precedence and associativity info and are called local options. Setting a global option to anywhere other than the grammar's starting symbol will be silently ignored, and issues have been opened around that. On Farkle 7 global configuration methods will return a special grammar type that can not be composed but only built and set other global configuration. This way such mistakes will result in compiler errors.

    • Improving string regexes: The language of string regexes will be redesigned, with the goal of making it as close to the standard regex syntax as possible, and removing GOLD Parser influences.

    • Compatibility levels: To lessen the impact of future breaking behavior changes in the builder, in Farkle 7 you can set a compatibility level to a builder object. If for example you write a grammar in Farkle 7.0, you can set its compatibility level to Farkle 7.0, to that if a future Farkle version gets released with an observable behavior change, you won't be immediately affected.

  • Improve the precompiler: The precompiler is Farkle's killer feature and sets it apart from other parser generators and combinator libraries, allowing it to compete with both. Its current design's main problems are that it requires reflection, does not allow the precompiled grammar to be trimmed, did not work with AOT (until Farkle 6.5.0), and relies on static constructors which during precompiling might execute more code than necessary. Farkle 7 will come with a new and slightly different way to mark grammars to precompile, that will solve all of the above problems.

  • Improve diagnostics: Farkle's existing diagnostic facilities leaves a lot to be desired. Farkle 7 will improve on this area by adding support for logging events other than errors in the builder and parser, giving every error and warning a unique code, and localizing its error messages (apart from English I will localize them to my native Greek; the community is free to contribute additional languages).

Plans beyond 7.0

The Farkle 7.0 release will be mostly a refactoring, with few exciting features. Once that gets done and we have a solid foundation under our feet, here are some things that could get done in the 7.x timeframe:

  • Better Unicode support: Farkle's tokenizer works by matching individual 16-bit characters. Farkle's tokenizer matches characters in text with the help of a DFA, which gets created at the builder from the terminals' regexes. These regexes match individual 16-bit characters, constrained to the Basic Multilingual Plane (BMP). You can match individual characters outside of it -the string will contain the surrogate characters and the DFA will just match them without problem- but you can't match a range of characters outside the BMP. We can make Farkle's regexes match Unicode code points, and at the time of constructing the DFA lower them to match UTF-16 characters. Thus a pattern like [🧄-🧅] which corresponds to [\U0001F9C4-\U0001F9C5] will get lowered to \uD83E[\uDDC4-\uDDC5].

    • Once we do that we can finally start talking about UTF-8 parsers that would directly parse the UTF-8 bytes without converting them to UTF-16.

    • And we can also support matching Unicode Categories and get rid of GOLD Parser's problematic predefined sets like All Letters that were brought over to Farkle.

  • Support for the IELR algorithm: To construct the parsing tables, Farkle uses the LALR algorithm like many other bottom-up parsing tools such as FsLexYacc. The problem with LALR is that it's not powerful enough and the grammar author often has to adjust it to avoid conflicts. IELR is a more powerful algorithm that can resolve more conflicts than LALR and would increase productivity. Besides that it would be a significant innovation in Farkle; other than GNU Bison there is no other known IELR implementation.

  • Code generation: The way Farkle's parser works today is by reading the grammar file, and interpreting its parsing tables. We could instead generate specialized code that will parse a specific grammar. Farkle 6 had this capability but only in the post-processor and was eventually removed in Farkle 6.5.0 due to stability concerns. It will be interesting to investigate after Farkle 7.0 to create an improved code generator for the entirety of the parsing process and without using the grammar file. This would bring Farkle's performance to a whole new level, while also taking advantage of technologies like ReadyToRun, Native AOT and Dynamic PGO. It will have to support both dynamic code generation with System.Reflection.Emit and static code generation at the precompiler.

Frequently Asked Questions

Sounds great. Is there an ETA?

Unfortunately not. There are a lot of features to be implemented, and I don't want to release even a preview version until it reaches almost (some features will be dropped) feature parity with the F# Farkle library.

What about compatibility?

Binary compatibility with older Farkle versions will not be guaranteed; you will need to at least recompile your applications to use the new version of Farkle. Regarding source compatibility, the expectation is that if you are only building a grammar and parsing it you would just have to do some renames and slightly adjust code for things like the precompiler or the string regexes. Other scenarios will need more extensive changes. It is not yet decided how smooth the update process will be but there will be a detailed guide, and open-source projects that use Farkle will be assisted with the upgrade.

Footnotes

  1. Only the parser itself will not allocate. The objects produced by transformers and fusers and any custom tokenizers if exist may still cause allocations, but that's something user code can control.

Repository owner locked and limited conversation to collaborators Jun 6, 2022
Repository owner unlocked this conversation Jun 6, 2022
@teo-tsirpanis teo-tsirpanis changed the title Farkle 7 The plan for Farkle 7 Oct 6, 2022
@teo-tsirpanis teo-tsirpanis pinned this issue Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant