-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The future of parsing TLA+ #16
Comments
Author of multiple "non major" TLA+ parsers here... most notably PGo, but by now it's almost my hobby to try and write more of them as parser tooling stress tests. Thanks for the write-up, especially given I did not make it to the aforementioned discussion. I actually didn't know the SANY XML exporter existed. That's interesting. Regardless of parser implementation standardization, I think that a common representation of how a spec is "understood" is valuable for experimenting and qualifying what is or is not a correct parse. I should go learn more about this and incorporate it into my testing process. That is to say, I think any effort in making parsing more consistent and better defined will also benefit TLA+ tooling that is neither SANY nor TLAPM. 👀 |
Commenting since there was the request, but I am not an expert on parsing so this is mostly a beginner/user perspective. Personally, I think the second option is good. My main interest in this would be for the XML format to gain support for further transformations, as some third-party tooling might support SMXML (state machine XML) and this could be a great way to expand interest in the project. I don't think there's anything currently written in rust and I would be careful about adding new languages to the codebase, as that would provide additional overhead to anyone who wanted to read/understand in the future. |
As much as I do love Rust, I completely agree with this point. I would also vote for option 2, or for option 2.1 ("use SANY but find a better interchange format than XML"), in large part because I also agree that
Indeed, I think there's a bit more to this puzzle than parsing. SANY does at least four other important jobs:
Anyone building tooling for TLA+ will benefit from some or all of those---and anyone using TLA+ tooling will benefit from tools that agree on all 4 points. The one time I wrote some TLA+ tooling I benefited immensely from points 2 and 3 in particular. |
To respond to a few other specific points in @ahelwer's fantastic analysis:
I agree. We need some numbers to make a decision though. JVM startup time is notoriously long, but it has also improved drastically in recent years. New tools like jlink and graal may give us some ways to avoid any remaining overhead. SANY's runtime should be very quick, even starting in a cold JIT---but I say "should be" because I have never seen it measured. Some ballpark numbers on a laptop would help us decide if there's any hope of getting a SANY run down to a few milliseconds. ^ I'll try to make an afternoon project out of this over the winter holidays (if nobody beats me to it).
It does. (Sorry to be a language pedant. 🤣 I don't love Java, but we should be maligning it for other reasons, not this one.) The feature has been added slowly, but all the major pieces (including pattern matching!) are available and stable as of Java 21. My real point is that we could write a very nice wrapper library in Java 21 (or Rust or OCaml or...) that invokes SANY, parses its output, and returns a pleasant in-memory representation that is fairly easy to consume for toolwriters. The limitations of SANY's implementation language don't matter once you have a stable interchange format.
Agreed. Further complicating the issue: there is no documentation of SANY's XML format (which is a somewhat direct translation of SANY's internal datastructures, which might not be exactly the right thing to spit out in the first place). If we do go down the switch-to-SANY route, a large and important part of the project will be simply documenting the interchange format and keeping it stable between releases. I don't know how to estimate that work, but I expect it is substantial. |
I would love if we could upgrade to a newer version of Java and take advantage of some new language features, especially variants & exhaustive pattern-matching. I recall in tlaplus/tlaplus#951 @lemmy mentioned we should only really upgrade to LTS releases; Java 21 which has variants is LTS, and the next one after that (which includes the Java 22 FFM API necessary for the linked PR) is Java 25 releasing in September 2025. |
According to our TLA+ execution statistics, the majority of our users are using Java 11 or Java 17. tlaplus=# SELECT substring(jvm_version from '^[0-9]+') AS main_version, COUNT(*) AS entry_count FROM execution_description WHERE execution_timestamp >= NOW() - INTERVAL '12 months' GROUP BY main_version ORDER BY main_version;
main_version | entry_count
--------------+-------------
1 | 627
11 | 226.397
13 | 4.303
14 | 51.214
15 | 4
16 | 597
17 | 211.988
18 | 49.619
19 | 3472
20 | 631
21 | 24.226
22 | 37.580
23 | 57.839 |
FWIW: The lists of known TLAPS parser and semantic checker bugs can be found at https://github.com/tlaplus/tlapm/issues?q=is%3Aissue+is%3Aopen+label%3A%22syntax+parser%22 and https://github.com/tlaplus/tlapm/issues?q=is%3Aissue+is%3Aopen+label%3A%22semantic+checker%22. Currently, there are 25 known bugs in SANY compared to 23 known bugs in TLAPS' parser and semantic checker (16 in the parser and 7 in the semantic checker). Note that the TLAPS parser and semantic checker implement only a subset of the features available in SANY. Moreover, tlaplus/tlapm#186 discusses a severe performance bottleneck in TLAPS' level checking. |
From my experience implementing the LSP server for TLAPM, the important differentiator (apart from the correctness) for deciding on the parser is its speed, including the incremental parsing and integration with it. The LSP updates the document with almost every key press (modulus short debounce period); thus, the document is parsed several times per second. When starting to work on the LSP, I considered using the SANY parser, but integration looked a bit heavy. E.g., java startup times, serialization/deserialization. In my case, it looked like I would need to use the TLAPM functions anyway, like extracting proof obligations. |
Regarding SANY's performance, here are a few ballpark numbers taken on my laptop. They show time to parse and check the TLA+ benchmark models. This does not include exporting the parsed information anywhere, just enough work to spit out "yes" or "no". The
So:
These numbers could probably be drastically improved by implementing a SANY daemon that starts once and keeps the JVM running to parse many specs. I have no easy way to measure what that might look like. Another option is to use Graal native images. I had to make a few small patches to SANY to get this to work, but here are the same numbers for a lightly-modified ahead-of-time-compiled version of SANY:
These numbers look very good to me: <10ms startup overhead, <50ms per parse. We could absolutely build tooling around something that fast. |
@Calvin-L Can you please also collect numbers that include the XML serialization/export? |
For
...and for a Graal native image:
(And for posterity, these numbers and my earlier ones were collected on commit 40b92d0e). |
The spec, including proofs I'm working on now, takes
It parses 3278 lines from 16 files. Those include 6 files with 1779 lines written by me. Others are from the standard modules and the tlapm standard library. |
Opening this discussion after the December community meeting. This regards future efforts to either unify the various TLA+ parsers, or not. I try to give a fair appraisal of the options but I do have my preferences, since I will probably be the one doing this work!
The Problem
Currently there are two main parsers used by TLA+ language tooling:
We are faced with three possible choices:
Comparison of SANY and TLAPM parsers
Of the two, SANY is more fully-featured. While it does contain bugs, it accepts all important TLA+ syntax and implements proper semantic analysis and level-checking. In contrast, TLAPM's parser does not accept some TLA+ syntax and does not perform level-checking; also, it is the author's understanding that instead of implementing semantic analysis as a graph of references overlaid on the parse tree, TLAPM simply rewrites/inlines all references to form larger expressions. While this rewriting approach does fit with TLAPM's purpose of rewriting TLA+ syntax into proof obligations in other languages, it has drawbacks; for example, it does not support
RECURSIVE
definitions, since such an expansion would have no end.It isn't all downside for the TLAPM parser though, and this is really due to OCaml's language features. OCaml has the ability to express variants, where an object might be one of several possible types with their own named & typed fields; it is then possible to exhaustively pattern-match over the type in a switch statement (many programmers will have been introduced to this concept from rust). This comes in especially useful in parsers where a given syntax tree node is one of several possible kinds and we want to handle them all differently. Java in contrast does not have this ability, and so consuming the parse tree usually requires non-exhaustively switching on an enum then casting a value to a subclass. This section should not be discounted as mere programmer language preference; in practical terms, consuming and manipulating the parse tree emitted by SANY is very difficult and requires a lot of implicit & undocumented knowledge about the order that elements are supposed to occur (or not) in variable-length arrays. Usually the only way to tell is to run a debugger and see what the parser spits out.
Benefits & drawbacks of switching TLAPM to use SANY
The primary benefit to making this switch would be reduced development burden. While parsers are a rare breed of software project where the requirements are so well-understood that they are actually possible to complete, there is a very large amount of work remaining to get both parsers to that point. This work is made more difficult by the necessity for multiple developers to understand the parsers in great depth, so as to review each others' work. While this can be worked around by developing very large test suites - to reduce reviewer trepidation when approving changes in unfamiliar parts of the codebase - it remains the case that developers don't like to sign off on changes they don't understand and so PRs tend to languish for one or more months. It is the author's experience that iteration time is the single greatest factor affecting overall development time, and we can reasonably expect a rate of 1-3 changes/month to result in development time of five years or more for a substantial project.
It is the author's opinion that the primary difficulty in transitioning TLAPM to use SANY's parse output is that no developer active with the project has full living knowledge of how both parsers represent the semantic parse tree, and this project requires us to write a translator from one semantic parse tree to another while preserving a large number of unknown invariants. Thus this project cannot be attacked directly; instead, various plausibly-valuable sub-projects must be worked on to cultivate a theory of how both these codebases work in some developer's head. Thankfully, one such sub-project presents itself: development of a large semantic test corpus, similar to the existing syntactic test corpus. This corpus has obvious value on its own and will also be useful in testing a SANY-to-TLAPM parse tree translator.
One illustrative example of the theory shortfall & a clear unknown is the conceptual equivalence of TLAPM's semantic parse tree to SANY's. As mentioned above, TLAPM takes a rewrite-based approach to semantic resolution. Does this present an obstacle to translating SANY's graph-of-references overlay approach? Does this mean there will need to be additional development on the TLAPM parser before the translation can even work? I don't know.
One final drawback to switching TLAPM to use SANY is the overall complexity of the proposed pipeline. While TLAPM is no stranger to calling out to other programs and parsing their output - this is the approach it takes with all its backend theorem provers - replacing an in-codebase in-language parser with a "run SANY/parse XML/translate datastructures" pipeline does seem to add fragility, as well as an additional dependency on the Java runtime (although users of TLAPM are almost certainly already users of Java-based TLC). This also adds additional time to the parse loop. While it is not possible to appeal to objective criteria here, it is the author's opinion that programmers prize snappy language tooling and waiting on the order of an entire second to parse a few hundred lines of TLA+ should not be viewed as acceptable. This is especially true in the new(ish) domain of language servers, where live updates are expected as users type in their editor. As a somewhat niche research-adjacent language we can get away with some long processing times, but I do think we should aim higher here.
Writing a new parser?
I don't forward this as a very serious proposal at this point. Perhaps in half a decade we will have reached the real limit of what we can do with the existing parser(s) and will want to look into this. I took a brief stab last year at writing a new parser in rust but was ultimately unhappy with the idea of just writing another recursive descent parser and it fizzled out. I'm currently working through a university course on formal languages to hopefully exactly characterize what power/variant of automaton is needed to parse TLA+ and what efficient algorithms exists for doing so. Who knows, maybe a TLA+ parser written in TLA+ will come into existence, if only as part of the language standard!
Any new parser would have to be motivated by a substantial new capability that the existing parsers cannot be modified to possess. Something like structured editing and incremental parsing, with error recovery enhanced by running an algorithm to diff edits against prior successfully-parsed trees. Complicated stuff like that. Although really, I think the killer feature would have to be a finite model-checker rewrite to use compiled bytecode for a huge boost in throughput, which might be better to do in a language other than Java. All speculative at this point.
Conclusion
This was long, but also our discussion in the community meeting was long. Please leave any comments or feedback below. Hopefully we can all read this thread and so shorten our discussion at the next community meeting, or establish a parser working group.
The text was updated successfully, but these errors were encountered: