Tree-sitter rolling fixes: 1.121 edition #1085

savetheclocktower · 2024-08-23T22:22:13Z

This one starts off big — a sizable diff that probably overstates the scope of the underlying refactor.

Indentation logic is complex. It's not as complex as it seems, but it's still pretty complex, and I haven't exactly helped the issue because I haven't found an intuitive way to explain it. The first commit in this PR represents the beginning of an effort to demystify indentation queries and make them easier for others to reason about.

The main initial benefit is that indentation logic is now encapsulated in its own IndentResolver class. Unlike FoldResolver — a different instance of which exists for each LanguageLayer — a WASMTreeSitterLanguageMode instance has a single instance of IndentResolver. That's because the logic of indentation hinting crosses LanguageLayers; we don't gain anything by subdividing the labor further.

But the encapsulation itself is useful. It lets us write new methods to reduce repetition without increasing the API surface of WASMTreeSitterLanguageMode.

There are two other ideas that I've added in the process:

The @match capture is good, but a recent bug report suggested to me that we could benefit from a similar kind of capture that works in the first phase of indentation hinting. So I created @match.next; it's a different approach to phase-one hinting from what you get with @indent/@dedent/@dedent.next captures.

“Phase one” of indentation hinting is where we determine what the “baseline” indentation is for line X before its own content is considered; we look at the content of line X-1 and decide whether line X should start with the same indentation level, one level more, or one level less. I'll illustrate its usefulness below — much easier when I'm not in an unordered list.
For a long time I've wanted to make the indentation decision process more visible. It's not something anyone needs to care about… until they're writing a modern Tree-sitter grammar or trying to figure out why a grammar someone else wrote isn't making the right indentation decision. When I've found myself in this position, I've had no choice but to open dev tools and set some debugger breakpoints.

Now I'm experimenting with a different approach: an obscure API that, once an indentation decision is made, fires an event with metadata that could help a user understand the logic behind an indentation decision. The IndentResolver::onDidSuggestIndent method accepts a callback and provides an object with a bunch of properties. It's still a bit intimidating to consume directly, but my eventual goal is to write a community package that can interpret the data (or else add such functionality to tree-sitter-tools).

About `@match.next`

The impetus for @match.next was this example:

function foo() {
  let event = initializeCustomEvent(0, 2, 3, 4, 5, 'event', null, null,
                                                         undefined);
  return event;
}

This is a contrived example, but the idea is that a user should be able to insert whatever sort of hanging indent they want when they're spreading a statement over multiple lines. Since we have Tree-sitter, we should be able to understand that line 4 shouldn't keep the deep indentation of line 3; it should match the indentation level of line 2!

I didn't think about this until a user reported an issue with hanging indents in C. I came up with a fix that's most of the way to what I want, but it uses a @match capture, so operates in the second phase of indentation hinting — the phase that considers the content on the current line. For this reason, it doesn't let us correct the indentation until after the user starts typing.

With @match.next, that fix looks a bit different:

(
  [
    (expression_statement)
    (return_statement)
    (continue_statement)
    (break_statement)
    (throw_statement)
    (debugger_statement)
    (lexical_declaration)
  ] @match.next
  (#is? indent.matchesComparisonRow endPosition)
  (#set! indent.matchIndentOf startPosition)
)

I'm still playing around with this, and I'm only 96% sure it's a good idea. But this could be expressed as follows: “the presence of expression_statement on row X-1 suggests that row X should start with the same level of indentation as that of the row where expression_statement started.” (Also true for return_statement and all other node types within the square brackets.)

Hence in the example above:

The user is typing the end of line 3 and presses return.
Line 3 represents the end of an expression_statement, so it produces a @match.next capture.
Under certain circumstances, we could reject that capture; for instance, if the expression_statement didn't end on row 3. But the indent.matchesComparisonRow test passes, so we proceed.
suggestedIndentForBufferRow interprets the @match.next capture by using the indent.matchIndentOf predicate to figure out which line's indentation should be copied. In this case, it uses startPosition — i.e., the starting position of the expression_statement node.
The cursor starts out at one level of indentation because that's what we had on line 2.

I still sense that I'm not explaining this well enough. I'm not certain that these captures have the right names; one further advantage of the IndentResolver encapsulation is that it would make it a bit easier to introduce new aliases for these captures and predicates while preserving backward compatibility. (For instance, indent.matchIndentOf is now aliased to the terser indent.match, and indent.offsetIndent is aliased to indent.offset.)

Another reason to introduce this new encapsulation is to give us a place to hang new indentation-specific query tests. In order to implement what I described above, I needed a way to introduce the concepts of “current row” and “comparison row” to query tests, and those are indentation-specific ideas. (The current row is the row whose suggested indentation we're determining; the comparison row is the one we're using as reference; typically it's the row just above the current row, but we tend to skip over whitespace-only rows.)

Anyway, if any of this doesn't make sense, and you want to understand it better, please ask! In the process of answering your questions I hope I find ways to make this subject less intimidating.

In other news: a fix I made for tree-sitter-css months ago has been accepted, so I've also bumped our tree-sitter-css to the latest release.

Changelog

Updated web-tree-sitter to version 0.23.0.
[language-css] Updated tree-sitter-css to the latest version.
[language-gfm] Updated tree-sitter-markdown to the latest version.
[language-html] Updated tree-sitter-html and tree-sitter-embedded-template to their latest versions.
[language-javascript] Updated tree-sitter-javascript to the latest version.
[language-typescript] Updated tree-sitter-typescript to the latest version.
Added a new @match.next capture for advanced control of how indentation should change from one line to the next.
Added a new event that is emitted when Pulsar makes an indentation decision, with the ultimate goal of making indentation hinting easier to understand for grammar authors. The metadata on the event is preliminary and subject to change. (Maybe I'll mention this somewhere, but probably not in the changelog, since it's not part of a public API yet.)
Added new indentation-specific query predicates indent.matchesComparisonRow and indent.matchesCurrentRow for comparing arbitrary positions in a Tree-sitter node tree to the operative rows in an indentation suggestion query. Makes it possible to say things like “decrease the indent on line 10 if a statement ends on line 9.”
Renamed indentation directives indent.matchIndentOf and indent.offsetIndent to indent.match and indent.offset, respectively. The old names still work as aliases.

…and out of `WASMTreeSitterLanguageMode` for reasons of encapsulation.

Fixes issue with parsing of selectors in `:has`, `:is`, and other pseudoclasses.

savetheclocktower · 2024-08-23T22:23:59Z

I should add that the new indentation functionality doesn't have any tests yet, but neither does it change how existing stuff behaves, so I expect all tests to pass. If I don't manage to write tests for this stuff in the next three weeks, I might take some small steps to hide the new functionality, or at least mark it as obviously experimental and not something that should be relied upon.

(Update: we now have a spec for @match.next, so I consider this stuff to be stable enough to ship. At this point, any further changes that I make to how @match.next behaves can be done in a backwards-compatible way; so all that's left to do is document it, which I plan to do soon.)

…including more consistent firing of the `did-suggest-indent` event.

…and fix a bug with the indentation reparse budget.

…as booleans. Handle `NULL` in a similar way.

…and update our API usages. Tree-sitter harmonized the API differences between `web-tree-sitter` and `node-tree-sitter` in version 0.22.0. This is the first time we’ve had to deal with that. Luckily, the changes were mostly palatable. The biggest API difference is in `Query#captures`; two positional arguments for defining the extent of the query have been moved to keyword arguments. We’ve updated our internal usages, but any community packages that relied on the old function signature would break if we didn’t do anything about it. So we’ve wrapped the `Query#captures` method in one of our own; it detects usages that expect the old signature and rearranges their arguments, issuing a deprecation warning in the process. Hopefully this generates enough noise that any such packages understand what’s going on and can update. Other API changes are more obscure — which is good, because we can’t wrap them the way we wrapped `Query#captures`. They involve conversion of functions to getters (`node.hasErrors` instead of `node.hasErrors()`), and there’s no good way to make both usages work… short of wrapping nodes in `Proxy` objects, and that’s not on the table. Since lots has changed in `tree-sitter` since we last upgraded `web-tree-sitter`, I updated our documentation about building a custom version of `web-tree-sitter`.

savetheclocktower · 2024-09-07T23:57:37Z

Just bumped our web-tree-sitter to 0.23.0. This is a big bump and I'd been putting it off because of disruptive changes to the Tree-sitter ecosystem; but the short version is that everything's cool.

Immediately after I bumped web-tree-sitter locally, I reloaded my Pulsar window and found that everything was heartbreakingly slow. Profiling had me chase down a couple of red herrings before I discovered that the problem had to do with API changes in web-tree-sitter. This happened a while back for a good reason: the desire to eliminate differences between the web-tree-sitter API and the node-tree-sitter API.

The slowness had to do with the fact that every query against the tree was accidentally being executed against the entire tree rather than against the specified range. (Even this was fast on the web-tree-sitter side; the slowness happened in our post-processing phase, the one in which we iterate through all the captures.) Once I updated our usages of Query#captures to use its new signature, the problem went away. (One of the red herrings was still a good idea: a hot code path was allocating lots of unnecessary arrays, so I rewrote it to use a different approach.)

We can monkeypatch Query#captures to work with both the old function signature and the new one, so that's exactly what I did. The purpose is not so much to save us from rewriting our usages (since I've done that; it's in this PR) but to save the editor from throwing an error if a community package expects the old function signature. (On that note, I've also updated tree-sitter-tools to be more defensive and work equally well with the old API and the new API.) We'll issue a deprecation warning if the old function signature is detected.

Everything went well locally, including specs, so I'll watch the CI carefully just in case there's something really obscure that regressed.

…only injections with a shallower depth.

…and modify query files in response to all non–backwards-compatible changes.

…after `tree-sitter-javascript` bump.

…to conform to new `tree-sitter-(java|type)script` node structure for template strings.

…at new repo location.

savetheclocktower · 2024-09-10T18:18:10Z

Enough last-minute changes; time to take this out of draft so that I stop clipping at the edges like it's a bonsai tree.

…for `tree-sitter-html` and `tree-sitter-embedded-template`. Fixed an inscrutable out-of-memory error I was getting in an EJS file.

…related to updating open documents’ syntax highlighting as grammars are processed.

savetheclocktower · 2024-09-14T22:32:00Z

Just kidding! I ran into a strange WASM memory error when working on an EJS document, and I couldn't figure out why it was happening, so eventually I decided “might as well update the relevant Tree-sitter parsers” and that solved it.

Along the way I discovered another subtle bug, so that's also been dealt with.

confused-Techie

Without diving too deep into the code, with all tests passing and the history of these bumps, I'm confident in merging this one.

But love to see lots of little changes in seemingly every grammar, it's quite impressive tbh

DeeDeeG · 2024-09-16T23:25:25Z