Parsing can be expensive in general. Microgrammars add certain further considerations.
A microgrammar notionally attempts to match its matcher on every character of the input, sliding forward one character if the match fails. When a match is attempted, it may fail at any point, resulting in back tracking to try alternatives at any point.
Thus the worst case--a matcher that can match unpredictably, and lots of back tracking--can be expensive.
On the positive side, microgrammars do not need to build a full AST, so they can cope with large files better than traditional grammars. There is also a lot of leeway in their authoring, compared with traditional grammars, because they are not building a definitive AST: Often you can express the same thing in a number of ways, with different performance characteristics.
If you loop over input in a generator rather than using findMatches
, you are kinder to the event loop by breaking up CPU-intensive actions, and may be able to cease work after you find what you're looking for.
For example, consider these two alternatives:
const matches = microgrammar.findMatches(content);
const matches = microgrammar.matchIterator(content);
for (const m of matches) {
// Do work
}
The first will find all matches in one operation. The second will consume the input one match at a time, and if you break out of the loop processing will cease.
This enables prefix scanning: an important optimization where the parser can discard input until the literal is found, not even attempting a match. If you can begin a grammar with a literal string rather than a regex, do so.
Often you want to fail a match based on semantics, rather than syntax, when a structural match has illegal values in the present content.
For example, the following grammar matches a method that has an @ChangeControlled
annotation. Because this annotation could occur at any point
among other annotations, it's impossible to express that in the grammar itself. But we check promptly that such an
annotation was present and veto the match if it wasn't.
microgrammar<ChangeControlledMethod>({
annotations: new Rep1(AnyAnnotation),
_check(ctx: any) {
const found = ctx.annotations.filter(a => a.name === "ChangeControlled");
if (found.length === 0) {
return false;
}
ctx.changeControlledAnnotation = found;
return true;
},
_visibilityModifier: "public",
type: JAVA_IDENTIFIER,
name: JAVA_IDENTIFIER,
parameterContent: JavaParenthesizedExpression,
body: JavaBlock,
});
The _check
function will veto methods that don't have the required annotation before the remainder of the method is parsed, minimizing
backtracking.
If you require at least one match, specify the cardinality. This may enable prefix scanning, and reduces the possible match space.
Using Alt
is an invitation for back tracking. It's often necessary, but consider where the alternatives are in your grammars.
It's easier for the parser to optimize when it understands the alternatives of the matches. Furthermore, it's against the spirit of microgrammars to stuff too much into regular expressions. Use them to match tokens such as identifiers.
MatchingMachine
allows the specification of an observer that can work along side with the grammar. This can be very useful, but
if the matcher and the observer do not share a start literal, prefix scanning is impossible, which can dramatically increase the cost of
parsing large files.
If you have a grammar that is not amenable to prefix scanning, and is prone to a lot of back tracking, you may want to reduce the size of the input
through a preprocessing step. For example, call canonicalize
on Java source code before parsing it.
Note that this approach won't work if you want to update matches, as it loses positional information.