Parser performance for the same grammar across code generation targets varies a great deal for two reasons:
- raw performance of the underlying implementation language, such as python versus C++
- the runtime support library must be carefully tuned, such as we did for the Java target; e.g., the hash function used has to be appropriate for our unusual use case
It's also the case that different grammars for the same language can exhibit radically different performance. For example, in the OOPSLA paper we compared the performance of two different grammars for Java. The grammar from the Java language specification converted to ANTLR notation one to one performed much worse than one we hand tune to reduce lookahead requirements.
A note on testing the performance of ANTLR parsers. ANTLR v4 generates ALL(*) parsers, which use a form of decision caching in order to improve future performance on the same or similar input statements. That implies there is a warm up period associated with the parsers before they reach their final throughput speed, and of course, Java's JIT also has a warm up period (if using the Java target).
At this point, the test script compares the performance of a tuned Java grammar (on all generated Java code from all three grammars), Postgresql, and sparql with sample input. For convenience, a snapshot of upcoming ANTLR 4.10 release is in the lib
dir and is used by the scripts.
cd /tmp
git clone [email protected]:antlr/performance.git
cd performance/java
Then you can run the build script, which will download the grammar repository, pull out the 3 grammars of interest, generate code via ANTLR:
$ ./build.sh
Download sample grammars
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 133 100 133 0 0 565 0 --:--:-- --:--:-- --:--:-- 568
100 27.1M 0 27.1M 0 0 5110k 0 --:--:-- 0:00:05 --:--:-- 6674k
Unzip sample grammars
error: cannot create grammars-v4-master/molecule/examples/NiC2O4 -? 2H2O.txt
Illegal byte sequence
Copy sample grammars and input
Building parsers
warning(146): PostgreSQLLexer.g4:2610:0: non-fragment lexer rule AfterEscapeStringConstantMode_NotContinued can match the empty string
warning(146): PostgreSQLLexer.g4:2629:0: non-fragment lexer rule AfterEscapeStringConstantWithNewlineMode_NotContinued can match the empty string
Compiling parsers and test rig
Note: TestANTLR4.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Here is how to test the throughput:
$ ./throughput.sh
SPARQL
4 files
Parsed 4 files 24 lines 721 bytes in 151ms at 158 lines/sec 4,774 chars/sec
Parsed 4 files 24 lines 721 bytes in 7ms at 3,428 lines/sec 103,000 chars/sec
Parsed 4 files 24 lines 721 bytes in 6ms at 4,000 lines/sec 120,166 chars/sec
Parsed 4 files 24 lines 721 bytes in 17ms at 1,411 lines/sec 42,411 chars/sec
Parsed 4 files 24 lines 721 bytes in 7ms at 3,428 lines/sec 103,000 chars/sec
Parsed 4 files 24 lines 721 bytes in 9ms at 2,666 lines/sec 80,111 chars/sec
Parsed 4 files 24 lines 721 bytes in 7ms at 3,428 lines/sec 103,000 chars/sec
Parsed 4 files 24 lines 721 bytes in 10ms at 2,400 lines/sec 72,100 chars/sec
Parsed 4 files 24 lines 721 bytes in 6ms at 4,000 lines/sec 120,166 chars/sec
Parsed 4 files 24 lines 721 bytes in 6ms at 4,000 lines/sec 120,166 chars/sec
average parse 8.857ms, min 6.582ms, stddev=3.891ms (First 3 runs skipped for JIT warmup)
Java
16 files
Parsed 16 files 137,280 lines 4,341,964 bytes in 1455ms at 94,350 lines/sec 2,984,167 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 429ms at 320,000 lines/sec 10,121,128 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 312ms at 440,000 lines/sec 13,916,551 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 327ms at 419,816 lines/sec 13,278,177 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 290ms at 473,379 lines/sec 14,972,289 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 284ms at 483,380 lines/sec 15,288,605 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 363ms at 378,181 lines/sec 11,961,333 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 433ms at 317,043 lines/sec 10,027,630 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 232ms at 591,724 lines/sec 18,715,362 chars/sec
Parsed 16 files 137,280 lines 4,341,964 bytes in 245ms at 560,326 lines/sec 17,722,302 chars/sec
average parse 310.571ms, min 232.870ms, stddev=70.249ms (First 3 runs skipped for JIT warmup)
postgresql
3 files
Parsed 3 files 5,232 lines 181,261 bytes in 5506ms at 950 lines/sec 32,920 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 1203ms at 4,349 lines/sec 150,674 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 812ms at 6,443 lines/sec 223,227 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 683ms at 7,660 lines/sec 265,389 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 737ms at 7,099 lines/sec 245,944 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 694ms at 7,538 lines/sec 261,182 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 619ms at 8,452 lines/sec 292,828 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 620ms at 8,438 lines/sec 292,356 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 623ms at 8,398 lines/sec 290,948 chars/sec
Parsed 3 files 5,232 lines 181,261 bytes in 603ms at 8,676 lines/sec 300,598 chars/sec
average parse 654.143ms, min 603.979ms, stddev=50.453ms (First 3 runs skipped for JIT warmup)
You'll notice that the SQL parser has much lower throughput than the Java parser. Part of this is due to the complexity of the grammars:
$ wc grammars/*.g4
241 723 6877 grammars/JavaLexer.g4
750 1825 16397 grammars/JavaParser.g4
2650 6672 30064 grammars/PostgreSQLLexer.g4
5327 12002 103840 grammars/PostgreSQLParser.g4
505 1217 9337 grammars/Sparql.g4
But there is usually room to improve performance by left-factoring common grammatical prefixes. Using the intellij plug-in's built-in profiler, we can see that the amount of lookahead for even a small create
statement is quite large. Consider the highlighted section here:
(Open PostgreSQLParser.g4
in Intellij, right-click on rule root
and select Test rule root
, enter sample input in the ANTLR Preview tool pane, then click on the profiler, click on the various headers to sort forward and backwards.)
The parser needs 12 tokens of lookahead because of the way the grammar is expressed:
There are multiple statements that either start with the same left prefix or are variations on a create
statement. In this case, it looks like the parser is scanning the entire statement until the semicolon before deciding between the various grammatical forms. SQL is a very complex (or big at least) language and so merging grammatical rules might put a larger burden on a semantic analyzer phase and might also make the grammar less readable. This is a trade off to keep in mind when either designing languages or implementing grammars. :)
For example, let's say we have a simple rule in isolation that is very similar looking from the left edge:
create
: 'create' 'table' ID '(' ID 'integer' ')' ';'
| 'create' 'table' ID '(' ')' ';'
;
ALL(*) has no problem with that rule; it just dynamically scans ahead until it can distinguish the alternatives of any decision it faces. The cost is having to look ahead to the token beyond '(' in order to make a decision. In this case, it looks ahead 5 tokens.
Without changing the language recognized, we can left factor and collapse the alternatives of that rule so that it needs only one symbol of look ahead to match the optional (ID 'integer')?
subrule:
create
: 'create' 'table' ID '(' (ID 'integer')? ')' ';'
;
But, of course, this grammar might be less easy to read. The language is the same but we have expressed it differently to improve performance; the usual trade off in optimizations.