ANTLR to recognize language with a predefined alphabet ? #4385
Replies: 2 comments
-
Why can't you use an ordinary grammar and split symbols on the parse tree traversion using Visitor? I mean you can treat |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think this question is not very well framed. I need to further think about this. I may open it later if I get a more clear idea. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
I'm wondering does ANTLR support such task:
Given this simple grammar:
ANTLR4 will generate a parser that can parse expressions like this:
and produce an AST like this:
But what if I want to generate a parser that has a predefined alphabet for example:
And we define a special symbol
~
that glue two numbers together. For example:With this alphabet, the parser should be able to parse expressions like this:
and produce multiple ASTs like this:
In a high level, we have an ambiguous grammar, and we want to generate a parser that can parse the grammar and produce multiple ASTs.
The ambiguity comes from the lexer level. The same string can be tokenized in different ways. For example,
12
can be tokenized as12
or1 ~ 2
.Does this question make sense?
Can ANTLR support this? If not, is this a feature that is interesting to you?
I'm not coming from a PL background and maybe my description is not accurate or confusing. Please let me know if you doubt.
P.S.
The scenario above happens when we want to use LLM to generate a sentence of formal Language. LLM has predefined alphabet(token lists), typically ~30K. The current implementation of LLM's tokenizer allows ambiguity, i.e. a string can be tokenized in different ways and in practice, the longest matching is returned.
Related work in this direction:
Beta Was this translation helpful? Give feedback.
All reactions