ANTLR to recognize language with a predefined alphabet ? #4385

Saibo-creator · 2023-08-15T08:53:21Z

Saibo-creator
Aug 15, 2023

Hi all,

I'm wondering does ANTLR support such task:

Given this simple grammar:

grammar Expr;		
prog:	expr EOF ;
expr:	expr ('*'|'/') expr
    |	expr ('+'|'-') expr
    |	INT
    |	'(' expr ')'
    ;
NEWLINE : [\r\n]+ -> skip;
INT     : [0-9]+ ;

ANTLR4 will generate a parser that can parse expressions like this:

(1+2)*3

and produce an AST like this:

(* (+ 1 2) 3)

But what if I want to generate a parser that has a predefined alphabet for example:

alphabet = {1,2,3,4,5,6,7,8,9,0,+,-,*,/,(,), 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}

And we define a special symbol ~ that glue two numbers together. For example:

1 ~ 2 = 12
1 ~ 3 = 13

With this alphabet, the parser should be able to parse expressions like this:

(12+13)*14

and produce multiple ASTs like this:

(* (+ 12 13) 14)
(* (+ 12 13 ) (1 ~ 4) )
(* (+ 12 (1 ~ 3) ) (1 ~ 4) )
(* (+ (1 ~ 2) (1 ~ 3) ) (1 ~ 4) )
...

In a high level, we have an ambiguous grammar, and we want to generate a parser that can parse the grammar and produce multiple ASTs.

The ambiguity comes from the lexer level. The same string can be tokenized in different ways. For example, 12 can be tokenized as 12 or 1 ~ 2.

Does this question make sense?

Can ANTLR support this? If not, is this a feature that is interesting to you?

I'm not coming from a PL background and maybe my description is not accurate or confusing. Please let me know if you doubt.

P.S.
The scenario above happens when we want to use LLM to generate a sentence of formal Language. LLM has predefined alphabet(token lists), typically ~30K. The current implementation of LLM's tokenizer allows ambiguity, i.e. a string can be tokenized in different ways and in practice, the longest matching is returned.

Related work in this direction:

https://github.com/mkuchnik/relm (they support regex)

KvanTTT · 2023-08-15T10:25:20Z

KvanTTT
Aug 15, 2023

Why can't you use an ordinary grammar and split symbols on the parse tree traversion using Visitor? I mean you can treat 13 either as just a 13 or as 1 ~ 3.

0 replies

Saibo-creator · 2023-08-15T14:02:57Z

Saibo-creator
Aug 15, 2023
Author

I think this question is not very well framed. I need to further think about this. I may open it later if I get a more clear idea.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANTLR to recognize language with a predefined alphabet ? #4385

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

ANTLR to recognize language with a predefined alphabet ? #4385

Saibo-creator Aug 15, 2023

Replies: 2 comments

KvanTTT Aug 15, 2023

Saibo-creator Aug 15, 2023 Author

Saibo-creator
Aug 15, 2023

KvanTTT
Aug 15, 2023

Saibo-creator
Aug 15, 2023
Author