Combining C grammar with partial PL/SQL grammar for parsing ProC #4170

raffian · 2023-03-10T17:00:56Z

raffian
Mar 10, 2023

Our goal is to create a grammar for parsing source code consisting of ANSI C with embedded PLSQL 11g/12c. This language - called ProC, was developed by Oracle in the 90's.

#include <stdio.h>
int main() {
   varchar firstname[15];
   
   EXEC SQL
      SELECT fname
      INTO :firstname
      FROM EMPLOYEES
      WHERE id = 999;

   printf("Hello, %s", firstname);
   return 0;
}

Embedded SQL in ProC always starts with "EXEC SQL", followed by arbitrary plsql block, and finally ending with ';'. There's a bit more to it than that but for our needs that single rule constitutes 95% of our ProC legacy code.

Our theory for achieving this is to extract a subset of rules/tokens from plsql grammar that constitute the "body" portion of plsql procedures and functions - the remaining plsql rules we don't need. Once identified, we merge those rules into the C grammar, thus creating a Pro-C.g4 grammar. Based on our analysis, the starting point in plsql grammar is the rule called body

https://github.com/antlr/grammars-v4/blob/master/sql/plsql/PlSqlParser.g4#L5457

Our experience with ANTLR is a few weeks - at most, but here's what we've done so far to extract the body rule - and its dependencies, from pl/sql grammar:

Use ANTLR to generate parser/lexer/visitor Java classes for the ANTLRv4Parser.g4 grammar. We did this because extracting a subset of rules from plsql grammar requires parsing the grammar itself.
Parse PlSqlParser.g4 grammar with custom visitor method visitParserRuleSpec(ANTLRv4Parser.ParserRuleSpecContext ctx)
Walk the tree looking for target node body using if(ctx.getText().startsWith("body:"))
When finding the body: node, we grab the raw text for the rule and extract all constituent rules/tokens from its alternatives. We we don't understand ANTLR sufficiently enough to use it for this step with perhaps getChild()/Nodes, etc, so instead we use regex for extracting the rules/tokens manually, then push them onto a stack.
Repeat steps 3 and 4 recursively, saving all rules/tokens found along the way until the stack is empty.

Result
Of the 927 total rules and 2325 tokens in PL/SQL grammar, the body rule is composed of 285 rules - and nearly all the tokens.

The next steps will be to parse plsql blocks from our legacy code using this partial grammar but without any C grammar rules. Assuming those tests are successful, the last step will be to merge all extracted plsql rules and tokens into the C grammar though this step must be done carefully to avoid name conflicts. An easy solution we're exploring for that is prefixing all plsql rules and tokens with a qualifier, something like this:

plsql_<rule>
PLSQL_<TOKEN>

So that's it - that's our plan. Is this approach worthwhile, or is it plagued with landmines and pitfalls not worth pursuing?

Update:
We're concerned about this note in the PL/SQL grammar readme. If case insensitivity matters to PlSql grammar but not for the C grammar; what implication will this have when we merge the two grammars?

As SQL grammar are normally not case sensitive but this grammar implementation is, you must use a custom character stream that converts all characters to uppercase before sending them to the lexer.

https://github.com/antlr/grammars-v4/tree/master/sql/plsql#readme

Thanks for listening,
Raffi

jimidle · 2023-03-11T00:21:36Z

jimidle
Mar 11, 2023

I have actually done this exact thing before, many years ago. Before ANTLR.

With ANTLR, you can switch modes in the lexer and combine parsers. But I found for this that it was easier to just match the entire EXEC SQL statement and pass it off to a different parser. It is all doable, but needs a bit of care.

I'm looking for work if you want to contract it out ;)

For the case sensitivity/insensitivity, you can code the lexer tokens for PL/SQL in a separate mode to the C lexer. Or just match the EXEC SQL in the C lexer and call a different lexer parser from that point in the input stream up until the end of the EXEC SQL statement. How to do that depends on what your performance requirements are.

If you have taken it as far as a tree, then you probably went too far IMO - you can do this at lexing/parsing and then end up with a tree that represents both languages.

1 reply

raffian Mar 12, 2023
Author

Hello Jim,

So you've actually done a Pro*C conversion? Interesting, what was the target language?
I'll reach out to you on LinkedIn, we should talk.

If you have taken it as far as a tree, then you probably went too far IMO - you can do this at lexing/parsing and then end up with a tree that represents both languages.

We used parse trees to parse the plsql grammar, not pro-c code. We did that to extract a subset of grammar rules for handling the embedded SQL commands in proc. I don't believe we need the entire plsql grammar for parsing proc, need to do more analysis.

KvanTTT · 2023-03-11T12:54:41Z

KvanTTT
Mar 11, 2023

As SQL grammar are normally not case sensitive but this grammar implementation is, you must use a custom character stream that converts all characters to uppercase before sending them to the lexer.

It's the outdated info becuase options { caseInsensitive = true;} is used. Now CharStream is useless. It should be removed from REDME.

2 replies

raffian Mar 11, 2023
Author

Thank you for pointing that out.
I saw options { caseInsensitive = true;} in the grammar as contrary to the README but wasn't sure what to make of it.

KvanTTT Mar 12, 2023

I've updated README and removed outdated info.

jimidle · 2023-03-12T06:25:53Z

jimidle
Mar 12, 2023

Yes. It was slightly different because it was a compiler for my own company’s language; a form of BASIC that generated C. I put the EXEC SQL in that language so I could parse it and extract target variables etc. then generate the C equivalent. Same basic idea though - recognize the exec SQL, then switch lexers and parsers using the same input stream. For sure, let’s talk. First thing to know is what you are looking to achieve of course. I see what you mean with the tree now. There is probably a simpler approach but I would need to see where you’re headed. It’s a solvable situation though with a little discussion. https://www.LinkedIn.com/jimidle Talk later, Jim

…

On Sun, Mar 12, 2023 at 13:49 Raffi Basmajian ***@***.***> wrote: Hello Jim, So you've actually done a Pro*C conversion? Interesting, what was the target language? I'll reach out to you on LinkedIn, we should talk. If you have taken it as far as a tree, then you probably went too far IMO - you can do this at lexing/parsing and then end up with a tree that represents both languages. We used parse trees to parse the plsql *grammar*, not pro-c code. We did that to extract a subset of grammar rules for handling the embedded SQL commands in pro*c. I don't believe we need the entire plsql grammar for parsing pro*c, need to do more analysis. — Reply to this email directly, view it on GitHub <#4170 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMDERPEAF6EBDQ2V6DLW3VPX5ANCNFSM6AAAAAAVWWNKVE> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

seychelles111 · 2023-04-18T12:40:43Z

seychelles111
Apr 18, 2023

Hey @raffian @KvanTTT @jimidle @aphyr @tonyarnold

please check my bounty at
I am offering 25UST, or 50$ for charity of your liking

#4239

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining C grammar with partial PL/SQL grammar for parsing ProC #4170

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Combining C grammar with partial PL/SQL grammar for parsing ProC #4170

raffian Mar 10, 2023

Replies: 4 comments · 3 replies

jimidle Mar 11, 2023

raffian Mar 12, 2023 Author

KvanTTT Mar 11, 2023

raffian Mar 11, 2023 Author

KvanTTT Mar 12, 2023

jimidle Mar 12, 2023

seychelles111 Apr 18, 2023

raffian
Mar 10, 2023

Replies: 4 comments 3 replies

jimidle
Mar 11, 2023

raffian Mar 12, 2023
Author

KvanTTT
Mar 11, 2023

raffian Mar 11, 2023
Author

jimidle
Mar 12, 2023

seychelles111
Apr 18, 2023