This repository contains a framework that allows you build parsers for languages within several lines. Specifically, it contains:
- A regex engine that supports a subset of regex, based on Min-DFA.
- A LR(1) transition table generator.
- A portable scanner + parser, and built-in CFG production action support (i.e. functions to be called when some production is reduced)
This repository, doesn't contain:
- LR error detection and recovery (you will get cryptic error message when feeding ill-formed input, or when CFG is problematic)
- No lossless parse tree (space, tabs, comments, token span not recorded)
- No EBNF support
- No interface to control type's priority in regex set
- Problems may occur when the language sets described by two regexes overlap
So, it's capable of building AST or evaluate expression tree. You may use AST to do checks like linting or formatting. However, it cannot be used to build syntax highlighters, or be used to build language servers.
There's no dependency if you just want to use the repository. However, if you want to contribute to the framework, you need to install pytest
and pre-commit
using command poetry install --only dev
.
To start with, you need to build a language definition object using LangDefBuilder
class method:
from lang_def_builder import LangDefBuilder
ld = LangDefBuilder.new(
"""
START -> E
E -> E "+" T | E "-" T | T
T -> T "*" F | F
F -> "(" E ")" | int_const
int_const -> r"0|(-?)[1-9][0-9]*"
"""
)
ld
has type LangDef
, which stores all information required to parse the language you defined.
You could then define actions that should be executed when a production was recognized by decorating ld
:
@ld.production("E -> T", "T -> F", "F -> int_const")
def __identity(_, e: int) -> int:
return e
@ld.production('E -> E "+" T')
def __add(_, e: int, _p: str, t: int) -> int:
return e + t
@ld.production('E -> E "-" T')
def __sub(_, e: int, _m: str, t: int) -> int:
return e - t
@ld.production('T -> T "*" F')
def __mul(_, t: int, _m: str, f: int) -> int:
return t * f
@ld.production('F -> "(" E ")"')
def __par(_, _l, e: int, _r) -> int:
return e
@ld.production('int_const -> r"0|(-?)[1-9][0-9]*"')
def __int(_, int_const: str) -> int:
return int(int_const)
You can now host a calculator within a few lines:
while True:
try:
print(ld.eval(input(">>> "), {}))
except EOFError:
break
Save these code snippets into a file, you should get a working calculator:
$ python examples/calc.py
>>> 5 + 6 * 7 - (8 - 9 * 10)
129
>>> (5 + 6) * 7 - 8 - 9 * 10
-21
>>>
You may find the above code in examples/calc.py
and try to run it yourself. There's also other simple examples that might interest you.
You don't have to always copy & paste heavy dependencies such as TypeDefinition
, ContextFreeGrammar
, and dfa_utils
when you need to use them.
If you want to use a parser in your project, you only need to copy lang_def.py
, where all transition tables are stored, and all parsing algorithms are defined.
You can do this by calling ld.to_json()
. This would dump all internal transition tables into a single LangDef
:
json.dump(ld.to_json(), open("calc_v2.json", "w"))
Notice that associated action functions would not be saved. You must manually define them after you loaded LangDef
using LangDef.from_json()
.
For example:
import json
from lang_def import LangDef
ld = LangDef.from_json(json.load(open("calc_v2.json", "r")))
@ld.production("E -> T", "T -> F", "F -> int_const")
def __identity(_, e: int) -> int:
return e
@ld.production('E -> E "+" T')
def __add(_, e: int, _p: str, t: int) -> int:
return e + t
@ld.production('E -> E "-" T')
def __sub(_, e: int, _m: str, t: int) -> int:
return e - t
@ld.production('T -> T "*" F')
def __mul(_, t: int, _m: str, f: int) -> int:
return t * f
@ld.production('F -> "(" E ")"')
def __par(_, _l, e: int, _r) -> int:
return e
@ld.production('int_const -> r"0|(-?)[1-9][0-9]*"')
def __int(_, int_const: str) -> int:
return int(int_const)
while True:
try:
print(ld.eval(input(">>> "), {}))
except EOFError:
break
This would allow you avoid rebuilding transition tables from definitions each time, which could be time-consuming when the definition is too large.
We provided a web app that could print out the content of each LR(1) item set, and their transition tables. On top of that, you can parse and see result in real time:
The screenshot shows the parsing result of this expression:
(define (even? x)
(& x 1))
The underlying CFG is:
START -> expression
expression -> function_call | val
function_call -> "(" id arguments ")"
arguments -> expression arguments | expression
val -> int_const | id
int_const -> r"0|(-?)[1-9][0-9]*"
id -> r".|([a-zA-Z]|_)([a-zA-Z]|[0-9]|_|\?)*"
flask --app server.py run
Currently it only supports *
, |
, ?
, +
, [a-z]
, [^x]
. You may use \
to change the meaning of these special chars, including \
itself.
The core underlying mechanism is range matching. An edge in DFA could accept multiple ranges of chars.
We used a slight variant of BNF form:
::=
got replaced by->
.- Texts surrounded by quotes, e.g.
"xxx"
, represents terminals. On top of BNF form, we introducedr"..."
to tell that this terminal should be parsed using regex engine. ''
means epsilon.- Everything else that's not wrapped in quotes are considered as non-terminal. If you prefer wrapping them with
<>
, you are free to do so. - You must use
START
as the entry non-terminal.
Here's an example:
START -> E
E -> E "+" T | E "-" T | T
T -> "(" E ")" | int_const
int_const -> r"0|(-?)[1-9][0-9]*"
In this CFG, "+", "-", "(", ")", r"0|(-?)[1-9][0-9]*"
are terminals, START, E, T, int_const
are non-terminals.
Notice that in plain terminals ("+", "-", "(", ")"
), you don't need to use \
to alter their meanings. Instead, they would be all treated as normal chars to be matched.
- Rewrite lexer using DFA instead of NFA-based
re
- Impl regex parser that generates NFA
-
*
-
|
-
?
-
()
-
+
(requires implCopy
trait forFiniteAutomata
) -
\
-
[x-x]
-
.
-
- Impl NFA to DFA algo
- Add tests
- Impl minimize DFA algo
- Add tests
- Write a hash function that encodes the structural information of arbitrary finite automata for auto tests
- Implement a lexer that's based on DFA
- Add FA serialization & de-serialization
- Add accept state id propagation & merge for faster scanning
- Impl regex parser that generates NFA
- Migrate to typed python
- Make parser a module (now it's
LangDef
) - Optimize parser memory usage by eliminating building actual parse tree
- Add unit tests for each module
- (Cancelled)
Define interfaces for CFG parser & implement other parsers (LL1, etc) - Improve
ContextFreeGrammar.from_string
parser, make it supportr"[^ ]*"
- Further optimize the performance of
LangDef
- Add parser generator
- Web app optimizations:
- (Cancelled)
Add customYou don't have to provide a separate type definition file nowTypeDefinition
to web app - Add example
TypeDefinition
andContextFreeGrammar
to web app
- (Cancelled)
- Language definition QOL issues:
- Unify
TypeDefinition
andContextFreeGrammar
into one file (CFG section, typedef section) Note: not by dividing into sections, but by integrating literals into CFG - Remove boilerplate definitions in
typedef
, i.e. things don't need regex power (e.g.public public
,( \(
) - (Cancelled)
MakeContextFreeGrammar
support BNF notation (or maybe extended BNF?)- My personal preference is
->
over::=
; I also need regex support - Adding brackets, like
<non_terminal>
, is supported as long as they don't contain space
- My personal preference is
- Unify
- Move
simpleSQL
andsimpleJava
to other repositories- For
simpleSQL
, implement an operator-based sql engine in a separate project - For
simpleJava
, implement IR passes & maybe generate LLVM IRs in a separate project- control flow graph generation?
- SSA generation?
- Partial redundancy elimination?
- For
- (Cancelled)
Add CLI support for generating a standalone front end parser (.py) given type definition, CFG definition, and tree action.- Already fulfilled by
LangDef
class.
- Already fulfilled by