Toy compiler front-end

This repository contains a framework that allows you build parsers for languages within several lines. Specifically, it contains:

A regex engine that supports a subset of regex, based on Min-DFA.
A LR(1) transition table generator.
A portable scanner + parser, and built-in CFG production action support (i.e. functions to be called when some production is reduced)

This repository, doesn't contain:

LR error detection and recovery (you will get cryptic error message when feeding ill-formed input, or when CFG is problematic)
No lossless parse tree (space, tabs, comments, token span not recorded)
No EBNF support
No interface to control type's priority in regex set
- Problems may occur when the language sets described by two regexes overlap

So, it's capable of building AST or evaluate expression tree. You may use AST to do checks like linting or formatting. However, it cannot be used to build syntax highlighters, or be used to build language servers.

How to use

Dependencies

There's no dependency if you just want to use the repository. However, if you want to contribute to the framework, you need to install pytest and pre-commit using command poetry install --only dev.

Example usage

To start with, you need to build a language definition object using LangDefBuilder class method:

from lang_def_builder import LangDefBuilder

ld = LangDefBuilder.new(
    """
    START -> E
    E -> E "+" T | E "-" T | T
    T -> T "*" F | F
    F -> "(" E ")" | int_const
    int_const -> r"0|(-?)[1-9][0-9]*"
    """
)

ld has type LangDef, which stores all information required to parse the language you defined.

You could then define actions that should be executed when a production was recognized by decorating ld:

@ld.production("E -> T", "T -> F", "F -> int_const")
def __identity(_, e: int) -> int:
    return e

@ld.production('E -> E "+" T')
def __add(_, e: int, _p: str, t: int) -> int:
    return e + t

@ld.production('E -> E "-" T')
def __sub(_, e: int, _m: str, t: int) -> int:
    return e - t

@ld.production('T -> T "*" F')
def __mul(_, t: int, _m: str, f: int) -> int:
    return t * f

@ld.production('F -> "(" E ")"')
def __par(_, _l, e: int, _r) -> int:
    return e

@ld.production('int_const -> r"0|(-?)[1-9][0-9]*"')
def __int(_, int_const: str) -> int:
    return int(int_const)

You can now host a calculator within a few lines:

while True:
    try:
        print(ld.eval(input(">>> "), {}))
    except EOFError:
        break

Save these code snippets into a file, you should get a working calculator:

$ python examples/calc.py
>>> 5 + 6 * 7 - (8 - 9 * 10)
129
>>> (5 + 6) * 7 - 8 - 9 * 10
-21
>>>

You may find the above code in examples/calc.py and try to run it yourself. There's also other simple examples that might interest you.

Portability

You don't have to always copy & paste heavy dependencies such as TypeDefinition, ContextFreeGrammar, and dfa_utils when you need to use them.

If you want to use a parser in your project, you only need to copy lang_def.py, where all transition tables are stored, and all parsing algorithms are defined.

You can do this by calling ld.to_json(). This would dump all internal transition tables into a single LangDef:

json.dump(ld.to_json(), open("calc_v2.json", "w"))

Notice that associated action functions would not be saved. You must manually define them after you loaded LangDef using LangDef.from_json().

For example:

import json
from lang_def import LangDef

ld = LangDef.from_json(json.load(open("calc_v2.json", "r")))

@ld.production("E -> T", "T -> F", "F -> int_const")
def __identity(_, e: int) -> int:
    return e

@ld.production('E -> E "+" T')
def __add(_, e: int, _p: str, t: int) -> int:
    return e + t

@ld.production('E -> E "-" T')
def __sub(_, e: int, _m: str, t: int) -> int:
    return e - t

@ld.production('T -> T "*" F')
def __mul(_, t: int, _m: str, f: int) -> int:
    return t * f

@ld.production('F -> "(" E ")"')
def __par(_, _l, e: int, _r) -> int:
    return e

@ld.production('int_const -> r"0|(-?)[1-9][0-9]*"')
def __int(_, int_const: str) -> int:
    return int(int_const)

while True:
    try:
        print(ld.eval(input(">>> "), {}))
    except EOFError:
        break

This would allow you avoid rebuilding transition tables from definitions each time, which could be time-consuming when the definition is too large.

Web app

We provided a web app that could print out the content of each LR(1) item set, and their transition tables. On top of that, you can parse and see result in real time:

The screenshot shows the parsing result of this expression:

(define (even? x)
  (& x 1))

The underlying CFG is:

START -> expression
expression -> function_call | val
function_call -> "(" id arguments ")"
arguments -> expression arguments | expression
val -> int_const | id
int_const -> r"0|(-?)[1-9][0-9]*"
id -> r".|([a-zA-Z]|_)([a-zA-Z]|[0-9]|_|\?)*"

Start web app

flask --app server.py run

Supported regex

Currently it only supports *, |, ?, +, [a-z], [^x]. You may use \ to change the meaning of these special chars, including \ itself.

The core underlying mechanism is range matching. An edge in DFA could accept multiple ranges of chars.

Supported CFG

We used a slight variant of BNF form:

::= got replaced by ->.
Texts surrounded by quotes, e.g. "xxx", represents terminals. On top of BNF form, we introduced r"..." to tell that this terminal should be parsed using regex engine.
'' means epsilon.
Everything else that's not wrapped in quotes are considered as non-terminal. If you prefer wrapping them with <>, you are free to do so.
You must use START as the entry non-terminal.

Here's an example:

START -> E
E -> E "+" T | E "-" T | T
T -> "(" E ")" | int_const
int_const -> r"0|(-?)[1-9][0-9]*"

In this CFG, "+", "-", "(", ")", r"0|(-?)[1-9][0-9]*" are terminals, START, E, T, int_const are non-terminals.

Notice that in plain terminals ("+", "-", "(", ")"), you don't need to use \ to alter their meanings. Instead, they would be all treated as normal chars to be matched.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
cfg_utils		cfg_utils
dfa_utils		dfa_utils
examples		examples
io_utils		io_utils
lr1		lr1
server_utils		server_utils
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
lang_def.py		lang_def.py
lang_def_builder.py		lang_def_builder.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
server.py		server.py
webapp.svg		webapp.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toy compiler front-end

How to use

Dependencies

Example usage

Portability

Web app

Start web app

Supported regex

Supported CFG

Planned improvements

About

Releases

Packages

Languages

med1844/Toy-Compiler-Frontend

Folders and files

Latest commit

History

Repository files navigation

Toy compiler front-end

How to use

Dependencies

Example usage

Portability

Web app

Start web app

Supported regex

Supported CFG

Planned improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages