This defines a simple grammar (cmd/ILPLang.g4
) and a command-line
tool which can be used to lint for problems in dataset formatting.
The target is a linter
binary to help point out issues when tokenizing
or parsing a dataset.
Example 1: No Errors
When the dataset is well-formatted, nothing is returned:
smokes(person1).
friends(person1,person2).
friends(person2,person1).
./linter -tokens -file=examples/pos/pos1.txt
./linter -file=examples/pos/pos1.txt
# (No output for either case)
Example 2: Bad Data
When there is something in the data that cannot be recognized, problems are directed to stderr:
friends(person1,person2).
Bad Data.
./linter -tokens -file=examples/neg/neg1.txt
line 2:0 token recognition error at: 'B'
line 2:3 token recognition error at: ' '
line 2:4 token recognition error at: 'D'
./linter -file=examples/neg/neg1.txt
line 2:0 token recognition error at: 'B'
line 2:3 token recognition error at: ' '
line 2:4 token recognition error at: 'D'
line 2:5 missing '(' at 'ata'
line 2:8 mismatched input '.' expecting {')', ','}
Example 3: Regression Examples
The parser can also look for regressionExample
values, used in regression
data sets.
The parser will not check whether an entire dataset is correct
(regressionExample
in labeled as positive, empty negative examples, and
facts). But this could be accomplished fairly easily elsewhere.
regressionExample(medv(id100),33.2).
regressionExample(medv(id101),27.5).
regressionExample(medv(id10),18.9).
regressionExample(medv(id102),26.5).
Precompiled binaries are listed on the GitHub Releases page, and the latest version can be downloaded with these links:
Platform | Link |
---|---|
Linux/amd64 | Download |
macOS/amd64 | Download |
Windows/amd64 | Download |
Building requires a Go compiler.
cd cmd
go build
A copy of the generated ANTLR parser files are committed to the repository, and rebuilding them requires an ANTLR Parser Generator.
make clean
make linter
This grammar is extremely conservative currently: the only tokens allowed are lowercase characters, integers, and underscores.
a(x_1,y_1).
b(x_1).
- Alexander L. Hayes - Indiana University, Bloomington
Some ideas were taken from the FOPC_MLN_ILP_Parser
developed by
Jude Shavlik and Trevor Walker (and possibly contributed to by many others
who went unnamed in the source code). There are a few versions of their
Tokenizers
(StreamTokenizerJWS
and
StreamTokenizerTAW)
and Parser
currently used in other projects.