astminer
supports multiple parsers for various programming languages.
Here we describe the integrated parsers and their peculiarities.
ANTLR provides an infrastructure to generate lexers and parsers for languages based on grammars.
For now, astminer
supports ANTLR-based parsers for Java, Python, JS, and PHP.
GumTree
is a framework to work with source code as trees and to compute the differences between the trees in different versions of code.
It also builds language-agnostic representations of code.
For now, astminer
supports GumTree-based parsers for Java and Python.
Running GumTree with Python requires python-parser
.
You can set it up as follows:
- Download the sources from GitHub
- Install the dependencies
pip install -r requirements.txt
- Make the
python-parser
script executable
chmod +x src/main/python/pythonparser/pythonparser_3.py
- Add python-parser to
PATH
cp src/main/python/pythonparser/pythonparser_3.py src/main/python/pythonparser/pythonparser
export PATH="<path>/src/main/python/pythonparser/pythonparser:${PATH}"
A lot of languages in gumtree additionally supported with srcML backend, so astminer
uses gumtree with srcML as a whole new parser.
Running it requires installing srcML
: https://www.srcml.org/
If you have any problems with installation check the Dockerfile in the project root
Originally fuzzyc2cpg, Fuzzy is
now part of codepropertygraph.
astminer
uses it to parse C/C++ code. g++
is required for this parser.
Parser for Java which is used to get trees for Code2seq and Code2vec models, and is also
used in many other studies to collect trees and work with them.
When working with Javaparser astminer
implements an algorithm similar to the algorithm in
the JavaExtractor module
in the Code2Vec repository to get similar trees.
Java parser written in pure python. In order to work with it, you need to install our
self-written translator package that will reformat javalang inner AST into json AST that astminer
can understand.
To install this package simply run in the root of the project:
pip install src/main/python/parse/javalang
Support for a new programming language can be implemented in a few simple steps.
If there is an ANTLR grammar for the language:
- Add the corresponding ANTLR4 grammar file to the
antlr
directory. - Run the
generateGrammarSource
Gradle task to generate the parser. - Implement a small wrapper around the generated parser.
See JavaParser, AntlrJavaParsingResultFactory, and
getParsingResultFactory
for an example of building such a wrapper and integrating it in the pipeline.
If the language has a parsing tool that is available as a Java library:
- Add the library as a dependency in build.gradle.kts.
- Implement a wrapper for the parsing tool. See FuzzyCppParser for an example of such a wrapper.