Lumberjack

Exploring automated BPE-like approaches to reduce the size of parse-trees

The idea behind Lumberjack

In ML4SE we often deal with large vocabularies of code tokens (10,000+ or even 100,000+). Meanwhile, the vocabulary of node types in ASTs or parse-trees is way smaller: 100-200 for most programming languages. It leads to two consequences:

Embeddings of node types (which we use in many ML models) are not that meaningful
Trees are quite large (since simple operations like assignment to a variable require a whole subtree of 5-6 nodes) and models struggle to learn from them

It brings us the following idea: what if we can automatically detect frequently occurring code constructs in the ASTs and compress them into new nodes? In NLP people do similar things with BPE-like approaches: they start from a vocabulary with individual characters and iteratively merge them into larger words.

What Lumberjack can do

Currently it is a prototype of a tool that can:

Parse datasets in Java and Python
Compress the most frequent edge K times to get smaller trees + larger node type vocabularies
Handle up to millions of files

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
gradle/wrapper		gradle/wrapper
src/main/kotlin/org/jetbrains/research/lumberjack		src/main/kotlin/org/jetbrains/research/lumberjack
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lumberjack

The idea behind Lumberjack

What Lumberjack can do

About

Releases

Packages

Languages

License

JetBrains-Research/lumberjack

Folders and files

Latest commit

History

Repository files navigation

Lumberjack

The idea behind Lumberjack

What Lumberjack can do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages