If you're reading this, there's a chance you'd like to contribute to Elastiknn. Very nice!
You need at least the following software installed: git, Java 14, Python3, docker, docker-compose, and task. I'm assuming you're running on a Linux or OSX operating system. I have no idea if any of this will work on Windows. There might be other software which is missing. If so, please submit an issue or PR.
Once you have the prerequisites installed, clone the project and run:
task jvm:run:gradle
This starts a local instance of Elasticsearch with the plugin installed. It can take about five minutes the first time you run it.
Once you see "EXECUTING", you should open another shell and run curl localhost:9200
.
You should see the usual Elasticsearch JSON response containing the version, cluster name, etc.
Elastiknn currently consists of several subprojects managed by Task and Gradle:
- client-python - Python client.
- elastiknn-api4s - Gradle project containing Scala case classes that model the Elastiknn API.
- elastiknn-benchmarks - Gradle project containing Scala code and infrastructure for benchmarking.
- elastiknn-client-elastic4s - Gradle project containing a Scala client based on Elastic4s.
- elastiknn-lucene - Gradle project containing custom Lucene queries implemented in Java.
- elastiknn-models - Gradle project containing custom similarity models implemented in Java.
- elastiknn-plugin - Gradle project containing the actual plugin implementation.
- elastiknn-testing - Gradle project containing Scala tests for all of the other Gradle subprojects.
The lucene
and models
sub-projects are implemented in Java for a few reasons:
- It makes it easier to ask questions on the Lucene issue tracker and mailing list.
- They are the most CPU-bound parts of the codebase. While Scala's abstractions are nicer than Java's, they sometimes have a surprising performance cost (e.g., boxing).
- It makes them more likely to be useful to other JVM developers. In particular the
models
project, which can be used to hash vectors and compute similarities in any JVM app.
Gradle manages the plugin and all of the JVM (i.e. Java and Scala) subprojects.
Task is used to define command aliases with simple dependencies. This makes it relatively easy to run tests, generate docs, publish artifacts, etc. all from one file.
I recommend using IntelliJ Idea to work on the Gradle projects and Pycharm to work on the client-python project.
IntelliJ should immediately recognize the Gradle project when you open the elastiknn
directory.
PyCharm can be a bit of a different story.
You should first create a virtual environment in client-python/venv
.
You can do this by running task py:venv
. Even if the tests fail, it will still create the virtual environment.
Then you should setup PyCharm to use the interpreter in client-python/venv
.
Elastiknn has a fairly thorough test suite.
To run it, you'll first need to run task cluster:run
or task jvm:run:gradle
to start a local Elasticsearch server.
Then, run task jvm:test
to run the Gradle test suite, or task py:test
to run the smaller Python test suite.
You can attach IntelliJ's debugger to a local Elasticsearch process. This can be immensely helpful when dealing with bugs or just figuring out how the code is structured.
First, open your project in IntelliJ and run the Debug Elasticsearch
target (usually in the upper right corner).
Then just run task jvm:run:debug
in your terminal.
Now you should be able to set and hit breakpoints in IntelliJ.
Use task cluster:run
to run a local cluster with one master node and one data node (using docker-compose).
There are a couple parts of the codebase that deal with serializing queries for use in a distributed environment.
Running this small local cluster exercises those code paths.
TODO
Nearest neighbors search is a large topic. Some good places to start are:
- Chapter 3 of Mining of Massive Datasets by Leskovec, et. al.
- Lectures 13-20 of this lecture series from IIT Kharagpur
- Assignment 1 of Stanford's CS231n course
- This work-in-progress literature review of nearest neighbor search methods related to Elasticsearch