Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

faiss interface refactoring to support multiple methods #344

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Apr 20, 2021

Issue #, if available:
#225

Description of changes:
This PR focuses on refactoring current faiss-support branch's interface to support several additional features including:

  1. IVF index type - a cell probe based method that allows a user to reduce search space using a k-Means clustering algorithm. It takes "ncentroids" and "nprobes" as parameters
  2. Product quantization - a method to encode vectors to reduce size. It takes "code_size" as a parameter
  3. Composite indices - the ability to combine different faiss features into a single index

The interface looks like:

{
   "my_vector":{
      "type":"knn_vector",
      "dimension":4,
      "method":{
         "name":"ivf",
         "engine":"faiss",
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
               "ncentroids":15
            }
         },
         "encoder":{
            "name":"pq",
            "parameters":{
               "code_size":8
            }
         },
         "parameters":{
            "ncentroids":128
         }
      }
   }
}

The main logic where the interface has been refactored can be found in:

  1. KNNVectorFieldMapper - where the parsing between the user provided method and the the plugin occurs
  2. KNNMethodContext - stored structure of the user provided method configuration
  3. KNNMethod - structure of a given method supported by a particular engine
  4. KNNLibrary - interface for a particular library. Includes implementations for nmslib and faiss
  5. KNNEngine - enum mapping name to KNNLibrary

A lot of code was changed in order to support these additional features:

  1. Because we use faiss's index factory, only a certain portion of the parameters are configured through the index factory string description. To support additional parameters (for example, ef_construction for HNSW), this PR adds functionality to pass an extra parameter map to the jni to be parsed.
  2. Because IVF and PQ require training, in the JNI save index function, this PR implements a training approach where a subset of the data to be indexed is used for training. This is inherently inefficient because it requires each segment to be trained before it can add data to it. In the future, we will introduce a train api that trains before indexing, to work around this.
  3. Several other minor changes to make refactor cleaner/easier

Testing
For testing, this PR focuses on addings tests that exercise the interface as opposed to adding end to end tests testing each jni libraries functionality. This is because that functionality will change in the future. Right now, it is just a place holder to get the interface functionality working. That being said, the following test refactoring was done:

  1. Added additional unit tests to test faiss interface
  2. Refactored old tests so that gradle build passes

Future Development

  1. Introduce training api
  2. Add additional end to end tests
  3. Investigate storing data exclusively with faiss (as opposed to storing vectors in doc values in Lucene)

Notes
We are in the process of migrating from ODFE to OpenSearch. Included in this will be porting over the faiss-support branch to OpenSearch. Because porting requires significant refactoring, we will merge this PR and then port the faiss-support branch to OpenSearch.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

* @return length of the file in kilobytes
*/
public static long getFileSizeInKB(String filePath) {
if (filePath == null || filePath.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not differentiate empty file with invalid file path or null. Is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess it would say an empty file has a size of 1 Kb, where as a non-existent file has a size of 0.

@jmazanec15
Copy link
Member Author

Closing PR now. Will continue work on OpenSearch repo.

@jmazanec15 jmazanec15 closed this May 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Features New functionality added
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants