Skip to content

ozgurdemir/golsh

Repository files navigation

A simple go library for approximate nearest neighbours (ANN).

Background

In many computational problems such as NLP, Recommendation Systems and Search, items (e.g. words) are represented as vectors in a multidimensional space. Then given a specific item it's nearest neighbours need to be find e.g. given a query find the most similar ones. A naive liner scan over the data set might be too slow for most data sets.

Hence, more efficient algorithms are needed. One of the most widely used approaches is Locality Sensitive Hashing (LSH). This family of algorithms are very fast but might not give the exact solution and are hence called approximate nearest neighbours (ANN). The trade off between accuracy and speed is set via parameters of the algorithm.

GoLsh

is a library that finds approximate nearest neighbours based on the cosine distance between two vectors (Wikipedia). It is meant to be used as an online algorithm. That is whenever queries are comming in by user requests.

Algorithm

the basic idea of the algorithm is very simple. Find a hash encodind for every vector such that similar vectors have the same hash value. Hence, finding similar items boils down to finding vectors with the same hash value which can be done very efficiently using a hash table.

The hashes are generated by randomly splitting the space using d hyperplanes. For every hyperplane it is determined if the current vector lies to the left (0) or to the right (1). Doing so every vector can be represented as a d dimensional bit hashes. Vectors lying nearby will most probably fall into the same hash representation but still a single bit flip may lead to neighbours not found. Hence, the above step is repeated numEmbedding times to be more accurate.

The algorithm proceeds as follows:

  1. Init the lsh: for every vector in the corpus generate an embedding by using d randomly generated hyperplanes.
    • the number of hyperplanes used is defined by d
    • the larger d the less vectors have to be searched
    • however if this number is set too high collisions will rarely happen and no less results will be found
  2. in order to increase the recall (number of found neighbours) step 1. is repeated numEmbedding times. After this step every vector in the corpus is represented as numEmedding bitstrings of length d.

Usage

the library is initialized as follows:

golsh.NewLsh(vectors *map[int][]float32, numEmbeddings int, d int) Lsh

vectors is a simple go map from an user defined int (id) to the input vectors to be searched against. This function will return an golsh.Lsh object which is used for all subsequent operations. The two parameters numEmbeddings and d controll the trade off between speed and accuracy (see above).

Get vector based on id

lsh.Vector(id int) ([]float32, bool)

given an id will return the vector stored. This is just a convinience function and may be used if the vector to search with is part of the corpus itself.

Find approximate nearest neighbours

lsh.Ann(vector []float32, k int, threshold float32) ([]Hit, int, error)

vector the vector to search nearest neighbours for. k max number of neighbours returned. threshold min cosine similarity that a neighbour needs to have. This parameter is used to filter false positives.

Result

the result is of type []golsh.Hit where Hit consists of:

type Hit struct {
	ID     int
	Vector *[]float32
	Cosine float32
}

where ID is the id of the result vector. Vector is the result vector itself and Cosine is the exact cosine distance between the query and this result vector. The result array is sorted by similarity that is most similar vector is pos 1.

About

Approximate Nearest Neighbors in Go

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages