Skip to content

Parallel LSH based high dimensional clustering for documents

License

Notifications You must be signed in to change notification settings

englehardt/pLSH-HDC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering

Based on Rajamaran, "Mining of Massive Datasets" - Section 3.4

A parallel implementation of LSH for High Dimensional Clustering.

  • Documents or sets are represented by a MinHash signature.
  • LSH is used to map similar signatures to similar bins.
  • Items which map to the same bin are considered candidate pairs for clustering.
  • A constraint function (currently Levenshtein distance) is applied to candidate pairs.
  • Items which satisfy constraint function are clustered via UnionFind.

Signatures can be pre-computed (in parallel) and stored using the MinHasher. Clusters should be built from MinHash signatures. Constraint checking currently uses the Levenshtein distance of the actual documents stored in a LevelDB database via a JSON interface.

Summary of changes:

  • Updated to use murmur3 hashing (for signatures and LSH)
  • Unicode support
  • De-coupled clustering from signature creation to allow parallel and pre-computation of signatures
  • Ability to dump/load signer state to disk
  • Constraint function checking for candidate pairs
  • Native parallel processing for constraint checks
  • Methods to help serialize cluster state to disk

Requires the C/C++ based pyhash, python-Levenshtein, and leveldb libraries. These can be installed via pip:

pip install pyhash python-Levenshtein leveldb

TODO: Remove LevelDB dependency, improve generality of constraint checking, update tests.

About

Parallel LSH based high dimensional clustering for documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%