Clustering a set of wordnet
synsets using k-means
, the wordnet
pair-wise distance (semantic relatedness) of word senses using the Edge Counting method of the of Wu & Palmer (1994) is mapped to the euclidean distance to allow K-means to converge preserving the original pair-wise relationship.
By toggling use_wordnet = False
to True
the distance metric between words will use a GloVe
model glove.6B.300d_word2vec.txt
(this must be in the word2vec format) and the word2vec
similarity value
extras
folder is proof of concept/experimentations
- create a newline delimited file with a list of
wordnet
senses (eg. data/example_tags.txt) - to use
wordnet
setuse_wordnet=True
, to useword2vec
use_wordnet=False
python generate-tag-clusters.py data/example_tags.txt 25 0.7
- 25 is the number of clusters to segment the list of
wordnet
senses into. - 0.7 is the similarity threshold, below this the words are considered not similar
- 25 is the number of clusters to segment the list of
- results places into the
results
folder as a json file