-
Notifications
You must be signed in to change notification settings - Fork 122
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Samuel Smith
authored and
Samuel Smith
committed
Jun 6, 2017
1 parent
2153484
commit ee6e252
Showing
1 changed file
with
206 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,206 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# So, show me how to align two vector spaces for myself!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"No problem. We're going to run through the example given in the README again, and show you how to learn your own transformation to align the French vector space to the Russian vector space.\n", | ||
"\n", | ||
"First, let's define a few simple functions..." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"from fasttext import FastVector\n", | ||
"\n", | ||
"# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy\n", | ||
"def normalized(a, axis=-1, order=2):\n", | ||
" \"\"\"Utility function to normalize the rows of a numpy array.\"\"\"\n", | ||
" l2 = np.atleast_1d(np.linalg.norm(a, order, axis))\n", | ||
" l2[l2==0] = 1\n", | ||
" return a / np.expand_dims(l2, axis)\n", | ||
"\n", | ||
"def make_training_matrices(source_dictionary, target_dictionary, bilingual_dictionary):\n", | ||
" \"\"\"source and target dictionaries are the FastVector objects of source/target languages.\n", | ||
" bilingual_dictionary is a list of translation pair tuples [(source_word, target_word), ...].\"\"\"\n", | ||
" source_matrix = []\n", | ||
" target_matrix = []\n", | ||
"\n", | ||
" for (source, target) in bilingual_dictionary:\n", | ||
" if source in source_dictionary and target in target_dictionary:\n", | ||
" source_matrix.append(source_dictionary[source])\n", | ||
" target_matrix.append(target_dictionary[target])\n", | ||
"\n", | ||
" # return training matrices\n", | ||
" return np.array(source_matrix), np.array(target_matrix)\n", | ||
"\n", | ||
"def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):\n", | ||
" \"\"\"source and target matrices are numpy arrays, shape (dictionary_length, embedding_dimension).\n", | ||
" These contain paired word vectors from the bilingual dictionary.\"\"\"\n", | ||
" # optionally normalize the training vectors\n", | ||
" if normalize_vectors:\n", | ||
" source_matrix = normalized(source_matrix)\n", | ||
" target_matrix = normalized(target_matrix)\n", | ||
"\n", | ||
" # perform the SVD\n", | ||
" product = np.matmul(source_matrix.transpose(), target_matrix)\n", | ||
" U, s, V = np.linalg.svd(product)\n", | ||
"\n", | ||
" # return orthogonal transformation which aligns source language to the target\n", | ||
" return np.matmul(U, V)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we load the French and Russian word vectors, and evaluate the similarity of \"chat\" and \"кот\":" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"reading word vectors from wiki.fr.vec\n", | ||
"reading word vectors from wiki.ru.vec\n", | ||
"0.0238620125217\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"fr_dictionary = FastVector(vector_file='wiki.fr.vec')\n", | ||
"ru_dictionary = FastVector(vector_file='wiki.ru.vec')\n", | ||
"\n", | ||
"fr_vector = fr_dictionary[\"chat\"]\n", | ||
"ru_vector = ru_dictionary[\"кот\"]\n", | ||
"print(FastVector.cosine_similarity(fr_vector, ru_vector))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\"chat\" and \"кот\" both mean \"cat\", so they should be highly similar; clearly the two word vector spaces are not yet aligned. To align them, we need a bilingual dictionary of French and Russian translation pairs. As it happens, this is a great opportunity to show you something truly amazing...\n", | ||
"\n", | ||
"Many words appear in the vocabularies of more than one language; words like \"alberto\", \"london\" and \"presse\". These words usually mean similar things in each language. Therefore we can form a bilingual dictionary, by simply extracting every word that appears in both the French and Russian vocabularies." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"ru_words = set(ru_dictionary.word2id.keys())\n", | ||
"fr_words = set(fr_dictionary.word2id.keys())\n", | ||
"overlap = list(ru_words & fr_words)\n", | ||
"bilingual_dictionary = [(entry, entry) for entry in overlap]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's align the French vectors to the Russian vectors, using only this \"free\" dictionary that we acquired without any bilingual expert knowledge." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# form the training matrices\n", | ||
"source_matrix, target_matrix = make_training_matrices(\n", | ||
" fr_dictionary, ru_dictionary, bilingual_dictionary)\n", | ||
"\n", | ||
"# learn and apply the transformation\n", | ||
"transform = learn_transformation(source_matrix, target_matrix)\n", | ||
"fr_dictionary.apply_transform(transform)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we re-evaluate the similarity of \"chat\" and \"кот\":" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"0.377368048895\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"fr_vector = fr_dictionary[\"chat\"]\n", | ||
"ru_vector = ru_dictionary[\"кот\"]\n", | ||
"print(FastVector.cosine_similarity(fr_vector, ru_vector))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\"chat\" and \"кот\" are pretty similar after all :)\n", | ||
"\n", | ||
"Use this simple \"identical strings\" trick to align other language pairs for yourself, or prepare your own expert bilingual dictionaries for optimal performance." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.5.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |