example_alignment_notebook

babylonhealth · Jun 6, 2017 · ee6e252 · ee6e252
1 parent 2153484
commit ee6e252
Showing 1 changed file with 206 additions and 0 deletions.
diff --git a/align_your_own.ipynb b/align_your_own.ipynb
@@ -0,0 +1,206 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# So, show me how to align two vector spaces for myself!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "No problem. We're going to run through the example given in the README again, and show you how to learn your own transformation to align the French vector space to the Russian vector space.\n",
+    "\n",
+    "First, let's define a few simple functions..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from fasttext import FastVector\n",
+    "\n",
+    "# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy\n",
+    "def normalized(a, axis=-1, order=2):\n",
+    "    \"\"\"Utility function to normalize the rows of a numpy array.\"\"\"\n",
+    "    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))\n",
+    "    l2[l2==0] = 1\n",
+    "    return a / np.expand_dims(l2, axis)\n",
+    "\n",
+    "def make_training_matrices(source_dictionary, target_dictionary, bilingual_dictionary):\n",
+    "    \"\"\"source and target dictionaries are the FastVector objects of source/target languages.\n",
+    "    bilingual_dictionary is a list of translation pair tuples [(source_word, target_word), ...].\"\"\"\n",
+    "    source_matrix = []\n",
+    "    target_matrix = []\n",
+    "\n",
+    "    for (source, target) in bilingual_dictionary:\n",
+    "        if source in source_dictionary and target in target_dictionary:\n",
+    "            source_matrix.append(source_dictionary[source])\n",
+    "            target_matrix.append(target_dictionary[target])\n",
+    "\n",
+    "    # return training matrices\n",
+    "    return np.array(source_matrix), np.array(target_matrix)\n",
+    "\n",
+    "def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):\n",
+    "    \"\"\"source and target matrices are numpy arrays, shape (dictionary_length, embedding_dimension).\n",
+    "    These contain paired word vectors from the bilingual dictionary.\"\"\"\n",
+    "    # optionally normalize the training vectors\n",
+    "    if normalize_vectors:\n",
+    "        source_matrix = normalized(source_matrix)\n",
+    "        target_matrix = normalized(target_matrix)\n",
+    "\n",
+    "    # perform the SVD\n",
+    "    product = np.matmul(source_matrix.transpose(), target_matrix)\n",
+    "    U, s, V = np.linalg.svd(product)\n",
+    "\n",
+    "    # return orthogonal transformation which aligns source language to the target\n",
+    "    return np.matmul(U, V)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we load the French and Russian word vectors, and evaluate the similarity of \"chat\" and \"кот\":"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "reading word vectors from wiki.fr.vec\n",
+      "reading word vectors from wiki.ru.vec\n",
+      "0.0238620125217\n"
+     ]
+    }
+   ],
+   "source": [
+    "fr_dictionary = FastVector(vector_file='wiki.fr.vec')\n",
+    "ru_dictionary = FastVector(vector_file='wiki.ru.vec')\n",
+    "\n",
+    "fr_vector = fr_dictionary[\"chat\"]\n",
+    "ru_vector = ru_dictionary[\"кот\"]\n",
+    "print(FastVector.cosine_similarity(fr_vector, ru_vector))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\"chat\" and \"кот\" both mean \"cat\", so they should be highly similar; clearly the two word vector spaces are not yet aligned. To align them, we need a bilingual dictionary of French and Russian translation pairs. As it happens, this is a great opportunity to show you something truly amazing...\n",
+    "\n",
+    "Many words appear in the vocabularies of more than one language; words like \"alberto\", \"london\" and \"presse\". These words usually mean similar things in each language. Therefore we can form a bilingual dictionary, by simply extracting every word that appears in both the French and Russian vocabularies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "ru_words = set(ru_dictionary.word2id.keys())\n",
+    "fr_words = set(fr_dictionary.word2id.keys())\n",
+    "overlap = list(ru_words & fr_words)\n",
+    "bilingual_dictionary = [(entry, entry) for entry in overlap]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's align the French vectors to the Russian vectors, using only this \"free\" dictionary that we acquired without any bilingual expert knowledge."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# form the training matrices\n",
+    "source_matrix, target_matrix = make_training_matrices(\n",
+    "    fr_dictionary, ru_dictionary, bilingual_dictionary)\n",
+    "\n",
+    "# learn and apply the transformation\n",
+    "transform = learn_transformation(source_matrix, target_matrix)\n",
+    "fr_dictionary.apply_transform(transform)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we re-evaluate the similarity of \"chat\" and \"кот\":"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.377368048895\n"
+     ]
+    }
+   ],
+   "source": [
+    "fr_vector = fr_dictionary[\"chat\"]\n",
+    "ru_vector = ru_dictionary[\"кот\"]\n",
+    "print(FastVector.cosine_similarity(fr_vector, ru_vector))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\"chat\" and \"кот\" are pretty similar after all :)\n",
+    "\n",
+    "Use this simple \"identical strings\" trick to align other language pairs for yourself, or prepare your own expert bilingual dictionaries for optimal performance."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}