Skip to content

Commit

Permalink
Merge pull request #17 from Babylonpartners/updated_alignment_matrices
Browse files Browse the repository at this point in the history
Updated alignment matrices
  • Loading branch information
nhammerla authored Feb 7, 2018
2 parents 451ee74 + 78215a3 commit f81f3e4
Show file tree
Hide file tree
Showing 79 changed files with 22,921 additions and 22,843 deletions.
87 changes: 44 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,80 +57,80 @@ First things first, let's test the translation performance from English into eve
|-----------------|--------------|--------------|---------------|
| fr | 0.73 | 0.86 | 0.88 |
| pt | 0.73 | 0.86 | 0.89 |
| es | 0.73 | 0.85 | 0.88 |
| it | 0.71 | 0.86 | 0.89 |
| nl | 0.69 | 0.83 | 0.86 |
| no | 0.68 | 0.85 | 0.88 |
| ca | 0.67 | 0.82 | 0.86 |
| da | 0.66 | 0.83 | 0.88 |
| es | 0.72 | 0.85 | 0.88 |
| it | 0.70 | 0.86 | 0.89 |
| nl | 0.68 | 0.83 | 0.86 |
| no | 0.68 | 0.85 | 0.89 |
| da | 0.66 | 0.84 | 0.88 |
| ca | 0.66 | 0.81 | 0.86 |
| sv | 0.65 | 0.82 | 0.86 |
| cs | 0.64 | 0.81 | 0.85 |
| ro | 0.63 | 0.81 | 0.85 |
| hu | 0.62 | 0.80 | 0.85 |
| pl | 0.62 | 0.79 | 0.82 |
| fi | 0.61 | 0.79 | 0.84 |
| de | 0.61 | 0.75 | 0.78 |
| ru | 0.61 | 0.77 | 0.82 |
| de | 0.62 | 0.75 | 0.78 |
| pl | 0.62 | 0.79 | 0.83 |
| hu | 0.61 | 0.80 | 0.84 |
| fi | 0.61 | 0.80 | 0.84 |
| eo | 0.61 | 0.80 | 0.85 |
| ru | 0.60 | 0.78 | 0.82 |
| gl | 0.60 | 0.77 | 0.82 |
| id | 0.58 | 0.81 | 0.86 |
| mk | 0.58 | 0.79 | 0.84 |
| bg | 0.58 | 0.77 | 0.82 |
| id | 0.58 | 0.81 | 0.86 |
| bg | 0.57 | 0.77 | 0.82 |
| ms | 0.57 | 0.81 | 0.86 |
| sh | 0.56 | 0.77 | 0.82 |
| uk | 0.56 | 0.75 | 0.79 |
| uk | 0.57 | 0.75 | 0.79 |
| sh | 0.56 | 0.77 | 0.81 |
| hr | 0.56 | 0.75 | 0.80 |
| tr | 0.56 | 0.77 | 0.81 |
| sl | 0.56 | 0.77 | 0.82 |
| hr | 0.55 | 0.75 | 0.80 |
| el | 0.55 | 0.75 | 0.80 |
| el | 0.54 | 0.75 | 0.80 |
| sk | 0.54 | 0.75 | 0.81 |
| et | 0.53 | 0.73 | 0.78 |
| sk | 0.53 | 0.75 | 0.81 |
| sr | 0.53 | 0.72 | 0.77 |
| af | 0.52 | 0.75 | 0.80 |
| lt | 0.50 | 0.72 | 0.79 |
| ar | 0.48 | 0.69 | 0.75 |
| bs | 0.48 | 0.70 | 0.76 |
| bs | 0.47 | 0.70 | 0.77 |
| lv | 0.47 | 0.68 | 0.75 |
| eu | 0.46 | 0.68 | 0.75 |
| fa | 0.45 | 0.68 | 0.75 |
| hy | 0.43 | 0.66 | 0.73 |
| be | 0.43 | 0.64 | 0.71 |
| sq | 0.42 | 0.65 | 0.71 |
| zh | 0.41 | 0.67 | 0.74 |
| sq | 0.43 | 0.65 | 0.71 |
| be | 0.43 | 0.64 | 0.70 |
| zh | 0.40 | 0.68 | 0.75 |
| ka | 0.40 | 0.63 | 0.71 |
| hi | 0.40 | 0.58 | 0.63 |
| cy | 0.39 | 0.63 | 0.71 |
| hi | 0.39 | 0.58 | 0.63 |
| az | 0.38 | 0.60 | 0.67 |
| ko | 0.37 | 0.58 | 0.66 |
| te | 0.36 | 0.56 | 0.63 |
| ko | 0.36 | 0.58 | 0.66 |
| kk | 0.35 | 0.60 | 0.68 |
| he | 0.33 | 0.45 | 0.48 |
| fy | 0.33 | 0.52 | 0.61 |
| vi | 0.31 | 0.53 | 0.61 |
| fy | 0.33 | 0.52 | 0.60 |
| vi | 0.31 | 0.53 | 0.62 |
| ta | 0.31 | 0.50 | 0.56 |
| bn | 0.30 | 0.49 | 0.56 |
| ur | 0.29 | 0.52 | 0.61 |
| is | 0.28 | 0.51 | 0.59 |
| is | 0.29 | 0.51 | 0.59 |
| tl | 0.28 | 0.51 | 0.59 |
| kn | 0.28 | 0.43 | 0.46 |
| gu | 0.25 | 0.44 | 0.51 |
| mn | 0.25 | 0.48 | 0.58 |
| mn | 0.25 | 0.49 | 0.58 |
| uz | 0.24 | 0.43 | 0.51 |
| si | 0.22 | 0.40 | 0.45 |
| ml | 0.21 | 0.35 | 0.39 |
| th | 0.21 | 0.33 | 0.38 |
| ky | 0.20 | 0.40 | 0.49 |
| mr | 0.20 | 0.37 | 0.44 |
| th | 0.20 | 0.33 | 0.38 |
| la | 0.19 | 0.34 | 0.42 |
| ja | 0.18 | 0.43 | 0.56 |
| ja | 0.18 | 0.44 | 0.56 |
| ne | 0.16 | 0.33 | 0.38 |
| pa | 0.16 | 0.32 | 0.38 |
| tg | 0.15 | 0.31 | 0.39 |
| km | 0.12 | 0.26 | 0.31 |
| my | 0.10 | 0.20 | 0.23 |
| lb | 0.10 | 0.18 | 0.21 |
| mg | 0.07 | 0.19 | 0.25 |
| ceb | 0.06 | 0.14 | 0.18 |
| tg | 0.14 | 0.31 | 0.39 |
| km | 0.12 | 0.26 | 0.30 |
| my | 0.10 | 0.19 | 0.23 |
| lb | 0.09 | 0.18 | 0.21 |
| mg | 0.07 | 0.18 | 0.25 |
| ceb | 0.06 | 0.13 | 0.18 |

As you can see, the alignment is consistently much better than random! In general, the procedure works best for other European languages like French, Portuguese and Spanish. We use 2500 word pairs, because of the 5000 words in the test dictionary, not all the words found by the Google Translate API are actually present in the fastText vocabulary.

Expand All @@ -147,23 +147,24 @@ Intriquingly, even though we only directly aligned the languages to English, som
| Language 1 | Language 2 | Inter-pair precision @1 | English-pair precision @1 |
|:----------:|:----------:|:-----------------------:|:-------------------------:|
| bs | sh | 0.88 | 0.52 |
| ru | uk | 0.84 | 0.59 |
| ru | uk | 0.84 | 0.58 |
| ca | es | 0.82 | 0.69 |
| cs | sk | 0.82 | 0.59 |
| ca | es | 0.82 | 0.70 |
| hr | sh | 0.78 | 0.56 |
| be | uk | 0.77 | 0.49 |
| be | uk | 0.77 | 0.50 |
| gl | pt | 0.76 | 0.66 |
| bs | hr | 0.74 | 0.52 |
| be | ru | 0.73 | 0.52 |
| sr | sh | 0.73 | 0.54 |
| be | ru | 0.73 | 0.51 |
| da | no | 0.73 | 0.67 |
| sr | sh | 0.73 | 0.54 |
| pt | es | 0.72 | 0.72 |
| ca | pt | 0.70 | 0.69 |
| gl | es | 0.70 | 0.66 |
| hr | sr | 0.69 | 0.54 |
| ca | gl | 0.68 | 0.63 |
| bs | sr | 0.67 | 0.50 |
| mk | sr | 0.57 | 0.55 |
| mk | sr | 0.56 | 0.55 |
| kk | ky | 0.30 | 0.28 |
| kk | uz | 0.29 | 0.29 |

All of these language pairs share very close linguistic roots. For instance the first pair above are Bosnian and Serbo-Croatian; Bosnian is a variant of Serbo-Croatian. The second pair is Russian and Ukranian; both east-slavic languages. It seems that the more similar two languages are, the more similar the geometry of their fastText vectors; leading to improved translation performance.

Expand Down
Loading

0 comments on commit f81f3e4

Please sign in to comment.