Skip to content

Commit

Permalink
update edit distance tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
bezzazi abir committed Aug 24, 2022
1 parent 363e2a9 commit 5cb27ad
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions wiki/Tutorials/edit-distances-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,20 @@ For calculating the cost of each cell we procede by using this function :

This function above defines the restricted Damerau-Levenshtein distance. The four first conditions only, define the Levenshtein distance.

- Okay so for each cell, as said we apply the function bellow. Let's take the cell (5,4) for example. Note that in Pharo we start at 1 and not 0. The indexes i are for the rows and j are for the columns.
- Okay so for each cell, as said we apply the function above. Let's take the cell (4,3) for example. Note that in Pharo we start at 1 and not 0. The indexes i are for the rows and j are for the columns.
- So we calculate the min of the upper cell value +1, the left cell value +1 and the upper-left cell +( 0 if the characters of this cell are equal and 1 if they're not - in this case no, they're not). These cells are the one indicated with the red arrows, min(2+1,3+1,2+1) = 3. Now, if we stop here we would have calculated the edit distance using Levenshtein distance.
- To use the restricted Damerau-Levenshtein we have to calculate the transposition operation too (swap two consecutive characters). This is possible with the last condition of our function above. Which is the value of the cell (i-2,j-2). Here its value is 1, so we do the same as we did before, we calculate the min ((2+1,3+1,2+1,1+1) = 2. It's the cell coloured in green. This cell gives us the vaue of restricted Damereau-Levenshtein distance between our first string "a cat" and our second string "an act".
- _Note_ : Edit distance has properties of dynamic programming. Because at each stage of the algorithm we have the optimal choice. Back to our example of the cell (5,4), if you see this cell represents the substrings `"a ca"` and `"an "`. The edit distance between these two is 3 - since we have to add "n" after the "a" and add "c" and "a" after the space character " ". Three edit operations to transform "a ca" into "an ".
- _Continuity of note_: If you take any cell of the matrix you'll notice that the value of it is the edit distance of the substrings that is related to. So obviously, the last cell in green is the value of the edit distance of our two starting strings.
- _Note_ : Edit distance has properties of dynamic programming. Because at each stage of the algorithm we have the optimal choice. Back to our example of the cell (4, 3), if you see this cell represents the substrings `"a ca"` and `"an "`. The edit distance between these two is 3 - since we have to add "n" after the "a" and add "c" and "a" after the space character " ". Three edit operations to transform "a ca" into "an ".
- _Continuity of note_: If you take any cell of the matrix you'll notice that the value of it, is the edit distance of the substrings that is related to. So obviously, the last cell in green is the value of the edit distance of our two starting strings.

<img src="./img/RDL.png" alt="formulaRDL" width="50%" height="50%"><br>


## Damerau-Levenshtein distance :

For 2 words, such as 'a cat' and 'an abct', a matrix of size 5x6 is created as shown in the matrix bellow. Note that the rows and columns surrounded in dark and light purple are not part of the matrix they are just added to count correctly s.
For 2 words, such as 'a cat' and 'an abct', a matrix of size 5x6 is created as shown in the matrix below. Note that the rows and columns surrounded in dark and light purple are not part of the matrix they are just added to count correctly s.

The main difference between the retricted algorithm and this one is that in the first we couldn't calculate a transposion of non-adjacents characters. In other words, we couldn't edit a substring that was already edited. In the example of the matrix bellow we have the first string `"a cat"` and second string `"a abct"`. "ca" could be swapped (swap = tanspose) with "ac". But we couldn't re-edit this substring to add "b" to it and have "abc". On the contrary, in the non-restricted Damerau-Levenshtein distance we can ! Let's see how.
The main difference between the retricted algorithm and this one is that in the first we couldn't calculate a transposion of non-adjacents characters. In other words, we couldn't edit a substring that was already edited. In the example of the matrix below we have the first string `"a cat"` and second string `"a abct"`. "ca" could be swapped (swap = tanspose) with "ac". But we couldn't re-edit this substring to add "b" to it and have "abc". On the contrary, in the non-restricted Damerau-Levenshtein distance we can ! Let's see how.
- Let's take the example of cell (6,7). We apply the levenshtein function (the first four conditions of the equation above) to calculate the edit distance value of the cell. In addition to that, we calculate the value of transposition. To know if we have a transposion or not we:
- First, go through the rows in backwards from our current cell (Matrix: red arrow from blue cell to cell with star ) and stop when we find row with the current column's character - here we're looking for the character `"c"`.
- Second, go through the columns in backwards from the cell we stopped in (here the blue one) and look for the column in this row where the characters match with the characters of our curent cell (Remember our current cell is still (6,7)). One row back (j-1) we have (c,b) not a match, two rows back (j-2) we have (a,c) it's a match !
Expand Down

0 comments on commit 5cb27ad

Please sign in to comment.