Talk:Levenshtein distance: Difference between revisions

Content added Content deleted

Inline

Revision as of 18:13, 13 January 2011

J Implementation

This is some documentation and elaboration for:

First, we setup up an matrix of costs, with 0 or 1 for unexplored cells (1 being where the character pair corresponding to that cell position has two different characters). Note that the "cost to reach the empty string" cells go on the bottom and the right, instead of the top and the left, because this works better with J's "reduce" operator.

Then we reduce the rows of that matrix using an operation that treats those two rows as columns and reduces the rows of this derived matrix with an operation that gives the (unexplored cell + the minumum of the other cells) followed by (the cell adjacent to the previously unexplored cell.

In other words, levD does this:

<lang j> 'kitten' levD 'sitting' 1 1 1 1 1 1 1 6 1 0 1 1 0 1 1 5 1 1 0 0 1 1 1 4 1 1 0 0 1 1 1 3 1 1 1 1 1 1 1 2 1 1 1 1 1 0 1 1 7 6 5 4 3 2 1 0</lang>

The rightmost column and bottom row indicate the edit distance from those substrings to the empty string. In other words, 6 is the edit distance from 'kitten' to the empty string, while 3 is the edit distance from 'ten' to the empty string.

The J operator / inserts its verb between each item in an array, so 'kitten' lev 'sitting' is equivalent to:

Or, breaking it down and avoiding problems with long lines:

<lang j>d=. 7 6 5 4 3 2 1 0 d=. 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 1 1 1 1 1 2 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 0 0 1 1 1 3 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 0 0 1 1 1 4 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 0 1 1 0 1 1 5 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t

  {.d  NB. the first number in the final instance of d is our result

</lang>

And, here, each instance of d corresponds to a fully computed row of D in the wikipedia implementation.

So, let's take a specific example: <lang j>1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>

Examining this sentence, bottom up, the first thing we do is join together these two lists, as columns:

And then we use insert again, between each row, to compute our final result. However, we also use the \. operator so we keep a running list of each of these intermediate results (and the last row will be kept as-is, in this accumulated result). Finally, after this is all done, we will be pulling out the first column of this accumulated result, using {."1

That leaves us with a sentence like:

Or, breaking it down, again:

So, for example, finding the value for dd4 is 2 3 given that we know that dd5 is 1 2

<lang j> 1 3, 1 2 NB. the four numbers of interest 1 3 1 2

  }. 1 3 1 2  NB. find previous delete,insert and substitute costs

3 1 2

  {. 1 3 1 2  NB. find if current character matches

1

  <./3 1 2  NB. find cheapest of previous delete, insert or substitute

1

  1 + 1  NB. add cost of fixing current character to cheapest previous edit

2

  1&{1 3 1 2  NB. retain current delete cost to be used as next substitute cost

3

  3,~ 2  NB. new values

2 3</lang>

In other words, each instance of this operation is computing the new insert and substitute costs based on the previous values. (That said, note that "insert" and "delete" are arbitrary designations -- an "insert" on the string 'kitten' will be a delete on the string 'sitting', and vice versa.)

The downside, of these kinds of dynamic algorithms is that while the individual operations can be easy or trivial to picture, the cascading dependencies make comprehending the data as a whole rather slippery. But perhaps viewing the data as a whole, with significant intermediate results, can help:

<lang j> 'kitten' {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,./\.@levD 'sitting' 3 3 4 4 5 6 6 6 3 2 3 3 4 5 5 5 4 3 2 2 4 4 4 4 4 3 2 2 3 3 3 3 6 5 4 3 2 2 2 2 6 5 4 3 2 1 1 1 7 6 5 4 3 2 1 0</lang>

Here, the rows correspond to suffixes of the string 'kitten' and the columns correspond to suffixes of the string 'sitting', and the numbers represent the edit cost of matching the two corresponding suffixes.