Talk:Levenshtein distance: Difference between revisions

From Rosetta Code
Content added Content deleted
(J: follow bug fix on main page)
Line 4: Line 4:


<lang j>levD=: i.@-@>:@#@] ,~ >:@i.@-@#@[ ,.~ ~:/
<lang j>levD=: i.@-@>:@#@] ,~ >:@i.@-@#@[ ,.~ ~:/
lev=: [: {. {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,./@levD</lang>
lev=: [: {. {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,./@levD</lang>


''First, we setup up an matrix of costs, with 0 or 1 for unexplored cells (1 being where the character pair corresponding to that cell position has two different characters). Note that the "cost to reach the empty string" cells go on the bottom and the right, instead of the top and the left, because this works better with J's "reduce" operator.''
''First, we setup up an matrix of costs, with 0 or 1 for unexplored cells (1 being where the character pair corresponding to that cell position has two different characters). Note that the "cost to reach the empty string" cells go on the bottom and the right, instead of the top and the left, because this works better with J's "reduce" operator.''
Line 25: Line 25:
The J operator / inserts its verb between each item in an array, so <code>'kitten' lev 'sitting'</code> is equivalent to:
The J operator / inserts its verb between each item in an array, so <code>'kitten' lev 'sitting'</code> is equivalent to:


<lang j>{. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. ... 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>
<lang j>{. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. ... 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>


Or, breaking it down and avoiding problems with long lines:
Or, breaking it down and avoiding problems with long lines:


<lang j>d=. 7 6 5 4 3 2 1 0
<lang j>d=. 7 6 5 4 3 2 1 0
d=. 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 2 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 2 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 3 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 3 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 4 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 4 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
d=. 1 0 1 1 0 1 1 5 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 0 1 1 0 1 1 5 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t
{.d NB. the first number in the final instance of d is our result
{.d NB. the first number in the final instance of d is our result
</lang>
</lang>
Line 42: Line 42:


So, let's take a specific example:
So, let's take a specific example:
<lang j>1 1 1 1 1 0 1 1 {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>
<lang j>1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>


Examining this sentence, bottom up, the first thing we do is join together these two lists, as columns:
Examining this sentence, bottom up, the first thing we do is join together these two lists, as columns:
Line 60: Line 60:
That leaves us with a sentence like:
That leaves us with a sentence like:


<lang j>1 7 (1&{ ,~ {. + <./@}.)@, ... 1 1 (1&{ ,~ {. + <./@}.)@, 1 0</lang>
<lang j>1 7 (1&{ ,~ (1 1,{.) <./@:+ }.)@, ... 1 1 (1&{ ,~ (1 1,{.) <./@:+ }.)@, 1 0</lang>


Or, breaking it down, again:
Or, breaking it down, again:


<lang j>dd7=.1 0
<lang j>dd7=.1 0
dd6=.1 1 (1&{ ,~ {. + <./@}.)@, dd7
dd6=.1 1 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd7
dd5=.0 2 (1&{ ,~ {. + <./@}.)@, dd6
dd5=.0 2 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd6
dd4=.1 3 (1&{ ,~ {. + <./@}.)@, dd5
dd4=.1 3 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd5
dd3=.1 4 (1&{ ,~ {. + <./@}.)@, dd4
dd3=.1 4 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd4
dd2=.1 5 (1&{ ,~ {. + <./@}.)@, dd3
dd2=.1 5 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd3
dd1=.1 6 (1&{ ,~ {. + <./@}.)@, dd2
dd1=.1 6 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd2
dd0=.1 7 (1&{ ,~ {. + <./@}.)@, dd1</lang>
dd0=.1 7 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd1</lang>


So, for example, finding the value for dd4 is 2 3 given that we know that dd5 is 1 2
So, for example, finding the value for dd4 is 2 3 given that we know that dd5 is 1 2
Line 94: Line 94:
The downside, of these kinds of dynamic algorithms is that while the individual operations can be easy or trivial to picture, the cascading dependencies make comprehending the data as a whole rather slippery. But perhaps viewing the data as a whole, with significant intermediate results, can help:
The downside, of these kinds of dynamic algorithms is that while the individual operations can be easy or trivial to picture, the cascading dependencies make comprehending the data as a whole rather slippery. But perhaps viewing the data as a whole, with significant intermediate results, can help:


<lang j> 'kitten' {."1@((1&{ ,~ {. + <./@}.)@,/\.)@,./\.@levD 'sitting'
<lang j> 'kitten' {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,./\.@levD 'sitting'
3 3 4 4 5 6 6 6
3 3 4 4 5 6 6 6
3 2 3 3 4 5 5 5
3 2 3 3 4 5 5 5

Revision as of 18:13, 13 January 2011

J Implementation

This is some documentation and elaboration for:

<lang j>levD=: i.@-@>:@#@] ,~ >:@i.@-@#@[ ,.~ ~:/ lev=: [: {. {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,./@levD</lang>

First, we setup up an matrix of costs, with 0 or 1 for unexplored cells (1 being where the character pair corresponding to that cell position has two different characters). Note that the "cost to reach the empty string" cells go on the bottom and the right, instead of the top and the left, because this works better with J's "reduce" operator.

Then we reduce the rows of that matrix using an operation that treats those two rows as columns and reduces the rows of this derived matrix with an operation that gives the (unexplored cell + the minumum of the other cells) followed by (the cell adjacent to the previously unexplored cell.

In other words, levD does this:

<lang j> 'kitten' levD 'sitting' 1 1 1 1 1 1 1 6 1 0 1 1 0 1 1 5 1 1 0 0 1 1 1 4 1 1 0 0 1 1 1 3 1 1 1 1 1 1 1 2 1 1 1 1 1 0 1 1 7 6 5 4 3 2 1 0</lang>

The rightmost column and bottom row indicate the edit distance from those substrings to the empty string. In other words, 6 is the edit distance from 'kitten' to the empty string, while 3 is the edit distance from 'ten' to the empty string.

The J operator / inserts its verb between each item in an array, so 'kitten' lev 'sitting' is equivalent to:

<lang j>{. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. ... 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>

Or, breaking it down and avoiding problems with long lines:

<lang j>d=. 7 6 5 4 3 2 1 0 d=. 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 1 1 1 1 1 2 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 0 0 1 1 1 3 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 0 0 1 1 1 4 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 0 1 1 0 1 1 5 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t d=. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. t

  {.d  NB. the first number in the final instance of d is our result

</lang>

And, here, each instance of d corresponds to a fully computed row of D in the wikipedia implementation.

So, let's take a specific example: <lang j>1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>

Examining this sentence, bottom up, the first thing we do is join together these two lists, as columns:

<lang j> 1 1 1 1 1 0 1 1 ,. 7 6 5 4 3 2 1 0 1 7 1 6 1 5 1 4 1 3 0 2 1 1 1 0</lang>

And then we use insert again, between each row, to compute our final result. However, we also use the \. operator so we keep a running list of each of these intermediate results (and the last row will be kept as-is, in this accumulated result). Finally, after this is all done, we will be pulling out the first column of this accumulated result, using {."1

That leaves us with a sentence like:

<lang j>1 7 (1&{ ,~ (1 1,{.) <./@:+ }.)@, ... 1 1 (1&{ ,~ (1 1,{.) <./@:+ }.)@, 1 0</lang>

Or, breaking it down, again:

<lang j>dd7=.1 0 dd6=.1 1 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd7 dd5=.0 2 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd6 dd4=.1 3 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd5 dd3=.1 4 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd4 dd2=.1 5 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd3 dd1=.1 6 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd2 dd0=.1 7 (1&{ ,~ (1 1,{.) <./@:+ }.)@, dd1</lang>

So, for example, finding the value for dd4 is 2 3 given that we know that dd5 is 1 2

<lang j> 1 3, 1 2 NB. the four numbers of interest 1 3 1 2

  }. 1 3 1 2  NB. find previous delete,insert and substitute costs

3 1 2

  {. 1 3 1 2  NB. find if current character matches

1

  <./3 1 2  NB. find cheapest of previous delete, insert or substitute

1

  1 + 1  NB. add cost of fixing current character to cheapest previous edit

2

  1&{1 3 1 2  NB. retain current delete cost to be used as next substitute cost

3

  3,~ 2  NB. new values

2 3</lang>

In other words, each instance of this operation is computing the new insert and substitute costs based on the previous values. (That said, note that "insert" and "delete" are arbitrary designations -- an "insert" on the string 'kitten' will be a delete on the string 'sitting', and vice versa.)

The downside, of these kinds of dynamic algorithms is that while the individual operations can be easy or trivial to picture, the cascading dependencies make comprehending the data as a whole rather slippery. But perhaps viewing the data as a whole, with significant intermediate results, can help:

<lang j> 'kitten' {."1@((1&{ ,~ (1 1,{.) <./@:+ }.)@,/\.)@,./\.@levD 'sitting' 3 3 4 4 5 6 6 6 3 2 3 3 4 5 5 5 4 3 2 2 4 4 4 4 4 3 2 2 3 3 3 3 6 5 4 3 2 2 2 2 6 5 4 3 2 1 1 1 7 6 5 4 3 2 1 0</lang>

Here, the rows correspond to suffixes of the string 'kitten' and the columns correspond to suffixes of the string 'sitting', and the numbers represent the edit cost of matching the two corresponding suffixes.