Talk:Levenshtein distance: Difference between revisions

m
→‎Java: l/c i
(verbose details for J)
m (→‎Java: l/c i)
 
(10 intermediate revisions by 4 users not shown)
Line 1:
 
==J Implementation==
 
Line 4 ⟶ 5:
 
<lang j>levD=: i.@-@>:@#@] ,~ >:@i.@-@#@[ ,.~ ~:/
lev=: [: {. {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,./@levD</lang>
 
''First, we setup up an matrix of costs, with 0 or 1 for unexplored cells (1 being where the character pair corresponding to that cell position has two different characters). Note that the "cost to reach the empty string" cells go on the bottom and the right, instead of the top and the left, because this works better with J's "reduce" operator.''
Line 25 ⟶ 26:
The J operator / inserts its verb between each item in an array, so <code>'kitten' lev 'sitting'</code> is equivalent to:
 
<lang j>{. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. ... 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>
 
Or, breaking it down and avoiding problems with long lines:
 
<lang j>d=. 7 6 5 4 3 2 1 0
d=. 1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 2 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 3 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
d=. 1 1 0 0 1 1 1 4 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
d=. 1 0 1 1 0 1 1 5 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
d=. 1 1 1 1 1 1 1 6 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. t
{.d NB. the first number in the final instance of d is our result
</lang>
Line 42 ⟶ 43:
 
So, let's take a specific example:
<lang j>1 1 1 1 1 0 1 1 {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,. 7 6 5 4 3 2 1 0</lang>
 
Examining this sentence, bottom up, the first thing we do is join together these two lists, as columns:
Line 60 ⟶ 61:
That leaves us with a sentence like:
 
<lang j>1 7 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, ... 1 1 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, 1 0</lang>
 
Or, breaking it down, again:
 
<lang j>dd7=.1 0
dd6=.1 1 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd7
dd5=.0 2 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd6
dd4=.1 3 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd5
dd3=.1 4 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd4
dd2=.1 5 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd3
dd1=.1 6 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd2
dd0=.1 7 (1&{ ,~ (1 1,{. +) <./@:+ }.)@, dd1</lang>
 
So, for example, finding the value for dd4 is 2 3 given that we know that dd5 is 1 2
Line 81 ⟶ 82:
{. 1 3 1 2 NB. find if current character matches
1
<./31 1, 1 2 NB. find cheapestedit ofcost previousdelta delete,from insertprevious oredit substitutecosts
1
1 1 1+3 1 2 NB. addfind costedit ofcosts fixingfrom currentadjacent characterstates to cheapestthis previous editstate
4 2 3
<./4 2 3 NB. find cheapest of previous delete, insert or substitute
2
1&{1 3 1 2 NB. retain current delete cost to be used as next substitute cost
Line 92 ⟶ 95:
In other words, each instance of this operation is computing the new insert and substitute costs based on the previous values. (That said, note that "insert" and "delete" are arbitrary designations -- an "insert" on the string 'kitten' will be a delete on the string 'sitting', and vice versa.)
 
The downside, of these kinds of dynamic algorithms is that while the individual operations can be easy or trivial to picture, the cascading dependencies make comprehending the data as a whole rather difficultslippery. But perhaps viewing the data as a whole, with significant intermediate results, can help:
 
<lang j> 'kitten' {."1@((1&{ ,~ (1 1,{. +) <./@:+ }.)@,/\.)@,./\.@levD 'sitting'
3 3 4 45 5 6 6 6
3 2 3 34 4 5 5 5
4 3 2 23 4 4 4 4
5 4 3 2 2 3 3 3 3
6 5 4 3 2 2 2 2
6 5 4 3 2 1 1 1
Line 104 ⟶ 107:
 
Here, the rows correspond to suffixes of the string 'kitten' and the columns correspond to suffixes of the string 'sitting', and the numbers represent the edit cost of matching the two corresponding suffixes.
 
== Changed definition ==
 
The [http://rosettacode.org/mw/index.php?title=Levenshtein_distance&oldid=211835 2015-09-15] edits to the task description change the task. Prior to those edits, substitutions cost the same as an insert or a delete. Those edits instead declare that a substitution should cost the same as an insert plus a delete. And, those edits include a reference to a paper which describes this same system.
 
But that's a different task from what this was. A [[https://en.wikipedia.org/wiki/Edit_distance#Formal_definition_and_properties|wikipedia entry]] currently seems to suggest that this would make the task equivalent to the [[http://rosettacode.org/wiki/Longest_Common_Subsequence|longest common subsequence]] task. I don't think that's precise, but there is a pretty strong relationship there.
 
Anyways, I am reverting these edits, but did not want to lose track of them.
 
Perhaps this is worth doing a new task for? I'm not sure... --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 13:56, 15 September 2015 (UTC)
 
== Java ==
 
The Java Iterative space optimized (even bounded) version translated from Python was missing the initialization of the cost array. Implicitly done in Python, but missing for Java. This results in subtle bugs for a subset of strings. One such example being ["abcdefgh", "defghabc"] with the correct answer being 6 and the previously translated version here returning 5. I updated the source code with the correct initialization, (i.e. cost[i] = i;). --[[User:Caislen|Caislen]] ([[User talk:Caislen|talk]]) 20:41, 16 November 2020 (UTC)
7,805

edits