Levenshtein distance

From Rosetta Code
Revision as of 20:49, 11 January 2011 by rosettacode>Dingowolf (add note to verifyoutput)
Levenshtein distance is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.
This page uses content from Wikipedia. The original article was at Levenshtein distance. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance)

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e. an edit distance). The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of 'k' with 's')
  2. sitten → sittin (substitution of 'e' with 'i')
  3. sittin → sitting (insert 'g' at the end).

The Levenshtein distance between "rosettacode", "raisethysword" is 8; The distance between two strings is same as that when both strings is reversed.

Task : Implements a Levenshtein distance function, or uses a library function, to show the Levenshtein distance between "kitten" and "sitting".

Other edit distance at Rosetacode.org :

D

The standard library std.algorithm module includes a Levenshtein distance function. <lang d>import std.algorithm ;

   writeln(levenshteinDistance("kitten", "sitting")) ; // print 3</lang>

PureBasic

Based on Wikipedia's pseudocode. <lang PureBasic>Procedure LevenshteinDistance(A_string$, B_String$)

 Protected m, n, i, j, min, k, l
 m = Len(A_string$)
 n = Len(B_String$)
 Dim D(m, n)
 
 For i=0 To m: D(i,0)=i: Next
 For j=0 To n: D(0,j)=j: Next
 
 For j=1 To n
   For i=1 To m
     If Mid(A_string$,i,1) = Mid(B_String$,j,1)
       D(i,j) = D(i-1, j-1); no operation required
     Else
       min = D(i-1, j)+1   ; a deletion
       k   = D(i, j-1)+1   ; an insertion
       l   = D(i-1, j-1)+1 ; a substitution
       If k < min: min=k: EndIf
       If l < min: min=l: EndIf
       D(i,j) = min
     EndIf
   Next
 Next
 ProcedureReturn D(m,n)

EndProcedure

- Testing

n = LevenshteinDistance("kitten", "sitting") MessageRequester("Info","Levenshtein Distance= "+Str(n))</lang>

Tcl

<lang tcl>proc levenshteinDistance {s t} {

   # Edge cases
   if {![set n [string length $t]]} {

return [string length $s]

   } elseif {![set m [string length $s]]} {

return $n

   }
   # Fastest way to initialize
   for {set i 0} {$i <= $m} {incr i} {

lappend d 0 lappend p $i

   }
   # Loop, computing the distance table (well, a moving section)
   for {set j 0} {$j < $n} {} {

set tj [string index $t $j] lset d 0 [incr j] for {set i 0} {$i < $m} {} { set a [expr {[lindex $d $i]+1}] set b [expr {[lindex $p $i]+([string index $s $i] ne $tj)}] set c [expr {[lindex $p [incr i]]+1}] # Faster than min($a,$b,$c) lset d $i [expr {$a<$b ? $c<$a ? $c : $a : $c<$b ? $c : $b}] } # Swap set nd $p; set p $d; set d $nd

   }
   # The score is at the end of the last-computed row
   return [lindex $p end]

}</lang> Usage: <lang tcl>puts [levenshteinDistance "kitten" "sitting"]; # Prints 3</lang>