User:Mwn3d/Levenshtein distance

From Rosetta Code

This is wrong. It's an attempt at a recursive version of Levenshtein distance. I'll be working on it myself on the side, but if anyone wants to help they can talk about it or directly edit it here.

It tries an exhaustive search of possibilities from string a to get to string b. It will either insert a character from string b into the right position in a, delete an unneeded character from a, or change a character in a to the corresponding on from string b to generate possibilities. It does this one possibility at a time and calls itself to see if it got any closer. <lang java5>import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Set;

public class Dist {

   public static int lDist(String a, String b) throws Exception{
       if(a == null || b == null){
           return -1; //error
       }
       if(a.equals(b)) return 0;
       if(a.length() == 0){
           return b.length();
       }
       if(b.length() == 0){
           return a.length();
       }
       //keep track of the scores for each candidate
       Map<String, Integer> mem = new HashMap<String, Integer>();
       StringBuilder aBuff = new StringBuilder(a);
       //set so we don't test the same candidate twice
       Set<String> candidates = new HashSet<String>();
       if (a.length() < b.length()) {//if a is longer than b generate all possible deletion possibilities
           for (int i = 0; i < a.length(); i++) {
               //insert
               candidates.add(aBuff.insert(i + 1, b.charAt(i + 1)).toString());
               aBuff.deleteCharAt(i + 1);
           }
       }
       if(a.length() > b.length()){//if a is short than b generate all insertion possibilities
           for(int i = 0; i < a.length(); i++){
               //remove
               char temp = aBuff.charAt(i);
               candidates.add(aBuff.deleteCharAt(i).toString());
               aBuff.insert(i, temp);
           }
       }
       if (a.length() == b.length()) {//if they are the same length generate all possible change possibilities
           for (int i = 0; i < a.length(); i++) {
               //change
               if (aBuff.charAt(i) != b.charAt(i)) {
                   char temp = aBuff.charAt(i);
                   candidates.add(aBuff.replace(i, i + 1,
                           Character.toString(b.charAt(i))).toString());
                   aBuff.replace(i, i + 1, Character.toString(temp));
               }
           }
       }
       for(String candidate: candidates){
           mem.put(candidate, 1 + lDist(candidate, b));
       }
       int ans = Integer.MAX_VALUE;
       for(String candidate: mem.keySet()){
           if(mem.get(candidate) < ans){
               ans = mem.get(candidate);
           }
       }
       return ans;
   }
   public static void main(String[] args) throws Exception{
       System.out.println(lDist("kitten", "sitting")); //prints 3, which is correct
       System.out.println(lDist("nettik", "gnittis")); //prints 4, which doesn't make sense
   }

}</lang>

Part of the problem is that you're not actually computing the distance function correctly. For example, the Levenshtein distance between “stop” and “tops” is 2 (delete the “s” at the front and add an “s” at the end). Because you have to consider inserts and deletes even when the length is the same, you've got a much larger explosion of possibilities. OTOH, there's a algorithm that only has a linear amount of state; see the Tcl solution of the problem for how to do it. –Donal Fellows 16:26, 13 January 2011 (UTC)
I realize this isn't the best algorithm. I was doing this more as an exercise. Thanks for your hints about stop/tops, though. That will be a helpful test case (maybe it should be added ot the task page). --Mwn3d 18:15, 13 January 2011 (UTC)
There's also a far greater number of insertion possibilities than what you have here. You could insert any character from "b" at any position in "a" (including before the start of "a"). --Paul.miner 16:59, 13 January 2011 (UTC)
Thinking about it before i fell asleep last night I realized that I forgot inserting at the start. I guess I will have to re-work some things. Also, I bet the D recursive solution is wrong just like this one is. Thanks for your help. --Mwn3d 18:15, 13 January 2011 (UTC)