Most frequent k chars distance: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
{{clarify task}} |
{{clarify task}} |
||
{{draft task}}{{Wikipedia}} |
{{draft task}}{{Wikipedia|Most frequent k characters}} |
||
In [[wp:information theory|information theory]], the '''MostFreqKDistance''' is a [[wp:String metric|string metric]] for quickly estimating how [[wp:Similarity measure|similar]] two [[wp:Order theory|ordered sets]] or [[wp:String (computer science)|strings]] are. The scheme was invented by Sadi Evren SEKER,<ref name="mfkc"/> and initially used in [[wp:text mining|text mining]] applications like [[wp:author recognition|author recognition]].<ref name="mfkc">{{citation |
In [[wp:information theory|information theory]], the '''MostFreqKDistance''' is a [[wp:String metric|string metric]] for quickly estimating how [[wp:Similarity measure|similar]] two [[wp:Order theory|ordered sets]] or [[wp:String (computer science)|strings]] are. The scheme was invented by Sadi Evren SEKER,<ref name="mfkc"/> and initially used in [[wp:text mining|text mining]] applications like [[wp:author recognition|author recognition]].<ref name="mfkc">{{citation |
||
| last1 = SEKER | first1 = Sadi E. | author1-link = Sadi Evren SEKER |
| last1 = SEKER | first1 = Sadi E. | author1-link = Sadi Evren SEKER |
Revision as of 07:13, 10 April 2016
This page uses content from Wikipedia. The original article was at Most frequent k characters. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance) |
In information theory, the MostFreqKDistance is a string metric for quickly estimating how similar two ordered sets or strings are. The scheme was invented by Sadi Evren SEKER,[1] and initially used in text mining applications like author recognition.[1] This method is originally based on a hashing function, MaxFreqKChars[2] classical author recognition problem and idea first came out while studying data stream mining.[3] The string distance
Definition
Method has two steps.
- Hash input strings str1 and str2 separately using MostFreqKHashing and output hstr1 and hstr2 respectively
- Calculate string distance (or string similarity coefficient) of two hash outputs, hstr1 and hstr2 and output an integer value
Most Frequent K Hashing
The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:
string function MostFreqKHashing (string inputString, int K) def string outputString for each distinct characters count occurrence of each character for i from 0 to K char c = next most frequent ith character (if two chars have same frequency than get the first occurrence in inputString) int count = number of occurrence of the character append to outputString, c and count end for return outputString
Aim of 'Most Frequent K Hashing' function is calculating the most count of each character and returning the K most frequent character with the character and count. Rules of hash can be listed as below:
- Output will hold the character and count
- Most frequent character and count will appear before the least frequent at the output
- if two characters have equal frequency, the first appearing in input will appear before at the output
Similar to the most of hashing functions, Most Frequent K Hashing is also a wp:one way function.
Most Frequent K Distance
Distance calculation between two strings is based on the hash outputs of two strings.
int function MostFreqKSimilarity (string inputStr1, string inputStr2) def int similarity for each c = next character from inputStr1 lookup c in inputStr2 if c is not null similarity += frequency of c in inputStr1 + frequency of c in inputStr2 return similarity
Above function, simply gets two input strings, previously outputted from the MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings and if their frequencies are equal.
In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a wp:wrapper function over above two functions.
String Distance Wrapper Function
In order to calculate the distance between two strings, below function can be implemented
int function MostFreqKSDF (string inputStr1, string inputStr2, int K, int maxDistance) return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple wp:additive inverse of similarity.
Examples
Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’. <lang javascript>MostFreqKHashing('research',2) = 'r2e2'</lang> because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string. <lang javascript>MostFreqKHashing('seeking',2) = 'e2s1'</lang> Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'. Finally we make the comparison: <lang javascript>MostFreqKSimilarity('r2e2','e2s1') = 2</lang> We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2. Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below: <lang javascript>MostFreqKSDF('research', 'seeking',2) = 2</lang>
Below table holds some sample runs between example inputs for K=2:
Inputs | Hash Outputs | SDF Output (max from 10) |
---|---|---|
'night'
'nacht' |
n1i1
n1a1 |
9 |
'my'
'a' |
m1y1
a1NULL0 |
10 |
‘research’
‘research’ |
r2e2
r2e2 |
8 |
‘aaaaabbbb’
‘ababababa’ |
a5b4
a5b4 |
1 |
‘significant’
‘capabilities’ |
i3n2
i3a2 |
5 |
Method is also suitable for bioinformatics to compare the genetic strings like in wp:fasta format
Str1= LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG MostFreqKHashing(str1,2) = L9T8 MostFreqKHashing(str2,2) = F9L8 MostFreqKSDF(str1,str2,2,100) = 83
Implementations
C++
<lang cpp>#include <string>
- include <vector>
- include <map>
- include <iostream>
- include <algorithm>
- include <utility>
- include <sstream>
std::string mostFreqKHashing ( const std::string & input , int k ) {
std::ostringstream oss ; std::map<char, int> frequencies ; for ( char c : input ) { frequencies[ c ] = std::count ( input.begin( ) , input.end( ) , c ) ; } std::vector<std::pair<char , int>> letters ( frequencies.begin( ) , frequencies.end( ) ) ; std::sort ( letters.begin( ) , letters.end( ) , [input] ( std::pair<char, int> a ,
std::pair<char, int> b ) { char fc = std::get<0>( a ) ; char fs = std::get<0>( b ) ; int o = std::get<1>( a ) ; int p = std::get<1>( b ) ; if ( o != p ) { return o > p ; } else { return input.find_first_of( fc ) < input.find_first_of ( fs ) ; } } ) ;
for ( int i = 0 ; i < letters.size( ) ; i++ ) { oss << std::get<0>( letters[ i ] ) ; oss << std::get<1>( letters[ i ] ) ; } std::string output ( oss.str( ).substr( 0 , 2 * k ) ) ; if ( letters.size( ) >= k ) { return output ; } else { return output.append( "NULL0" ) ; }
}
int mostFreqKSimilarity ( const std::string & first , const std::string & second ) {
int i = 0 ; while ( i < first.length( ) - 1 ) { auto found = second.find_first_of( first.substr( i , 2 ) ) ; if ( found != std::string::npos )
return std::stoi ( first.substr( i , 2 )) ;
else
i += 2 ;
} return 0 ;
}
int mostFreqKSDF ( const std::string & firstSeq , const std::string & secondSeq , int num ) {
return mostFreqKSimilarity ( mostFreqKHashing( firstSeq , num ) , mostFreqKHashing( secondSeq , num ) ) ;
}
int main( ) {
std::string s1("LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" ) ; std::string s2( "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" ) ; std::cout << "MostFreqKHashing( s1 , 2 ) = " << mostFreqKHashing( s1 , 2 ) << '\n' ; std::cout << "MostFreqKHashing( s2 , 2 ) = " << mostFreqKHashing( s2 , 2 ) << '\n' ; return 0 ;
} </lang>
- Output:
MostFreqKHashing( s1 , 2 ) = L9T8 MostFreqKHashing( s2 , 2 ) = F9L8
Haskell
<lang Haskell>module MostFrequentK
where
import Data.List ( nub , sortBy ) import qualified Data.Set as S
count :: Eq a => [a] -> a -> Int count [] x = 0 count ( x:xs ) k
|x == k = 1 + count xs k |otherwise = count xs k
orderedStatistics :: String -> [(Char , Int)] orderedStatistics s = sortBy myCriterion $ nub $ zip s ( map (\c -> count s c ) s )
where myCriterion :: (Char , Int) -> (Char , Int) -> Ordering myCriterion (c1 , n1) (c2, n2)
|n1 > n2 = LT |n1 < n2 = GT |n1 == n2 = compare ( found c1 s ) ( found c2 s )
found :: Char -> String -> Int found e s = length $ takeWhile (/= e ) s
mostFreqKHashing :: String -> Int -> String mostFreqKHashing s n = foldl ((++)) [] $ map toString $ take n $ orderedStatistics s
where toString :: (Char , Int) -> String toString ( c , i ) = c : show i
mostFreqKSimilarity :: String -> String -> Int mostFreqKSimilarity s t = snd $ head $ S.toList $ S.fromList ( doublets s ) `S.intersection`
S.fromList ( doublets t ) where toPair :: String -> (Char , Int) toPair s = ( head s , fromEnum ( head $ tail s ) - 48 ) doublets :: String -> [(Char , Int)] doublets str = map toPair [take 2 $ drop start str | start <- [0 , 2 ..length str - 2]]
mostFreqKSDF :: String -> String -> Int ->Int mostFreqKSDF s t n = mostFreqKSimilarity ( mostFreqKHashing s n ) (mostFreqKHashing t n ) </lang>
- Output:
mostFrequentKHashing "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" 2 "L9T8" *MostFrequentK> mostFrequentKHashing "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" 2 "F9L8"
J
Solution:<lang j>NB. String Distance Wrapper Function mfksDF =: {:@:[ - (mfks@:(mfkh&.>)~ {.)~
NB. Most Frequent K Distance mfks =: score@:(charMap@:[ {"1 charVals@:])/@:kHashes
score =. ([ +/ .* =)/ NB. (+ +/ .* *.&:*)/ for sum += freq_in_left + freq_in_right charMap =. (,&< i.&> <@:~.@:,)&;/ charVals =. (; , 0:)"1 kHashes =. 0 1 |: ({.&>~ [: <./ #&>)
NB. Most Frequent K Hashing mfkh =: _&$: : (takeK freqHash) NB. Default LHA of _ means "return complete frequency table"
takeK =. (<.#) {. ] freqHash =. ~. (] \:~ ,.&:(<"0)) #/.~
NB. No need to fix mfksDF mfkh =: mfkh f. mfks =: mfks f.</lang>
Examples:<lang j>verb define fkh =. ;@:,@:(":&.>) NB. format k hash
assert. 'r2e2 e2s1' (-: [: fkh 2&mfkh)&>&;: 'research seeking' assert. 2 = mfks 2 mfkh&.> 'research';'seeking'
assert. 'n1i1 n1a1' (-: [: fkh 2&mfkh)&>&;: 'night nacht' assert. 9 = 2 10 mfksDF 'night';'nacht'
assert. 'm1y1 a1' (-: [: fkh 2&mfkh)&>&;: 'my a' assert. 10 = 2 10 mfksDF 'my';'a'
assert. 'r2e2' -: fkh 2 mfkh 'research' assert. 6 = 2 10 mfksDF 'research';'research' NB. task says 8; right answer is 6
assert. 'a5b4 a5b4' (-: [: fkh 2&mfkh)&>&;: 'aaaaabbbb ababababa' assert. 1 = 2 10 mfksDF 'aaaaabbbb';'ababababa'
assert. 'i3n2 i3a2' (-: [: fkh 2&mfkh)&>&;: 'significant capabilities' assert. 7 = 2 10 mfksDF 'significant';'capabilities' NB. task says 5; right answer is 7
assert. 'L9T8 F9L8' (-: [: fkh 2&mfkh)&>&;: 'LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG' assert. 100 = 2 100 mfksDF 'LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV';'EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG' NB. task says 83; right answer is 100
'pass' )
pass</lang>
Notes: As of press time, there are significant discrepancies between the task description, its pseudocode, the test cases provided, and the two other existing implementations. See the talk page for the assumptions made in this implementation to reconcile these discrepancies (in particular, in the scoring function).
Java
Translation of the pseudo-code of the Wikipedia article wp:Most frequent k characters to wp:java implementation of three functions given in the definition section are given below with wp:JavaDoc comments:
<lang java>import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.LinkedHashMap; import java.util.ArrayList; import java.util.List; import java.util.Map;
public class SDF {
/** Counting the number of occurrences of each character * @param character array * @return hashmap : Key = char, Value = num of occurrence */ public static HashMap<Character, Integer> countElementOcurrences(char[] array) {
HashMap<Character, Integer> countMap = new HashMap<Character, Integer>();
for (char element : array) { Integer count = countMap.get(element); count = (count == null) ? 1 : count + 1; countMap.put(element, count); }
return countMap; } /** * Sorts the counted numbers of characters (keys, values) by java Collection List * @param HashMap (with key as character, value as number of occurrences) * @return sorted HashMap */ private static <K, V extends Comparable<? super V>> HashMap<K, V> descendingSortByValues(HashMap<K, V> map) {
List<Map.Entry<K, V>> list = new ArrayList<Map.Entry<K, V>>(map.entrySet()); // Defined Custom Comparator here Collections.sort(list, new Comparator<Map.Entry<K, V>>() { public int compare(Map.Entry<K, V> o1, Map.Entry<K, V> o2) { return o2.getValue().compareTo(o1.getValue()); } });
// Here I am copying the sorted list in HashMap // using LinkedHashMap to preserve the insertion order HashMap<K, V> sortedHashMap = new LinkedHashMap<K, V>(); for (Map.Entry<K, V> entry : list) { sortedHashMap.put(entry.getKey(), entry.getValue()); } return sortedHashMap;
} /** * get most frequent k characters * @param array of characters * @param limit of k * @return hashed String */ public static String mostOcurrencesElement(char[] array, int k) { HashMap<Character, Integer> countMap = countElementOcurrences(array); System.out.println(countMap); Map<Character, Integer> map = descendingSortByValues(countMap); System.out.println(map); int i = 0; String output = ""; for (Map.Entry<Character, Integer> pairs : map.entrySet()) {
if (i++ >= k) break;
output += "" + pairs.getKey() + pairs.getValue(); } return output; } /** * Calculates the similarity between two input strings * @param input string 1 * @param input string 2 * @param maximum possible limit value * @return distance as integer */ public static int getDiff(String str1, String str2, int limit) { int similarity = 0;
int k = 0; for (int i = 0; i < str1.length() ; i = k) { k ++; if (Character.isLetter(str1.charAt(i))) { int pos = str2.indexOf(str1.charAt(i));
if (pos >= 0) { String digitStr1 = ""; while ( k < str1.length() && !Character.isLetter(str1.charAt(k))) { digitStr1 += str1.charAt(k); k++; }
int k2 = pos+1; String digitStr2 = ""; while (k2 < str2.length() && !Character.isLetter(str2.charAt(k2)) ) { digitStr2 += str2.charAt(k2); k2++; }
similarity += Integer.parseInt(digitStr2) + Integer.parseInt(digitStr1);
} } } return Math.abs(limit - similarity);
} /** * Wrapper function * @param input string 1 * @param input string 2 * @param maximum possible limit value * @return distance as integer */ public static int SDFfunc(String str1, String str2, int limit) { return getDiff(mostOcurrencesElement(str1.toCharArray(), 2), mostOcurrencesElement(str2.toCharArray(), 2), limit); }
public static void main(String[] args) { String input1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"; String input2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"; System.out.println(SDF.SDFfunc(input1,input2,100));
}
}</lang>
Perl
<lang Perl>#!/usr/bin/perl use strict ; use warnings ;
sub mostFreqHashing {
my $inputstring = shift ; my $howmany = shift ; my $outputstring ; my %letterfrequencies = findFrequencies ( $inputstring ) ; my @orderedChars = sort { $letterfrequencies{$b} <=> $letterfrequencies{$a} || index( $inputstring , $a ) <=> index ( $inputstring , $b ) } keys %letterfrequencies ; for my $i ( 0..$howmany - 1 ) { $outputstring .= ( $orderedChars[ $i ] . $letterfrequencies{$orderedChars[ $i ]} ) ; } return $outputstring ;
}
sub findFrequencies {
my $input = shift ; my %letterfrequencies ; for my $i ( 0..length( $input ) - 1 ) { $letterfrequencies{substr( $input , $i , 1 ) }++ ; } return %letterfrequencies ;
}
sub mostFreqKSimilarity {
my $first = shift ; my $second = shift ; my $similarity = 0 ; my %frequencies_first = findFrequencies( $first ) ; my %frequencies_second = findFrequencies( $second ) ; foreach my $letter ( keys %frequencies_first ) { if ( exists ( $frequencies_second{$letter} ) ) {
$similarity += ( $frequencies_second{$letter} + $frequencies_first{$letter} ) ;
} } return $similarity ;
}
sub mostFreqKSDF {
(my $input1 , my $input2 , my $k , my $maxdistance ) = @_ ; return $maxdistance - mostFreqKSimilarity( mostFreqHashing( $input1 , $k) ,
mostFreqHashing( $input2 , $k) ) ; }
my $firststring = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" ; my $secondstring = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" ; print "MostFreqKHashing ( " . '$firststring , 2)' . " is " . mostFreqHashing( $firststring , 2 ) . "\n" ; print "MostFreqKHashing ( " . '$secondstring , 2)' . " is " . mostFreqHashing( $secondstring , 2 ) . "\n" ; </lang>
- Output:
MostFreqKHashing ( $firststring , 2) is L9T8 MostFreqKHashing ( $secondstring , 2) is F9L8
Python
unoptimized and limited <lang python>import collections def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int) for c in inputString: occuDict[c] += 1 occuList = sorted(occuDict.items(), key = lambda x: x[1], reverse = True) outputStr = .join(c + str(cnt) for c, cnt in occuList[:K]) return outputStr
- If number of occurrence of the character is not more than 9
def MostFreqKSimilarity(inputStr1, inputStr2):
similarity = 0 for i in range(0, len(inputStr1), 2): c = inputStr1[i] cnt1 = int(inputStr1[i + 1]) for j in range(0, len(inputStr2), 2): if inputStr2[j] == c: cnt2 = int(inputStr2[j + 1]) similarity += cnt1 + cnt2 break return similarity
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))</lang>
optimized
A version that replaces the intermediate string with OrderedDict to reduce the time complexity of lookup operation: <lang python>import collections def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int) for c in inputString: occuDict[c] += 1 occuList = sorted(occuDict.items(), key = lambda x: x[1], reverse = True) outputDict = collections.OrderedDict(occuList[:K]) #Return OrdredDict instead of string for faster lookup. return outputDict
def MostFreqKSimilarity(inputStr1, inputStr2):
similarity = 0 for c, cnt1 in inputStr1.items(): #Reduce the time complexity of lookup operation to about O(1). if c in inputStr2: cnt2 = inputStr2[c] similarity += cnt1 + cnt2 return similarity
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))</lang>
Test: <lang python>str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" K = 2 maxDistance = 100 dict1 = MostFreqKHashing(str1, 2) print("%s:"%dict1) print(.join(c + str(cnt) for c, cnt in dict1.items())) dict2 = MostFreqKHashing(str2, 2) print("%s:"%dict2) print(.join(c + str(cnt) for c, cnt in dict2.items())) print(MostFreqKSDF(str1, str2, K, maxDistance))</lang>
- Output:
OrderedDict([('L', 9), ('T', 8)]): L9T8 OrderedDict([('F', 9), ('L', 8)]): F9L8 83
Racket
<lang Racket>#lang racket
(define (MostFreqKHashing inputString K)
(define t (make-hash)) (for ([c (in-string inputString)] [i (in-naturals)]) (define b (cdr (hash-ref! t c (λ() (cons i (box 0)))))) (set-box! b (add1 (unbox b)))) (define l (for/list ([(k v) (in-hash t)]) (list (car v) k (unbox (cdr v))))) (map cdr (take (sort (sort l < #:key car) > #:key caddr) K)))
(define (MostFreqKSimilarity inputStr1 inputStr2) ; not strings in this impl.
(for*/sum ([c1 (in-list inputStr1)] [c2 (in-value (assq (car c1) inputStr2))] #:when c2) (+ (cadr c1) (cadr c2))))
(define (MostFreqKSDF inputStr1 inputStr2 K maxDistance)
(- maxDistance (MostFreqKSimilarity (MostFreqKHashing inputStr1 K) (MostFreqKHashing inputStr2 K))))
(MostFreqKSDF
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" 2 100)
- => 83
- (Should add more tests, but it looks like there's a bunch of mistakes
- in the given tests...)
</lang>
Tcl
<lang tcl>package require Tcl 8.6
proc MostFreqKHashing {inputString k} {
foreach ch [split $inputString ""] {dict incr count $ch} join [lrange [lsort -stride 2 -index 1 -integer -decreasing $count] 0 [expr {$k*2-1}]] ""
} proc MostFreqKSimilarity {hashStr1 hashStr2} {
while {$hashStr2 ne ""} {
regexp {^(.)(\d+)(.*)$} $hashStr2 -> ch n hashStr2 set lookup($ch) $n
} set similarity 0 while {$hashStr1 ne ""} {
regexp {^(.)(\d+)(.*)$} $hashStr1 -> ch n hashStr1 if {[info exist lookup($ch)]} { incr similarity $n incr similarity $lookup($ch) }
} return $similarity
} proc MostFreqKSDF {inputStr1 inputStr2 k limit} {
set h1 [MostFreqKHashing $inputStr1 $k] set h2 [MostFreqKHashing $inputStr2 $k] expr {$limit - [MostFreqKSimilarity $h1 $h2]}
}</lang> Demonstrating: <lang tcl>set str1 "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" set str2 "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" puts [MostFreqKHashing $str1 2] puts [MostFreqKHashing $str2 2] puts [MostFreqKSDF $str1 $str2 2 100]</lang>
- Output:
L9T8 F9L8 83
- A more efficient metric calculator
This version is appreciably more efficient because it does not compute the intermediate string representation “hash”, instead working directly on the intermediate dictionaries and lists: <lang tcl>proc MostFreqKSDF {inputStr1 inputStr2 k limit} {
set c1 [set c2 {}] foreach ch [split $inputStr1 ""] {dict incr c1 $ch} foreach ch [split $inputStr2 ""] {dict incr c2 $ch} set c2 [lrange [lsort -stride 2 -index 1 -integer -decreasing $c2[set c2 {}]] 0 [expr {$k*2-1}]] set s 0 foreach {ch n} [lrange [lsort -stride 2 -index 1 -integer -decreasing $c1[set c1 {}]] 0 [expr {$k*2-1}]] {
if {[dict exists $c2 $ch]} { incr s [expr {$n + [dict get $c2 $ch]}] }
} return [expr {$limit - $s}]
}</lang> It computes the identical value on the identical inputs.
References
- ↑ 1.0 1.1 SEKER, Sadi E.; Altun, Oguz; Ayan, Ugur; Mert, Cihan (2014), "A Novel String Distance Function based on Most Frequent K Characters", wp:International Journal of Machine Learning and Computing (IJMLC), 4, wp:International Association of Computer Science and Information Technology Press (IACSIT Press), pp. 177-183, http://arxiv.org/abs/1401.6596.
- ↑ Seker, Sadi E.; Mert, Cihan (2013), "A Novel Feature Hashing For Text Mining", Journal of Technical Science and Technologies, 2, wp:International Black Sea University, pp. 37 -41, Template:Citation/identifier, http://journal.ibsu.edu.ge/index.php/jtst/article/view/428.