Jump to content

Most frequent k chars distance: Difference between revisions

Try to tidy up
(Try to tidy up)
Line 1:
{{clarify task}}
{{draft task}}{{Wikipedia}}
In [[wp:information theory|information theory]], the '''MostFreqKDistance''' is a [[wp:String metric| string metric]] technique for quickly estimating how [[wp:Similarity measure|similar]] two [[wp:Order theory|ordered sets]] or [[wp:String (computer science)|strings]] are. The scheme was invented by Sadi Evren SEKER,<ref name="mfkc"/> and initially used in [[wp:text mining|text mining]] applications like [[wp:author recognition|author recognition]].<ref name="mfkc">{{citation
| last1 = SEKER | first1 = Sadi E. | author1-link = Sadi Evren SEKER
| last2 = Altun | first2 = Oguz
Line 44 ⟶ 45:
===Most Frequent K Hashing===
The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:
String '''string function''' ''MostFreqKHashing'' (String'''string''' inputString, '''int''' K)
<lang java>
'''def string''' outputString
String function MostFreqKHashing (String inputString, int K)
'''for each''' districtdistinct characters
def string outputString
'''count''' occurrence of each character
for each district characters
'''for''' i :='''from''' 0 '''to''' K
count occurrence of each character
'''char''' c = '''next''' most freqfrequent ith''i''<sup>th</sup> character (if two chars have same frequency than get the first occurrence in inputString)
for i := 0 to K
'''int''' count = number of occurrence of the character
char c = next most freq ith character (if two chars have same frequency than get the first occurrence in inputString)
'''append''' to ''outputString'', c and count
int count = number of occurrence of the character
'''end for'''
append to outputString, c and count
'''return''' outputString
end for
return outputString
</lang>
 
Aim of 'Most Frequent K Hashing' function is calculating the most count of each character and returning the K most frequent character with the character and count. Rules of hash can be listed as below:
Line 66 ⟶ 65:
===Most Frequent K Distance===
Distance calculation between two strings is based on the hash outputs of two strings.
'''int function''' ''MostFreqKSimilarity'' (String'''string''' inputStr1, String'''string''' inputStr2)
<lang java>
'''def int''' similarity
int function MostFreqKSimilarity (String inputStr1, String inputStr2)
'''for each''' c = '''next''' character '''from''' inputStr1
def int similarity
'''lookup''' c '''in''' inputStr2
for each c = next character from inputStr1
lookup '''if''' c in'''is not inputStr2null'''
similarity '''+=''' frequency of c in inputStr1 + frequency of c in inputStr2
if c is null
'''return''' similarity
continue
similarity += frequency of c in inputStr1 + frequency of c in inputStr2
return similarity
</lang>
Above function, simply gets two input strings, previously outputted from the MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings and if their frequencies are equal.
 
Line 82 ⟶ 78:
===String Distance Wrapper Function===
In order to calculate the distance between two strings, below function can be implemented
'''int function''' MostFreqKSDF (String'''string''' inputStr1, String'''string''' inputStr2, '''int''' K, '''int''' maxDistance)
<lang java>
'''return''' maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
int function MostFreqKSDF (String inputStr1, String inputStr2, int K, int maxDistance)
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
</lang>
 
Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple [[wp:additive inverse]] of similarity.
Line 91 ⟶ 85:
==Examples==
Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’.
<lang javascript>MostFreqKHashing('research',2) = 'r2e2'</lang>
because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string.
<lang javascript>MostFreqKHashing('seeking',2) = 'e2s1'</lang>
Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'.
Finally we make the comparison:
<lang javascript>MostFreqKSimilarity('r2e2','e2s1') = 2</lang>
We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2.
Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below:
<lang javascript>MostFreqKSDF('research', 'seeking',2) = 2</lang>
 
Below table holds some sample runs between example inputs for K=2:
Line 140 ⟶ 134:
 
Method is also suitable for bioinformatics to compare the genetic strings like in [[wp:fasta]] format
<pre>
 
Str1= LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
 
Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
 
MostFreqKHashing(str1,2) = L9T8
 
MostFreqKHashing(str2,2) = F9L8
 
MostFreqKSDF(str1,str2,2,100) = 83
</langpre>
 
=Implementations=
=={{header|Java}}==
Translation of the pseudo-code of the Wikipedia article [[wp:Most frequent k characters]] to [[wp:java]] implementation of three functions given in the definition section are given below with [[wp:JavaDoc]] comments:
Line 266 ⟶ 257:
}</lang>
 
==References==
{{reflist}}
Anonymous user
Cookies help us deliver our services. By using our services, you agree to our use of cookies.