Most frequent k chars distance
The string distance function (SDF) that is the subject of this entry was proposed in a paper published in 2014 as a method for quickly estimating how similar two ordered sets or strings are. This SDF has also been proposed as being suitable in bioinformatics, e.g. for comparing sequences in the wp:fasta format.
Unfortunately, the original paper by Sadi Evren Seker et al. (arXiv:1401.6596 [cs.DS]) [1] has a number of internal inconsistencies and for this and other reasons is not entirely clear in several key respects, but the paper does give some worked examples and several of these (notably the two given under the "Worked Examples" heading below) agree with the interpretation used by the authors of the Python entry herein, so that interpretation is used in the present task description.
- The Task
(a) Show a time-efficient implementation of the SDF as if by a function mostFreqKSDF(s1, s2, k, n), where s1 and s2 are arbitrary strings, and k and n are non-negative integers as explained below;
(b) Show the values produced by the SDF for the following values of s1 and s2, with k = 2 and n = 10:
String Inputs | "Top Two" | SDF Output (k=2, n=10) |
---|---|---|
'night'
'nacht' |
n:1 i:1
n:1 a:1 |
8 |
'my'
'a' |
m:1 y:1
a:1 |
10 |
‘research’
‘research’ |
r:2 e:2
r:2 e:2 |
2 |
‘significant’
‘capabilities’ |
i:3 n:2
i:3 a:2 |
4 |
(c) Show the value produced by the SDF for k = 2, n = 100,
and the two strings:
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
- Reference Algorithm
The original paper employs an unusual scheme for encoding the frequencies of single letters as a string of letters and numbers. This scheme has several disadvantages, and so in the following we take a more abstract view and simply assume, solely for the purposes of this description, that two functions are available:
1) MostFreqK(s, k) returns (in decreasing order of their frequency of occurrence) the k most frequently occurring letters in s, with any ties broken by the position of the first occurrence in s;
2) For any letter c in MostFreqK(s, k), Freq(s,c) returns the frequency of c in s.
- MFKsimilarity
Next define MFKsimilarlity(s1, s2, k) as if by the following pseudo-code:
int function MFKsimilarity(string s1, string s2, int k): let similarity = 0 let mfk1 = MostFreqK(s1, k) let mfk2 = MostFreqK(s2, k) for each c in mfk1 if (c occurs in mfk2) then similarity += Freq(s1, c) + Freq(s2, c) end return similarity
- String Distance Wrapper Function
Now define MostFreqKSDF as if by:
int function MostFreqKSDF(string s1, string s2, int k, int n) return n - MFKsimilarity(s1, s2, k)
- Worked Examples
The original paper gives several worked examples, including these two:
(i) MostFreqKSDF("night", "nacht", 2, 10)
For "night", the "top two" letters with their frequencies are: n:1 and i:1.
For "nacht", the "top two" letters with their frequencies are: n:1 and a:1.
Since "n" is the only shared letter amongst these candidates, the SDF is 10 - 1 - 1 = 8, as per the original paper.
(2) MostFreqKSDF( "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV", "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG", 2, 100)
For the first string, the "top two" letters with their frequencies are: L:9 and T:8; and for the second, the "top two" are F:9 and L:8.
Since "L" is the only shared letter amongst these top-k characters, and since the sum of the corresponding frequencies is 9+8 = 17, the SDF is 83.
11l
F most_freq_khashing(inputString, K)
DefaultDict[Char, Int] occuDict
L(c) inputString
occuDict[c]++
V occuList = sorted(occuDict.items(), key' x -> x[1], reverse' 1B)
V outputDict = Dict(occuList[0 .< K])
R outputDict
F most_freq_ksimilarity(inputStr1, inputStr2)
V similarity = 0
L(c, cnt1) inputStr1
I c C inputStr2
V cnt2 = inputStr2[c]
similarity += cnt1 + cnt2
R similarity
F most_freq_ksdf(inputStr1, inputStr2, K, maxDistance)
R maxDistance - most_freq_ksimilarity(most_freq_khashing(inputStr1, K), most_freq_khashing(inputStr2, K))
V str1 = ‘LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV’
V str2 = ‘EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG’
V K = 2
V maxDistance = 100
V dict1 = most_freq_khashing(str1, 2)
print(dict1, end' ":\n")
print(dict1.map((c, cnt) -> c‘’String(cnt)).join(‘’))
V dict2 = most_freq_khashing(str2, 2)
print(dict2, end' ":\n")
print(dict2.map((c, cnt) -> c‘’String(cnt)).join(‘’))
print(most_freq_ksdf(str1, str2, K, maxDistance))
- Output:
[L = 9, T = 8]: L9T8 [F = 9, L = 8]: F9L8 83
C++
#include <string>
#include <vector>
#include <map>
#include <iostream>
#include <algorithm>
#include <utility>
#include <sstream>
std::string mostFreqKHashing ( const std::string & input , int k ) {
std::ostringstream oss ;
std::map<char, int> frequencies ;
for ( char c : input ) {
frequencies[ c ] = std::count ( input.begin( ) , input.end( ) , c ) ;
}
std::vector<std::pair<char , int>> letters ( frequencies.begin( ) , frequencies.end( ) ) ;
std::sort ( letters.begin( ) , letters.end( ) , [input] ( std::pair<char, int> a ,
std::pair<char, int> b ) { char fc = std::get<0>( a ) ; char fs = std::get<0>( b ) ;
int o = std::get<1>( a ) ; int p = std::get<1>( b ) ; if ( o != p ) { return o > p ; }
else { return input.find_first_of( fc ) < input.find_first_of ( fs ) ; } } ) ;
for ( int i = 0 ; i < letters.size( ) ; i++ ) {
oss << std::get<0>( letters[ i ] ) ;
oss << std::get<1>( letters[ i ] ) ;
}
std::string output ( oss.str( ).substr( 0 , 2 * k ) ) ;
if ( letters.size( ) >= k ) {
return output ;
}
else {
return output.append( "NULL0" ) ;
}
}
int mostFreqKSimilarity ( const std::string & first , const std::string & second ) {
int i = 0 ;
while ( i < first.length( ) - 1 ) {
auto found = second.find_first_of( first.substr( i , 2 ) ) ;
if ( found != std::string::npos )
return std::stoi ( first.substr( i , 2 )) ;
else
i += 2 ;
}
return 0 ;
}
int mostFreqKSDF ( const std::string & firstSeq , const std::string & secondSeq , int num ) {
return mostFreqKSimilarity ( mostFreqKHashing( firstSeq , num ) , mostFreqKHashing( secondSeq , num ) ) ;
}
int main( ) {
std::string s1("LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" ) ;
std::string s2( "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" ) ;
std::cout << "MostFreqKHashing( s1 , 2 ) = " << mostFreqKHashing( s1 , 2 ) << '\n' ;
std::cout << "MostFreqKHashing( s2 , 2 ) = " << mostFreqKHashing( s2 , 2 ) << '\n' ;
return 0 ;
}
- Output:
MostFreqKHashing( s1 , 2 ) = L9T8 MostFreqKHashing( s2 , 2 ) = F9L8
FreeBASIC
Type Occurrence
char As String
count As Integer
End Type
Sub MostFreqKHashing(inputString As String, K As Integer, outputDict() As Occurrence)
Dim As Integer i, j, c, cnt, index
Dim As Integer occuDict(65 To 90) ' ASCII values for A-Z
For i = 65 To 90: occuDict(i) = 0: Next
For i = 1 To Len(inputString)
c = Asc(Ucase(Mid(inputString, i, 1)))
If c >= 65 And c <= 90 Then occuDict(c) += 1
Next
cnt = 0
For i = 65 To 90
If occuDict(i) > 0 Then cnt += 1
Next
Redim outputDict(cnt - 1)
index = 0
For i = 65 To 90
If occuDict(i) > 0 Then
outputDict(index).char = Chr(i)
outputDict(index).count = occuDict(i)
index += 1
End If
Next
' Sort the outputDict by cnt in descending order
For i = 0 To Ubound(outputDict) - 1
For j = i + 1 To Ubound(outputDict)
If outputDict(j).count > outputDict(i).count Then
Swap outputDict(i), outputDict(j)
End If
Next
Next
' Keep only the top K elements
If Ubound(outputDict) + 1 > K Then Redim Preserve outputDict(K - 1)
End Sub
Function MostFreqKSimilarity(inputStr1() As Occurrence, inputStr2() As Occurrence) As Integer
Dim As Integer i, j, similarity
similarity = 0
For i = 0 To Ubound(inputStr1)
For j = 0 To Ubound(inputStr2)
If inputStr1(i).char = inputStr2(j).char Then
similarity += inputStr1(i).count + inputStr2(j).count
End If
Next
Next
Return similarity
End Function
Function MostFreqKSDF(inputStr1 As String, inputStr2 As String, K As Integer, maxDistance As Integer) As Integer
Dim As Occurrence outputDict1(), outputDict2()
MostFreqKHashing(inputStr1, K, outputDict1())
MostFreqKHashing(inputStr2, K, outputDict2())
Return maxDistance - MostFreqKSimilarity(outputDict1(), outputDict2())
End Function
Sub PrintDict(dict() As Occurrence)
For i As Integer = 0 To Ubound(dict)
Print dict(i).char & dict(i).count;
Next
Print
End Sub
' test cases
Dim As String str1, str2
str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
Dim As Integer K = 2
Dim As Integer maxDistance = 100
Dim As Occurrence dict1(), dict2()
MostFreqKHashing(str1, K, dict1())
Print "dict1: ";
PrintDict(dict1())
MostFreqKHashing(str2, K, dict2())
Print "dict2: ";
PrintDict(dict2())
Print "MostFreqKSDF: "; MostFreqKSDF(str1, str2, K, maxDistance)
Sleep
- Output:
dict1: L9T8 dict2: F9L8 MostFreqKSDF: 83
Go
package main
import (
"fmt"
"sort"
)
type cf struct {
c rune
f int
}
func reverseStr(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
func indexOfCf(cfs []cf, r rune) int {
for i, cf := range cfs {
if cf.c == r {
return i
}
}
return -1
}
func minOf(i, j int) int {
if i < j {
return i
}
return j
}
func mostFreqKHashing(input string, k int) string {
var cfs []cf
for _, r := range input {
ix := indexOfCf(cfs, r)
if ix >= 0 {
cfs[ix].f++
} else {
cfs = append(cfs, cf{r, 1})
}
}
sort.SliceStable(cfs, func(i, j int) bool {
return cfs[i].f > cfs[j].f // descending, preserves order when equal
})
acc := ""
min := minOf(len(cfs), k)
for _, cf := range cfs[:min] {
acc += fmt.Sprintf("%c%c", cf.c, cf.f)
}
return acc
}
func mostFreqKSimilarity(input1, input2 string) int {
similarity := 0
runes1, runes2 := []rune(input1), []rune(input2)
for i := 0; i < len(runes1); i += 2 {
for j := 0; j < len(runes2); j += 2 {
if runes1[i] == runes2[j] {
freq1, freq2 := runes1[i+1], runes2[j+1]
if freq1 != freq2 {
continue // assuming here that frequencies need to match
}
similarity += int(freq1)
}
}
}
return similarity
}
func mostFreqKSDF(input1, input2 string, k, maxDistance int) {
fmt.Println("input1 :", input1)
fmt.Println("input2 :", input2)
s1 := mostFreqKHashing(input1, k)
s2 := mostFreqKHashing(input2, k)
fmt.Printf("mfkh(input1, %d) = ", k)
for i, c := range s1 {
if i%2 == 0 {
fmt.Printf("%c", c)
} else {
fmt.Printf("%d", c)
}
}
fmt.Printf("\nmfkh(input2, %d) = ", k)
for i, c := range s2 {
if i%2 == 0 {
fmt.Printf("%c", c)
} else {
fmt.Printf("%d", c)
}
}
result := maxDistance - mostFreqKSimilarity(s1, s2)
fmt.Printf("\nSDF(input1, input2, %d, %d) = %d\n\n", k, maxDistance, result)
}
func main() {
pairs := [][2]string{
{"research", "seeking"},
{"night", "nacht"},
{"my", "a"},
{"research", "research"},
{"aaaaabbbb", "ababababa"},
{"significant", "capabilities"},
}
for _, pair := range pairs {
mostFreqKSDF(pair[0], pair[1], 2, 10)
}
s1 := "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
s2 := "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
mostFreqKSDF(s1, s2, 2, 100)
s1 = "abracadabra12121212121abracadabra12121212121"
s2 = reverseStr(s1)
mostFreqKSDF(s1, s2, 2, 100)
}
- Output:
input1 : research input2 : seeking mfkh(input1, 2) = r2e2 mfkh(input2, 2) = e2s1 SDF(input1, input2, 2, 10) = 8 input1 : night input2 : nacht mfkh(input1, 2) = n1i1 mfkh(input2, 2) = n1a1 SDF(input1, input2, 2, 10) = 9 input1 : my input2 : a mfkh(input1, 2) = m1y1 mfkh(input2, 2) = a1 SDF(input1, input2, 2, 10) = 10 input1 : research input2 : research mfkh(input1, 2) = r2e2 mfkh(input2, 2) = r2e2 SDF(input1, input2, 2, 10) = 6 input1 : aaaaabbbb input2 : ababababa mfkh(input1, 2) = a5b4 mfkh(input2, 2) = a5b4 SDF(input1, input2, 2, 10) = 1 input1 : significant input2 : capabilities mfkh(input1, 2) = i3n2 mfkh(input2, 2) = i3a2 SDF(input1, input2, 2, 10) = 7 input1 : LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV input2 : EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG mfkh(input1, 2) = L9T8 mfkh(input2, 2) = F9L8 SDF(input1, input2, 2, 100) = 100 input1 : abracadabra12121212121abracadabra12121212121 input2 : 12121212121arbadacarba12121212121arbadacarba mfkh(input1, 2) = 112a10 mfkh(input2, 2) = 112210 SDF(input1, input2, 2, 100) = 88
Haskell
module MostFrequentK
where
import Data.List ( nub , sortBy )
import qualified Data.Set as S
count :: Eq a => [a] -> a -> Int
count [] x = 0
count ( x:xs ) k
|x == k = 1 + count xs k
|otherwise = count xs k
orderedStatistics :: String -> [(Char , Int)]
orderedStatistics s = sortBy myCriterion $ nub $ zip s ( map (\c -> count s c ) s )
where
myCriterion :: (Char , Int) -> (Char , Int) -> Ordering
myCriterion (c1 , n1) (c2, n2)
|n1 > n2 = LT
|n1 < n2 = GT
|n1 == n2 = compare ( found c1 s ) ( found c2 s )
found :: Char -> String -> Int
found e s = length $ takeWhile (/= e ) s
mostFreqKHashing :: String -> Int -> String
mostFreqKHashing s n = foldl ((++)) [] $ map toString $ take n $ orderedStatistics s
where
toString :: (Char , Int) -> String
toString ( c , i ) = c : show i
mostFreqKSimilarity :: String -> String -> Int
mostFreqKSimilarity s t = snd $ head $ S.toList $ S.fromList ( doublets s ) `S.intersection`
S.fromList ( doublets t )
where
toPair :: String -> (Char , Int)
toPair s = ( head s , fromEnum ( head $ tail s ) - 48 )
doublets :: String -> [(Char , Int)]
doublets str = map toPair [take 2 $ drop start str | start <- [0 , 2 ..length str - 2]]
mostFreqKSDF :: String -> String -> Int ->Int
mostFreqKSDF s t n = mostFreqKSimilarity ( mostFreqKHashing s n ) (mostFreqKHashing t n )
- Output:
mostFrequentKHashing "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" 2 "L9T8" *MostFrequentK> mostFrequentKHashing "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" 2 "F9L8"
J
Solution:
NB. String Distance Wrapper Function
mfksDF =: {:@:[ - (mfks@:(mfkh&.>)~ {.)~
NB. Most Frequent K Distance
mfks =: score@:(charMap@:[ {"1 charVals@:])/@:kHashes
score =. ([ +/ .* =)/ NB. (+ +/ .* *.&:*)/ for sum += freq_in_left + freq_in_right
charMap =. (,&< i.&> <@:~.@:,)&;/
charVals =. (; , 0:)"1
kHashes =. 0 1 |: ({.&>~ [: <./ #&>)
NB. Most Frequent K Hashing
mfkh =: _&$: : (takeK freqHash) NB. Default LHA of _ means "return complete frequency table"
takeK =. (<.#) {. ]
freqHash =. ~. (] \:~ ,.&:(<"0)) #/.~
NB. No need to fix mfksDF
mfkh =: mfkh f.
mfks =: mfks f.
Examples:
verb define ''
fkh =. ;@:,@:(":&.>) NB. format k hash
assert. 'r2e2 e2s1' (-: [: fkh 2&mfkh)&>&;: 'research seeking'
assert. 2 = mfks 2 mfkh&.> 'research';'seeking'
assert. 'n1i1 n1a1' (-: [: fkh 2&mfkh)&>&;: 'night nacht'
assert. 9 = 2 10 mfksDF 'night';'nacht'
assert. 'm1y1 a1' (-: [: fkh 2&mfkh)&>&;: 'my a'
assert. 10 = 2 10 mfksDF 'my';'a'
assert. 'r2e2' -: fkh 2 mfkh 'research'
assert. 6 = 2 10 mfksDF 'research';'research' NB. task says 8; right answer is 6
assert. 'a5b4 a5b4' (-: [: fkh 2&mfkh)&>&;: 'aaaaabbbb ababababa'
assert. 1 = 2 10 mfksDF 'aaaaabbbb';'ababababa'
assert. 'i3n2 i3a2' (-: [: fkh 2&mfkh)&>&;: 'significant capabilities'
assert. 7 = 2 10 mfksDF 'significant';'capabilities' NB. task says 5; right answer is 7
assert. 'L9T8 F9L8' (-: [: fkh 2&mfkh)&>&;: 'LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG'
assert. 100 = 2 100 mfksDF 'LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV';'EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG'
NB. task says 83; right answer is 100
'pass'
)
pass
Notes: As of press time, there are significant discrepancies between the task description, its pseudocode, the test cases provided, and the two other existing implementations. See the talk page for the assumptions made in this implementation to reconcile these discrepancies (in particular, in the scoring function).
Java
import java.util.Comparator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public final class MostFrequentKCharsDistance {
public static void main(String[] args) {
record Pair(String first, String second) {}
List<Pair> pairs = List.of( new Pair("night", "nacht"), new Pair("my", "a"),
new Pair("research", "research"), new Pair("significant", "capabilities"),
new Pair("LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV",
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG") );
for ( Pair pair : pairs ) {
System.out.print(pair.first + ", " + pair.second + " -> ");
System.out.print(mostFrequent(pair.first, 2) + ", " + mostFrequent(pair.second, 2) + " -> ");
final int maximum = ( pair.first.length() < 20 ) ? 10 : 100;
System.out.println(mostFrequentKSDF(pair.first, pair.second, 2, maximum));
}
}
private static int mostFrequentKSDF(String text1, String text2, int limit, int maximum) {
return maximum - mostFrequentKsimilarity(text1, text2, limit);
}
private static int mostFrequentKsimilarity(String text1, String text2, int limit) {
int similarity = 0;
Map<Character, Integer> mostFrequent1 = mostFrequent(text1, limit);
Map<Character, Integer> mostFrequent2 = mostFrequent(text2, limit);
for ( Character ch : mostFrequent1.keySet() ) {
if ( mostFrequent2.containsKey(ch) ) {
similarity += mostFrequent1.get(ch) + mostFrequent2.get(ch);
}
}
return similarity;
}
private static Map<Character, Integer> mostFrequent(String text, int limit) {
Map<Character, Integer> charCount = text.chars().boxed().collect(
Collectors.toMap(k -> (char) k.intValue(), v -> 1, Integer::sum, LinkedHashMap::new));
return charCount.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(limit).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue,
(oldValue, newValue) -> oldValue, LinkedHashMap::new));
}
}
- Output:
night, nacht -> {n=1, i=1}, {n=1, a=1} -> 8 my, a -> {m=1, y=1}, {a=1} -> 10 research, research -> {r=2, e=2}, {r=2, e=2} -> 2 significant, capabilities -> {i=3, n=2}, {i=3, a=2} -> 4 LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV, EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG -> {L=9, T=8}, {F=9, L=8} -> 83
JavaScript
Defines the functions for counting and frequency outside of the hashing function, as it is easier to calculate the comparison from a structured result rather than parsing a hash.
//returns an object of counts keyed by character
const kCounts = str => {
const counts = {};
for (let char of str) {
counts[char] = counts[char] ? counts[char] + 1 : 1;
}
return counts;
};
//returns an array of length k containing the characters with the highest counts
const frequentK = (counts, k) => {
//note that this is written for clarity rather than speed,
//as it sorts all of the counts when only the top k are needed
return Object.keys(counts)
.sort((a, b) => counts[b] - counts[a])
.slice(0, k);
};
//returns a hashed string of the most frequent k characters and their frequencies
const mostFreqKHashing = (str, k) => {
const counts = kCounts(str);
return frequentK(counts, k)
.map(char => char + counts[char])
.join("");
};
//numeric score of similarity based on the sum of counts of characters appearing in the top k of both strings
const mostFreqKSimilarity = (str1, str2, k) => {
const counts1 = kCounts(str1);
const counts2 = kCounts(str2);
const freq1 = frequentK(counts1, k);
const freq2 = frequentK(counts2, k);
let similarity = 0;
for (let char of freq1) {
//only considers a character if it is in the top k of both strings
if (freq2.includes(char)) {
//the examples on this page are inconsistent in whether the similarity
//should be the sum of the two counts or only the shared count (the minimum of the two)
//this code uses the sum
similarity += counts1[char] + counts2[char];
}
}
return similarity;
};
//subtracts the similarity score from the maxDifference
const mostFreqKSDF = (str1, str2, k, maxDistance) => {
return maxDistance - mostFreqKSimilarity(str1, str2, k);
};
Testing with Jest
test('hash of "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" is "L9T8"', () => {
expect(mostFreqKHashing(str1, 2)).toBe("L9T8");
});
test('hash of "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" is "F9L8"', () => {
expect(mostFreqKHashing(str2, 2)).toBe("F9L8");
});
test("SDF of strings 1 and 2 with k=2 and max=100 is 83", () => {
expect(mostFreqKSDF(str1, str2, 2, 100)).toBe(83);
});
Testing in Console
const str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV";
const str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG";
const K = 2;
const maxDistance = 100;
console.log(mostFreqKHashing(str1, K));
console.log(mostFreqKHashing(str2, K));
console.log(mostFreqKSDF(str1, str2, K, maxDistance));
- Output:
L9T8 F9L8 83
jq
Works with gojq, the Go implementation of jq
Preliminaries
# bag of words
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
# Like sort_by(f) but for items that compare equal, retain the original order
def sort_by_decreasing(f):
# Add $n-$i
def enumerate: . as $in | length as $n | reduce range(0;$n) as $i ([]; . + [ [$n-$i, $in[$i] ] ]);
enumerate
| sort_by((.[1]|f), .[0])
| reverse
| map(.[1]);
The Tasks
# Output: { characters: array_of_characters_in_decreasing_order_of_frequency, frequency: object}
def MostFreqKHashing($K):
def chars: explode | range(0;length) as $i | [.[$i]] | implode;
. as $in
| bow(tostring | chars) as $bow
| $bow
| to_entries
| sort_by_decreasing(.value) # if two chars have same frequency than get the first occurrence in $in
| (reduce .[0:$K][] as $kv ({}; .[$kv.key] = $kv.value) ) as $frequency
| {characters: map(.key)[:$K], $frequency};
def MostFreqKSimilarity($in1; $in2; $K):
[$in1, $in2] | map( MostFreqKHashing($K)) as [$s1, $s2]
| reduce $s1.characters[] as $c (0;
$s2.frequency[$c] as $f
| if $f then . + $s1.frequency[$c] + $f
else . end) ;
def MostFreqKSDF($inputStr1; $inputStr2; $K; $maxDistance):
$maxDistance - MostFreqKSimilarity($inputStr1; $inputStr2; $K);
def MostFreqKSDF($K; $maxDistance):
. as [$inputStr1, $inputStr2]
| $maxDistance - MostFreqKSimilarity($inputStr1; $inputStr2; $K);
def task2:
["night", "nacht"],
["my", "a"],
["research", "research"],
["research", "seeking"],
["significant", "capabilities"]
| MostFreqKSDF(2; 10) as $sdk
| [., $sdk] ;
def task100:
["LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV",
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"]
| MostFreqKSDF(2; 100);
task2, task100
- Output:
[["night","nacht"],8] [["my","a"],10] [["research","research"],2] [["research","seeking"],6] [["significant","capabilities"],4] 83
Kotlin
The code for the MostFreqKSimilarity function differs in this task from that in the associated Wikipedia article. Also the description is inconsistent with the code in both cases.
What I've decided to assume is that the frequency for commonly occurring characters must be the same in both strings for it to be added to the 'similarity' variable which seems to be the implication of both descriptions. This gives the same results as the Wikipedia article for all except the last example where it gives 100 rather than 83.
It's also evident that you can't just add the frequency of each character to the output string of the MostFreqKHashing function and then expect to be able to parse it afterwards. This is because, in the general case, any printing characters (including digits) could occur in the input string and the frequency could be more than 9. I've therefore encoded the frequency as the character with the same unicode value rather than the frequency itself.
I have no idea what NULL0 is supposed to mean so I've ignored that.
// version 1.1.51
fun mostFreqKHashing(input: String, k: Int): String =
input.groupBy { it }.map { Pair(it.key, it.value.size) }
.sortedByDescending { it.second } // preserves original order when equal
.take(k)
.fold("") { acc, v -> acc + "${v.first}${v.second.toChar()}" }
fun mostFreqKSimilarity(input1: String, input2: String): Int {
var similarity = 0
for (i in 0 until input1.length step 2) {
for (j in 0 until input2.length step 2) {
if (input1[i] == input2[j]) {
val freq1 = input1[i + 1].toInt()
val freq2 = input2[j + 1].toInt()
if (freq1 != freq2) continue // assuming here that frequencies need to match
similarity += freq1
}
}
}
return similarity
}
fun mostFreqKSDF(input1: String, input2: String, k: Int, maxDistance: Int) {
println("input1 : $input1")
println("input2 : $input2")
val s1 = mostFreqKHashing(input1, k)
val s2 = mostFreqKHashing(input2, k)
print("mfkh(input1, $k) = ")
for ((i, c) in s1.withIndex()) print(if (i % 2 == 0) c.toString() else c.toInt().toString())
print("\nmfkh(input2, $k) = ")
for ((i, c) in s2.withIndex()) print(if (i % 2 == 0) c.toString() else c.toInt().toString())
val result = maxDistance - mostFreqKSimilarity(s1, s2)
println("\nSDF(input1, input2, $k, $maxDistance) = $result\n")
}
fun main(args: Array<String>) {
val pairs = listOf(
Pair("research", "seeking"),
Pair("night", "nacht"),
Pair("my", "a"),
Pair("research", "research"),
Pair("aaaaabbbb", "ababababa"),
Pair("significant", "capabilities")
)
for (pair in pairs) mostFreqKSDF(pair.first, pair.second, 2, 10)
var s1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
var s2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
mostFreqKSDF(s1, s2, 2, 100)
s1 = "abracadabra12121212121abracadabra12121212121"
s2 = s1.reversed()
mostFreqKSDF(s1, s2, 2, 100)
}
- Output:
input1 : research input2 : seeking mfkh(input1, 2) = r2e2 mfkh(input2, 2) = e2s1 SDF(input1, input2, 2, 10) = 8 input1 : night input2 : nacht mfkh(input1, 2) = n1i1 mfkh(input2, 2) = n1a1 SDF(input1, input2, 2, 10) = 9 input1 : my input2 : a mfkh(input1, 2) = m1y1 mfkh(input2, 2) = a1 SDF(input1, input2, 2, 10) = 10 input1 : research input2 : research mfkh(input1, 2) = r2e2 mfkh(input2, 2) = r2e2 SDF(input1, input2, 2, 10) = 6 input1 : aaaaabbbb input2 : ababababa mfkh(input1, 2) = a5b4 mfkh(input2, 2) = a5b4 SDF(input1, input2, 2, 10) = 1 input1 : significant input2 : capabilities mfkh(input1, 2) = i3n2 mfkh(input2, 2) = i3a2 SDF(input1, input2, 2, 10) = 7 input1 : LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV input2 : EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG mfkh(input1, 2) = L9T8 mfkh(input2, 2) = F9L8 SDF(input1, input2, 2, 100) = 100 input1 : abracadabra12121212121abracadabra12121212121 input2 : 12121212121arbadacarba12121212121arbadacarba mfkh(input1, 2) = 112a10 mfkh(input2, 2) = 112210 SDF(input1, input2, 2, 100) = 88
Nim
It is impossible to get consistent results. We chose to implement Wikipedia algorithm (note that the page is no longer available, except in Internet Archive). With this algorithm, we get the same results as those of Wikipedia except for the last example where we get 91 instead of 83.
In order to avoid the limitation to 10 occurrences, we chose to use an ordered table (equivalent to Python OrderedDict). We could have used Kotlin solution or used a sequence of tuples (char, count) but the ordered table gives better performance.
import algorithm, sugar, tables
type CharCounts = OrderedTable[char, int]
func `$`(counts: CharCounts): string =
for (c, count) in counts.pairs:
result.add $c & $count
func mostFreqKHashing(str: string; k: Positive): CharCounts =
var counts: CharCounts # To get the counts in apparition order.
for c in str: inc counts.mgetOrPut(c, 0)
counts.sort((x, y) => cmp(x[1], y[1]), Descending) # Note that sort is stable.
var count = 0
for c, val in counts.pairs:
inc count
result[c] = val
if count == k: break
func mostFreqKSimilarity(input1, input2: CharCounts): int =
for c, count in input1.pairs:
if c in input2:
result += count
func mostFreqKSDF(str1, str2: string; k, maxDistance: Positive): int =
maxDistance - mostFreqKSimilarity(mostFreqKHashing(str1, k), mostFreqKHashing(str2, k))
const
Pairs = [("night", "nacht"),
("my", "a"),
("research", "research"),
("aaaaabbbb", "ababababa"),
("significant", "capabilities")]
for (str1, str2) in Pairs:
echo "str1: ", str1
echo "str2: ", str2
echo "mostFreqKHashing(str1, 2) = ", mostFreqKHashing(str1, 2)
echo "mostFreqKHashing(str2, 2) = ", mostFreqKHashing(str2, 2)
echo "mostFreqKSDF(str1, str2, 2, 10) = ", mostFreqKSDF(str1, str2, 2, 10)
echo()
const
S1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
S2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
echo "str1: ", S1
echo "str2: ", S2
echo "mostFreqKHashing(str1, 2) = ", mostFreqKHashing(S1, 2)
echo "mostFreqKHashing(str2, 2) = ", mostFreqKHashing(S2, 2)
echo "mostFreqKSDF(str1, str2, 2, 100) = ", mostFreqKSDF(S1, S2, 2, 100)
- Output:
str1: night str2: nacht mostFreqKHashing(str1, 2) = n1i1 mostFreqKHashing(str2, 2) = n1a1 mostFreqKSDF(str1, str2, 2, 10) = 9 str1: my str2: a mostFreqKHashing(str1, 2) = m1y1 mostFreqKHashing(str2, 2) = a1 mostFreqKSDF(str1, str2, 2, 10) = 10 str1: research str2: research mostFreqKHashing(str1, 2) = r2e2 mostFreqKHashing(str2, 2) = r2e2 mostFreqKSDF(str1, str2, 2, 10) = 6 str1: aaaaabbbb str2: ababababa mostFreqKHashing(str1, 2) = a5b4 mostFreqKHashing(str2, 2) = a5b4 mostFreqKSDF(str1, str2, 2, 10) = 1 str1: significant str2: capabilities mostFreqKHashing(str1, 2) = i3n2 mostFreqKHashing(str2, 2) = i3a2 mostFreqKSDF(str1, str2, 2, 10) = 7 str1: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV str2: EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG mostFreqKHashing(str1, 2) = L9T8 mostFreqKHashing(str2, 2) = F9L8 mostFreqKSDF(str1, str2, 2, 100) = 91
Perl
#!/usr/bin/perl
use strict ;
use warnings ;
sub mostFreqHashing {
my $inputstring = shift ;
my $howmany = shift ;
my $outputstring ;
my %letterfrequencies = findFrequencies ( $inputstring ) ;
my @orderedChars = sort { $letterfrequencies{$b} <=> $letterfrequencies{$a} ||
index( $inputstring , $a ) <=> index ( $inputstring , $b ) } keys %letterfrequencies ;
for my $i ( 0..$howmany - 1 ) {
$outputstring .= ( $orderedChars[ $i ] . $letterfrequencies{$orderedChars[ $i ]} ) ;
}
return $outputstring ;
}
sub findFrequencies {
my $input = shift ;
my %letterfrequencies ;
for my $i ( 0..length( $input ) - 1 ) {
$letterfrequencies{substr( $input , $i , 1 ) }++ ;
}
return %letterfrequencies ;
}
sub mostFreqKSimilarity {
my $first = shift ;
my $second = shift ;
my $similarity = 0 ;
my %frequencies_first = findFrequencies( $first ) ;
my %frequencies_second = findFrequencies( $second ) ;
foreach my $letter ( keys %frequencies_first ) {
if ( exists ( $frequencies_second{$letter} ) ) {
$similarity += ( $frequencies_second{$letter} + $frequencies_first{$letter} ) ;
}
}
return $similarity ;
}
sub mostFreqKSDF {
(my $input1 , my $input2 , my $k , my $maxdistance ) = @_ ;
return $maxdistance - mostFreqKSimilarity( mostFreqHashing( $input1 , $k) ,
mostFreqHashing( $input2 , $k) ) ;
}
my $firststring = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" ;
my $secondstring = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" ;
print "MostFreqKHashing ( " . '$firststring , 2)' . " is " . mostFreqHashing( $firststring , 2 ) . "\n" ;
print "MostFreqKHashing ( " . '$secondstring , 2)' . " is " . mostFreqHashing( $secondstring , 2 ) . "\n" ;
- Output:
MostFreqKHashing ( $firststring , 2) is L9T8 MostFreqKHashing ( $secondstring , 2) is F9L8
Phix
with javascript_semantics function mostFreqKHashing(string input, integer k) string cfs = "" sequence cfsnx = {} for i=1 to length(input) do integer r = input[i], ix = find(r,cfs) if ix>0 then cfsnx[ix][1] -= 1 else cfs &= r cfsnx = append(cfsnx,{-1,length(cfs)}) end if end for cfsnx = sort(cfsnx) -- (aside: the idx forces stable sort) sequence acc := {} for i=1 to min(length(cfs),k) do integer {n,ix} = cfsnx[i] acc &= {cfs[ix], -n} end for return acc end function function mostFreqKSimilarity(sequence input1, input2) integer similarity := 0 for i=1 to length(input1) by 2 do for j=1 to length(input2) by 2 do if input1[i] == input2[j] then integer freq1 = input1[i+1], freq2 = input2[j+1] if freq1=freq2 then similarity += freq1 end if end if end for end for return similarity end function function flat(sequence s) string res = "" for i=1 to length(s) by 2 do res &= sprintf("%c%d",s[i..i+1]) end for return res end function procedure mostFreqKSDF(string input1, input2, integer k, maxDistance) printf(1,"input1 : %s\n", {input1}) printf(1,"input2 : %s\n", {input2}) sequence s1 := mostFreqKHashing(input1, k), s2 := mostFreqKHashing(input2, k) printf(1,"mfkh(input1, %d) = %s\n", {k,flat(s1)}) printf(1,"mfkh(input2, %d) = %s\n", {k,flat(s2)}) integer result := maxDistance - mostFreqKSimilarity(s1, s2) printf(1,"SDF(input1, input2, %d, %d) = %d\n\n", {k, maxDistance, result}) end procedure constant tests = {{"research", "seeking"}, {"night", "nacht"}, {"my", "a"}, {"research", "research"}, {"aaaaabbbb", "ababababa"}, {"significant", "capabilities"}} for i=1 to length(tests) do string {t1,t2} = tests[i] mostFreqKSDF(t1, t2, 2, 10) end for string s1 := "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV", s2 := "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" mostFreqKSDF(s1, s2, 2, 100) s1 = "abracadabra12121212121abracadabra12121212121" s2 = reverse(s1) mostFreqKSDF(s1, s2, 2, 100)
Output matches Go and Kotlin.
Python
unoptimized and limited
import collections
def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int)
for c in inputString:
occuDict[c] += 1
occuList = sorted(occuDict.items(), key = lambda x: x[1], reverse = True)
outputStr = ''.join(c + str(cnt) for c, cnt in occuList[:K])
return outputStr
#If number of occurrence of the character is not more than 9
def MostFreqKSimilarity(inputStr1, inputStr2):
similarity = 0
for i in range(0, len(inputStr1), 2):
c = inputStr1[i]
cnt1 = int(inputStr1[i + 1])
for j in range(0, len(inputStr2), 2):
if inputStr2[j] == c:
cnt2 = int(inputStr2[j + 1])
similarity += cnt1 + cnt2
break
return similarity
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
optimized
A version that replaces the intermediate string with OrderedDict to reduce the time complexity of lookup operation:
import collections
def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int)
for c in inputString:
occuDict[c] += 1
occuList = sorted(occuDict.items(), key = lambda x: x[1], reverse = True)
outputDict = collections.OrderedDict(occuList[:K])
#Return OrdredDict instead of string for faster lookup.
return outputDict
def MostFreqKSimilarity(inputStr1, inputStr2):
similarity = 0
for c, cnt1 in inputStr1.items():
#Reduce the time complexity of lookup operation to about O(1).
if c in inputStr2:
cnt2 = inputStr2[c]
similarity += cnt1 + cnt2
return similarity
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
Test:
str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
K = 2
maxDistance = 100
dict1 = MostFreqKHashing(str1, 2)
print("%s:"%dict1)
print(''.join(c + str(cnt) for c, cnt in dict1.items()))
dict2 = MostFreqKHashing(str2, 2)
print("%s:"%dict2)
print(''.join(c + str(cnt) for c, cnt in dict2.items()))
print(MostFreqKSDF(str1, str2, K, maxDistance))
- Output:
OrderedDict([('L', 9), ('T', 8)]): L9T8 OrderedDict([('F', 9), ('L', 8)]): F9L8 83
Racket
#lang racket
(define (MostFreqKHashing inputString K)
(define t (make-hash))
(for ([c (in-string inputString)] [i (in-naturals)])
(define b (cdr (hash-ref! t c (λ() (cons i (box 0))))))
(set-box! b (add1 (unbox b))))
(define l (for/list ([(k v) (in-hash t)]) (list (car v) k (unbox (cdr v)))))
(map cdr (take (sort (sort l < #:key car) > #:key caddr) K)))
(define (MostFreqKSimilarity inputStr1 inputStr2) ; not strings in this impl.
(for*/sum ([c1 (in-list inputStr1)] [c2 (in-value (assq (car c1) inputStr2))]
#:when c2)
(+ (cadr c1) (cadr c2))))
(define (MostFreqKSDF inputStr1 inputStr2 K maxDistance)
(- maxDistance (MostFreqKSimilarity (MostFreqKHashing inputStr1 K)
(MostFreqKHashing inputStr2 K))))
(MostFreqKSDF
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
2 100)
;; => 83
;; (Should add more tests, but it looks like there's a bunch of mistakes
;; in the given tests...)
Raku
(formerly Perl 6)
My initial impressions of this task are registered on the discussion page under "Prank Page?".
The "most frequent k characters" hashing function is straightforward enough to implement. The distance function is incomprehensible though. The description doesn't match the pseudo-code and neither of them match the examples. I'm not going to bother trying to figure it out unless there is some possible value.
Maybe I am too hasty though. Lets give it a try. Implement a MFKC routine and run an assortment of words through it to get a feel for how it hashes different words.
# Fairly straightforward implementation, actually returns a list of pairs
# which can be joined to make a string or manipulated further.
sub mfkh ($string, \K = 2) {
my %h;
$string.comb.kv.map: { %h{$^v}[1] //= $^k; %h{$^v}[0]++ };
%h.sort( { -$_.value[0], $_.value[1] } ).head(K).map( { $_.key => $_.value[0] } );
}
# lets try running 150 or so words from unixdic.txt through it to see
# how many unique hash values it comes up with.
my @test-words = <
aminobenzoic arginine asinine biennial biennium brigantine brinkmanship
britannic chinquapin clinging corinthian declination dickinson dimension
dinnertime dionysian diophantine dominican financial financier finessing
fingernail fingerprint finnish giovanni hopkinsian inaction inalienable
inanimate incaution incendiary incentive inception incident incidental
incinerate incline inclusion incommunicable incompletion inconceivable
inconclusive incongruity inconsiderable inconsiderate inconspicuous
incontrovertible inconvertible incurring incursion indefinable indemnify
indemnity indeterminacy indian indiana indicant indifferent indigene
indigenous indigent indispensable indochina indochinese indoctrinate
indonesia inequivalent inexplainable infantile inferential inferring
infestation inflammation inflationary influential information infringe
infusion ingenious ingenuity ingestion ingredient inhabitant inhalation
inharmonious inheritance inholding inhomogeneity inkling inoffensive
inordinate inorganic inputting inseminate insensible insincere insinuate
insistent insomnia insomniac insouciant installation instinct instinctual
insubordinate insubstantial insulin insurrection intangible intelligent
intensify intensive interception interruption intestinal intestine
intoxicant introduction introversion intrusion invariant invasion inventive
inversion involution justinian kidnapping kingpin lineprinter liniment
livingston mainline mcginnis minion minneapolis minnie pigmentation
pincushion pinion quinine quintessential resignation ruination seminarian
triennial wilkinson wilmington wilsonian wineskin winnie winnipeg
>;
say @test-words.map( { join '', mfkh($_)».kv } ).Bag;
- Output:
Bag(i2n2(151))
One... Nope, I was right, it is pretty much worthless.
Ring
# Project : Most frequent k chars distance
load "stdlib.ring"
str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
see "Str1 = " + str1 + nl
see "Str2 = " + str2 + nl
MostFreqKHashing(str1,"str1")
MostFreqKHashing(str2,"str2")
func MostFreqKHashing(str3,strp)
chr = newlist(26,2)
for n = 1 to 26
str = char(n+64)
cstr = count(str3,str)
chr[n][1] = str
chr[n][2] = cstr
next
chr = sortsecond(chr)
chr = reverse(chr)
see "MostFreqKHashing(" + strp + ",2) = "
see chr[1][1] + chr[1][2] + chr[2][1] + chr[2][2] + nl
func count(cString,dString)
sum = 0
while substr(cString,dString) > 0
sum++
cString = substr(cString,substr(cString,dString)+len(string(sum)))
end
return sum
func sortsecond(alist)
aList = sort(alist,2)
for n=1 to len(alist)-1
for m=n to len(aList)-1
if alist[m+1][2] = alist[m][2] and alist[m+1][1] < alist[m][1]
temp = alist[m+1]
alist[m+1] = alist[m]
alist[m] = temp ok
next
next
return aList
Output:
Str1 = LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG MostFreqKHashing(str1,2) = L9T8 MostFreqKHashing(str2,2) = F9L8
Sidef
func _MostFreqKHashing(string, k) {
var seen = Hash()
var chars = string.chars
var freq = chars.freq
var schars = freq.keys.sort_by {|c| -freq{c} }
var mfkh = []
for i in ^k {
chars.each { |c|
seen{c} && next
if (freq{c} == freq{schars[i]}) {
seen{c} = true
mfkh << Hash(c => c, f => freq{c})
break
}
}
}
mfkh << (k-seen.len -> of { Hash(c => :NULL, f => 0) }...)
mfkh
}
func MostFreqKSDF(a, b, k, d) {
var mfkh_a = _MostFreqKHashing(a, k);
var mfkh_b = _MostFreqKHashing(b, k);
d - gather {
mfkh_a.each { |s|
s{:c} == :NULL && next
mfkh_b.each { |t|
s{:c} == t{:c} &&
take(s{:f} + (s{:f} == t{:f} ? 0 : t{:f}))
}
}
}.sum
}
func MostFreqKHashing(string, k) {
gather {
_MostFreqKHashing(string, k).each { |h|
take("%s%d" % (h{:c}, h{:f}))
}
}.join
}
var str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
var str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
say "str1 = #{str1.dump}"
say "str2 = #{str2.dump}"
say ''
say("MostFreqKHashing(str1, 2) = ", MostFreqKHashing(str1, 2))
say("MostFreqKHashing(str2, 2) = ", MostFreqKHashing(str2, 2))
say("MostFreqKSDF(str1, str2, 2, 100) = ", MostFreqKSDF(str1, str2, 2, 100))
say ''
var arr = [
%w(night nacht),
%w(my a),
%w(research research),
%w(aaaaabbbb ababababa),
%w(significant capabilities),
]
var k = 2
var limit = 10
for s,t in arr {
"mfkh(%s, %s, #{k}) = [%s, %s] (SDF: %d)\n".printf(
s.dump, t.dump,
MostFreqKHashing(s, k).dump,
MostFreqKHashing(t, k).dump,
MostFreqKSDF(s, t, k, limit),
)
}
- Output:
str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" MostFreqKHashing(str1, 2) = L9T8 MostFreqKHashing(str2, 2) = F9L8 MostFreqKSDF(str1, str2, 2, 100) = 83 mfkh("night", "nacht", 2) = ["n1i1", "n1a1"] (SDF: 9) mfkh("my", "a", 2) = ["m1y1", "a1NULL0"] (SDF: 10) mfkh("research", "research", 2) = ["r2e2", "r2e2"] (SDF: 6) mfkh("aaaaabbbb", "ababababa", 2) = ["a5b4", "a5b4"] (SDF: 1) mfkh("significant", "capabilities", 2) = ["i3n2", "i3a2"] (SDF: 7)
Tcl
package require Tcl 8.6
proc MostFreqKHashing {inputString k} {
foreach ch [split $inputString ""] {dict incr count $ch}
join [lrange [lsort -stride 2 -index 1 -integer -decreasing $count] 0 [expr {$k*2-1}]] ""
}
proc MostFreqKSimilarity {hashStr1 hashStr2} {
while {$hashStr2 ne ""} {
regexp {^(.)(\d+)(.*)$} $hashStr2 -> ch n hashStr2
set lookup($ch) $n
}
set similarity 0
while {$hashStr1 ne ""} {
regexp {^(.)(\d+)(.*)$} $hashStr1 -> ch n hashStr1
if {[info exist lookup($ch)]} {
incr similarity $n
incr similarity $lookup($ch)
}
}
return $similarity
}
proc MostFreqKSDF {inputStr1 inputStr2 k limit} {
set h1 [MostFreqKHashing $inputStr1 $k]
set h2 [MostFreqKHashing $inputStr2 $k]
expr {$limit - [MostFreqKSimilarity $h1 $h2]}
}
Demonstrating:
set str1 "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
set str2 "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
puts [MostFreqKHashing $str1 2]
puts [MostFreqKHashing $str2 2]
puts [MostFreqKSDF $str1 $str2 2 100]
- Output:
L9T8 F9L8 83
- A more efficient metric calculator
This version is appreciably more efficient because it does not compute the intermediate string representation “hash”, instead working directly on the intermediate dictionaries and lists:
proc MostFreqKSDF {inputStr1 inputStr2 k limit} {
set c1 [set c2 {}]
foreach ch [split $inputStr1 ""] {dict incr c1 $ch}
foreach ch [split $inputStr2 ""] {dict incr c2 $ch}
set c2 [lrange [lsort -stride 2 -index 1 -integer -decreasing $c2[set c2 {}]] 0 [expr {$k*2-1}]]
set s 0
foreach {ch n} [lrange [lsort -stride 2 -index 1 -integer -decreasing $c1[set c1 {}]] 0 [expr {$k*2-1}]] {
if {[dict exists $c2 $ch]} {
incr s [expr {$n + [dict get $c2 $ch]}]
}
}
return [expr {$limit - $s}]
}
It computes the identical value on the identical inputs.
Typescript
with added type annotations
//returns an object of counts keyed by character
const kCounts = (str: string): Record<string, number> => {
const counts: Record<string, number> = {};
for (let char of str) {
counts[char] = counts[char] ? counts[char] + 1 : 1;
}
return counts;
};
//returns an array of length k containing the characters with the highest counts
const frequentK = ( counts: Record<string, number>, k: number ): string[] => {
//note that this is written for clarity rather than speed,
//as it sorts all of the counts when only the top k are needed
return Object.keys(counts)
.sort((a, b) => counts[b] - counts[a])
.slice(0, k);
};
//returns a hashed string of the most frequent k characters and their frequencies
const mostFreqKHashing = (str: string, k: number): string => {
const counts = kCounts(str);
return frequentK(counts, k)
.map(char => char + counts[char])
.join("");
};
//numeric score of similarity based on the sum of counts of characters appearing in the top k of both strings
const mostFreqKSimilarity = ( str1: string, str2: string, k: number ): number => {
const counts1 = kCounts(str1);
const counts2 = kCounts(str2);
const freq1 = frequentK(counts1, k);
const freq2 = frequentK(counts2, k);
let similarity = 0;
for (let char of freq1) {
//only considers a character if it is in the top k of both strings
if (freq2.includes(char)) {
//the examples on this page are inconsistent in whether the similarity
//should be the sum of the two counts or only the shared count (the minimum of the two)
//this code uses the sum
similarity += counts1[char] + counts2[char];
}
}
return similarity;
};
//subtracts the similarity score from the maxDifference
const mostFreqKSDF = ( str1: string, str2: string, k: number, maxDistance: number ): number => {
return maxDistance - mostFreqKSimilarity(str1, str2, k);
};
Wren
import "./seq" for Lst
import "./sort" for Sort
import "./iterate" for Stepped
var mostFreqKHashing = Fn.new { |input, k|
var indivs = Lst.individuals(input.toList).map { |indiv| [indiv[0], indiv[1]] }.toList
var cmp = Fn.new { |p1, p2| (p2[1] - p1[1]).sign }
Sort.insertion(indivs, cmp)
return indivs.take(k).reduce("") { |acc, p| acc + "%(p[0])%(String.fromByte(p[1]))" }
}
var mostFreqKSimilarity = Fn.new { |input1, input2|
var similarity = 0
for (i in Stepped.new(0...input1.count, 2)) {
for (j in Stepped.new(0...input2.count, 2)) {
if (input1[i] == input2[j]) {
var freq1 = input1[i + 1].bytes[0]
var freq2 = input2[j + 1].bytes[0]
if (freq1 == freq2) similarity = similarity + freq1
}
}
}
return similarity
}
var mostFreqKSDF = Fn.new { |input1, input2, k, maxDistance|
System.print("input1 : %(input1)")
System.print("input2 : %(input2)")
var s1 = mostFreqKHashing.call(input1, k)
var s2 = mostFreqKHashing.call(input2, k)
System.write("mfkh(input1, %(k)) = ")
var i = 0
for (c in s1) {
System.write((i % 2 == 0) ? c : c.bytes[0])
i = i + 1
}
System.write("\nmfkh(input2, %(k)) = ")
i = 0
for (c in s2) {
System.write((i % 2 == 0) ? c : c.bytes[0])
i = i + 1
}
var result = maxDistance - mostFreqKSimilarity.call(s1, s2)
System.print("\nSDF(input1, input2, %(k), %(maxDistance)) = %(result)\n")
}
var pairs = [
["research", "seeking"],
["night", "nacht"],
["my", "a"],
["research", "research"],
["aaaaabbbb", "ababababa"],
["significant", "capabilities"]
]
for (pair in pairs) mostFreqKSDF.call(pair[0], pair[1], 2, 10)
var s1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
var s2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
mostFreqKSDF.call(s1, s2, 2, 100)
s1 = "abracadabra12121212121abracadabra12121212121"
s2 = s1[-1..0]
mostFreqKSDF.call(s1, s2, 2, 100)
- Output:
Sane as Kotlin entry.
- References
- ↑ SEKER, Sadi E.; Altun, Oguz; Ayan, Ugur; Mert, Cihan (2014), "A Novel String Distance Function based on Most Frequent K Characters", wp:International Journal of Machine Learning and Computing (IJMLC), 4, wp:International Association of Computer Science and Information Technology Press (IACSIT Press), pp. 177-183, https://arxiv.org/abs/1401.6596.