Jaro-Winkler distance
You are encouraged to solve this task according to the task description, using any language you may know.
The Jaro-Winkler distance is a metric for measuring the edit distance between words. It is similar to the more basic Levenshtein distance but the Jaro distance also accounts for transpositions between letters in the words. With the Winkler modification to the Jaro metric, the Jaro-Winkler distance also adds an increase in similarity for words which start with the same letters (prefix).
The Jaro-Winkler distance is a modification of the Jaro similarity metric, which measures the similarity between two strings. The Jaro similarity is 1.0 when strings are identical and 0 when strings have no letters in common. Distance measures such as the Jaro distance or Jaro-Winkler distance, on the other hand, are 0 when strings are identical and 1 when they have no letters in common.
The Jaro similarity between two strings s1 and s2, simj, is defined as
- simj = 0 if m is 0.
- simj = ( (m / length of s1) + (m / length of s2) + (m - t) / m ) / 3 otherwise.
Where:
- is the number of matching characters (the same character within max(|s1|, |s2|)/2 - 1 of one another);
- is half the number of transpositions (a shared character placed in different positions).
The Winkler modification to Jaro is to check for identical prefixes of the strings.
If we define the number of initial (prefix) characters in common as:
l = the length of a common prefix between strings, up to 4 characters
and, additionally, select a multiplier (Winkler suggested 0.1) for the relative importance of the prefix for the word similarity:
p = 0.1
The Jaro-Winkler similarity can then be defined as
simw = simj + lp(1 - simj)
Where:
- simj is the Jaro similarity.
- l is the number of matching characters at the beginning of the strings, up to 4.
- p is a factor to modify the amount to which the prefix similarity affects the metric.
Winkler suggested this be 0.1.
The Jaro-Winkler distance between strings, which is 0.0 for identical strings, is then defined as
dw = 1 - simw
String metrics such as Jaro-Winkler distance are useful in applications such as spelling checkers, because letter transpositions are common typing errors and humans tend to misspell the middle portions of words more often than their beginnings. This may help a spelling checker program to generate better alternatives for misspelled word replacement.
- The task
Using a dictionary of your choice and the following list of 9 commonly misspelled words:
"accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich"
- Calculate the Jaro-Winkler distance between the misspelled word and words in the dictionary.
- Use this distance to list close alternatives (at least two per word) to the misspelled words.
- Show the calculated distances between the misspelled words and their potential replacements.
- See also
-
- Wikipedia page: Jaro–Winkler distance.
- Comparing string similarity algorithms. Comparison of algorithms on Medium
11l
V WORDS = File(‘linuxwords.txt’).read_lines()
V MISSPELLINGS = [‘accomodate’,
‘definately’,
‘goverment’]
F jaro_winkler_distance(=st1, =st2)
I st1.len < st2.len
(st1, st2) = (st2, st1)
V len1 = st1.len
V len2 = st2.len
I len2 == 0
R 0.0
V delta = max(0, len2 I/ 2 - 1)
V flag = (0 .< len2).map(_ -> 0B)
[Char] ch1_match
L(ch1) st1
V idx1 = L.index
L(ch2) st2
V idx2 = L.index
I idx2 <= idx1 + delta & idx2 >= idx1 - delta & ch1 == ch2 & !(flag[idx2])
flag[idx2] = 1B
ch1_match.append(ch1)
L.break
V matches = ch1_match.len
I matches == 0
R 1.0
V transpositions = 0
V idx1 = 0
L(ch2) st2
V idx2 = L.index
I flag[idx2]
transpositions += (ch2 != ch1_match[idx1])
idx1++
V jaro = (Float(matches) / len1 + Float(matches) / len2 + (matches - transpositions / 2) / matches) / 3.0
V commonprefix = 0
L(i) 0 .< min(4, len2)
commonprefix += (st1[i] == st2[i])
R 1.0 - (jaro + commonprefix * 0.1 * (1 - jaro))
F within_distance(maxdistance, stri, maxtoreturn)
V arr = :WORDS.filter(w -> jaro_winkler_distance(@stri, w) <= @maxdistance)
arr.sort(key' x -> jaro_winkler_distance(@stri, x))
R I arr.len <= maxtoreturn {arr} E arr[0 .< maxtoreturn]
L(STR) MISSPELLINGS
print("\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to \" "STR" \" are:\n Word | Distance")
L(w) within_distance(0.15, STR, 5)
print(‘#14 | #.4’.format(w, jaro_winkler_distance(STR, w)))
- Output:
Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " accomodate " are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " definately " are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " goverment " are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833
Elm
Author: zh5
module JaroWinkler exposing (similarity)
commonPrefixLength : List a -> List a -> Int -> Int
commonPrefixLength xs ys counter =
case ( xs, ys ) of
( x :: xs_, y :: ys_ ) ->
if x == y then
commonPrefixLength xs_ ys_ (counter + 1)
else
counter
_ ->
counter
similarity : String -> String -> Float
similarity s1 s2 =
let
chars1 =
String.toList s1
chars2 =
String.toList s2
jaroScore =
jaro chars1 chars2
l =
toFloat <| min (commonPrefixLength chars1 chars2 0) 4
p =
0.1
in
jaroScore + (l * p * (1.0 - jaroScore))
containtsInNextN : Int -> a -> List a -> Bool
containtsInNextN i a items =
case ( i, items ) of
( 0, _ ) ->
False
( _, [] ) ->
False
( _, item :: rest ) ->
if item == a then
True
else
containtsInNextN (i - 1) a rest
exists : Int -> Int -> List a -> a -> Bool
exists startAt endAt items i =
if endAt < startAt then
False
else if startAt == 0 then
case items of
first :: rest ->
if i == first then
True
else
exists 0 (endAt - 1) rest i
[] ->
False
else
exists 0 (endAt - startAt) (List.drop startAt items) i
existsInWindow : a -> List a -> Int -> Int -> Bool
existsInWindow item items offset radius =
let
startAt =
max 0 (offset - radius)
endAt =
min (offset + radius) (List.length items - 1)
in
exists startAt endAt items item
transpositions : List a -> List a -> Int -> Int
transpositions xs ys counter =
case ( xs, ys ) of
( [], _ ) ->
counter
( _, [] ) ->
counter
( x :: xs_, y :: ys_ ) ->
if x /= y then
transpositions xs_ ys_ (counter + 1)
else
transpositions xs_ ys_ counter
commonItems : List a -> List a -> Int -> List a
commonItems items1 items2 radius =
items1
|> List.indexedMap
(\index item ->
if existsInWindow item items2 index radius then
[ item ]
else
[]
)
|> List.concat
jaro : List Char -> List Char -> Float
jaro chars1 chars2 =
let
minLenth =
min (List.length chars1) (List.length chars2)
matchRadius =
minLenth // 2 + (minLenth |> modBy 2)
c1 =
commonItems chars1 chars2 matchRadius
c2 =
commonItems chars2 chars1 matchRadius
c1length =
toFloat (List.length c1)
c2length =
toFloat (List.length c2)
mismatches =
transpositions c1 c2 0
transpositionScore =
(toFloat mismatches + abs (c1length - c2length)) / 2.0
s1length =
toFloat (List.length chars1)
s2length =
toFloat (List.length chars2)
tLength =
max c1length c2length
result =
(c1length / s1length + c2length / s2length + (tLength - transpositionScore) / tLength) / 3.0
in
if isNaN result then
0.0
else
result
ALGOL 68
- the actual distance routines are translated from the Wren sample, the file reading and asociative arrays etc. are based on similar Algol 68 task solutions.
Uses unixdict.txt - possibly a different version to those used by some other solutions, as this finds a slightly different list of matches for "seperate" (assuming I got the translation correct!).
Prints the 6 closest matches regarddless of their distance (i.e. we don't restrict it to matches closer that 0.15).
PROC jaro sim = ( STRING sp1, sp2 )REAL:
IF STRING s1 = sp1[ AT 0 ];
STRING s2 = sp2[ AT 0 ];
INT le1 = ( UPB s1 - LWB s1 ) + 1;
INT le2 = ( UPB s2 - LWB s2 ) + 1;
le1 < 1 AND le2 < 1
THEN # both strings are empty # 1
ELIF le1 < 1 OR le2 < 1
THEN # one of the strings is empty # 0
ELSE # both strings are non-empty #
INT dist := IF le2 > le1 THEN le2 ELSE le1 FI;
dist OVERAB 2 -:= 1;
[ 0 : le1 ]BOOL matches1; FOR i FROM LWB matches1 TO UPB matches1 DO matches1[ i ] := FALSE OD;
[ 0 : le2 ]BOOL matches2; FOR i FROM LWB matches2 TO UPB matches2 DO matches2[ i ] := FALSE OD;
INT matches := 0;
INT transpos := 0;
FOR i FROM LWB s1 TO UPB s1 DO
INT start := i - dist;
IF start < 0 THEN start := 0 FI;
INT end := i + dist + 1;
IF end > le2 THEN end := le2 FI;
FOR k FROM start TO end - 1
WHILE IF matches2[ k ]
THEN TRUE
ELIF s1[ i ] /= s2[ k ]
THEN TRUE
ELSE
matches2[ k ] := matches1[ i ] := TRUE;
matches +:= 1;
FALSE
FI
DO SKIP OD
OD;
IF matches = 0
THEN 0
ELSE
INT k := 0;
FOR i FROM LWB s1 TO UPB s1 DO
IF matches1[ i ] THEN
WHILE NOT matches2[ k ] DO k +:= 1 OD;
IF s1[ i ] /= s2[ k ] THEN transpos +:= 1 FI;
k +:= 1
FI
OD;
transpos OVERAB 2;
( ( matches / le1 )
+ ( matches / le2 )
+ ( ( matches - transpos ) / matches )
) / 3
FI
FI # jaro sim # ;
PROC jaro winkler dist = ( STRING sp, tp )REAL:
BEGIN
STRING s = sp[ AT 0 ];
STRING t = tp[ AT 0 ];
INT ls = ( UPB s - LWB s ) + 1;
INT lt = ( UPB t - LWB t ) + 1;
INT l max := IF ls < lt THEN ls ELSE lt FI;
IF l max > 4 THEN l max := 4 FI;
INT l := 0;
FOR i FROM 0 TO l max - 1 DO IF s[ i ] = t[ i ] THEN l +:= 1 FI OD;
REAL js = jaro sim( s, t );
REAL p = 0.1;
REAL ws = js + ( l * p * ( 1 - js ) );
1 - ws
END # jaro winkler dist # ;
# include the Associative Array code #
PR read "aArray.a68" PR
# test cases #
[]STRING misspelt = ( "accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich" );
IF FILE input file;
STRING file name = "unixdict.txt";
open( input file, file name, stand in channel ) /= 0
THEN
# failed to open the file #
print( ( "Unable to open """ + file name + """", newline ) )
ELSE
# file opened OK #
BOOL at eof := FALSE;
# set the EOF handler for the file #
on logical file end( input file, ( REF FILE f )BOOL:
BEGIN
# note that we reached EOF on the #
# latest read #
at eof := TRUE;
# return TRUE so processing can continue #
TRUE
END
);
REF AARRAY words := INIT LOC AARRAY;
STRING word;
WHILE NOT at eof
DO
STRING word;
get( input file, ( word, newline ) );
words // word := ""
OD;
# close the file #
close( input file );
# look for near matches to the misspelt words #
INT max closest = 6; # max number of closest matches to show #
FOR m pos FROM LWB misspelt TO UPB misspelt DO
[ max closest ]STRING closest word;
[ max closest ]REAL closest jwd;
FOR i TO max closest DO closest word[ i ] := ""; closest jwd[ i ] := 999 999 999 OD;
REF AAELEMENT e := FIRST words;
WHILE e ISNT nil element DO
STRING word = key OF e;
REAL jwd = jaro winkler dist( misspelt[ m pos ], word );
BOOL found better match := FALSE;
FOR i TO max closest WHILE NOT found better match DO
IF jwd <= closest jwd[ i ] THEN
# found a new closer match #
found better match := TRUE;
# shuffle the others down 1 and insert the new match #
FOR j FROM max closest BY - 1 TO i + 1 DO
closest word[ j ] := closest word[ j - 1 ];
closest jwd[ j ] := closest jwd[ j - 1 ]
OD;
closest word[ i ] := word;
closest jwd[ i ] := jwd
FI
OD;
e := NEXT words
OD;
print( ( "Misspelt word: ", misspelt[ m pos ], ":", newline ) );
FOR i TO max closest DO
print( ( fixed( closest jwd[ i ], -8, 4 ), " ", closest word[ i ], newline ) )
OD;
print( ( newline ) )
OD
FI
- Output:
Misspelt word: accomodate: 0.0182 accommodate 0.1044 accordant 0.1136 accolade 0.1219 acclimate 0.1327 accompanist 0.1333 accost Misspelt word: definately: 0.0800 define 0.0850 definite 0.0886 defiant 0.1200 definitive 0.1219 designate 0.1267 deflate Misspelt word: goverment: 0.0667 govern 0.1167 governor 0.1175 governess 0.1330 governance 0.1361 coverlet 0.1367 sovereignty Misspelt word: occured: 0.0250 occurred 0.0571 occur 0.0952 occurrent 0.1056 occlude 0.1217 concurred 0.1429 cure Misspelt word: publically: 0.0800 public 0.1325 pullback 0.1327 publication 0.1400 pull 0.1556 pulley 0.1571 publish Misspelt word: recieve: 0.0333 receive 0.0667 relieve 0.0762 reeve 0.0852 recessive 0.0852 receptive 0.0905 recipe Misspelt word: seperate: 0.0708 desperate 0.0917 separate 0.1042 temperate 0.1048 repartee 0.1167 sewerage 0.1167 selenate Misspelt word: untill: 0.0333 until 0.1111 till 0.1333 huntsville 0.1357 instill 0.1422 unital 0.1511 unilateral Misspelt word: wich: 0.0533 witch 0.0533 winch 0.0600 which 0.0857 wichita 0.1111 twitch 0.1111 switch
C++
#include <algorithm>
#include <cstdlib>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <string>
#include <vector>
auto load_dictionary(const std::string& path) {
std::ifstream in(path);
if (!in)
throw std::runtime_error("Cannot open file " + path);
std::string line;
std::vector<std::string> words;
while (getline(in, line))
words.push_back(line);
return words;
}
double jaro_winkler_distance(std::string str1, std::string str2) {
size_t len1 = str1.size();
size_t len2 = str2.size();
if (len1 < len2) {
std::swap(str1, str2);
std::swap(len1, len2);
}
if (len2 == 0)
return len1 == 0 ? 0.0 : 1.0;
size_t delta = std::max(size_t(1), len1/2) - 1;
std::vector<bool> flag(len2, false);
std::vector<char> ch1_match;
ch1_match.reserve(len1);
for (size_t idx1 = 0; idx1 < len1; ++idx1) {
char ch1 = str1[idx1];
for (size_t idx2 = 0; idx2 < len2; ++idx2) {
char ch2 = str2[idx2];
if (idx2 <= idx1 + delta && idx2 + delta >= idx1
&& ch1 == ch2 && !flag[idx2]) {
flag[idx2] = true;
ch1_match.push_back(ch1);
break;
}
}
}
size_t matches = ch1_match.size();
if (matches == 0)
return 1.0;
size_t transpositions = 0;
for (size_t idx1 = 0, idx2 = 0; idx2 < len2; ++idx2) {
if (flag[idx2]) {
if (str2[idx2] != ch1_match[idx1])
++transpositions;
++idx1;
}
}
double m = matches;
double jaro = (m/len1 + m/len2 + (m - transpositions/2.0)/m)/3.0;
size_t common_prefix = 0;
len2 = std::min(size_t(4), len2);
for (size_t i = 0; i < len2; ++i) {
if (str1[i] == str2[i])
++common_prefix;
}
return 1.0 - (jaro + common_prefix * 0.1 * (1.0 - jaro));
}
auto within_distance(const std::vector<std::string>& words,
double max_distance, const std::string& str,
size_t max_to_return) {
using pair = std::pair<std::string, double>;
std::vector<pair> result;
for (const auto& word : words) {
double jaro = jaro_winkler_distance(word, str);
if (jaro <= max_distance)
result.emplace_back(word, jaro);
}
std::stable_sort(result.begin(), result.end(),
[](const pair& p1, const pair& p2) { return p1.second < p2.second; });
if (result.size() > max_to_return)
result.resize(max_to_return);
return result;
}
int main() {
try {
auto words(load_dictionary("linuxwords.txt"));
std::cout << std::fixed << std::setprecision(4);
for (auto str : {"accomodate", "definately", "goverment",
"occured", "publically", "recieve",
"seperate", "untill", "wich"}) {
std::cout << "Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to '"
<< str << "' are:\n Word | Distance\n";
for (const auto& pair : within_distance(words, 0.15, str, 5)) {
std::cout << std::setw(14) << pair.first << " | "
<< std::setw(6) << pair.second << '\n';
}
std::cout << '\n';
}
} catch (const std::exception& ex) {
std::cerr << ex.what() << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
- Output:
Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'definately' are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'goverment' are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'occured' are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'publically' are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'recieve' are: Word | Distance receive | 0.0333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.0667 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'seperate' are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'untill' are: Word | Distance until | 0.0333 untie | 0.1067 untimely | 0.1083 till | 0.1111 Antilles | 0.1264 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'wich' are: Word | Distance witch | 0.0533 which | 0.0600 switch | 0.1111 twitch | 0.1111 witches | 0.1143
F#
This task uses Jaro Distance (F#)
// Calculate Jaro-Winkler Similarity of 2 Strings. Nigel Galloway: August 7th., 2020
let Jw P n g=let L=float(let i=Seq.map2(fun n g->n=g) n g in (if Seq.length i>4 then i|>Seq.take 4 else i)|>Seq.takeWhile id|>Seq.length)
let J=J n g in J+P*L*(1.0-J)
let dict=System.IO.File.ReadAllLines("linuxwords.txt")
let fN g=let N=Jw 0.1 g in dict|>Array.map(fun n->(n,1.0-(N n)))|>Array.sortBy snd
["accomodate";"definately";"goverment";"occured";"publically";"recieve";"seperate";"untill";"wich"]|>
List.iter(fun n->printfn "%s" n;fN n|>Array.take 5|>Array.iter(fun n->printf "%A" n);printfn "\n")
- Output:
accomodate ("accommodate", 0.01818181818)("accommodated", 0.03333333333)("accommodates", 0.03333333333)("accommodation", 0.08153846154)("accommodating", 0.08153846154) definately ("definitely", 0.04)("defiantly", 0.04222222222)("define", 0.08)("definite", 0.085)("definable", 0.08722222222) goverment ("government", 0.05333333333)("govern", 0.06666666667)("governments", 0.0696969697)("governmental", 0.08333333333)("governs", 0.09523809524) occured ("occurred", 0.025)("occur", 0.05714285714)("occupied", 0.07857142857)("occurs", 0.09047619048)("cured", 0.09523809524) publically ("publicly", 0.04)("public", 0.08)("publicity", 0.1044444444)("publication", 0.1327272727)("politically", 0.1418181818) recieve ("receive", 0.03333333333)("received", 0.0625)("receives", 0.0625)("receiver", 0.0625)("relieve", 0.07619047619) seperate ("desperate", 0.0787037037)("separate", 0.09166666667)("separated", 0.1143518519)("separates", 0.1143518519)("temperate", 0.1157407407) untill ("until", 0.03333333333)("untie", 0.1066666667)("untimely", 0.1083333333)("till", 0.1111111111)("Huntsville", 0.1333333333) wich ("witch", 0.05333333333)("which", 0.06)("switch", 0.1111111111)("twitch", 0.1111111111)("witches", 0.1142857143)
Go
This uses unixdict and borrows code from the Jaro_distance#Go task. Otherwise it is a translation of the Wren entry.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"sort"
)
func jaroSim(str1, str2 string) float64 {
if len(str1) == 0 && len(str2) == 0 {
return 1
}
if len(str1) == 0 || len(str2) == 0 {
return 0
}
match_distance := len(str1)
if len(str2) > match_distance {
match_distance = len(str2)
}
match_distance = match_distance/2 - 1
str1_matches := make([]bool, len(str1))
str2_matches := make([]bool, len(str2))
matches := 0.
transpositions := 0.
for i := range str1 {
start := i - match_distance
if start < 0 {
start = 0
}
end := i + match_distance + 1
if end > len(str2) {
end = len(str2)
}
for k := start; k < end; k++ {
if str2_matches[k] {
continue
}
if str1[i] != str2[k] {
continue
}
str1_matches[i] = true
str2_matches[k] = true
matches++
break
}
}
if matches == 0 {
return 0
}
k := 0
for i := range str1 {
if !str1_matches[i] {
continue
}
for !str2_matches[k] {
k++
}
if str1[i] != str2[k] {
transpositions++
}
k++
}
transpositions /= 2
return (matches/float64(len(str1)) +
matches/float64(len(str2)) +
(matches-transpositions)/matches) / 3
}
func jaroWinklerDist(s, t string) float64 {
ls := len(s)
lt := len(t)
lmax := lt
if ls < lt {
lmax = ls
}
if lmax > 4 {
lmax = 4
}
l := 0
for i := 0; i < lmax; i++ {
if s[i] == t[i] {
l++
}
}
js := jaroSim(s, t)
p := 0.1
ws := js + float64(l)*p*(1-js)
return 1 - ws
}
type wd struct {
word string
dist float64
}
func main() {
misspelt := []string{
"accomodate", "definately", "goverment", "occured", "publically",
"recieve", "seperate", "untill", "wich",
}
b, err := ioutil.ReadFile("unixdict.txt")
if err != nil {
log.Fatal("Error reading file")
}
words := bytes.Fields(b)
for _, ms := range misspelt {
var closest []wd
for _, w := range words {
word := string(w)
if word == "" {
continue
}
jwd := jaroWinklerDist(ms, word)
if jwd < 0.15 {
closest = append(closest, wd{word, jwd})
}
}
fmt.Println("Misspelt word:", ms, ":")
sort.Slice(closest, func(i, j int) bool { return closest[i].dist < closest[j].dist })
for i, c := range closest {
fmt.Printf("%0.4f %s\n", c.dist, c.word)
if i == 5 {
break
}
}
fmt.Println()
}
}
- Output:
Misspelt word: accomodate : 0.0182 accommodate 0.1044 accordant 0.1136 accolade 0.1219 acclimate 0.1327 accompanist 0.1333 accord Misspelt word: definately : 0.0800 define 0.0850 definite 0.0886 defiant 0.1200 definitive 0.1219 designate 0.1267 deflate Misspelt word: goverment : 0.0667 govern 0.1167 governor 0.1175 governess 0.1330 governance 0.1361 coverlet 0.1367 sovereignty Misspelt word: occured : 0.0250 occurred 0.0571 occur 0.0952 occurrent 0.1056 occlude 0.1217 concurred 0.1429 cure Misspelt word: publically : 0.0800 public 0.1327 publication 0.1400 pull 0.1492 pullback Misspelt word: recieve : 0.0333 receive 0.0667 relieve 0.0762 reeve 0.0852 receptive 0.0852 recessive 0.0905 recife Misspelt word: seperate : 0.0708 desperate 0.0917 separate 0.1042 temperate 0.1167 selenate 0.1167 sewerage 0.1167 sept Misspelt word: untill : 0.0333 until 0.1111 till 0.1333 huntsville 0.1357 instill 0.1422 unital Misspelt word: wich : 0.0533 winch 0.0533 witch 0.0600 which 0.0857 wichita 0.1111 switch 0.1111 twitch
J
Implementation:
jaro=: {{
Eq=. (x=/y)*(<.<:-:x>.&#y)>:|x -/&i.&# y
xM=. (+./"1 Eq)#x
yM=. (+./"2 Eq)#y
M=. xM <.&# yM
T=. -: +/ xM ~:&(M&{.) yM
3%~ (M%#x) + (M%#y) + (M-T)%M
}}
jarowinkler=: {{
p=. 0.1
l=. +/*/\x =&((4<.x<.&#y)&{.) y
simj=. x jaro y
-.simj + l*p*-.simj
}}
Task example:
task=: {{
words=. <;._2 fread '/usr/share/dict/words'
for_word. ;:'accomodate definately goverment occured publically recieve seperate untill wich' do.
b=.d<:close=. 2{/:~d=. word jarowinkler every words
echo (;word),':'
echo ' ',.(":,.b#d),.' ',.>b#words
echo''
end.
}}
task''
accomodate:
0.0681818 accommodate
0.0945455 accorporate
0.0703704 commodate
definately:
0.0422222 defiantly
0.0622222 definably
0.0622222 definedly
goverment:
0.0833333 govern
0.0644444 government
0.0944444 governmental
occured:
0.105556 occlude
0.0571429 occur
0.0952381 occursive
publically:
0.08 public
0.0747222 publicity
0.0525 publicly
recieve:
0.0592593 reachieve
0.0333333 receive
0.0392857 recidive
seperate:
0.0145833 separate
0.0405093 separates
0.0458333 septate
untill:
0.0333333 until
0 untill
0.0333333 untrill
wich:
0.04 wicht
0.0533333 winch
0.0533333 witch
Java
import java.io.*;
import java.util.*;
public class JaroWinkler {
public static void main(String[] args) {
try {
List<String> words = loadDictionary("linuxwords.txt");
String[] strings = {
"accomodate", "definately", "goverment", "occured",
"publically", "recieve", "seperate", "untill", "wich"
};
for (String string : strings) {
System.out.printf("Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to '%s' are:\n"
+ " Word | Distance\n", string);
for (StringDistance s : withinDistance(words, 0.15, string, 5)) {
System.out.printf("%14s | %.4f\n", s.word, s.distance);
}
System.out.println();
}
} catch (Exception e) {
e.printStackTrace();
}
}
private static class StringDistance implements Comparable<StringDistance> {
private StringDistance(String word, double distance) {
this.word = word;
this.distance = distance;
}
public int compareTo(StringDistance s) {
return Double.compare(distance, s.distance);
}
private String word;
private double distance;
}
private static List<StringDistance> withinDistance(List<String> words,
double maxDistance, String string, int max) {
List<StringDistance> result = new ArrayList<>();
for (String word : words) {
double distance = jaroWinklerDistance(word, string);
if (distance <= maxDistance)
result.add(new StringDistance(word, distance));
}
Collections.sort(result);
if (result.size() > max)
result = result.subList(0, max);
return result;
}
private static double jaroWinklerDistance(String string1, String string2) {
int len1 = string1.length();
int len2 = string2.length();
if (len1 < len2) {
String s = string1;
string1 = string2;
string2 = s;
int tmp = len1;
len1 = len2;
len2 = tmp;
}
if (len2 == 0)
return len1 == 0 ? 0.0 : 1.0;
int delta = Math.max(1, len1 / 2) - 1;
boolean[] flag = new boolean[len2];
Arrays.fill(flag, false);
char[] ch1Match = new char[len1];
int matches = 0;
for (int i = 0; i < len1; ++i) {
char ch1 = string1.charAt(i);
for (int j = 0; j < len2; ++j) {
char ch2 = string2.charAt(j);
if (j <= i + delta && j + delta >= i && ch1 == ch2 && !flag[j]) {
flag[j] = true;
ch1Match[matches++] = ch1;
break;
}
}
}
if (matches == 0)
return 1.0;
int transpositions = 0;
for (int i = 0, j = 0; j < len2; ++j) {
if (flag[j]) {
if (string2.charAt(j) != ch1Match[i])
++transpositions;
++i;
}
}
double m = matches;
double jaro = (m / len1 + m / len2 + (m - transpositions / 2.0) / m) / 3.0;
int commonPrefix = 0;
len2 = Math.min(4, len2);
for (int i = 0; i < len2; ++i) {
if (string1.charAt(i) == string2.charAt(i))
++commonPrefix;
}
return 1.0 - (jaro + commonPrefix * 0.1 * (1.0 - jaro));
}
private static List<String> loadDictionary(String path) throws IOException {
try (BufferedReader reader = new BufferedReader(new FileReader(path))) {
List<String> words = new ArrayList<>();
String word;
while ((word = reader.readLine()) != null)
words.add(word);
return words;
}
}
}
- Output:
Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'definately' are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'goverment' are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'occured' are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'publically' are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'recieve' are: Word | Distance receive | 0.0333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.0667 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'seperate' are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'untill' are: Word | Distance until | 0.0333 untie | 0.1067 untimely | 0.1083 till | 0.1111 Antilles | 0.1264 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'wich' are: Word | Distance witch | 0.0533 which | 0.0600 switch | 0.1111 twitch | 0.1111 witches | 0.1143
jq
Works with gojq, the Go implementation of jq
This entry, which uses unixdict.txt, borrows the implementation in jq of the Jaro similarity measure as defined at Jaro_similarity#jq; since it is quite long, it is not repeated here.
# See [[Jaro_similarity#jq]] for the implementation of jaro/2
def length_of_common_prefix($s1; $s2):
if ($s1|length) > ($s2|length) then length_of_common_prefix($s2; $s1)
else ($s1|explode) as $x1
| ($s2|explode) as $x2
| first( range(0;$x1|length) | select( $x1[.] != $x2[.] )) // ($x1|length)
end;
# Output: the Jaro-WInkler distance using 0.1 as the common-prefix multiplier
def jaro_winkler($s1; $s2):
if $s1 == $s2 then 0
else jaro($s1; $s2) as $j
| length_of_common_prefix($s1[:4]; $s2[:4]) as $l
| 1 - ($j + 0.1 * $l * (1 - $j))
end ;
# Input: an array of words
# Output: [[match, distance] ...]
def candidates($word; $threshold):
map(jaro_winkler($word; . ) as $x | select($x <= $threshold) | [., $x] );
def lpad($len): tostring | ($len - length) as $l | (" " * $l)[:$l] + .;
def task:
[inputs] # the dictionary
| ("accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich") as $word
| candidates($word; 0.15) | sort_by(.[-1]) | .[:5]
| "Matches for \($word|lpad(10)): Distance",
(.[] | "\(.[0] | lpad(21)) : \(.[-1] * 1000 | round / 1000)") ;
task
- Output:
Invocation: jq -rRn -f program.jq unixdict.txt
Matches for accomodate: Distance accommodate : 0.018 accordant : 0.104 accolade : 0.114 acclimate : 0.122 accompanist : 0.133 Matches for definately: Distance define : 0.08 definite : 0.085 defiant : 0.089 definitive : 0.12 deflate : 0.127 Matches for goverment: Distance govern : 0.08 governor : 0.13 governess : 0.133 governance : 0.149 Matches for occured: Distance occurred : 0.025 occur : 0.057 occurrent : 0.095 occlude : 0.106 concurred : 0.122 Matches for publically: Distance public : 0.08 publication : 0.133 Matches for recieve: Distance receive : 0.063 reeve : 0.1 relieve : 0.105 recife : 0.108 recipe : 0.108 Matches for seperate: Distance desperate : 0.079 separate : 0.092 temperate : 0.116 sept : 0.117 septate : 0.131 Matches for untill: Distance until : 0.033 till : 0.111 huntsville : 0.133 unital : 0.142 Matches for wich: Distance winch : 0.107 witch : 0.107 which : 0.12 wichita : 0.126
Julia
# download("http://users.cs.duke.edu/~ola/ap/linuxwords", "linuxwords.txt")
const words = read("linuxwords.txt", String) |> split .|> strip
function jarowinklerdistance(s1, s2)
if length(s1) < length(s2)
s1, s2 = s2, s1
end
len1, len2 = length(s1), length(s2)
len2 == 0 && return 0.0
delta = max(0, len2 ÷ 2 - 1)
flag = zeros(Bool, len2) # flags for possible transpositions, begin as false
ch1_match = eltype(s1)[]
for (i, ch1) in enumerate(s1)
for (j, ch2) in enumerate(s2)
if (j <= i + delta) && (j >= i - delta) && (ch1 == ch2) && !flag[j]
flag[j] = true
push!(ch1_match, ch1)
break
end
end
end
matches = length(ch1_match)
matches == 0 && return 1.0
transpositions, i = 0, 0
for (j, ch2) in enumerate(s2)
if flag[j]
i += 1
transpositions += (ch2 != ch1_match[i])
end
end
jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0
commonprefix = count(i -> s1[i] == s2[i], 1:min(len2, 4))
return 1 - (jaro + commonprefix * 0.1 * (1 - jaro))
end
function closewords(s, maxdistance, maxtoreturn)
jw = 0.0
arr = [(w, jw) for w in words if (jw = jarowinklerdistance(s, w)) <= maxdistance]
sort!(arr, lt=(x, y) -> x[2] < y[2])
return length(arr) <= maxtoreturn ? arr : arr[1:maxtoreturn]
end
for s in ["accomodate", "definately", "goverment", "occured", "publically",
"recieve", "seperate", "untill", "wich"]
println("\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to '$s' are:")
println(" Word | Distance")
for (w, jw) in closewords(s, 0.15, 5)
println(rpad(w, 14), "| ", Float16(jw))
end
end
- Output:
Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are: Word | Distance accommodate | 0.01819 accommodated | 0.03333 accommodates | 0.03333 accommodating | 0.08154 accommodation | 0.08154 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'definately' are: Word | Distance definitely | 0.04 defiantly | 0.04224 define | 0.08 definite | 0.085 definable | 0.0872 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'goverment' are: Word | Distance government | 0.05334 govern | 0.06665 governments | 0.0697 movement | 0.081 governmental | 0.0833 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'occured' are: Word | Distance occurred | 0.025 occur | 0.05713 occupied | 0.07855 occurs | 0.09045 accursed | 0.0917 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'publically' are: Word | Distance publicly | 0.04 public | 0.08 publicity | 0.10443 publication | 0.1327 biblically | 0.14 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'recieve' are: Word | Distance receive | 0.03333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.06665 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'seperate' are: Word | Distance desperate | 0.07086 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'untill' are: Word | Distance until | 0.03333 untie | 0.1067 untimely | 0.10834 Antilles | 0.1263 untidy | 0.1333 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to 'wich' are: Word | Distance witch | 0.05334 which | 0.06 witches | 0.11426 rich | 0.11664 wick | 0.11664
Mathematica /Wolfram Language
ClearAll[JWD]
JWD[a_][b_]:=Experimental`JaroWinklerDistance[a,b]
dict=DictionaryLookup[];
TakeSmallestBy[dict->{"Element","Value"},JWD["accomodate"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["definately"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["goverment"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["occured"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["publically"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["recieve"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["seperate"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["untill"],5]//Grid
TakeSmallestBy[dict->{"Element","Value"},JWD["wich"],5]//Grid
- Output:
accommodate 0.0181818 accommodated 0.0333333 accommodates 0.0333333 accommodation 0.0815385 accommodating 0.0815385 definitely 0.04 defiantly 0.0422222 definably 0.0622222 definitively 0.07 define 0.08 government 0.0422222 governments 0.0585859 govern 0.0666667 governmental 0.0722222 governs 0.0952381 occurred 0.025 occur 0.0571429 occupied 0.0785714 occurs 0.0904762 cured 0.0952381 publicly 0.04 public 0.08 publican 0.085 publicans 0.104444 publicity 0.10444 receive 0.0333333 receives 0.0625 received 0.0625 receiver 0.0625 reeve 0.0761905 desperate 0.0787037 separate 0.0916667 separateness 0.106944 sprat 0.1125 separated 0.114352 until 0.0333333 untiled 0.0904762 untiles 0.0904762 unlit 0.0977778 untypically 0.106061 winch 0.0533333 witch 0.0533333 which 0.06 switch 0.111111 twitch 0.111111
Nim
import lenientops
func jaroSim(s1, s2: string): float =
if s1.len == 0 and s2.len == 0: return 1
if s1.len == 0 or s2.len == 0: return 0
let matchDistance = max(s1.len, s2.len) div 2 - 1
var s1Matches = newSeq[bool](s1.len)
var s2Matches = newSeq[bool](s2.len)
var matches = 0
for i in 0..s1.high:
for j in max(0, i - matchDistance)..min(i + matchDistance, s2.high):
if not s2Matches[j] and s1[i] == s2[j]:
s1Matches[i] = true
s2Matches[j] = true
inc matches
break
if matches == 0: return 0
var transpositions = 0.0
var k = 0
for i in ..s1.high:
if not s1Matches[i]: continue
while not s2Matches[k]: inc k
if s1[i] != s2[k]: transpositions += 0.5
inc k
result = (matches / s1.len + matches / s2.len + (matches - transpositions) / matches) / 3
func jaroWinklerDist(s, t: string): float =
let ls = s.len
let lt = t.len
var lmax = if ls < lt: ls else: lt
if lmax > 4: lmax = 4
var l = 0
for i in 0..<lmax:
if s[i] == t[i]: inc l
let js = jaroSim(s, t)
let p = 0.1
let ws = js + float(l) * p * (1 - js)
result = 1 - ws
when isMainModule:
import algorithm, sequtils, strformat
type Wd = tuple[word: string; dist: float]
func `<`(w1, w2: Wd): bool =
if w1.dist < w2.dist: true
elif w1.dist == w2.dist: w1.word < w2.word
else: false
const Misspelt = ["accomodate", "definately", "goverment", "occured",
"publically", "recieve", "seperate", "untill", "wich"]
let words = toSeq("unixdict.txt".lines)
for ms in Misspelt:
var closest: seq[Wd]
for word in words:
if word.len == 0: continue
let jwd = jaroWinklerDist(ms, word)
if jwd < 0.15:
closest.add (word, jwd)
echo "Misspelt word: ", ms, ":"
closest.sort()
for i, c in closest:
echo &"{c.dist:0.4f} {c.word}"
if i == 5: break
echo()
- Output:
Misspelt word: accomodate: 0.0182 accommodate 0.1044 accordant 0.1136 accolade 0.1219 acclimate 0.1327 accompanist 0.1333 accord Misspelt word: definately: 0.0800 define 0.0850 definite 0.0886 defiant 0.1200 definitive 0.1219 designate 0.1267 deflate Misspelt word: goverment: 0.0667 govern 0.1167 governor 0.1175 governess 0.1330 governance 0.1361 coverlet 0.1367 sovereignty Misspelt word: occured: 0.0250 occurred 0.0571 occur 0.0952 occurrent 0.1056 occlude 0.1217 concurred 0.1429 cure Misspelt word: publically: 0.0800 public 0.1327 publication 0.1400 pull 0.1492 pullback Misspelt word: recieve: 0.0333 receive 0.0667 relieve 0.0762 reeve 0.0852 receptive 0.0852 recessive 0.0905 recife Misspelt word: seperate: 0.0708 desperate 0.0917 separate 0.1042 temperate 0.1167 selenate 0.1167 sept 0.1167 sewerage Misspelt word: untill: 0.0333 until 0.1111 till 0.1333 huntsville 0.1357 instill 0.1422 unital Misspelt word: wich: 0.0533 winch 0.0533 witch 0.0600 which 0.0857 wichita 0.1111 switch 0.1111 twitch
Perl
use strict;
use warnings;
use List::Util qw(min max head);
sub jaro_winkler {
my($s, $t) = @_;
my(@s_matches, @t_matches, $matches);
return 0 if $s eq $t;
my $s_len = length $s; my @s = split //, $s;
my $t_len = length $t; my @t = split //, $t;
my $match_distance = int (max($s_len,$t_len)/2) - 1;
for my $i (0 .. $#s) {
my $start = max(0, $i - $match_distance);
my $end = min($i + $match_distance, $t_len - 1);
for my $j ($start .. $end) {
next if $t_matches[$j] or $s[$i] ne $t[$j];
($s_matches[$i], $t_matches[$j]) = (1, 1);
$matches++ and last;
}
}
return 1 unless $matches;
my($k, $transpositions) = (0, 0);
for my $i (0 .. $#s) {
next unless $s_matches[$i];
$k++ until $t_matches[$k];
$transpositions++ if $s[$i] ne $t[$k];
$k++;
}
my $prefix = 0;
$s[$_] eq $t[$_] and ++$prefix for 0 .. -1 + min 5, $s_len, $t_len;
my $jaro = ($matches / $s_len + $matches / $t_len +
(($matches - $transpositions / 2) / $matches)) / 3;
1 - ($jaro + $prefix * .1 * ( 1 - $jaro) )
}
my @words = split /\n/, `cat ./unixdict.txt`;
for my $word (<accomodate definately goverment occured publically recieve seperate untill wich>) {
my %J;
$J{$_} = jaro_winkler($word, $_) for @words;
print "\nClosest 5 dictionary words with a Jaro-Winkler distance < .15 from '$word':\n";
printf "%15s : %0.4f\n", $_, $J{$_}
for head 5, sort { $J{$a} <=> $J{$b} or $a cmp $b } grep { $J{$_} < 0.15 } keys %J;
}
- Output:
Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'accomodate': accommodate : 0.0152 accordant : 0.1044 accompanist : 0.1106 accolade : 0.1136 accompaniment : 0.1183 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'definately': define : 0.0667 definite : 0.0708 defiant : 0.0886 definitive : 0.1000 designate : 0.1219 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'goverment': govern : 0.0556 governor : 0.0972 governess : 0.0979 governance : 0.1108 coverlet : 0.1167 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'occured': occurred : 0.0208 occur : 0.0476 occurrent : 0.0794 occlude : 0.1056 occurring : 0.1217 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'publically': public : 0.0667 publication : 0.1106 publish : 0.1310 pull : 0.1400 pullback : 0.1492 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'recieve': receive : 0.0333 relieve : 0.0571 reeve : 0.0667 receptive : 0.0852 recessive : 0.0852 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'seperate': desperate : 0.0708 separate : 0.0786 sewerage : 0.1000 repartee : 0.1083 repeater : 0.1083 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'untill': until : 0.0278 till : 0.1111 tilt : 0.1111 huntsville : 0.1333 instill : 0.1357 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from 'wich': winch : 0.0533 witch : 0.0533 which : 0.0600 wichita : 0.0857 switch : 0.1111
Phix
Uses jaro() from Jaro_distance#Phix (reproduced below for your convenience) and the standard unix_dict()
function jaro(string str1, str2) str1 = trim(upper(str1)) str2 = trim(upper(str2)) integer len1 = length(str1), len2 = length(str2), match_distance = floor(max(len1,len2)/2)-1, match_count = 0, half_transposed = 0 if len1==0 then return len2==0 end if -- count the number of matches sequence m1 = repeat(false,len1), m2 = repeat(false,len2) for i=1 to len1 do for k=max(1,i-match_distance) to min(len2,i+match_distance) do if not m2[k] then if str1[i]=str2[k] then m1[i] = true m2[k] = true match_count += 1 exit end if end if end for end for if match_count==0 then return 0 end if -- count the number of half-transpositions integer k = 1 for i=1 to len1 do if m1[i] then while not m2[k] do k += 1 end while half_transposed += (str1[i]!=str2[k]) k += 1 end if end for integer transpositions = floor(half_transposed/2), not_transposed = match_count - transpositions -- -- return the average of: -- percentage/fraction of the first string matched, -- percentage/fraction of the second string matched, and -- percentage/fraction of matches that were not transposed. -- return (match_count/len1 + match_count/len2 + not_transposed/match_count)/3 end function with javascript_semantics function jaroWinklerDist(string s, t) integer lm = min({length(s),length(t),4}), l = sum(sq_eq(s[1..lm],t[1..lm])) return (1-jaro(s, t))*(1-l*0.1) end function constant mispelt = {"accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich"}, words = unix_dict() sequence jwds = repeat(0,length(words)) for m=1 to length(mispelt) do string ms = mispelt[m] printf(1,"\nMisspelt word: %s :\n", ms) for w=1 to length(words) do jwds[w] = jaroWinklerDist(ms,words[w]) end for sequence tags = custom_sort(jwds,tagset(length(words))) for j=1 to 6 do integer tj = tags[j] -- if jwds[tj]>0.15 then exit end if printf(1,"%0.4f %s\n", {jwds[tj], words[tj]}) end for end for
Output identical to Go/Wren Algol68
Python
"""
Test Jaro-Winkler distance metric.
linuxwords.txt is from http://users.cs.duke.edu/~ola/ap/linuxwords
"""
WORDS = [s.strip() for s in open("linuxwords.txt").read().split()]
MISSPELLINGS = [
"accomodate",
"definately",
"goverment",
"occured",
"publically",
"recieve",
"seperate",
"untill",
"wich",
]
def jaro_winkler_distance(st1, st2):
"""
Compute Jaro-Winkler distance between two strings.
"""
if len(st1) < len(st2):
st1, st2 = st2, st1
len1, len2 = len(st1), len(st2)
if len2 == 0:
return 0.0
delta = max(0, len2 // 2 - 1)
flag = [False for _ in range(len2)] # flags for possible transpositions
ch1_match = []
for idx1, ch1 in enumerate(st1):
for idx2, ch2 in enumerate(st2):
if idx2 <= idx1 + delta and idx2 >= idx1 - delta and ch1 == ch2 and not flag[idx2]:
flag[idx2] = True
ch1_match.append(ch1)
break
matches = len(ch1_match)
if matches == 0:
return 1.0
transpositions, idx1 = 0, 0
for idx2, ch2 in enumerate(st2):
if flag[idx2]:
transpositions += (ch2 != ch1_match[idx1])
idx1 += 1
jaro = (matches / len1 + matches / len2 + (matches - transpositions/2) / matches) / 3.0
commonprefix = 0
for i in range(min(4, len2)):
commonprefix += (st1[i] == st2[i])
return 1.0 - (jaro + commonprefix * 0.1 * (1 - jaro))
def within_distance(maxdistance, stri, maxtoreturn):
"""
Find words in WORDS of closeness to stri within maxdistance, return up to maxreturn of them.
"""
arr = [w for w in WORDS if jaro_winkler_distance(stri, w) <= maxdistance]
arr.sort(key=lambda x: jaro_winkler_distance(stri, x))
return arr if len(arr) <= maxtoreturn else arr[:maxtoreturn]
for STR in MISSPELLINGS:
print('\nClose dictionary words ( distance < 0.15 using Jaro-Winkler distance) to "',
STR, '" are:\n Word | Distance')
for w in within_distance(0.15, STR, 5):
print('{:>14} | {:6.4f}'.format(w, jaro_winkler_distance(STR, w)))
- Output:
Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " accomodate " are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " definately " are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " goverment " are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " occured " are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " publically " are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " recieve " are: Word | Distance receive | 0.0333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.0667 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " seperate " are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " untill " are: Word | Distance until | 0.0333 untie | 0.1067 untimely | 0.1083 Antilles | 0.1264 untidy | 0.1333 Close dictionary words ( distance < 0.15 using Jaro-Winkler distance) to " wich " are: Word | Distance witch | 0.0533 which | 0.0600 witches | 0.1143 rich | 0.1167 wick | 0.1167
Raku
A minor modification of the Jaro distance task entry.
using the unixdict.txt file from www.puzzlers.org
sub jaro-winkler ($s, $t) {
return 0 if $s eq $t;
my $s_len = + my @s = $s.comb;
my $t_len = + my @t = $t.comb;
my $match_distance = ($s_len max $t_len) div 2 - 1;
my @s_matches;
my @t_matches;
my $matches = 0;
for ^@s -> $i {
my $start = 0 max $i - $match_distance;
my $end = $i + $match_distance min ($t_len - 1);
for $start .. $end -> $j {
@t_matches[$j] and next;
@s[$i] eq @t[$j] or next;
@s_matches[$i] = 1;
@t_matches[$j] = 1;
$matches++;
last;
}
}
return 1 if $matches == 0;
my $k = 0;
my $transpositions = 0;
for ^@s -> $i {
@s_matches[$i] or next;
until @t_matches[$k] { ++$k }
@s[$i] eq @t[$k] or ++$transpositions;
++$k;
}
my $prefix = 0;
++$prefix if @s[$_] eq @t[$_] for ^(min 4, $s_len, $t_len);
my $jaro = ($matches / $s_len + $matches / $t_len +
(($matches - $transpositions / 2) / $matches)) / 3;
1 - ($jaro + $prefix * .1 * ( 1 - $jaro) )
}
my @words = './unixdict.txt'.IO.slurp.words;
for <accomodate definately goverment occured publically recieve seperate untill wich>
-> $word {
my %result = @words.race.map: { $_ => jaro-winkler($word, $_) };
say "\nClosest 5 dictionary words with a Jaro-Winkler distance < .15 from $word:";
printf "%15s : %0.4f\n", .key, .value for %result.grep({ .value < .15 }).sort({+.value, ~.key}).head(5);
}
- Output:
Closest 5 dictionary words with a Jaro-Winkler distance < .15 from accomodate: accommodate : 0.0182 accordant : 0.1044 accolade : 0.1136 acclimate : 0.1219 accompanist : 0.1327 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from definately: define : 0.0800 definite : 0.0850 defiant : 0.0886 definitive : 0.1200 designate : 0.1219 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from goverment: govern : 0.0667 governor : 0.1167 governess : 0.1175 governance : 0.1330 coverlet : 0.1361 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from occured: occurred : 0.0250 occur : 0.0571 occurrent : 0.0952 occlude : 0.1056 concurred : 0.1217 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from publically: public : 0.0800 publication : 0.1327 pull : 0.1400 pullback : 0.1492 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from recieve: receive : 0.0333 relieve : 0.0667 reeve : 0.0762 receptive : 0.0852 recessive : 0.0852 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from seperate: desperate : 0.0708 separate : 0.0917 temperate : 0.1042 selenate : 0.1167 sept : 0.1167 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from untill: until : 0.0333 till : 0.1111 huntsville : 0.1333 instill : 0.1357 unital : 0.1422 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from wich: winch : 0.0533 witch : 0.0533 which : 0.0600 wichita : 0.0857 switch : 0.1111
Rust
use std::fs::File;
use std::io::{self, BufRead};
fn load_dictionary(filename: &str) -> std::io::Result<Vec<String>> {
let file = File::open(filename)?;
let mut dict = Vec::new();
for line in io::BufReader::new(file).lines() {
dict.push(line?);
}
Ok(dict)
}
fn jaro_winkler_distance(string1: &str, string2: &str) -> f64 {
let mut st1 = string1;
let mut st2 = string2;
let mut len1 = st1.chars().count();
let mut len2 = st2.chars().count();
if len1 < len2 {
std::mem::swap(&mut st1, &mut st2);
std::mem::swap(&mut len1, &mut len2);
}
if len2 == 0 {
return if len1 == 0 { 0.0 } else { 1.0 };
}
let delta = std::cmp::max(1, len1 / 2) - 1;
let mut flag = vec![false; len2];
let mut ch1_match = vec![];
for (idx1, ch1) in st1.chars().enumerate() {
for (idx2, ch2) in st2.chars().enumerate() {
if idx2 <= idx1 + delta && idx2 + delta >= idx1 && ch1 == ch2 && !flag[idx2] {
flag[idx2] = true;
ch1_match.push(ch1);
break;
}
}
}
let matches = ch1_match.len();
if matches == 0 {
return 1.0;
}
let mut transpositions = 0;
let mut idx1 = 0;
for (idx2, ch2) in st2.chars().enumerate() {
if flag[idx2] {
transpositions += (ch2 != ch1_match[idx1]) as i32;
idx1 += 1;
}
}
let m = matches as f64;
let jaro =
(m / (len1 as f64) + m / (len2 as f64) + (m - (transpositions as f64) / 2.0) / m) / 3.0;
let mut commonprefix = 0;
for (c1, c2) in st1.chars().zip(st2.chars()).take(std::cmp::min(4, len2)) {
commonprefix += (c1 == c2) as i32;
}
1.0 - (jaro + commonprefix as f64 * 0.1 * (1.0 - jaro))
}
fn within_distance<'a>(
dict: &'a Vec<String>,
max_distance: f64,
stri: &str,
max_to_return: usize,
) -> Vec<(&'a String, f64)> {
let mut arr: Vec<(&String, f64)> = dict
.iter()
.map(|w| (w, jaro_winkler_distance(stri, w)))
.filter(|x| x.1 <= max_distance)
.collect();
// The trait std::cmp::Ord is not implemented for f64, otherwise
// we could just do this:
// arr.sort_by_key(|x| x.1);
let compare_distance = |d1, d2| {
use std::cmp::Ordering;
if d1 < d2 {
Ordering::Less
} else if d1 > d2 {
Ordering::Greater
} else {
Ordering::Equal
}
};
arr.sort_by(|x, y| compare_distance(x.1, y.1));
arr[0..std::cmp::min(max_to_return, arr.len())].to_vec()
}
fn main() {
match load_dictionary("linuxwords.txt") {
Ok(dict) => {
for word in &[
"accomodate",
"definately",
"goverment",
"occured",
"publically",
"recieve",
"seperate",
"untill",
"wich",
] {
println!("Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to '{}' are:", word);
println!(" Word | Distance");
for (w, dist) in within_distance(&dict, 0.15, word, 5) {
println!("{:>14} | {:6.4}", w, dist)
}
println!();
}
}
Err(error) => eprintln!("{}", error),
}
}
- Output:
Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'definately' are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'goverment' are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'occured' are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'publically' are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'recieve' are: Word | Distance receive | 0.0333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.0667 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'seperate' are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'untill' are: Word | Distance until | 0.0333 untie | 0.1067 untimely | 0.1083 till | 0.1111 Antilles | 0.1264 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'wich' are: Word | Distance witch | 0.0533 which | 0.0600 switch | 0.1111 twitch | 0.1111 witches | 0.1143
Sidef
Reusing the jaro() function from the Jaro similarity task.
func jaro_winkler_distance(s,t) {
var jaro_similarity = jaro(s, t)
var prefix = 0
for i in (0 .. min(3, s.len, t.len)) {
s.char(i) == t.char(i) ? ++prefix : break
}
1 - (prefix * 0.1 * (1 - jaro_similarity) + jaro_similarity)
}
# usage:
# sidef script.sf < unixdict.txt
var words = ARGF.slurp.words
%w(accomodate definately goverment occured publically recieve seperate untill wich).each {|word|
var result = Hash(words.map { (_, jaro_winkler_distance(word, _)) }...)
say "\nClosest 5 dictionary words with a Jaro-Winkler distance < .15 from #{word}:"
result.grep {|_,v| v < .15 }.sort_by{|_,v| v }.head(5).each_2d {|k,v|
printf("%15s : %0.4f\n", k, v)
}
}
- Output:
Closest 5 dictionary words with a Jaro-Winkler distance < .15 from accomodate: accommodate : 0.0182 accordant : 0.1044 accolade : 0.1136 acclimate : 0.1219 accompanist : 0.1327 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from definately: define : 0.0800 definite : 0.0850 defiant : 0.0886 definitive : 0.1200 deflate : 0.1267 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from goverment: govern : 0.0667 governor : 0.1167 governess : 0.1175 governance : 0.1330 goer : 0.1481 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from occured: occurred : 0.0250 occur : 0.0571 occurrent : 0.0952 occlude : 0.1056 concurred : 0.1217 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from publically: public : 0.0800 publication : 0.1327 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from recieve: receive : 0.0333 reeve : 0.0762 relieve : 0.0762 recessive : 0.0852 receptive : 0.0852 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from seperate: desperate : 0.0787 separate : 0.0917 temperate : 0.1157 sept : 0.1167 septate : 0.1306 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from untill: until : 0.0333 till : 0.1111 huntsville : 0.1333 unital : 0.1422 Closest 5 dictionary words with a Jaro-Winkler distance < .15 from wich: witch : 0.0533 winch : 0.0533 which : 0.0600 wichita : 0.0857 twitch : 0.1111
Swift
import Foundation
func loadDictionary(_ path: String) throws -> [String] {
let contents = try String(contentsOfFile: path, encoding: String.Encoding.ascii)
return contents.components(separatedBy: "\n")
}
func jaroWinklerDistance(string1: String, string2: String) -> Double {
var st1 = Array(string1)
var st2 = Array(string2)
var len1 = st1.count
var len2 = st2.count
if len1 < len2 {
swap(&st1, &st2)
swap(&len1, &len2)
}
if len2 == 0 {
return len1 == 0 ? 0.0 : 1.0
}
let delta = max(1, len1 / 2) - 1
var flag = Array(repeating: false, count: len2)
var ch1Match: [Character] = []
ch1Match.reserveCapacity(len1)
for idx1 in 0..<len1 {
let ch1 = st1[idx1]
for idx2 in 0..<len2 {
let ch2 = st2[idx2]
if idx2 <= idx1 + delta && idx2 + delta >= idx1 && ch1 == ch2 && !flag[idx2] {
flag[idx2] = true
ch1Match.append(ch1)
break
}
}
}
let matches = ch1Match.count
if matches == 0 {
return 1.0
}
var transpositions = 0
var idx1 = 0
for idx2 in 0..<len2 {
if flag[idx2] {
if st2[idx2] != ch1Match[idx1] {
transpositions += 1
}
idx1 += 1
}
}
let m = Double(matches)
let jaro =
(m / Double(len1) + m / Double(len2) + (m - Double(transpositions) / 2.0) / m) / 3.0
var commonPrefix = 0
for i in 0..<min(4, len2) {
if st1[i] == st2[i] {
commonPrefix += 1
}
}
return 1.0 - (jaro + Double(commonPrefix) * 0.1 * (1.0 - jaro))
}
func withinDistance(words: [String], maxDistance: Double, string: String,
maxToReturn: Int) -> [(String, Double)] {
var arr = Array(words.map{($0, jaroWinklerDistance(string1: string, string2: $0))}
.filter{$0.1 <= maxDistance})
arr.sort(by: { x, y in return x.1 < y.1 })
return Array(arr[0..<min(maxToReturn, arr.count)])
}
func pad(string: String, width: Int) -> String {
if string.count >= width {
return string
}
return String(repeating: " ", count: width - string.count) + string
}
do {
let dict = try loadDictionary("linuxwords.txt")
for word in ["accomodate", "definately", "goverment", "occured",
"publically", "recieve", "seperate", "untill", "wich"] {
print("Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to '\(word)' are:")
print(" Word | Distance")
for (w, dist) in withinDistance(words: dict, maxDistance: 0.15,
string: word, maxToReturn: 5) {
print("\(pad(string: w, width: 14)) | \(String(format: "%6.4f", dist))")
}
print()
}
} catch {
print(error.localizedDescription)
}
- Output:
Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'accomodate' are: Word | Distance accommodate | 0.0182 accommodated | 0.0333 accommodates | 0.0333 accommodating | 0.0815 accommodation | 0.0815 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'definately' are: Word | Distance definitely | 0.0400 defiantly | 0.0422 define | 0.0800 definite | 0.0850 definable | 0.0872 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'goverment' are: Word | Distance government | 0.0533 govern | 0.0667 governments | 0.0697 movement | 0.0810 governmental | 0.0833 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'occured' are: Word | Distance occurred | 0.0250 occur | 0.0571 occupied | 0.0786 occurs | 0.0905 accursed | 0.0917 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'publically' are: Word | Distance publicly | 0.0400 public | 0.0800 publicity | 0.1044 publication | 0.1327 biblically | 0.1400 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'recieve' are: Word | Distance receive | 0.0333 received | 0.0625 receiver | 0.0625 receives | 0.0625 relieve | 0.0667 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'seperate' are: Word | Distance desperate | 0.0708 separate | 0.0917 temperate | 0.1042 separated | 0.1144 separates | 0.1144 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'untill' are: Word | Distance until | 0.0333 untie | 0.1067 untimely | 0.1083 till | 0.1111 Antilles | 0.1264 Close dictionary words (distance < 0.15 using Jaro-Winkler distance) to 'wich' are: Word | Distance witch | 0.0533 which | 0.0600 switch | 0.1111 twitch | 0.1111 witches | 0.1143
Typescript
var fs = require('fs')
// Jaro Winkler Distance Formula
function jaroDistance(string1: string, string2: string): number{
// Compute Jaro-Winkler distance between two string
// Swap strings if string1 is shorter than string 2
if (string1.length < string2.length){
const tempString: string = string1;
string1 = string2;
string2 = tempString
}
let len1: number = string1.length
let len2: number = string2.length
if (!len2){
return 0.0
}
const delta: number = Math.max(1, len1 / 2.0) - 1.0;
// Flags for transpositions
let flag: boolean[] = Array(len2).fill(false)
let ch1Match: string[] = Array(len1).fill('')
// Count number of matching characters
let matches = 0
// Check if characters on both string matches
for (let i: number = 0; i < len1; i++){
const ch1: string = string1[i]
for (let j = 0; j < len2; j++){
const ch2: string = string2[j]
if (j <= i + delta && j + delta >= 1 && ch1 == ch2 && !flag[j]){
flag[j] = true
ch1Match[matches++] = ch1;
break;
}
}
}
if (!matches){
return 1.0
}
// Count number of transpositions (shared characters placed in different positions)
let transpositions: number = 0.0
for (let i: number = 0, j: number = 0; j < len2; j++){
if (flag[j]){
if (string2[j] != ch1Match[i]){
transpositions++
}
i++
}
}
const m: number = matches
// Jaro Similarity Formula simj = ( (m / length of s1) + (m / length of s2) + (m - t) / m ) / 3
const jaro: number = (m / len1 + m / len2 + (m - transpositions / 2.0) / m) / 3.0
// Length of common prefix between string up to 4 characters
let commonPrefix: number = 0.0
len2 = Math.min(4, len2)
for (let i: number = 0; i < len2; i++){
if (string1[i] == string2[i]){
commonPrefix++
}
}
// Jaro Winkler Distance Formula simw = simj + lp(1 - simj)
return 1.0 - (jaro + commonPrefix * 0.1 * (1.0 - jaro))
}
// Compute Jaro Winkler Distance for every word on the dictionary against the misspelled word
function withinDistance(words: string[] ,maxDistance: number, string: string, maxToReturn: number): (string | number)[][]{
let result: (string | number)[][] = new Array()
words.forEach(word =>{
const distance = jaroDistance(word, string)
// check if computed jaro winkler distance is within the set distance parameter
if (distance <= maxDistance){
const tuple = [distance, word]
result.push(tuple)
}
})
result.sort()
// Limit of matches set to maxtoReturn
return result.length <= maxToReturn ? result : result.slice(0, maxToReturn)
}
function loadDictionary(fileName: string): string[]{
let words: string[] = new Array()
try{
//attacomsian.com/blog/reading-a-file-line-by-line-in-nodejs
const data = fs.readFileSync(fileName, 'utf-8')
const lines: string[] = data.split(/\r?\n/)
lines.forEach(line => {
words.push(line)
})
return words
}
catch(error){
console.log("Error reading dictionary")
}
}
function main(): void{
try {
const misspellings = [
"accomodate",
"definately",
"goverment",
"occured",
"publically",
"recieve",
"seperate",
"untill",
"wich"
]
//unixdict.txt from users.cs.duke.edu/~ola/ap/linuxwords
let words: string[] = loadDictionary("unixdict.txt")
misspellings.forEach(spelling =>{
console.log("Misspelling:", spelling)
const closeWord = withinDistance(words, 0.15, spelling, 5)
closeWord.forEach(word =>{
console.log((word[0] as number).toFixed(4) + " " + word[1])
})
console.log("")
})
}
catch(error) {
console.log("Error on main")
}
}
main();
- Output:
Misspelling: accomodate 0.0364 accommodate 0.0515 accommodated 0.0515 accommodates 0.0979 accommodating 0.0979 accommodation Misspelling: definately 0.0564 definitely 0.0586 defiantly 0.0909 define 0.0977 definite 0.1013 defiant Misspelling: goverment 0.0533 government 0.0667 govern 0.0697 governments 0.0833 governmental 0.0952 governs Misspelling: occured 0.0250 occurred 0.0571 occur 0.0786 occupied 0.0905 occurs 0.0917 accursed Misspelling: publically 0.0400 publicly 0.0800 public 0.1044 publicity 0.1327 publication 0.1400 biblically Misspelling: recieve 0.0333 receive 0.0625 received 0.0625 receiver 0.0625 receives 0.0667 relieve Misspelling: seperate 0.0708 desperate 0.1042 temperate 0.1083 separate 0.1167 repeated 0.1167 sept Misspelling: untill 0.0333 until 0.1067 untie 0.1083 untimely 0.1111 till 0.1264 Antilles Misspelling: wich 0.0533 witch 0.0600 which 0.1111 switch 0.1111 twitch 0.1143 witches
V (Vlang)
import os
fn jaro_sim(str1 string, str2 string) f64 {
if str1.len == 0 && str2.len == 0 {
return 1
}
if str1.len == 0 || str2.len == 0 {
return 0
}
mut match_distance := str1.len
if str2.len > match_distance {
match_distance = str2.len
}
match_distance = match_distance/2 - 1
mut str1_matches := []bool{len: str1.len}
mut str2_matches := []bool{len: str2.len}
mut matches := 0.0
mut transpositions := 0.0
for i in 0..str1.len {
mut start := i - match_distance
if start < 0 {
start = 0
}
mut end := i + match_distance + 1
if end > str2.len {
end = str2.len
}
for k in start..end {
if str2_matches[k] {
continue
}
if str1[i] != str2[k] {
continue
}
str1_matches[i] = true
str2_matches[k] = true
matches++
break
}
}
if matches == 0 {
return 0
}
mut k := 0
for i in 0.. str1.len {
if !str1_matches[i] {
continue
}
for !str2_matches[k] {
k++
}
if str1[i] != str2[k] {
transpositions++
}
k++
}
transpositions /= 2
return (matches/f64(str1.len) +
matches/f64(str2.len) +
(matches-transpositions)/matches) / 3
}
fn jaro_winkler_dist(s string, t string) f64 {
ls := s.len
lt := t.len
mut lmax := lt
if ls < lt {
lmax = ls
}
if lmax > 4 {
lmax = 4
}
mut l := 0
for i in 0 .. lmax {
if s[i] == t[i] {
l++
}
}
js := jaro_sim(s, t)
p := 0.1
ws := js + f64(l)*p*(1-js)
return 1 - ws
}
struct Wd {
word string
dist f64
}
fn main() {
misspelt := [
"accomodate", "definately", "goverment", "occured", "publically",
"recieve", "seperate", "untill", "wich",
]
words := os.read_lines('unixdict.txt')?
for ms in misspelt {
mut closest := []Wd{}
for word in words {
if word == "" {
continue
}
jwd := jaro_winkler_dist(ms, word)
if jwd < 0.15 {
closest << Wd{word, jwd}
}
}
println("Misspelt word: $ms:")
closest.sort(a.dist<b.dist)
for i, c in closest {
println("${c.dist:.4f} ${c.word}")
if i == 5 {
break
}
}
println('')
}
}
- Output:
Misspelt word: accomodate : 0.0182 accommodate 0.1044 accordant 0.1136 accolade 0.1219 acclimate 0.1327 accompanist 0.1333 accord Misspelt word: definately : 0.0800 define 0.0850 definite 0.0886 defiant 0.1200 definitive 0.1219 designate 0.1267 deflate Misspelt word: goverment : 0.0667 govern 0.1167 governor 0.1175 governess 0.1330 governance 0.1361 coverlet 0.1367 sovereignty Misspelt word: occured : 0.0250 occurred 0.0571 occur 0.0952 occurrent 0.1056 occlude 0.1217 concurred 0.1429 cure Misspelt word: publically : 0.0800 public 0.1327 publication 0.1400 pull 0.1492 pullback Misspelt word: recieve : 0.0333 receive 0.0667 relieve 0.0762 reeve 0.0852 receptive 0.0852 recessive 0.0905 recife Misspelt word: seperate : 0.0708 desperate 0.0917 separate 0.1042 temperate 0.1167 selenate 0.1167 sewerage 0.1167 sept Misspelt word: untill : 0.0333 until 0.1111 till 0.1333 huntsville 0.1357 instill 0.1422 unital Misspelt word: wich : 0.0533 winch 0.0533 witch 0.0600 which 0.0857 wichita 0.1111 switch 0.1111 twitch
Wren
This uses unixdict and borrows code from the Jaro_distance#Wren task.
import "io" for File
import "./fmt" for Fmt
import "./sort" for Sort
var jaroSim = Fn.new { |s1, s2|
var le1 = s1.count
var le2 = s2.count
if (le1 == 0 && le2 == 0) return 1
if (le1 == 0 || le2 == 0) return 0
var dist = (le2 > le1) ? le2 : le1
dist = (dist/2).floor - 1
var matches1 = List.filled(le1, false)
var matches2 = List.filled(le2, false)
var matches = 0
var transpos = 0
for (i in 0...s1.count) {
var start = i - dist
if (start < 0) start = 0
var end = i + dist + 1
if (end > le2) end = le2
var k = start
while (k < end) {
if (!(matches2[k] || s1[i] != s2[k])) {
matches1[i] = true
matches2[k] = true
matches = matches + 1
break
}
k = k + 1
}
}
if (matches == 0) return 0
var k = 0
for (i in 0...s1.count) {
if (matches1[i]) {
while(!matches2[k]) k = k + 1
if (s1[i] != s2[k]) transpos = transpos + 1
k = k + 1
}
}
transpos = transpos / 2
return (matches/le1 + matches/le2 + (matches - transpos)/matches) / 3
}
var jaroWinklerDist = Fn.new { |s, t|
var ls = s.count
var lt = t.count
var lmax = (ls < lt) ? ls : lt
if (lmax > 4) lmax = 4
var l = 0
for (i in 0...lmax) {
if (s[i] == t[i]) l = l + 1
}
var js = jaroSim.call(s, t)
var p = 0.1
var ws = js + l*p*(1 - js)
return 1 - ws
}
var misspelt = ["accomodate", "definately", "goverment", "occured", "publically", "recieve", "seperate", "untill", "wich"]
var words = File.read("unixdict.txt").split("\n").map { |w| w.trim() }.where { |w| w != "" }
for (ms in misspelt) {
var closest = []
for (word in words) {
var jwd = jaroWinklerDist.call(ms, word)
if (jwd < 0.15) closest.add([word, jwd])
}
System.print("Misspelt word: %(ms):")
var cmp = Fn.new { |n1, n2| (n1[1]-n2[1]).sign }
Sort.insertion(closest, cmp)
for (c in closest.take(6)) Fmt.print("$0.4f $s", c[1], c[0])
System.print()
}
- Output:
Misspelt word: accomodate: 0.0182 accommodate 0.1044 accordant 0.1136 accolade 0.1219 acclimate 0.1327 accompanist 0.1333 accord Misspelt word: definately: 0.0800 define 0.0850 definite 0.0886 defiant 0.1200 definitive 0.1219 designate 0.1267 deflate Misspelt word: goverment: 0.0667 govern 0.1167 governor 0.1175 governess 0.1330 governance 0.1361 coverlet 0.1367 sovereignty Misspelt word: occured: 0.0250 occurred 0.0571 occur 0.0952 occurrent 0.1056 occlude 0.1217 concurred 0.1429 cure Misspelt word: publically: 0.0800 public 0.1327 publication 0.1400 pull 0.1492 pullback Misspelt word: recieve: 0.0333 receive 0.0667 relieve 0.0762 reeve 0.0852 receptive 0.0852 recessive 0.0905 recife Misspelt word: seperate: 0.0708 desperate 0.0917 separate 0.1042 temperate 0.1167 selenate 0.1167 sept 0.1167 sewerage Misspelt word: untill: 0.0333 until 0.1111 till 0.1333 huntsville 0.1357 instill 0.1422 unital Misspelt word: wich: 0.0533 winch 0.0533 witch 0.0600 which 0.0857 wichita 0.1111 switch 0.1111 twitch