Sorensen–Dice coefficient

You are encouraged to solve this task according to the task description, using any language you may know.
The Sørensen–Dice coefficient, also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice) is a statistic used to gauge the similarity of two population samples.
The original use was in botany as a measure of similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the Levenshtein edit distance function, though its characteristics are quite different.
Levenshtein can be useful for spelling correction, but relies on the tested word or phrase being quite similar to the desired one, and can be very slow for long words or phrases.
Sørensen–Dice is more useful for 'fuzzy' matching partial and poorly spelled words or phrases, possibly in improper order.
There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.
For instance, the word 'differ' would be tokenized to the group:
di if ff fe er
Different implementations may do slightly different transforms. For our purposes, fold case so that all characters are the same case, split words, and ignore white-space, but keep punctuation.
The phrase "Don't Panic!" will be tokenized to the bi-grams:
do on n' 't pa an ni ic c!
Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups:
SDC = 2 × |A∩B| / (|A| + |B|)
where A, B and A∩B are to be understood as multisets, and that if an item, x, has multiplicity a in A and b in B, then it will have multiplicity min(a,b) in A∩B.
The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations.
SDC can by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.
- Task
- Use the list of Rosetta Code task and draft task names as your "dictionary" to search.
- Using that dictionary, search for the mangled task names: 'Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers'.
- Show the search term and the coefficient / match for the five closest, most similar matches.
How you get the task names is peripheral to the task. You can web-scrape them or download them to a file, whatever.
If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations, it is acceptable to use that with a pointer to where it may be obtained.
ALGOL 68
Some of the close matches with equal coefficients appear in reverse order compared to some other samples.
Uses the definition in the task, so if a bi-graph appears twice, it is treated as two bi-graphs.
Note, the sources of the RC Algol 68 file and sort libraries are on separate pages of Rosetta Code - see the above links.
BEGIN # use the Sorensen–Dice coefficient to find the nearest matches for #
# some task names in a list of tasks #
PR read "files.incl.a68" PR # include file utilities, including READLINES #
PR read "sort.incl.a68" PR # include sort utilities, including QUICKSORT #
# details of a bi-graph #
MODE BIGRAPH = STRUCT( STRING bg, INT count );
# returns the length of s #
OP LENGTH = ( STRING s )INT: ( UPB s - LWB s ) + 1;
# returns the bi-graphs from line, spaces and control-characters delimit #
# words but are otherwise ignored #
OP BIGRAPHS = ( STRING line )[]BIGRAPH:
BEGIN
INT c len = LENGTH line;
OP NORMALISE = ( INT pos )CHAR:
IF pos > c len THEN " "
ELIF CHAR c = line[ pos ]; c <= " " THEN " "
ELIF c >= "a" AND c <= "z" THEN REPR ( ( ABS c - ABS "a" ) + ABS "A" )
ELSE c
FI # NORMALISE # ;
# split into bi-graphs, handling word boundaries #
# a one-character word forms a bi-graph containing the character #
# followed by a space, space does not appear in the bi-graphs of #
# longer words #
[ 1 : c len - 1 ]STRING bg;
BOOL in word := line[ LWB line ] > " ";
INT bg len := 0;
CHAR this char := NORMALISE LWB line;
FOR bg pos FROM LWB line TO UPB line DO
CHAR next char = NORMALISE ( bg pos + 1 );
IF this char = " " THEN
in word := FALSE
ELIF in word AND next char > " " THEN
bg[ bg len +:= 1 ] := this char + next char
ELIF NOT in word THEN
bg[ bg len +:= 1 ] := this char + next char;
in word := next char > " "
FI;
this char := next char
OD;
bg QUICKSORT ELEMENTS( 1, bg len ); # sort the bi-graphs #
# count the occurances of each bi-graph #
[ 1 : bg len ]BIGRAPH bg set;
IF bg len < 1
THEN bg set # no bi-graphs afterall #
ELSE INT set len := 1;
bg set[ 1 ] := BIGRAPH( bg[ 1 ], 1 );
FOR bg pos FROM LWB bg + 1 TO bg len DO
IF bg[ bg pos ] = bg OF bg set[ set len ]
THEN count OF bg set[ set len ] +:= 1
ELSE bg set[ set len +:= 1 ] := ( bg[ bg pos ], 1 )
FI
OD;
bg set[ 1 : set len ]
FI
END # BIGRAPHS # ;
# returns the Sorensen–Dice coefficient of a compared to b #
PRIO SDCOEFFICIENT = 1;
OP SDCOEFFICIENT = ( []BIGRAPH bga, bgb )REAL:
BEGIN
INT i len := 0, a len := 0, b len := 0;
FOR b pos FROM LWB bgb TO UPB bgb DO b len +:= count OF bgb[ b pos ] OD;
FOR a pos FROM LWB bga TO UPB bga DO
BIGRAPH bg = bga[ a pos ];
a len +:= count OF bg;
FOR b pos FROM LWB bgb TO UPB bgb WHILE bg OF bgb[ b pos ] <= bg OF bg DO
IF bg OF bg = bg OF bgb[ b pos ]
THEN i len +:= IF count OF bg < count OF bgb[ b pos ]
THEN count OF bg
ELSE count OF bgb[ b pos ]
FI
FI
OD
OD;
( 2 * i len ) / ( a len + b len )
END # SDCOEFFICIENT # ;
# task #
IF REF[]STRING tasks = READLINES "tasks.txt";
UPB tasks < LWB tasks
THEN print( ( "Unable to read tasks.txt", newline ) )
ELSE # tasks.txt read OK #
# find the closest matches for the specified non-task names #
[]STRING tests = ( "Primordial primes"
, "Sunkist-Giuliani formula"
, "Sieve of Euripides"
, "Chowder numbers"
);
MODE SDMATCH = STRUCT( REAL sd, STRING name );
FOR t pos FROM LWB tests TO UPB tests DO
[ 1 : 5 ]SDMATCH closest := ( ( 0, "" ), ( 0, "" ), ( 0, "" ), ( 0, "" ), ( 0, "" ) );
INT max closest = UPB closest;
STRING test = tests[ t pos ];
[]BIGRAPH tbg = BIGRAPHS test;
print( ( test, ":", newline ) );
FOR d pos FROM LWB tasks TO UPB tasks DO
STRING task = tasks[ d pos ];
IF REAL sd = tbg SDCOEFFICIENT BIGRAPHS task; sd > sd OF closest[ max closest ] THEN
BOOL added := FALSE;
FOR pos TO UPB closest - 1 WHILE NOT added DO
IF added := sd > sd OF closest[ pos ] THEN
closest[ pos + 1 : max closest ] := closest[ pos : max closest - 1 ];
closest[ pos ] := ( sd, task )
FI
OD;
IF NOT added THEN closest[ max closest ] := ( sd, task ) FI
FI
OD;
FOR c pos TO UPB closest DO
print( ( " ", fixed( sd OF closest[ c pos ], -8, 4 ) ) );
print( ( ": ", name OF closest[ c pos ], newline ) )
OD
OD
FI
END
- Output:
Primordial primes: 0.6857: Sequence of primorial primes 0.6667: Factorial primes 0.5714: Primorial numbers 0.5455: Prime words 0.5217: Almost prime Sunkist-Giuliani formula: 0.5652: Almkvist-Giullera formula for pi 0.3784: Faulhaber's formula 0.3429: Haversine formula 0.3333: Check Machin-like formulas 0.3077: Resistance calculator Sieve of Euripides: 0.4615: Four sides of square 0.4615: Sieve of Pritchard 0.4138: Sieve of Eratosthenes 0.4000: Piprimes 0.3846: Sierpinski curve Chowder numbers: 0.7826: Chowla numbers 0.6400: Powerful numbers 0.6087: Fermat numbers 0.6087: Rhonda numbers 0.6000: Lah numbers
C++
#include <algorithm>
#include <cctype>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iterator>
#include <set>
#include <sstream>
#include <string>
#include <vector>
using bigram = std::pair<char, char>;
std::multiset<bigram> bigrams(const std::string& phrase) {
std::multiset<bigram> result;
std::istringstream is(phrase);
std::string word;
while (is >> word) {
for (char& ch : word) {
ch = std::tolower(static_cast<unsigned char>(ch));
}
size_t length = word.size();
if (length == 1) {
result.emplace(word[0], '\0');
} else {
for (size_t i = 0; i + 1 < length; ++i) {
result.emplace(word[i], word[i + 1]);
}
}
}
return result;
}
double sorensen(const std::string& s1, const std::string& s2) {
auto a = bigrams(s1);
auto b = bigrams(s2);
std::multiset<bigram> c;
std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
std::inserter(c, c.begin()));
return (2.0 * c.size()) / (a.size() + b.size());
}
int main() {
std::vector<std::string> tasks;
std::ifstream is("tasks.txt");
if (!is) {
std::cerr << "Cannot open tasks file.\n";
return EXIT_FAILURE;
}
std::string task;
while (getline(is, task)) {
tasks.push_back(task);
}
const size_t tc = tasks.size();
const std::string tests[] = {"Primordial primes",
"Sunkist-Giuliani formula",
"Sieve of Euripides", "Chowder numbers"};
std::vector<std::pair<double, size_t>> sdi(tc);
std::cout << std::fixed;
for (const std::string& test : tests) {
for (size_t i = 0; i != tc; ++i) {
sdi[i] = std::make_pair(sorensen(tasks[i], test), i);
}
std::partial_sort(sdi.begin(), sdi.begin() + 5, sdi.end(),
[](const std::pair<double, size_t>& a,
const std::pair<double, size_t>& b) {
return a.first > b.first;
});
std::cout << test << " >\n";
for (size_t i = 0; i < 5 && i < tc; ++i) {
std::cout << " " << sdi[i].first << ' ' << tasks[sdi[i].second]
<< '\n';
}
std::cout << '\n';
}
return EXIT_SUCCESS;
}
- Output:
Primordial primes > 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula > 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides > 0.461538 Four sides of square 0.461538 Sieve of Pritchard 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve Chowder numbers > 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Rhonda numbers 0.608696 Fermat numbers 0.600000 Lah numbers
FreeBASIC
Type Bigram
primero As Ubyte
segundo As Ubyte
End Type
Type BigramArray
items(Any) As Bigram
cnt As Integer
End Type
Function createBigrams(phrase As String) As BigramArray
Dim As BigramArray result
Redim result.items(0)
result.cnt = 0
Dim As String word = ""
Dim As Integer i = 1, j
Dim As Integer length = Len(phrase)
While i <= length
' Skip spaces
While i <= length Andalso phrase[i-1] = Asc(" ")
i += 1
Wend
' Extract word
word = ""
While i <= length Andalso phrase[i-1] <> Asc(" ")
word &= Chr(phrase[i-1])
i += 1
Wend
If Len(word) > 0 Then
word = Lcase(word)
Dim As Integer wordLen = Len(word)
If wordLen = 1 Then
Redim Preserve result.items(result.cnt)
result.items(result.cnt).primero = Asc(Mid(word, 1, 1))
result.items(result.cnt).segundo = 0
result.cnt += 1
Else
For j = 1 To wordLen - 1
Redim Preserve result.items(result.cnt)
result.items(result.cnt).primero = Asc(Mid(word, j, 1))
result.items(result.cnt).segundo = Asc(Mid(word, j + 1, 1))
result.cnt += 1
Next
End If
End If
Wend
Return result
End Function
Function countIntersection(a As BigramArray, b As BigramArray) As Integer
Dim As Integer cnt, i, j
cnt = 0
For i = 0 To a.cnt - 1
For j = 0 To b.cnt - 1
If a.items(i).primero = b.items(j).primero Andalso a.items(i).segundo = b.items(j).segundo Then
cnt += 1
Exit For
End If
Next
Next
Return cnt
End Function
Function Sorensen(s1 As String, s2 As String) As Double
Dim As BigramArray a = createBigrams(s1)
Dim As BigramArray b = createBigrams(s2)
Return (2.0 * countIntersection(a, b)) / (a.cnt + b.cnt)
End Function
Sub main()
Dim As String tasks()
Dim As Integer taskCnt = 0
Dim As Integer ff = Freefile
' Open tasks file
If Open("i:\tasks.txt" For Input As #ff) <> 0 Then
Print "Cannot open tasks file."
Exit Sub
End If
' Read tasks from file
Do Until Eof(ff)
Redim Preserve tasks(taskCnt)
Line Input #ff, tasks(taskCnt)
taskCnt += 1
Loop
Close #ff
Dim As Integer i, j, k
Dim As String tests(3) = {"Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers"}
For i = 0 To Ubound(tests)
Print tests(i); " >"
Dim As Double scores(taskCnt - 1)
Dim As Integer indices(taskCnt - 1)
' Calculate scores
For j = 0 To taskCnt - 1
scores(j) = Sorensen(tasks(j), tests(i))
indices(j) = j
Next
' Simple bubble sort for top 5
For j = 0 To 4
For k = j + 1 To taskCnt - 1
If scores(k) > scores(j) Then
Swap scores(j), scores(k)
Swap indices(j), indices(k)
End If
Next
Next
' Print top 5 results
For j = 0 To 4
If j < taskCnt Then Print Using " ##.#### &"; scores(j); tasks(indices(j))
Next
Print
Next
End Sub
main()
Sleep
- Output:
Primordial primes > 0.7429 Sequence of primorial primes 0.6667 Factorial primes 0.5882 Safe primes and unsafe primes 0.5763 Prime numbers p for which the sum of primes less than or equal to p is prime 0.5714 Primorial numbers Sunkist-Giuliani formula > 0.6522 Almkvist-Giullera formula for pi 0.3830 Shoelace formula for polygonal area 0.3784 Faulhaber's formula 0.3500 Sort primes from list to a list 0.3429 Haversine formula Sieve of Euripides > 0.4615 Four sides of square 0.4615 Sieve of Pritchard 0.4390 Seven-sided dice from five-sided dice 0.4138 Sieve of Eratosthenes 0.4138 Law of cosines - triples Chowder numbers > 0.7826 Chowla numbers 0.6957 Humble numbers 0.6667 Bell numbers 0.6400 Powerful numbers 0.6154 Bernoulli numbers
J
Tentative implementation:
TASKS=: fread '~/tasks.txt' NB. from Sorensen–Dice_coefficient/Tasks
sdtok=: [: (#~ ' '*/ .~:~])2]\ 7 u: tolower@rplc&(LF,' ')
sdinter=: {{
all=. ~.x,y
X=. <:#/.~all,x
Y=. <:#/.~all,y
+/X<.Y
}}
sdunion=: #@,
SDC=: (2 * sdinter % sdunion)&sdtok S:0
nearest=: {{ m{.\:~ x (] ;"0~ SDC) cutLF y }}
fmt=: ((8j6": 0{::]),' ',1{::])"1
The trick here is the concept of "intersection" which we must use. We can't use set intersection -- the current draft task description suggests that SDI = 2 × (A ∩ B) / (A ⊎ B)
produces a number between 0 and 1. Because we're using division to produce this number, we must be using cardinality of the intersection rather than the intersection itself.
But if A and B are sets, each containing the same tokens, the result here using cardinality of sets would be 2 rather than 1.
Instead, we treat A and B as sequences of tokens (so repeated copies of a token are distinct), for the cardinality of the intersection we count the number of times that each token appears in either A and in B and sum the minimum of the two counts. (So, tokens which only appear in A count 0 times, for example, where a token which appears 3 times in A and 2 times in B would contribute 2 to the sum.)
With this implementation, here's the task examples:
fmt 'Primordial prime' 5 nearest TASKS 0.647059 Sequence of primorial primes 0.615385 Factorial primes 0.592593 Primorial numbers 0.571429 Prime words 0.545455 Almost prime fmt 'Sunkist-Giuliani formula' 5 nearest TASKS 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator fmt 'Sieve of Euripides' 5 nearest TASKS 0.461538 Sieve of Pritchard 0.461538 Four sides of square 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve fmt 'Chowder numbers' 5 nearest TASKS 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Rhonda numbers 0.608696 Fermat numbers 0.600000 Lah numbers
Java
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public final class SorensenDiceCoefficient {
public static void main(String[] args) throws IOException {
List<String> tasks = Files.readAllLines(Path.of("Rosetta Code Tasks.dat"), StandardCharsets.UTF_8);
List<String> tests = List.of(
"Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers" );
record TaskValue(String task, double value) {}
for ( String test : tests ) {
List<TaskValue> taskValues = new ArrayList<TaskValue>();
Map<String, Integer> bigramsTest = createBigrams(test);
for ( String task : tasks ) {
double value = sorensenDice(bigramsTest, createBigrams(task));
taskValues.add( new TaskValue(task, value) );
}
Collections.sort(taskValues, (one, two) -> Double.compare(two.value, one.value));
System.out.println(test + ":");
for ( int i = 0; i < 5; i++ ) {
System.out.println(String.format("%s%.4f%s%s",
" ", taskValues.get(i).value, " ", taskValues.get(i).task));
}
System.out.println();
}
}
private static double sorensenDice(Map<String, Integer> bigramsOne, Map<String, Integer> bigramsTwo) {
int intersectionSize = 0;
for ( Map.Entry<String, Integer> entry : bigramsOne.entrySet() ) {
if ( bigramsTwo.keySet().contains(entry.getKey()) ) {
intersectionSize += Math.min(entry.getValue(), bigramsTwo.get(entry.getKey()));
}
}
return 2.0 * intersectionSize / ( size(bigramsOne) + size(bigramsTwo) );
}
private static Map<String, Integer> createBigrams(String text) {
Map<String, Integer> result = new HashMap<String, Integer>();
for ( String word : text.toLowerCase().split(" ") ) {
if ( word.length() == 1 ) {
result.merge(word, 1, Integer::sum);
} else {
for ( int i = 0; i < word.length() - 1; i++ ) {
result.merge(word.substring(i, i + 2), 1, Integer::sum);
}
}
}
return result;
}
private static int size(Map<String, Integer> map) {
return map.values().stream().mapToInt(Integer::intValue).sum();
}
}
- Output:
Primordial primes: 0.6857 Sequence of primorial primes 0.6667 Factorial primes 0.5714 Primorial numbers 0.5455 Prime words 0.5217 Almost prime Sunkist-Giuliani formula: 0.5652 Almkvist-Giullera formula for pi 0.3784 Faulhaber's formula 0.3429 Haversine formula 0.3333 Check Machin-like formulas 0.3077 Resistance calculator Sieve of Euripides: 0.4615 Four sides of square 0.4615 Sieve of Pritchard 0.4138 Sieve of Eratosthenes 0.4000 Piprimes 0.3846 Sierpinski curve Chowder numbers: 0.7826 Chowla numbers 0.6400 Powerful numbers 0.6087 Fermat numbers 0.6087 Rhonda numbers 0.6000 Lah numbers
jq
Works with gojq, the Go implementation of jq
Works with jaq, the Rust implementation of jq
Adapted from Wren
### Generic preliminaries
def count(s): reduce s as $x (0; .+1);
def lpad($len): tostring | ($len - length) as $l | (" " * $l) + .;
# Emit the count of the common items in the two given sorted arrays
# viewed as multisets
def count_commonality_of_multisets($A; $B):
# Returns a stream of the common elements
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then 1, ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
count([0,0] | pop);
# Emit an array of the normalized bigrams of the input string
def bigrams:
# Emit a stream of the bigrams of the input string blindly
def bg: . as $in | range(0;length-1 ) | $in[.:.+2];
ascii_downcase | [splits(" *") | bg];
### The Sorensen-Dice coefficient
def sorensen($a; $b):
($a | bigrams | sort) as $A
| ($b | bigrams | sort) as $B
| 2 * count_commonality_of_multisets($A; $B) / (($A|length) + ($B|length));
### Exercises
def exercises:
"Primordial primes",
"Sunkist-Giuliani formula",
"Sieve of Euripides",
"Chowder numbers"
;
[inputs] as $phrases
| exercises as $test
| [ range(0; $phrases|length) as $i
| [sorensen($phrases[$i]; $test), $phrases[$i] ] ]
| sort_by(first)
| .[-5:]
| reverse
| "\($test) >",
map( " \(first|tostring|.[:4]|lpad(4)) \(.[1])")[],
""
- Output:
Invocation: jq -nrR -f sorensen-dice-coefficient.jq rc_tasks_2022_09_24.txt
Primordial primes > 0.68 Sequence of primorial primes 0.66 Factorial primes 0.57 Primorial numbers 0.54 Prime words 0.52 Almost prime Sunkist-Giuliani formula > 0.56 Almkvist-Giullera formula for pi 0.37 Faulhaber's formula 0.34 Haversine formula 0.33 Check Machin-like formulas 0.30 Resistance calculator Sieve of Euripides > 0.46 Sieve of Pritchard 0.46 Four sides of square 0.41 Sieve of Eratosthenes 0.4 Piprimes 0.38 Sierpinski curve Chowder numbers > 0.78 Chowla numbers 0.64 Powerful numbers 0.60 Rhonda numbers 0.60 Fermat numbers 0.6 Lah numbers
Julia
using Multisets
""" convert a phrase into a count of bigram tokens of its words """
function tokenizetext(txt)
tokens = Multiset{String}()
words = split(lowercase(txt), r"\s+")
for w in words
a = collect(w)
if length(a) < 3
push!(tokens, w)
else
for i in 1:length(a)-1
push!(tokens, String(a[i:i+1]))
end
end
end
return tokens
end
""" Sorenson-Dice similarity of multisets """
function sorenson_dice(text1, text2)
bc1, bc2 = tokenizetext(text1), tokenizetext(text2)
return 2 * length(bc1 ∩ bc2) / (length(bc1) + length(bc2))
end
const alltasks = split(read("onedrive/documents/julia programs/tasks.txt", String), "\n")
# run tests
for test in ["Primordial primes", "Sunkist-Giuliani formula",
"Sieve of Euripides", "Chowder numbers"]
taskvalues = sort!([(sorenson_dice(test, t), t) for t in alltasks], rev = true)
println("\n$test:")
for (val, task) in taskvalues[begin:begin+4]
println(lpad(Float16(val), 8), " ", task)
end
end
- Output:
Primordial primes: 0.6855 Sequence of primorial primes 0.6665 Factorial primes 0.5713 Primorial numbers 0.5454 Prime words 0.522 Almost prime Sunkist-Giuliani formula: 0.5654 Almkvist-Giullera formula for pi 0.3784 Faulhaber's formula 0.3428 Haversine formula 0.3333 Check Machin-like formulas 0.3076 Resistance calculator Sieve of Euripides: 0.4614 Sieve of Pritchard 0.4614 Four sides of square 0.4138 Sieve of Eratosthenes 0.4 Piprimes 0.3845 Sierpinski curve Chowder numbers: 0.7827 Chowla numbers 0.64 Powerful numbers 0.609 Rhonda numbers 0.609 Fermat numbers 0.6 Lah numbers
Nim
import std/[algorithm, strutils, sugar, tables]
func bigrams(text: string): CountTable[string] =
## Extract the bigrams from a text.
for word in text.toLower.split(' '):
if word.len == 1:
result.inc(word)
else:
for i in 0..(word.len - 2):
result.inc(word[i..(i+1)])
func intersectionCount(a, b: CountTable[string]): int =
## Compute the cardinal of the intersection of two
## count tables.
for key, count in a:
if key in b:
inc result, min(count, b[key])
func card(a: CountTable[string]): int =
## Return the cardinal of a count table (i.e. the sum of counts).
for count in a.values:
inc result, count
func sorensenDice(text1, text2: string): float =
## Compute the Sorensen-dice coefficient of "text1" and "text2".
let ct1 = text1.bigrams
let ct2 = text2.bigrams
result = 2 * intersectionCount(ct1, ct2) / (ct1.card + ct2.card)
# Build the list of tasks.
let tasks = collect:
for line in lines("Sorensen-Dice.txt"):
line
const Tests = ["Primordial primes", "Sunkist-Giuliani formula",
"Sieve of Euripides", "Chowder numbers"]
for test in Tests:
echo test
var scores: seq[(float, string)]
for task in tasks:
scores.add (sorensenDice(test, task), task)
scores.sort(Descending)
for i in 0..4:
echo " ", scores[i][0].formatFloat(ffDecimal, 6), ' ', scores[i][1]
echo()
- Output:
Primordial primes 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides 0.461538 Sieve of Pritchard 0.461538 Four sides of square 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve Chowder numbers 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Rhonda numbers 0.608696 Fermat numbers 0.600000 Lah numbers
Perl
use v5.036;
use Path::Tiny;
use List::Util <uniq head>;
sub bi_gram {
my $line = lc shift;
uniq map { substr $line,$_,2 } 0..length($line)-2;
}
sub score {
my($phrase, $word) = @_;
my %count;
my @match = bi_gram $phrase;
$count{$_}++ for @match, @$word;
2 * (grep { $count{$_} > 1 } keys %count) / (@match + @$word);
}
sub sorensen {
my($dict,$word,$cutoff) = @_; $cutoff //= 0.00;
my(%matches,$s);
($s = score($word, $$dict{$_})) > $cutoff and $matches{$_} = $s for keys %$dict;
%matches;
}
my %dict = map { $_ => [ bi_gram($_) ] } path('ref/Sorensen-Dice-Tasks.txt')->slurp =~ /.{10,}/gm;
for my $word ( ('Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers') ) {
my(%scored,@ranked);
%scored = sorensen(\%dict,$word);
push @ranked, sprintf "%.3f $_", $scored{$_} for sort { $scored{$b} <=> $scored{$a} || $a cmp $b } keys %scored;
say "\n$word:\n" . join("\n", head 5, @ranked);
}
- Output:
Primordial primes: 0.741 Factorial primes 0.629 Sequence of primorial primes 0.583 Almost prime 0.581 Next special primes 0.571 Pandigital prime Sunkist-Giuliani formula: 0.542 Almkvist-Giullera formula for pi 0.368 Haversine formula 0.359 Faulhaber's formula 0.348 Check Machin-like formulas 0.303 FASTA format Sieve of Euripides: 0.541 Sieve of Eratosthenes 0.529 Sieve of Pritchard 0.457 Four sides of square 0.457 The sieve of Sundaram 0.387 Sum of a series Chowder numbers: 0.769 Chowla numbers 0.615 Rhonda numbers 0.609 Bell numbers 0.609 Lah numbers 0.593 Kaprekar numbers
Phix
with javascript_semantics function bigram(string s) sequence words = split(lower(s)), res = {} for word in words do for i=1 to length(word)-1 do res = append(res,word[i..i+1]) end for end for res = sort(res) return res end function function intrasect(sequence s1, s2) integer l1 = length(s1), l2 = length(s2), i1 = 1, i2 = 1, res = 0 while i1<=l1 and i2<=l2 do integer c = compare(s1[i1],s2[i2]) res += (c=0) i1 += (c<=0) i2 += (c>=0) end while return res end function procedure test(string s, sequence dictionary) sequence scores = {}, s1 = bigram(s) for phrase in dictionary do sequence s2 = bigram(phrase) scores &= 2*intrasect(s1,s2)/(length(s1)+length(s2)) end for printf(1,"%s >\n",s) sequence tags = reverse(custom_sort(scores,tagset(length(scores)))[-5..-1]) for t in tags do printf(1,"%f %s\n",{scores[t],dictionary[t]}) end for printf(1,"\n") end procedure constant tests = split(""" Primordial primes Sunkist-Giuliani formula Sieve of Euripides Chowder numbers """,'\n'), tasks = split(""" Almkvist-Giullera formula for pi Almost prime Check Machin-like formulas Chowla numbers Factorial primes Faulhaber's formula Fermat numbers Four sides of square Haversine formula Lah numbers Piprimes Powerful numbers Prime words Primorial numbers Resistance calculator Rhonda numbers Sequence of primorial primes Sierpinski curve Sieve of Eratosthenes Sieve of Pritchard """,'\n') papply(true,test,{tests,{tasks}})
- Output:
Extending the task list to the full 1577 entries changes nothing.
Primordial primes > 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula > 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides > 0.461538 Sieve of Pritchard 0.461538 Four sides of square 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve Chowder numbers > 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Rhonda numbers 0.608696 Fermat numbers 0.600000 Lah numbers
Python
Of the several Python string similarity libraries implementing Sorenson-Dice similarity, none give the same results as the original example's Raku library, so this was imitated using Multisets, as per the C++ and Wren examples.
''' Rosetta Code task rosettacode.org/wiki/Sorensen–Dice_coefficient '''
from multiset import Multiset
def tokenizetext(txt):
''' convert a phrase into a count of bigram tokens of its words '''
arr = []
for wrd in txt.lower().split(' '):
arr += ([wrd] if len(wrd) == 1 else [wrd[i:i+2]
for i in range(len(wrd)-1)])
return Multiset(arr)
def sorenson_dice(text1, text2):
''' Sorenson-Dice similarity of Multisets '''
bc1, bc2 = tokenizetext(text1), tokenizetext(text2)
return 2 * len(bc1 & bc2) / (len(bc1) + len(bc2))
with open('tasklist_sorenson.txt', 'r') as fd:
alltasks = fd.read().split('\n')
for testtext in ['Primordial primes', 'Sunkist-Giuliani formula',
'Sieve of Euripides', 'Chowder numbers']:
taskvalues = sorted([(sorenson_dice(testtext, t), t)
for t in alltasks], reverse=True)
print(f'\n{testtext}:')
for (val, task) in taskvalues[:5]:
print(f' {val:.6f} {task}')
- Output:
Primordial primes: 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula: 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides: 0.461538 Sieve of Pritchard 0.461538 Four sides of square 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve Chowder numbers: 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Rhonda numbers 0.608696 Fermat numbers 0.600000 Lah numbers
Raku
Using the library Text::Sorensen from the Raku ecosystem.
See the Raku entry for Text completion for a bespoke implementation of Sorenson-Dice. (Which is very similar to the library implementation.)
use Text::Sorensen :sdi;
my %tasks = './tasks.txt'.IO.slurp.lines.race.map: { $_ => .&bi-gram };
sub fuzzy-search (Str $string) { sdi($string, %tasks, :ge(.2) ).head(5).join: "\n" }
say "\n$_ >\n" ~ .&fuzzy-search for
'Primordial primes',
'Sunkist-Giuliani formula',
'Sieve of Euripides',
'Chowder numbers';
- Output:
Primordial primes > 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula > 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides > 0.461538 Four sides of square 0.461538 Sieve of Pritchard 0.413793 Sieve of Eratosthenes 0.4 Piprimes 0.384615 Sierpinski curve Chowder numbers > 0.782609 Chowla numbers 0.64 Powerful numbers 0.608696 Fermat numbers 0.608696 Rhonda numbers 0.6 Lah numbers
Wren
This assumes that a one letter word is treated as a bigram. It also assumes that all bigrams are matched whether duplicates or not.
The results on this basis are the same as the Raku example.
import "io" for File
import "./str" for Str
import "./set" for Bag
import "./fmt" for Fmt
var bigrams = Fn.new { |phrase|
var words = Str.splitNoEmpty(phrase, " ")
var res = []
for (word in words) {
var chars = Str.lower(word).toList
if (chars.count == 1) {
res.add(chars[0])
} else {
for (i in 0...chars.count-1) {
res.add(chars[i] + chars[i+1])
}
}
}
return res
}
var sorensen = Fn.new { |a, b|
var abi = Bag.new(bigrams.call(a))
var bbi = Bag.new(bigrams.call(b))
var common = abi.intersect(bbi)
return 2 * common.count / (abi.count + bbi.count)
}
var fileName = "rc_tasks_2022_09_24.txt" // local copy
var tasks = File.read(fileName).trimEnd().split("\n")
var tc = tasks.count
var tests = [
"Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers"
]
var sdis = List.filled(tc, null)
for (test in tests) {
for (i in 0...tasks.count) sdis[i] = [tasks[i], sorensen.call(tasks[i], test)]
var top5 = sdis.sort { |e1, e2| e1[1] >= e2[1] }.take(5).toList
System.print("%(test) >")
for (e in top5) Fmt.print(" $f $s", e[1], e[0])
System.print()
}
- Output:
Primordial primes > 0.685714 Sequence of primorial primes 0.666667 Factorial primes 0.571429 Primorial numbers 0.545455 Prime words 0.521739 Almost prime Sunkist-Giuliani formula > 0.565217 Almkvist-Giullera formula for pi 0.378378 Faulhaber's formula 0.342857 Haversine formula 0.333333 Check Machin-like formulas 0.307692 Resistance calculator Sieve of Euripides > 0.461538 Four sides of square 0.461538 Sieve of Pritchard 0.413793 Sieve of Eratosthenes 0.400000 Piprimes 0.384615 Sierpinski curve Chowder numbers > 0.782609 Chowla numbers 0.640000 Powerful numbers 0.608696 Fermat numbers 0.608696 Rhonda numbers 0.600000 Lah numbers