Sorensen–Dice coefficient: Difference between revisions

m
(→‎Julia: file name)
m (→‎{{header|jq}}: simplify)
 
(8 intermediate revisions by 4 users not shown)
Line 1:
{{task}}
 
The [[wp:Sørensen–Dice coefficient|Sørensen–Dice coefficient]], also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two poulationpopulation samples.
 
The original use was in botany, indexingas a measure of similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the [[Levenshtein distance|Levenshtein edit distance]] function, though it'sits strengthcharacteristics liesare in aquite different area.
 
[[Levenshtein distance|Levenshtein]] iscan be gooduseful for findingspelling misspellingscorrection, but relies on the tested word /or phrase being prettyquite similar to the desired one, and can be very slow for long words /or phrases.
 
Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words /or phrases, possibly in improper order.
 
There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.
Line 22:
 
Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups.:
 
SDC = 2 × |A∩B| / (|A| + |B|)
For items(objects, populations) A and B:
 
where A, B and A∩B are to be understood as multisets, and that if an item, x, has multiplicity a in A and b in B, then it will have multiplicity min(a,b) in A∩B.
SDI = 2 × (A ∩ B) / (A ⊎ B)
 
The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations between 0.0 and 1.0.
 
SDISDC ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.
 
 
Line 41:
How you get the task names is peripheral to the task. You can [[:Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever.
 
If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations, it is acceptable to use that with a pointer to where it may be obtained.
 
 
Line 201:
0.608696 Fermat numbers
0.600000 Lah numbers </pre>
 
=={{header|Java}}==
<syntaxhighlight lang="java">
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
 
public final class SorensenDiceCoefficient {
 
public static void main(String[] args) throws IOException {
List<String> tasks = Files.readAllLines(Path.of("Rosetta Code Tasks.dat"), StandardCharsets.UTF_8);
List<String> tests = List.of(
"Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers" );
record TaskValue(String task, double value) {}
for ( String test : tests ) {
List<TaskValue> taskValues = new ArrayList<TaskValue>();
Map<String, Integer> bigramsTest = createBigrams(test);
for ( String task : tasks ) {
double value = sorensenDice(bigramsTest, createBigrams(task));
taskValues.add( new TaskValue(task, value) );
}
Collections.sort(taskValues, (one, two) -> Double.compare(two.value, one.value));
System.out.println(test + ":");
for ( int i = 0; i < 5; i++ ) {
System.out.println(String.format("%s%.4f%s%s",
" ", taskValues.get(i).value, " ", taskValues.get(i).task));
}
System.out.println();
}
}
private static double sorensenDice(Map<String, Integer> bigramsOne, Map<String, Integer> bigramsTwo) {
int intersectionSize = 0;
for ( Map.Entry<String, Integer> entry : bigramsOne.entrySet() ) {
if ( bigramsTwo.keySet().contains(entry.getKey()) ) {
intersectionSize += Math.min(entry.getValue(), bigramsTwo.get(entry.getKey()));
}
}
return 2.0 * intersectionSize / ( size(bigramsOne) + size(bigramsTwo) );
}
private static Map<String, Integer> createBigrams(String text) {
Map<String, Integer> result = new HashMap<String, Integer>();
for ( String word : text.toLowerCase().split(" ") ) {
if ( word.length() == 1 ) {
result.merge(word, 1, Integer::sum);
} else {
for ( int i = 0; i < word.length() - 1; i++ ) {
result.merge(word.substring(i, i + 2), 1, Integer::sum);
}
}
}
return result;
}
private static int size(Map<String, Integer> map) {
return map.values().stream().mapToInt(Integer::intValue).sum();
}
 
}
</syntaxhighlight>
{{ out }}
<pre>
Primordial primes:
0.6857 Sequence of primorial primes
0.6667 Factorial primes
0.5714 Primorial numbers
0.5455 Prime words
0.5217 Almost prime
 
Sunkist-Giuliani formula:
0.5652 Almkvist-Giullera formula for pi
0.3784 Faulhaber's formula
0.3429 Haversine formula
0.3333 Check Machin-like formulas
0.3077 Resistance calculator
 
Sieve of Euripides:
0.4615 Four sides of square
0.4615 Sieve of Pritchard
0.4138 Sieve of Eratosthenes
0.4000 Piprimes
0.3846 Sierpinski curve
 
Chowder numbers:
0.7826 Chowla numbers
0.6400 Powerful numbers
0.6087 Fermat numbers
0.6087 Rhonda numbers
0.6000 Lah numbers
</pre>
 
=={{header|jq}}==
{{Works with|jq}}
 
'''Works with gojq, the Go implementation of jq'''
 
'''Works with jaq, the Rust implementation of jq'''
 
'''Adapted from [[#Wren|Wren]]'''
<syntaxhighlight lang="jq">
### Generic preliminaries
 
def count(s): reduce s as $x (0; .+1);
 
def lpad($len): tostring | ($len - length) as $l | (" " * $l) + .;
 
# Emit the count of the common items in the two given sorted arrays
# viewed as multisets
def count_commonality_of_multisets($A; $B):
# Returns a stream of the common elements
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then 1, ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
count([0,0] | pop);
 
# Emit an array of the normalized bigrams of the input string
def bigrams:
# Emit a stream of the bigrams of the input string blindly
def bg: . as $in | range(0;length-1 ) | $in[.:.+2];
ascii_downcase | [splits(" *") | bg];
 
 
### The Sorensen-Dice coefficient
 
def sorensen($a; $b):
($a | bigrams | sort) as $A
| ($b | bigrams | sort) as $B
| 2 * count_commonality_of_multisets($A; $B) / (($A|length) + ($B|length));
 
 
### Exercises
 
def exercises:
"Primordial primes",
"Sunkist-Giuliani formula",
"Sieve of Euripides",
"Chowder numbers"
;
 
[inputs] as $phrases
| exercises as $test
| [ range(0; $phrases|length) as $i
| [sorensen($phrases[$i]; $test), $phrases[$i] ] ]
| sort_by(first)
| .[-5:]
| reverse
| "\($test) >",
map( " \(first|tostring|.[:4]|lpad(4)) \(.[1])")[],
""
</syntaxhighlight>
{{output}}
Invocation: jq -nrR -f sorensen-dice-coefficient.jq rc_tasks_2022_09_24.txt
<pre>
Primordial primes >
0.68 Sequence of primorial primes
0.66 Factorial primes
0.57 Primorial numbers
0.54 Prime words
0.52 Almost prime
 
Sunkist-Giuliani formula >
0.56 Almkvist-Giullera formula for pi
0.37 Faulhaber's formula
0.34 Haversine formula
0.33 Check Machin-like formulas
0.30 Resistance calculator
 
Sieve of Euripides >
0.46 Sieve of Pritchard
0.46 Four sides of square
0.41 Sieve of Eratosthenes
0.4 Piprimes
0.38 Sierpinski curve
 
Chowder numbers >
0.78 Chowla numbers
0.64 Powerful numbers
0.60 Rhonda numbers
0.60 Fermat numbers
0.6 Lah numbers
</pre>
 
=={{header|Julia}}==
 
Note that the code handles 2-byte characters such that a 2-char sequence containing one has 3 bytes, which causes a change in the fifth choice in the last example because "Erdős" has the multibyte char 'ő'.
<syntaxhighlight lang="julia">using Multisets
 
Line 211 ⟶ 409:
words = split(lowercase(txt), r"\s+")
for w in words
ifa length= collect(w) < 3
if length(a) < 3
push!(tokens, w)
else
for i in 1:length(wa)-1
isvalid(w, i) && isvalid(w, i + 1) && push!(tokens, wString(a[i:i+1]))
end
end
Line 223 ⟶ 422:
 
""" Sorenson-Dice similarity of multisets """
function sorensondicesorenson_dice(text1, text2)
bc1, bc2 = tokenizetext(text1), tokenizetext(text2)
return 2 * length(bc1 ∩ bc2) / (length(bc1) + length(bc2))
end
 
const alltasks = split(read("onedrive/documents/julia programs/tasks.txt", String), "\n")
 
# run tests
for test in ["Primordial primes", "Sunkist-Giuliani formula",
"Sieve of Euripides", "Chowder numbers"]
taskvalues = sort!([(sorensondicesorenson_dice(test, t), t) for t in alltasks], rev = true)
println("\n$test:")
for (val, task) in taskvalues[begin:begin+4]
Line 244 ⟶ 443:
Primordial primes:
0.6855 Sequence of primorial primes
0.6665 Factorial primes
0.5713 Primorial numbers
0.5454 Prime words
0.522 Almost prime
 
Sunkist-Giuliani formula:
Line 268 ⟶ 467:
0.609 Rhonda numbers
0.609 Fermat numbers
0.6096 Erdős–WoodsLah numbers
</pre>
 
Line 633 ⟶ 832:
 
The results on this basis are the same as the Raku example.
<syntaxhighlight lang="ecmascriptwren">import "io" for File
import "./str" for Str
import "./set" for Bag
2,461

edits