Sorensen–Dice coefficient: Difference between revisions
m
→{{header|jq}}: simplify
(New post.) |
m (→{{header|jq}}: simplify) |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 1:
{{task}}
The [[wp:Sørensen–Dice coefficient|Sørensen–Dice coefficient]], also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice
The original use was in botany
[[Levenshtein distance|Levenshtein]]
Sørensen–Dice is more useful for 'fuzzy' matching partial
There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.
Line 22:
Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups
where A, B and A∩B are to be understood as multisets, and that if an item, x, has multiplicity a in A and b in B, then it will have multiplicity min(a,b) in A∩B.
▲ SDI = 2 × (A ∩ B) / (A ⊎ B)
The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations
Line 41:
How you get the task names is peripheral to the task. You can [[:Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever.
If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations, it is acceptable to use that with a pointer to where it may be obtained.
Line 302:
0.6087 Rhonda numbers
0.6000 Lah numbers
</pre>
=={{header|jq}}==
{{Works with|jq}}
'''Works with gojq, the Go implementation of jq'''
'''Works with jaq, the Rust implementation of jq'''
'''Adapted from [[#Wren|Wren]]'''
<syntaxhighlight lang="jq">
### Generic preliminaries
def count(s): reduce s as $x (0; .+1);
def lpad($len): tostring | ($len - length) as $l | (" " * $l) + .;
# Emit the count of the common items in the two given sorted arrays
# viewed as multisets
def count_commonality_of_multisets($A; $B):
# Returns a stream of the common elements
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then 1, ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
count([0,0] | pop);
# Emit an array of the normalized bigrams of the input string
def bigrams:
# Emit a stream of the bigrams of the input string blindly
def bg: . as $in | range(0;length-1 ) | $in[.:.+2];
ascii_downcase | [splits(" *") | bg];
### The Sorensen-Dice coefficient
def sorensen($a; $b):
($a | bigrams | sort) as $A
| ($b | bigrams | sort) as $B
| 2 * count_commonality_of_multisets($A; $B) / (($A|length) + ($B|length));
### Exercises
def exercises:
"Primordial primes",
"Sunkist-Giuliani formula",
"Sieve of Euripides",
"Chowder numbers"
;
[inputs] as $phrases
| exercises as $test
| [ range(0; $phrases|length) as $i
| [sorensen($phrases[$i]; $test), $phrases[$i] ] ]
| sort_by(first)
| .[-5:]
| reverse
| "\($test) >",
map( " \(first|tostring|.[:4]|lpad(4)) \(.[1])")[],
""
</syntaxhighlight>
{{output}}
Invocation: jq -nrR -f sorensen-dice-coefficient.jq rc_tasks_2022_09_24.txt
<pre>
Primordial primes >
0.68 Sequence of primorial primes
0.66 Factorial primes
0.57 Primorial numbers
0.54 Prime words
0.52 Almost prime
Sunkist-Giuliani formula >
0.56 Almkvist-Giullera formula for pi
0.37 Faulhaber's formula
0.34 Haversine formula
0.33 Check Machin-like formulas
0.30 Resistance calculator
Sieve of Euripides >
0.46 Sieve of Pritchard
0.46 Four sides of square
0.41 Sieve of Eratosthenes
0.4 Piprimes
0.38 Sierpinski curve
Chowder numbers >
0.78 Chowla numbers
0.64 Powerful numbers
0.60 Rhonda numbers
0.60 Fermat numbers
0.6 Lah numbers
</pre>
Line 736 ⟶ 832:
The results on this basis are the same as the Raku example.
<syntaxhighlight lang="
import "./str" for Str
import "./set" for Bag
|