Sorensen–Dice coefficient: Difference between revisions

m
(tidy up task description)
m (→‎{{header|jq}}: simplify)
 
(3 intermediate revisions by the same user not shown)
Line 24:
Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups:
 
SDC = 2 × |A ∩ BA∩B| / (|A| + |B|)
 
where A, B and A∩B are to be understood as multisets, and that if an item, x, has multiplicity a in A and b in B, then it will have multiplicity min(a,b) in A∩B.
 
The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations.
Line 300 ⟶ 302:
0.6087 Rhonda numbers
0.6000 Lah numbers
</pre>
 
=={{header|jq}}==
{{Works with|jq}}
 
'''Works with gojq, the Go implementation of jq'''
 
'''Works with jaq, the Rust implementation of jq'''
 
'''Adapted from [[#Wren|Wren]]'''
<syntaxhighlight lang="jq">
### Generic preliminaries
 
def count(s): reduce s as $x (0; .+1);
 
def lpad($len): tostring | ($len - length) as $l | (" " * $l) + .;
 
# Emit the count of the common items in the two given sorted arrays
# viewed as multisets
def count_commonality_of_multisets($A; $B):
# Returns a stream of the common elements
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then 1, ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
count([0,0] | pop);
 
# Emit an array of the normalized bigrams of the input string
def bigrams:
# Emit a stream of the bigrams of the input string blindly
def bg: . as $in | range(0;length-1 ) | $in[.:.+2];
ascii_downcase | [splits(" *") | bg];
 
 
### The Sorensen-Dice coefficient
 
def sorensen($a; $b):
($a | bigrams | sort) as $A
| ($b | bigrams | sort) as $B
| 2 * count_commonality_of_multisets($A; $B) / (($A|length) + ($B|length));
 
 
### Exercises
 
def exercises:
"Primordial primes",
"Sunkist-Giuliani formula",
"Sieve of Euripides",
"Chowder numbers"
;
 
[inputs] as $phrases
| exercises as $test
| [ range(0; $phrases|length) as $i
| [sorensen($phrases[$i]; $test), $phrases[$i] ] ]
| sort_by(first)
| .[-5:]
| reverse
| "\($test) >",
map( " \(first|tostring|.[:4]|lpad(4)) \(.[1])")[],
""
</syntaxhighlight>
{{output}}
Invocation: jq -nrR -f sorensen-dice-coefficient.jq rc_tasks_2022_09_24.txt
<pre>
Primordial primes >
0.68 Sequence of primorial primes
0.66 Factorial primes
0.57 Primorial numbers
0.54 Prime words
0.52 Almost prime
 
Sunkist-Giuliani formula >
0.56 Almkvist-Giullera formula for pi
0.37 Faulhaber's formula
0.34 Haversine formula
0.33 Check Machin-like formulas
0.30 Resistance calculator
 
Sieve of Euripides >
0.46 Sieve of Pritchard
0.46 Four sides of square
0.41 Sieve of Eratosthenes
0.4 Piprimes
0.38 Sierpinski curve
 
Chowder numbers >
0.78 Chowla numbers
0.64 Powerful numbers
0.60 Rhonda numbers
0.60 Fermat numbers
0.6 Lah numbers
</pre>
 
2,461

edits