Sorensen–Dice coefficient: Difference between revisions

← Older edit

Sorensen–Dice coefficient (view source)

Revision as of 20:04, 10 March 2024

2,525 bytes added , 2 months ago

m

→‎{{header|jq}}: simplify

Peak

2,461

edits

Revision as of 07:57, 10 March 2024 (view source) Peak (talk \| contribs) (tidy up task description) ← Older edit		Latest revision as of 20:04, 10 March 2024 (view source) Peak (talk \| contribs) m (→‎{{header\|jq}}: simplify)
(3 intermediate revisions by the same user not shown)
Line 24: Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups: SDC = 2 × \|~~A ∩ B~~A∩B\| / (\|A\| + \|B\|) where A, B and A∩B are to be understood as multisets, and that if an item, x, has multiplicity a in A and b in B, then it will have multiplicity min(a,b) in A∩B. The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations. Line 300 ⟶ 302: 0.6087 Rhonda numbers 0.6000 Lah numbers </pre> =={{header\|jq}}== {{Works with\|jq}} '''Works with gojq, the Go implementation of jq''' '''Works with jaq, the Rust implementation of jq''' '''Adapted from [[#Wren\|Wren]]''' <syntaxhighlight lang="jq"> ### Generic preliminaries def count(s): reduce s as $x (0; .+1); def lpad($len): tostring \| ($len - length) as $l \| (" " * $l) + .; # Emit the count of the common items in the two given sorted arrays # viewed as multisets def count_commonality_of_multisets($A; $B): # Returns a stream of the common elements def pop: .[0] as $i \| .[1] as $j \| if $i == ($A\|length) or $j == ($B\|length) then empty elif $A[$i] == $B[$j] then 1, ([$i+1, $j+1] \| pop) elif $A[$i] < $B[$j] then [$i+1, $j] \| pop else [$i, $j+1] \| pop end; count([0,0] \| pop); # Emit an array of the normalized bigrams of the input string def bigrams: # Emit a stream of the bigrams of the input string blindly def bg: . as $in \| range(0;length-1 ) \| $in[.:.+2]; ascii_downcase \| [splits(" ") \| bg]; ### The Sorensen-Dice coefficient def sorensen($a; $b): ($a \| bigrams \| sort) as $A \| ($b \| bigrams \| sort) as $B \| 2 count_commonality_of_multisets($A; $B) / (($A\|length) + ($B\|length)); ### Exercises def exercises: "Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers" ; [inputs] as $phrases \| exercises as $test \| [ range(0; $phrases\|length) as $i \| [sorensen($phrases[$i]; $test), $phrases[$i] ] ] \| sort_by(first) \| .[-5:] \| reverse \| "\($test) >", map( " \(first\|tostring\|.[:4]\|lpad(4)) \(.[1])")[], "" </syntaxhighlight> {{output}} Invocation: jq -nrR -f sorensen-dice-coefficient.jq rc_tasks_2022_09_24.txt <pre> Primordial primes > 0.68 Sequence of primorial primes 0.66 Factorial primes 0.57 Primorial numbers 0.54 Prime words 0.52 Almost prime Sunkist-Giuliani formula > 0.56 Almkvist-Giullera formula for pi 0.37 Faulhaber's formula 0.34 Haversine formula 0.33 Check Machin-like formulas 0.30 Resistance calculator Sieve of Euripides > 0.46 Sieve of Pritchard 0.46 Four sides of square 0.41 Sieve of Eratosthenes 0.4 Piprimes 0.38 Sierpinski curve Chowder numbers > 0.78 Chowla numbers 0.64 Powerful numbers 0.60 Rhonda numbers 0.60 Fermat numbers 0.6 Lah numbers </pre>