Sorensen–Dice coefficient: Difference between revisions

Content added Content deleted
(Added Wren)
(J second draft)
Line 47: Line 47:
Tentative implementation:
Tentative implementation:
<syntaxhighlight lang=J>TASKS=: fread '~/tasks.txt' NB. from Sorensen–Dice_coefficient/Tasks
<syntaxhighlight lang=J>TASKS=: fread '~/tasks.txt' NB. from Sorensen–Dice_coefficient/Tasks
stok=: [: (#~ ' '*/ .~:~])2]\ 7 u: tolower@rplc&(LF,' ')
sdtok=: [: (#~ ' '*/ .~:~])2]\ 7 u: tolower@rplc&(LF,' ')
union=: ,
sdinter=: {{
all=. ~.x,y
inter=: [-.-.
X=. <:#/.~all,x
SDI=: ((inter +&#~ inter~) % #@union)&stok S:0
Y=. <:#/.~all,y
nearest=: {{ m{.\:~ x (] ;"0~ SDI) cutLF y }}</syntaxhighlight>
+/X<.Y
}}
sdunion=: #@,
SDC=: (2 * sdinter % sdunion)&sdtok S:0
nearest=: {{ m{.\:~ x (] ;"0~ SDC) cutLF y }}
fmt=: ((8j6": 0{::]),' ',1{::])"1</syntaxhighlight>


The trick here is the concept of "intersection" which we must use. We can't use set intersection -- the current draft task description, suggests that <code>SDI = 2 × (A ∩ B) / (A ⊎ B)</code> produces a number between 0 and 1. Because we're using division to produce this number, we must be using cardinality of the intersection rather than the intersection itself.
This is slightly different from the current draft task description, which suggests that <code>SDI = 2 × (A ∩ B) / (A ⊎ B)</code> produces a number between 0 and 1. If A and B are sets, each containing the same tokens, the result here would be 2 rather than 1. But we can make sense of this by assuming that the original algorithm was working with sequences rather than sets. The sequence difference is not commutative, so if <code>∩</code> represents sequence difference, and <code>⊎</code> represents sequence addition, it would make sense to define <code>SDI= ((A ∩ B) + (B ∩ A)) / (A ⊎ B)</code>, which is what we have done here (note that this change also includes an implicit shift from tokens to token counts somewhere in that calculation, as division only makes sense with numbers).
But if A and B are sets, each containing the same tokens, the result here using cardinality of sets would be 2 rather than 1.

Instead, we treat treat A and B as sequences of tokens (so repeated copies of a token are distinct) and for the cardinality of the intersection we count the number of times that each token appears in either A and in B and sum the minimum of the two counts. (So, tokens which only appear in A count 0 times, for example, where a token which appears 3 times in A and 2 times in B would contribute 2 to the sum.)


With this implementation, here's the task examples:
With this implementation, here's the task examples:
<pre> 'Primordial primes' 5 nearest TASKS
<pre> fmt 'Primordial prime' 5 nearest TASKS
0.647059 Sequence of primorial primes
┌────────┬────────────────────────────┐
│0.740741│Factorial primes
0.615385 Factorial primes
0.592593 Primorial numbers
├────────┼────────────────────────────┤
0.571429 Prime words
│0.714286│Sequence of primorial primes│
0.545455 Almost prime
├────────┼────────────────────────────┤
fmt 'Sunkist-Giuliani formula' 5 nearest TASKS
│0.681818│Prime words
0.565217 Almkvist-Giullera formula for pi
├────────┼────────────────────────────┤
│0.652174│Almost prime
0.378378 Faulhaber's formula
0.342857 Haversine formula
├────────┼────────────────────────────┤
0.333333 Check Machin-like formulas
│0.642857│Primorial numbers
0.307692 Resistance calculator
└────────┴────────────────────────────┘
'Sunkist-Giuliani formula' 5 nearest TASKS
fmt 'Sieve of Euripides' 5 nearest TASKS
0.461538 Sieve of Pritchard
┌────────┬───────────────────────────────────┐
0.461538 Four sides of square
│0.608696│Almkvist-Giullera formula for pi
0.413793 Sieve of Eratosthenes
├────────┼───────────────────────────────────┤
│0.378378│Faulhaber's formula
0.400000 Piprimes
0.384615 Sierpinski curve
├────────┼───────────────────────────────────┤
fmt 'Chowder numbers' 5 nearest TASKS
│0.371429│Haversine formula
0.782609 Chowla numbers
├────────┼───────────────────────────────────┤
0.640000 Powerful numbers
│0.357143│Check Machin-like formulas
0.608696 Rhonda numbers
├────────┼───────────────────────────────────┤
0.608696 Fermat numbers
│0.340426│Shoelace formula for polygonal area│
0.600000 Lah numbers </pre>
└────────┴───────────────────────────────────┘
'Sieve of Euripides' 5 nearest TASKS
┌────────┬────────────────────────┐
│0.461538│Sieve of Pritchard
├────────┼────────────────────────┤
│0.461538│Four sides of square
├────────┼────────────────────────┤
│0.413793│Sieve of Eratosthenes
├────────┼────────────────────────┤
│0.4 │Piprimes │
├────────┼────────────────────────┤
│0.392857│Law of cosines - triples│
└────────┴────────────────────────┘
'Chowder numbers' 5 nearest TASKS
┌────────┬──────────────┐
│0.826087│Chowla numbers│
├────────┼──────────────┤
│0.666667│Bell numbers │
├────────┼──────────────┤
│0.652174│Rhonda numbers│
├────────┼──────────────┤
│0.652174│Humble numbers│
├────────┼──────────────┤
│0.65 │Lah numbers │
└────────┴──────────────┘</pre>


=={{header|Phix}}==
=={{header|Phix}}==