Sorensen–Dice coefficient: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:29, 24 September 2022

The Sørensen–Dice coefficient, also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two poulation samples.

The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the Levenshtein edit distance function, though it's strength lies in a different area.

Levenshtein is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long words / phrases.

Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order.

There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.

For instance, the word 'differ' would be tokenized to the group:

   di if ff fe er

Different implementation may do slightly different transforms. For our purposes, fold case so that all characters are the same case, split words, and ignore white-space, but keep punctuation.

The phrase "Don't Panic!" will be tokenized to the bi-grams:

   do on n' 't pa an ni ic c!

Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups.

For items(objects, populations) A and B:

   SDI = 2 × (A ∩ B) / (A ⊎ B)

The Sørensen–Dice coefficient is a "percent similarity" between the two population between 0.0 and 1.0.

SDI can by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.

Task

Use the list of Rosetta Code task and draft task names as your "dictionary" to search.
Using that dictionary, search for the mangled task names: 'Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers'.
Show the search term and the coefficient / match for the five closest, most similar matches.

How you get the task names is peripheral to the task. You can web-scrape them or download them to a file, whatever.

If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.

Raku

Using the library Text::Sorensen from the Raku ecosystem.

See the Raku entry for Text completion for a bespoke implementation of Sorenson-Dice. (Which is very similar to the library implementation.)

use Text::Sorensen :sdi;

my %tasks = './tasks.txt'.IO.slurp.lines.race.map: { $_ => .&bi-gram };

sub fuzzy-search (Str $string) { sdi($string, %tasks, :ge(.2) ).head(5).join: "\n" }

say "\n$_ >\n" ~ .&fuzzy-search for
  'Primordial primes',
  'Sunkist-Giuliani formula',
  'Sieve of Euripides',
  'Chowder numbers';

Output:

Primordial primes >
0.685714 Sequence of primorial primes
0.666667 Factorial primes
0.571429 Primorial numbers
0.545455 Prime words
0.521739 Almost prime

Sunkist-Giuliani formula >
0.565217 Almkvist-Giullera formula for pi
0.378378 Faulhaber's formula
0.342857 Haversine formula
0.333333 Check Machin-like formulas
0.307692 Resistance calculator

Sieve of Euripides >
0.461538 Four sides of square
0.461538 Sieve of Pritchard
0.413793 Sieve of Eratosthenes
0.4 Piprimes
0.384615 Sierpinski curve

Chowder numbers >
0.782609 Chowla numbers
0.64 Powerful numbers
0.608696 Fermat numbers
0.608696 Rhonda numbers
0.6 Lah numbers

@@ Line 5: / Line 5: @@
 The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the [[Levenshtein distance|Levenshtein edit distance]] function, though it's strength lies in a different area.
-[[Levenshtein distance|Levenshtein]] is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long word / phrases.
+[[Levenshtein distance|Levenshtein]] is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long words / phrases.
-Sørensen–Dice is more useful for 'fuzzy' matching. Partial, and poorly spelled words / phrases, possibly in improper order.
+Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order.
 There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.
@@ Line 30: / Line 30: @@
 The Sørensen–Dice coefficient is a "percent similarity" between the two population between 0.0 and 1.0.
-SDI ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often get remarkably close with some pretty poor inputs.
+SDI ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.
@@ Line 38: / Line 38: @@
 * Show the search term and the coefficient / match for the five closest, most similar matches.
-How you get the task names is peripheral to the task. You can [[Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever. If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.
+How you get the task names is peripheral to the task. You can [[:Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever.
+If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.
 =={{header|Raku}}==
-Using the library [https://raku.land/github:thundergnat/Text::Sorensen Text::Sorensen] for the Raku ecosystem.
+Using the library [https://raku.land/github:thundergnat/Text::Sorensen Text::Sorensen] from the Raku ecosystem.
-See the Raku entry for [[Text_completion#Sorenson-Dice|Text completion]] for a bespoke implementation of Sorenson-Dice.
+See the Raku entry for [[Text_completion#Sorenson-Dice|Text completion]] for a bespoke implementation of Sorenson-Dice. (Which is ''very'' similar to the library implementation.)
 <syntaxhighlight lang="raku" line>use Text::Sorensen :sdi;