Sorensen–Dice coefficient: Difference between revisions

Content added Content deleted

Inline

Revision as of 05:44, 25 September 2022

The Sørensen–Dice coefficient, also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two poulation samples.

The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the Levenshtein edit distance function, though it's strength lies in a different area.

Levenshtein is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long words / phrases.

Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order.

There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.

For instance, the word 'differ' would be tokenized to the group:

   di if ff fe er

Different implementations may do slightly different transforms. For our purposes, fold case so that all characters are the same case, split words, and ignore white-space, but keep punctuation.

The phrase "Don't Panic!" will be tokenized to the bi-grams:

   do on n' 't pa an ni ic c!

Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups.

For items(objects, populations) A and B:

   SDI = 2 × (A ∩ B) / (A ⊎ B)

The Sørensen–Dice coefficient is a "percent similarity" between the two populations between 0.0 and 1.0.

SDI can by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.

Task

Use the list of Rosetta Code task and draft task names as your "dictionary" to search.
Using that dictionary, search for the mangled task names: 'Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers'.
Show the search term and the coefficient / match for the five closest, most similar matches.

How you get the task names is peripheral to the task. You can web-scrape them or download them to a file, whatever.

If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.

Phix

with javascript_semantics
constant match_raku = false
include sets.e

function bigram(string s)
    sequence words = split(lower(s)),
             res = {}
    for word in words do
        for i=1 to length(word)-1 do
            if not match_raku then
                res = add_member(res,word[i..i+1])
            else
                string c2 = word[i..i+1]
                while is_member(res,c2) do
                    c2 &= '1'
                end while
                res = add_member(res,c2)
            end if
        end for
    end for
    return res
end function

procedure test(string s, sequence dictionary)
    sequence scores = {},
             s1 = bigram(s)
    integer l1 = length(s1)
    for phrase in dictionary do
        sequence s2 = bigram(phrase)
        integer l2 = length(s2)
        s2 = intersection(s2,s1)
        scores &= 2*length(s2)/(l1+l2)
    end for
    printf(1,"%s >\n",s)
    sequence tags = reverse(custom_sort(scores,tagset(length(scores)))[-5..-1])
    for t in tags do
        printf(1,"%f %s\n",{scores[t],dictionary[t]})
    end for
end procedure

constant tests = split("""
Primordial primes
Sunkist-Giuliani formula
Sieve of Euripides
Chowder numbers
""",'\n'),
tasks = split("""
Almkvist-Giullera formula for pi
Almost prime
Bell numbers
Check Machin-like formulas
Chowla numbers
Factorial primes
Faulhaber's formula
Fermat numbers
Four sides of square
Haversine formula
Lah numbers
Piprimes
Powerful numbers
Prime words
Primorial numbers
Resistance calculator
Rhonda numbers
Sequence of primorial primes
Sierpinski curve
Sieve of Eratosthenes
Sieve of Pritchard
""",'\n')
papply(true,test,{tests,{tasks}})

Output:

You can get the same results as raku by setting match_raku to true, ie "uniqueify" all the bi-grams, so for instance "Primordial primes" adds "pr" and "pr1", whereas when match_raku is false it just adds "pr", once, and of course the same for "ri", "im", etc. Also, extending the task list to the full 1577 entries changes nothing either way.

Primordial primes >
0.695652 Factorial primes
0.642857 Sequence of primorial primes
0.631579 Prime words
0.600000 Almost prime
0.583333 Primorial numbers
Sunkist-Giuliani formula >
0.571429 Almkvist-Giullera formula for pi
0.352941 Haversine formula
0.350000 Check Machin-like formulas
0.342857 Faulhaber's formula
0.315789 Resistance calculator
Sieve of Euripides >
0.461538 Sieve of Pritchard
0.461538 Four sides of square
0.413793 Sieve of Eratosthenes
0.400000 Piprimes
0.384615 Sierpinski curve
Chowder numbers >
0.818182 Chowla numbers
0.636364 Rhonda numbers
0.631579 Lah numbers
0.631579 Bell numbers
0.608696 Powerful numbers

Raku

Using the library Text::Sorensen from the Raku ecosystem.

See the Raku entry for Text completion for a bespoke implementation of Sorenson-Dice. (Which is very similar to the library implementation.)

use Text::Sorensen :sdi;

my %tasks = './tasks.txt'.IO.slurp.lines.race.map: { $_ => .&bi-gram };

sub fuzzy-search (Str $string) { sdi($string, %tasks, :ge(.2) ).head(5).join: "\n" }

say "\n$_ >\n" ~ .&fuzzy-search for
  'Primordial primes',
  'Sunkist-Giuliani formula',
  'Sieve of Euripides',
  'Chowder numbers';

Output:

Primordial primes >
0.685714 Sequence of primorial primes
0.666667 Factorial primes
0.571429 Primorial numbers
0.545455 Prime words
0.521739 Almost prime

Sunkist-Giuliani formula >
0.565217 Almkvist-Giullera formula for pi
0.378378 Faulhaber's formula
0.342857 Haversine formula
0.333333 Check Machin-like formulas
0.307692 Resistance calculator

Sieve of Euripides >
0.461538 Four sides of square
0.461538 Sieve of Pritchard
0.413793 Sieve of Eratosthenes
0.4 Piprimes
0.384615 Sierpinski curve

Chowder numbers >
0.782609 Chowla numbers
0.64 Powerful numbers
0.608696 Fermat numbers
0.608696 Rhonda numbers
0.6 Lah numbers