Sorensen–Dice coefficient

From Rosetta Code
Sorensen–Dice coefficient is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

The Sørensen–Dice coefficient, also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two poulation samples.

The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the Levenshtein edit distance function, though it's strength lies in a different area.

Levenshtein is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long words / phrases.

Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order.

There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.

For instance, the word 'differ' would be tokenized to the group:

   di if ff fe er

Different implementations may do slightly different transforms. For our purposes, fold case so that all characters are the same case, split words, and ignore white-space, but keep punctuation.

The phrase "Don't Panic!" will be tokenized to the bi-grams:

   do on n' 't pa an ni ic c! 

Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups.

For items(objects, populations) A and B:

   SDI = 2 × (A ∩ B) / (A ⊎ B)

The Sørensen–Dice coefficient is a "percent similarity" between the two populations between 0.0 and 1.0.

SDI can by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.


Task
  • Use the list of Rosetta Code task and draft task names as your "dictionary" to search.
  • Using that dictionary, search for the mangled task names: 'Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers'.
  • Show the search term and the coefficient / match for the five closest, most similar matches.


How you get the task names is peripheral to the task. You can web-scrape them or download them to a file, whatever.

If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.


C++[edit]

Translation of: Wren
#include <algorithm>
#include <cctype>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <iterator>
#include <set>
#include <sstream>
#include <string>
#include <vector>

using bigram = std::pair<char, char>;

std::multiset<bigram> bigrams(const std::string& phrase) {
    std::multiset<bigram> result;
    std::istringstream is(phrase);
    std::string word;
    while (is >> word) {
        for (char& ch : word) {
            ch = std::tolower(static_cast<unsigned char>(ch));
        }
        size_t length = word.size();
        if (length == 1) {
            result.emplace(word[0], '\0');
        } else {
            for (size_t i = 0; i + 1 < length; ++i) {
                result.emplace(word[i], word[i + 1]);
            }
        }
    }
    return result;
}

double sorensen(const std::string& s1, const std::string& s2) {
    auto a = bigrams(s1);
    auto b = bigrams(s2);
    std::multiset<bigram> c;
    std::set_intersection(a.begin(), a.end(), b.begin(), b.end(),
                          std::inserter(c, c.begin()));
    return (2.0 * c.size()) / (a.size() + b.size());
}

int main() {
    std::vector<std::string> tasks;
    std::ifstream is("tasks.txt");
    if (!is) {
        std::cerr << "Cannot open tasks file.\n";
        return EXIT_FAILURE;
    }
    std::string task;
    while (getline(is, task)) {
        tasks.push_back(task);
    }
    const size_t tc = tasks.size();
    const std::string tests[] = {"Primordial primes",
                                 "Sunkist-Giuliani formula",
                                 "Sieve of Euripides", "Chowder numbers"};
    std::vector<std::pair<double, size_t>> sdi(tc);
    std::cout << std::fixed;
    for (const std::string& test : tests) {
        for (size_t i = 0; i != tc; ++i) {
            sdi[i] = std::make_pair(sorensen(tasks[i], test), i);
        }
        std::partial_sort(sdi.begin(), sdi.begin() + 5, sdi.end(),
                          [](const std::pair<double, size_t>& a,
                             const std::pair<double, size_t>& b) {
                              return a.first > b.first;
                          });
        std::cout << test << " >\n";
        for (size_t i = 0; i < 5 && i < tc; ++i) {
            std::cout << "  " << sdi[i].first << ' ' << tasks[sdi[i].second]
                      << '\n';
        }
        std::cout << '\n';
    }
    return EXIT_SUCCESS;
}
Output:
Primordial primes >
  0.685714 Sequence of primorial primes
  0.666667 Factorial primes
  0.571429 Primorial numbers
  0.545455 Prime words
  0.521739 Almost prime

Sunkist-Giuliani formula >
  0.565217 Almkvist-Giullera formula for pi
  0.378378 Faulhaber's formula
  0.342857 Haversine formula
  0.333333 Check Machin-like formulas
  0.307692 Resistance calculator

Sieve of Euripides >
  0.461538 Four sides of square
  0.461538 Sieve of Pritchard
  0.413793 Sieve of Eratosthenes
  0.400000 Piprimes
  0.384615 Sierpinski curve

Chowder numbers >
  0.782609 Chowla numbers
  0.640000 Powerful numbers
  0.608696 Rhonda numbers
  0.608696 Fermat numbers
  0.600000 Lah numbers

J[edit]

Tentative implementation:

TASKS=: fread '~/tasks.txt' NB. from Sorensen–Dice_coefficient/Tasks
sdtok=: [: (#~  ' '*/ .~:~])2]\ 7 u: tolower@rplc&(LF,' ')
sdinter=: {{
  all=. ~.x,y
  X=. <:#/.~all,x
  Y=. <:#/.~all,y
  +/X<.Y
}}
sdunion=: #@,
SDC=: (2 * sdinter % sdunion)&sdtok S:0
nearest=: {{ m{.\:~ x (] ;"0~ SDC) cutLF y }}
fmt=: ((8j6": 0{::]),' ',1{::])"1

The trick here is the concept of "intersection" which we must use. We can't use set intersection -- the current draft task description suggests that SDI = 2 × (A ∩ B) / (A ⊎ B) produces a number between 0 and 1. Because we're using division to produce this number, we must be using cardinality of the intersection rather than the intersection itself. But if A and B are sets, each containing the same tokens, the result here using cardinality of sets would be 2 rather than 1.

Instead, we treat A and B as sequences of tokens (so repeated copies of a token are distinct), for the cardinality of the intersection we count the number of times that each token appears in either A and in B and sum the minimum of the two counts. (So, tokens which only appear in A count 0 times, for example, where a token which appears 3 times in A and 2 times in B would contribute 2 to the sum.)

With this implementation, here's the task examples:

   fmt 'Primordial prime' 5 nearest TASKS
0.647059 Sequence of primorial primes
0.615385 Factorial primes            
0.592593 Primorial numbers           
0.571429 Prime words                 
0.545455 Almost prime                
   fmt 'Sunkist-Giuliani formula' 5 nearest TASKS
0.565217 Almkvist-Giullera formula for pi
0.378378 Faulhaber's formula             
0.342857 Haversine formula               
0.333333 Check Machin-like formulas      
0.307692 Resistance calculator           
   fmt 'Sieve of Euripides' 5 nearest TASKS
0.461538 Sieve of Pritchard   
0.461538 Four sides of square 
0.413793 Sieve of Eratosthenes
0.400000 Piprimes             
0.384615 Sierpinski curve     
   fmt 'Chowder numbers' 5 nearest TASKS
0.782609 Chowla numbers  
0.640000 Powerful numbers
0.608696 Rhonda numbers  
0.608696 Fermat numbers  
0.600000 Lah numbers     

Perl[edit]

use v5.036;
use Path::Tiny;
use List::Util <uniq head>;

sub bi_gram {
    my $line = lc shift;
    uniq map { substr $line,$_,2 } 0..length($line)-2;
}

sub score {
    my($phrase, $word) = @_;
    my %count;
    my @match = bi_gram $phrase;
    $count{$_}++ for @match, @$word;
    2 * (grep { $count{$_} > 1 } keys %count) / (@match + @$word);
}

sub sorensen {
    my($dict,$word,$cutoff) = @_; $cutoff //= 0.00;
    my(%matches,$s);
    ($s = score($word, $$dict{$_})) > $cutoff and $matches{$_} = $s for keys %$dict;
    %matches;
}

my %dict = map { $_ => [ bi_gram($_) ] } path('ref/Sorensen-Dice-Tasks.txt')->slurp =~ /.{10,}/gm;

for my $word ( ('Primordial primes', 'Sunkist-Giuliani formula', 'Sieve of Euripides', 'Chowder numbers') ) {
    my(%scored,@ranked);
    %scored = sorensen(\%dict,$word);
    push @ranked, sprintf "%.3f $_", $scored{$_} for sort { $scored{$b} <=> $scored{$a} || $a cmp $b } keys %scored;
    say "\n$word:\n" . join("\n", head 5, @ranked);
}
Output:
Primordial primes:
0.741 Factorial primes
0.629 Sequence of primorial primes
0.583 Almost prime
0.581 Next special primes
0.571 Pandigital prime

Sunkist-Giuliani formula:
0.542 Almkvist-Giullera formula for pi
0.368 Haversine formula
0.359 Faulhaber's formula
0.348 Check Machin-like formulas
0.303 FASTA format

Sieve of Euripides:
0.541 Sieve of Eratosthenes
0.529 Sieve of Pritchard
0.457 Four sides of square
0.457 The sieve of Sundaram
0.387 Sum of a series

Chowder numbers:
0.769 Chowla numbers
0.615 Rhonda numbers
0.609 Bell numbers
0.609 Lah numbers
0.593 Kaprekar numbers

Phix[edit]

with javascript_semantics
function bigram(string s)
    sequence words = split(lower(s)),
             res = {}
    for word in words do
        for i=1 to length(word)-1 do
            res = append(res,word[i..i+1])
        end for
    end for
    res = sort(res)
    return res
end function

function intrasect(sequence s1, s2)
    integer l1 = length(s1),
            l2 = length(s2),
            i1 = 1, i2 = 1,
            res = 0
    while i1<=l1 and i2<=l2 do
        integer c = compare(s1[i1],s2[i2])
        res += (c=0)
        i1 += (c<=0)
        i2 += (c>=0)
    end while
    return res
end function

procedure test(string s, sequence dictionary)
    sequence scores = {},
             s1 = bigram(s)
    for phrase in dictionary do
        sequence s2 = bigram(phrase)
        scores &= 2*intrasect(s1,s2)/(length(s1)+length(s2))
    end for
    printf(1,"%s >\n",s)
    sequence tags = reverse(custom_sort(scores,tagset(length(scores)))[-5..-1])
    for t in tags do
        printf(1,"%f %s\n",{scores[t],dictionary[t]})
    end for
    printf(1,"\n")
end procedure

constant tests = split("""
Primordial primes
Sunkist-Giuliani formula
Sieve of Euripides
Chowder numbers
""",'\n'),
tasks = split("""
Almkvist-Giullera formula for pi
Almost prime
Check Machin-like formulas
Chowla numbers
Factorial primes
Faulhaber's formula
Fermat numbers
Four sides of square
Haversine formula
Lah numbers
Piprimes
Powerful numbers
Prime words
Primorial numbers
Resistance calculator
Rhonda numbers
Sequence of primorial primes
Sierpinski curve
Sieve of Eratosthenes
Sieve of Pritchard
""",'\n')
papply(true,test,{tests,{tasks}})
Output:

Extending the task list to the full 1577 entries changes nothing.

Primordial primes >
0.685714 Sequence of primorial primes
0.666667 Factorial primes
0.571429 Primorial numbers
0.545455 Prime words
0.521739 Almost prime

Sunkist-Giuliani formula >
0.565217 Almkvist-Giullera formula for pi
0.378378 Faulhaber's formula
0.342857 Haversine formula
0.333333 Check Machin-like formulas
0.307692 Resistance calculator

Sieve of Euripides >
0.461538 Sieve of Pritchard
0.461538 Four sides of square
0.413793 Sieve of Eratosthenes
0.400000 Piprimes
0.384615 Sierpinski curve

Chowder numbers >
0.782609 Chowla numbers
0.640000 Powerful numbers
0.608696 Rhonda numbers
0.608696 Fermat numbers
0.600000 Lah numbers

Python[edit]

Of the several Python string similarity libraries implementing Sorenson-Dice similarity, none give the same results as the original example's Raku library, so this was imitated using Multisets, as per the C++ and Wren examples.

''' Rosetta Code task rosettacode.org/wiki/Sorensen–Dice_coefficient '''

from multiset import Multiset


def tokenizetext(txt):
    ''' convert a phrase into a count of bigram tokens of its words '''
    arr = []
    for wrd in txt.lower().split(' '):
        arr += ([wrd] if len(wrd) == 1 else [wrd[i:i+2]
                for i in range(len(wrd)-1)])
    return Multiset(arr)


def sorenson_dice(text1, text2):
    ''' Sorenson-Dice similarity of Multisets '''
    bc1, bc2 = tokenizetext(text1), tokenizetext(text2)
    return 2 * len(bc1 & bc2) / (len(bc1) + len(bc2))


with open('tasklist_sorenson.txt', 'r') as fd:
    alltasks = fd.read().split('\n')

for testtext in ['Primordial primes', 'Sunkist-Giuliani formula',
                 'Sieve of Euripides', 'Chowder numbers']:
    taskvalues = sorted([(sorenson_dice(testtext, t), t)
                        for t in alltasks], reverse=True)
    print(f'\n{testtext}:')
    for (val, task) in taskvalues[:5]:
        print(f'  {val:.6f}  {task}')
Output:
Primordial primes:
  0.685714  Sequence of primorial primes
  0.666667  Factorial primes
  0.571429  Primorial numbers
  0.545455  Prime words
  0.521739  Almost prime

Sunkist-Giuliani formula:
  0.565217  Almkvist-Giullera formula for pi
  0.378378  Faulhaber's formula
  0.342857  Haversine formula
  0.333333  Check Machin-like formulas
  0.307692  Resistance calculator

Sieve of Euripides:
  0.461538  Sieve of Pritchard
  0.461538  Four sides of square
  0.413793  Sieve of Eratosthenes
  0.400000  Piprimes
  0.384615  Sierpinski curve

Chowder numbers:
  0.782609  Chowla numbers
  0.640000  Powerful numbers
  0.608696  Rhonda numbers
  0.608696  Fermat numbers
  0.600000  Lah numbers

Raku[edit]

Using the library Text::Sorensen from the Raku ecosystem.

See the Raku entry for Text completion for a bespoke implementation of Sorenson-Dice. (Which is very similar to the library implementation.)

use Text::Sorensen :sdi;

my %tasks = './tasks.txt'.IO.slurp.lines.race.map: { $_ => .&bi-gram };

sub fuzzy-search (Str $string) { sdi($string, %tasks, :ge(.2) ).head(5).join: "\n" }

say "\n$_ >\n" ~ .&fuzzy-search for
  'Primordial primes',
  'Sunkist-Giuliani formula',
  'Sieve of Euripides',
  'Chowder numbers';
Output:
Primordial primes >
0.685714 Sequence of primorial primes
0.666667 Factorial primes
0.571429 Primorial numbers
0.545455 Prime words
0.521739 Almost prime

Sunkist-Giuliani formula >
0.565217 Almkvist-Giullera formula for pi
0.378378 Faulhaber's formula
0.342857 Haversine formula
0.333333 Check Machin-like formulas
0.307692 Resistance calculator

Sieve of Euripides >
0.461538 Four sides of square
0.461538 Sieve of Pritchard
0.413793 Sieve of Eratosthenes
0.4 Piprimes
0.384615 Sierpinski curve

Chowder numbers >
0.782609 Chowla numbers
0.64 Powerful numbers
0.608696 Fermat numbers
0.608696 Rhonda numbers
0.6 Lah numbers

Wren[edit]

Library: Wren-str
Library: Wren-set
Library: Wren-fmt

This assumes that a one letter word is treated as a bigram. It also assumes that all bigrams are matched whether duplicates or not.

The results on this basis are the same as the Raku example.

import "io" for File
import "./str" for Str
import "./set" for Bag
import "./fmt" for Fmt

var bigrams = Fn.new { |phrase|
    var words = Str.splitNoEmpty(phrase, " ")
    var res = []
    for (word in words) {
        var chars = Str.lower(word).toList
        if (chars.count == 1) {
            res.add(chars[0])
        } else {
            for (i in 0...chars.count-1) {
                res.add(chars[i] + chars[i+1])
            }
        }
    }
    return res
}

var sorensen = Fn.new { |a, b|
    var abi = Bag.new(bigrams.call(a))
    var bbi = Bag.new(bigrams.call(b))
    var common = abi.intersect(bbi)
    return 2 * common.count / (abi.count + bbi.count)
}

var fileName = "rc_tasks_2022_09_24.txt"  // local copy
var tasks = File.read(fileName).trimEnd().split("\n")
var tc = tasks.count
var tests = [
    "Primordial primes", "Sunkist-Giuliani formula", "Sieve of Euripides", "Chowder numbers"
]
var sdis = List.filled(tc, null)
for (test in tests) {
    for (i in 0...tasks.count) sdis[i] = [tasks[i], sorensen.call(tasks[i], test)]
    var top5 = sdis.sort { |e1, e2| e1[1] >= e2[1] }.take(5).toList
    System.print("%(test) >")
    for (e in top5) Fmt.print("  $f $s", e[1], e[0])
    System.print()
}
Output:
Primordial primes >
  0.685714 Sequence of primorial primes
  0.666667 Factorial primes
  0.571429 Primorial numbers
  0.545455 Prime words
  0.521739 Almost prime

Sunkist-Giuliani formula >
  0.565217 Almkvist-Giullera formula for pi
  0.378378 Faulhaber's formula
  0.342857 Haversine formula
  0.333333 Check Machin-like formulas
  0.307692 Resistance calculator

Sieve of Euripides >
  0.461538 Four sides of square
  0.461538 Sieve of Pritchard
  0.413793 Sieve of Eratosthenes
  0.400000 Piprimes
  0.384615 Sierpinski curve

Chowder numbers >
  0.782609 Chowla numbers
  0.640000 Powerful numbers
  0.608696 Fermat numbers
  0.608696 Rhonda numbers
  0.600000 Lah numbers