Sorensen–Dice coefficient: Difference between revisions

m
fix links, formatting, typos, grammar
(New draft task and Raku entry)
 
m (fix links, formatting, typos, grammar)
Line 5:
The original use was in botany, indexing similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the [[Levenshtein distance|Levenshtein edit distance]] function, though it's strength lies in a different area.
 
[[Levenshtein distance|Levenshtein]] is good for finding misspellings, but relies on the tested word / phrase being pretty similar to the desired one, and can be very slow for long wordwords / phrases.
 
Sørensen–Dice is more useful for 'fuzzy' matching. Partialpartial, and poorly spelled words / phrases, possibly in improper order.
 
There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.
Line 30:
The Sørensen–Dice coefficient is a "percent similarity" between the two population between 0.0 and 1.0.
 
SDI ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often getgets remarkably close with some pretty poor inputs.
 
 
Line 38:
* Show the search term and the coefficient / match for the five closest, most similar matches.
 
 
How you get the task names is peripheral to the task. You can [[Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever. If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.
How you get the task names is peripheral to the task. You can [[:Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever.
 
How you get the task names is peripheral to the task. You can [[Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever. If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.
 
 
=={{header|Raku}}==
Using the library [https://raku.land/github:thundergnat/Text::Sorensen Text::Sorensen] forfrom the Raku ecosystem.
 
See the Raku entry for [[Text_completion#Sorenson-Dice|Text completion]] for a bespoke implementation of Sorenson-Dice. (Which is ''very'' similar to the library implementation.)
<syntaxhighlight lang="raku" line>use Text::Sorensen :sdi;
 
10,333

edits