Sorensen–Dice coefficient: Difference between revisions

Line 1:

The [[wp:Sørensen–Dice coefficient|Sørensen–Dice coefficient]], also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice,) is a statistic used to gauge the similarity of two ~~poulation~~ samples.

The [[wp:Sørensen–Dice coefficient|Sørensen–Dice coefficient]], also known as the Sørensen–Dice index (or sdi, or sometimes by one of the individual names: sorensen or dice) is a statistic used to gauge the similarity of two population samples.

The original use was in botany, ~~indexing~~ similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the [[Levenshtein distance|Levenshtein edit distance]] function, though ~~it's~~ ~~strength~~ ~~lies~~ ~~in a~~ different ~~area~~.

The original use was in botany as a measure of similarity between populations of flora and fauna in different areas, but it has uses in other fields as well. It can be used as a text similarity function somewhat similar to the [[Levenshtein distance|Levenshtein edit distance]] function, though its characteristics are quite different.

[[Levenshtein distance|Levenshtein]] is ~~good~~ for ~~finding~~ ~~misspellings~~, but relies on the tested word / phrase being ~~pretty~~ similar to the desired one, and can be very slow for long words / phrases.

[[Levenshtein distance|Levenshtein]] can be useful for spelling correction, but relies on the tested word or phrase being quite similar to the desired one, and can be very slow for long words or phrases.

Sørensen–Dice is more useful for 'fuzzy' matching partial, and poorly spelled words / phrases, possibly in improper order.

Sørensen–Dice is more useful for 'fuzzy' matching partial and poorly spelled words or phrases, possibly in improper order.

There are several different methods to tokenize objects for Sørensen–Dice comparisons. The most typical tokenizing scheme for text is to break the words up into bi-grams: groups of two consecutive letters.

Line 22:

Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups.

Sørensen–Dice measures the similarity of two groups by dividing twice the intersection token count by the total token count of both groups:

⚫

SDC = 2 × |A ∩ B| / (|A| + |B|)

For items(objects, populations) A and B:

⚫

The Sørensen–Dice coefficient is thus a ratio between 0.0 and 1.0 giving the "percent similarity" between the two populations.

⚫

~~SDI~~ = 2 × (A ∩ B) / (A ⊎ B)

⚫

SDC ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.

⚫

The Sørensen–Dice coefficient is a "percent similarity" between the two populations ~~between 0.0 and 1.0~~.

⚫

~~SDI~~ ''can'' by used for spellchecking, but it's not really good at it, especially for short words. Where it really shines is for fuzzy matching of short phrases like book or movie titles. It may not return exactly what you are looking for, but often gets remarkably close with some pretty poor inputs.

Line 41:

Line 39:

How you get the task names is peripheral to the task. You can [[:Category:Programming_Tasks|web-scrape]] them or [[Sorensen–Dice coefficient/Tasks|download them to a file]], whatever.

If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations it is acceptable to use that with a pointer to where it may be obtained.

If there is a built-in or easily, freely available library implementation for Sørensen–Dice coefficient calculations, it is acceptable to use that with a pointer to where it may be obtained.