Talk:Sorensen–Dice coefficient

From Rosetta Code

Time varying numbers are difficult to compare

The current draft task description asks us to use task and draft task names here and suggests web scraping them from rosettacode category pages.

But the category pages change over time. Over the course of a decade, they change significantly (we should expect a lot of migration from draft task to tasks, for example).

This isn't going to be an immediate problem, but perhaps language should be added to the task acknowledging this issue? --Rdm (talk) 06:46, 25 September 2022 (UTC)

There;s a page Sorensen–Dice_coefficient/Tasks which contains a task list - I assume that is what we are supposed to use, but it doesn't seem to be mentioned in tha Task. --Tigerofdarkness (talk) 09:40, 25 September 2022 (UTC)
That's what comes up when you click the 'download them to a file' link on the main page so that's what I've used though I made a local copy first. --PureFox (talk) 10:00, 25 September 2022 (UTC)

SDI = 2 × (A ∩ B) / (A ⊎ B)

Where has this definition of Sorensen–Dice coefficient come from?--Nigel Galloway (talk) 13:04, 18 October 2023 (UTC)

ChatGPT is wobbly on the issue

ChatGPT: To calculate the Dice-Sørensen coefficient of similarity between the strings "gg" and "gggg", we first need to find the bigrams present in each string.

For "gg", there is only one bigram: "gg".
For "gggg", there are three bigrams: "gg", "gg", "gg".

Now, we calculate the intersection and union of the sets of bigrams:

Intersection: There is one common bigram, which is "gg".
Union: There are three distinct bigrams in total.
Using the formula for the Dice-Sørensen coefficient:

Dice("gg","gggg") = 2 × ∣ {"gg"} ∣ / (|{"gg"}∣ + ∣ {"gg", "gg", "gg"}∣
                  = 2 / 4
                  = 0.5


So, the Dice-Sørensen coefficient of similarity between the strings "gg" and "gggg" using bigrams is 0.5.

User
But {gg,gg,gg} as a set only has one element, so the formula yields 2 * 1 / (1+1), that is, 1, no?

ChatGPT: You are correct, I apologize for the oversight. Let's correct the calculation.

For the given strings "gg" and "gggg":

Intersection: There is one common bigram, "gg".
Union: There are two distinct bigrams, "gg" and "gg".
Using the formula for the Dice-Sørensen coefficient:

[...] 

Dice("gg","gggg")= 1

So, the Dice-Sørensen coefficient of similarity between the strings "gg" and "gggg" using bigrams is indeed 1. Thank you for catching that mistake.