Jaro similarity: Difference between revisions
Content added Content deleted
m (Markjreed moved page Jaro distance to Jaro similarity: Described task calculates the similarity (1=identical) rather than the distance (0=identical)) |
(Correct distance to similarity; tried to clarify definition of transpositions as well.) |
||
Line 1: | Line 1: | ||
{{task}} |
{{task}} |
||
The Jaro distance is a measure of edit distance between two strings; its inverse, called the ''Jaro similarity'', is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that '''0''' equates to no similarities and '''1''' is an exact match. |
|||
The Jaro distance is a measure of similarity between two strings. |
|||
The higher the Jaro distance for two strings is, the more similar the strings are. |
|||
The score is normalized such that '''0''' equates to no similarity and '''1''' is an exact match. |
|||
;;Definition |
;;Definition |
||
The Jaro |
The Jaro similarity <math>d_j</math> of two given strings <math>s_1</math> and <math>s_2</math> is |
||
: <math>d_j = \left\{ |
: <math>d_j = \left\{ |
||
Line 24: | Line 20: | ||
Two characters from <math>s_1</math> and <math>s_2</math> respectively, are considered ''matching'' only if they are the same and not farther than <math>\left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1</math>. |
Two characters from <math>s_1</math> and <math>s_2</math> respectively, are considered ''matching'' only if they are the same and not farther apart than <math>\left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1</math> characters. |
||
Each character of <math>s_1</math> is compared with all its matching |
|||
characters in <math>s_2</math>. |
|||
Each character of <math>s_1</math> is compared with all its matching characters in <math>s_2</math>. Each difference in position is half a ''transposition''; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one |
|||
The number of matching (but different sequence order) characters |
|||
divided by 2 defines the number of ''transpositions''. |
|||
Line 50: | Line 42: | ||
;Task |
;Task |
||
Implement the Jaro |
Implement the Jaro algorithm and show the similarity scores for each of the following pairs: |
||
* ("MARTHA", "MARHTA") |
* ("MARTHA", "MARHTA") |