Jaro similarity: Difference between revisions

Content added Content deleted
m (Markjreed moved page Jaro distance to Jaro similarity: Described task calculates the similarity (1=identical) rather than the distance (0=identical))
(Correct distance to similarity; tried to clarify definition of transpositions as well.)
Line 1: Line 1:
{{task}}
{{task}}


The Jaro distance is a measure of edit distance between two strings; its inverse, called the ''Jaro similarity'', is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that   '''0'''   equates to no similarities and   '''1'''   is an exact match.
The Jaro distance is a measure of similarity between two strings.

The higher the Jaro distance for two strings is, the more similar the strings are.

The score is normalized such that   '''0'''   equates to no similarity and   '''1'''   is an exact match.




;;Definition
;;Definition


The Jaro distance &nbsp; <math>d_j</math> &nbsp; of two given strings &nbsp; <math>s_1</math> &nbsp; and &nbsp; <math>s_2</math> &nbsp; is
The Jaro similarity &nbsp; <math>d_j</math> &nbsp; of two given strings &nbsp; <math>s_1</math> &nbsp; and &nbsp; <math>s_2</math> &nbsp; is


: <math>d_j = \left\{
: <math>d_j = \left\{
Line 24: Line 20:




Two characters from &nbsp; <math>s_1</math> &nbsp; and &nbsp; <math>s_2</math> &nbsp; respectively, are considered ''matching'' only if they are the same and not farther than &nbsp; <math>\left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1</math>.
Two characters from &nbsp; <math>s_1</math> &nbsp; and &nbsp; <math>s_2</math> &nbsp; respectively, are considered ''matching'' only if they are the same and not farther apart than &nbsp; <math>\left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1</math> characters.

Each character of &nbsp; <math>s_1</math> &nbsp; is compared with all its matching
characters in &nbsp; <math>s_2</math>.


Each character of &nbsp; <math>s_1</math> &nbsp; is compared with all its matching characters in &nbsp; <math>s_2</math>. Each difference in position is half a ''transposition''; that is, the number of transpositions is half the number of characters which are common to the two strings but occupy different positions in each one
The number of matching (but different sequence order) characters
divided by 2 defines the number of ''transpositions''.




Line 50: Line 42:
;Task
;Task


Implement the Jaro-distance algorithm and show the distances for each of the following pairs:
Implement the Jaro algorithm and show the similarity scores for each of the following pairs:


* ("MARTHA", "MARHTA")
* ("MARTHA", "MARHTA")