Longest common subsequence: Difference between revisions
Content added Content deleted
m (Preferred # over <> to represent incomparable matches.) |
(Moved Background and References sections from the C++ sample to the main Introduction for the page.) |
||
Line 10: | Line 10: | ||
Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2. |
Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2. |
||
If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be |
If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be ''incomparable''. Defining (#) to denote this case, we can write m1 # m2. |
||
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''. |
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''. |
||
Finding an '''LCS''' can then be restated as the problem of finding a chain of maximum cardinality over the set of matches '''M'''. |
Finding an '''LCS''' can then be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''. |
||
The set of matches '''M''' can be visualized as an m*n bit matrix, where each bit '''M[i, j]''' is True iff there is a match at the corresponding positions of strings A and B. |
The set of matches '''M''' can be visualized as an m*n bit matrix, where each bit '''M[i, j]''' is True iff there is a match at the corresponding positions of strings A and B. |
||
Line 20: | Line 20: | ||
Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True. |
Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True. |
||
⚫ | |||
⚫ | |||
⚫ | Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing. |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
'''Examples''' |
|||
⚫ | |||
'''<u>1234</u>''' |
'''<u>1234</u>''' |
||
'''<u>12</u>'''245'''<u>3</u>'''332'''<u>4</u>''' |
'''<u>12</u>'''245'''<u>3</u>'''332'''<u>4</u>''' |
||
For a string example, consider the sequences "thisisatest" and "testing123testing". An LCS would be "tsitest": |
For a string example, consider the sequences "thisisatest" and "testing123testing". An LCS would be "tsitest": |
||
'''<u>t</u>'''hi'''<u>si</u>'''sa'''<u>test</u>''' |
'''<u>t</u>'''hi'''<u>si</u>'''sa'''<u>test</u>''' |
||
Line 389: | Line 420: | ||
=={{header|C++}}== |
=={{header|C++}}== |
||
'''The Longest Common Subsequence (LCS) Problem''' |
|||
Defining a subsequence to be a string obtained by deleting zero or more symbols from an input string, the LCS Problem is to find a subsequence of maximum length that is common to two input strings. |
|||
⚫ | |||
⚫ | |||
This occurs, for example, in Bioinformatics applications of nucleotide and protein sequencing. |
|||
⚫ | |||
⚫ | |||
⚫ | |||
Assuming a uniform distribution of symbols, the number of matches may tend only towards linear, O(m+n) growth. |
|||
⚫ | A binary search optimization due to Hunt and Szymanski can be applied in this case, which results in expected performance of O(n log m) |
||
⚫ | |||
⚫ | |||
⚫ | |||
by Daniel S. Hirschberg, published June 1975<br /> |
|||
⚫ | |||
"An Algorithm for Differential File Comparison"<br /> |
|||
⚫ | |||
⚫ | |||
⚫ | |||
by James W. Hunt and Thomas G. Szymanski, published May 1977<br /> |
|||
⚫ | |||
⚫ | |||
by Claus Rick, received 17 March 2000, Information Processing Letters,<br /> |
|||
⚫ | |||
'''Hunt and Szymanski algorithm''' |
'''Hunt and Szymanski algorithm''' |
||
<lang cpp>#include <stdint.h> |
<lang cpp>#include <stdint.h> |