Longest common subsequence: Difference between revisions
Content added Content deleted
m (Corrected definition of the strict Cartesian product-order.) |
(Updated Introduction.) |
||
Line 4: | Line 4: | ||
Define a subsequence to be any string obtained by deleting zero or more symbols from an input string. |
Define a subsequence to be any string obtained by deleting zero or more symbols from an input string. |
||
The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] |
The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings. |
||
Let ''A'' = ''A''[0] |
Let ''A'' = ''A''[0]… ''A''[m - 1] and ''B'' = ''B''[0]… ''B''[n - 1], m ≤ n be strings drawn from an alphabet Σ of size s, containing every distinct symbol in A + B. |
||
An ordered pair (i, j) will be called a match if ''A''[i] == ''B''[j], where 0 |
An ordered pair (i, j) will be called a match if ''A''[i] == ''B''[j], where 0 ≤ i < m and 0 ≤ j < n. |
||
Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) |
Define the strict Cartesian [https://en.wikipedia.org/wiki/Product_order product-order] (<) over matches, such that (i1, j1) < (i2, j2) ↔ i1 < i2 and j1 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2. |
||
We write m1 <> m2 to mean that either m1 < m2 or m1 > m2 holds, ''i.e.'', that m1, m2 are ''comparable''. |
|||
If i1 ≤ i2 and j2 ≤ j1 (or i2 ≤ i1 and j1 ≤ j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are ''incomparable''. |
|||
Defining (#) to denote this case, we write m1 # m2. Because the underlying product order is strict, m1 == m2 (''i.e.'', i1 == i2 and j1 == j2) implies m1 # m2. |
|||
Defining (#) to denote this case, we write m1 # m2. Because the underlying product order is strict, m1 == m2 (''i.e.'', i1 == i2 and j1 == j2) implies m1 # m2. Thus, the (<>) operator is the inverse of (#). |
|||
Because the product order is strict, m1 <> m2 implies m1 |
Because the product order is strict, m1 <> m2 implies m1 ≠ m2, ''i.e.'', that the two tuples differ in some component. |
||
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where m1 <> m2 for every pair of distinct elements m1 and m2 of '''C'''. |
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where m1 <> m2 for every pair of distinct elements m1 and m2 of '''C'''. Similarly, an antichain '''D''' is any subset of '''M''' where m1 # m2 for every pair of distinct elements m1 and m2 of '''D'''. |
||
Finding an |
Finding an LCS can then be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''. |
||
According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique. |
|||
The set of matches '''M''' can be visualized as an m*n bit matrix, where each bit '''M[i, j]''' is True iff the ordered pair (i, j) is in the set. |
|||
Then |
The set of matches '''M''' can be interpreted as an m*n bit matrix such that '''M'''[i, j] == True ↔ (i, j) ∈ M. Then a chain '''C''' can be visualized as a strictly increasing curve through those match bits which are True. |
||
'''Background''' |
'''Background''' |
||
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing. |
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing. |
||
The "divide and conquer" approach |
The "divide and conquer" approach of [Hirschberg 1975] limits the space required to O(''m + n''). However, this approach requires O(''m*n'') time even in the best case. |
||
This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions. |
This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions. |
||
In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols approaches the length of the LCS. In this case |
In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols approaches the length of the LCS. In this case the number of matches reduces to linear, O(''m + n'') growth. |
||
A binary search optimization due to [Hunt and Szymanski 1977] can be applied to the basic Dynamic Programming approach, resulting in an expected performance of O(''n log m''). Performance can degrade to O(''m*n log m'') time in the worst case, as the number of matches grows to O(''m*n''). |
|||
'''Note''' |
|||
⚫ | |||
⚫ | |||
'''Legend''' |
|||
A, B are input strings of lengths m, n respectively |
|||
p is the length of the LCS |
|||
M is the set of match pairs (i, j) such that x[i] == y[j] |
|||
r is the magnitude of M |
|||
s is the magnitude of the alphabet of distinct symbols in x + y |
|||
'''References''' |
'''References''' |
||
[Dilworth 1950] "A decomposition theorem for partially ordered sets" |
|||
# "A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975 Communications of the ACM [Volume 18, Number 6, pp. 341–343] |
|||
by Robert P. Dilworth, published January 1950, |
|||
# "An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976 Computing Science Technical Report, Bell Laboratories 41 |
|||
Annals of Mathematics [Volume 51, Number 1, ''pp.'' 161-166] |
|||
# "A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977 Communications of the ACM [Volume 20, Number 5, pp. 350-353] |
|||
# "Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters, Elsevier Science [Volume 75, pp. 275–281] |
|||
[Goeman and Clausen 2002] "A New Practical Linear Space Algorithm for the Longest Common |
|||
Subsequence Problem" by Heiko Goeman and Michael Clausen, |
|||
published 2002, Kybernetika [Volume 38, Issue 1, ''pp.'' 45-66] |
|||
[Hirschberg 1975] "A linear space algorithm for computing maximal common subsequences" |
|||
by Daniel S. Hirschberg, published June 1975 |
|||
Communications of the ACM [Volume 18, Number 6, ''pp.'' 341-343] |
|||
[Hunt and McIlroy 1976] "An Algorithm for Differential File Comparison" |
|||
by James W. Hunt and M. Douglas McIlroy, June 1976 |
|||
Computing Science Technical Report, Bell Laboratories 41 |
|||
[Hunt and Szymanski 1977] "A Fast Algorithm for Computing Longest Common Subsequences" |
|||
by James W. Hunt and Thomas G. Szymanski, published May 1977 |
|||
Communications of the ACM [Volume 20, Number 5, ''pp.'' 350-353] |
|||
[Rick 2000] "Simple and fast linear space computation of longest common subsequences" |
|||
by Claus Rick, received 17 March 2000, Information Processing Letters, |
|||
Elsevier Science [Volume 75, ''pp.'' 275–281] |
|||
<br /> |
|||
'''Examples''' |
|||
The sequences "1234" and "1224533324" have an LCS of "1234": |
|||
'''<u>1234</u>''' |
|||
'''<u>12</u>'''245'''<u>3</u>'''332'''<u>4</u>''' |
|||
For a string example, consider the sequences "thisisatest" and "testing123testing". An LCS would be "tsitest": |
|||
'''<u>t</u>'''hi'''<u>si</u>'''sa'''<u>test</u>''' |
|||
'''<u>t</u>'''e'''<u>s</u>'''t'''<u>i</u>'''ng123'''<u>test</u>'''ing |
|||
In this puzzle, your code only needs to deal with strings. Write a function which returns an LCS of two strings (case-sensitive). You don't need to show multiple LCS's. |
|||
For more information on this problem please see [[wp:Longest_common_subsequence_problem|Wikipedia]]. |
|||
<br /> |
<br /> |
||
'''Examples''' |
'''Examples''' |