Longest common subsequence: Difference between revisions
Content added Content deleted
m (Clarified naming of the FindLCS() method, traceLCS bool, and chains deque.) |
(Described Forward and Reverse Contours.) |
||
Line 2: | Line 2: | ||
'''Introduction''' |
'''Introduction''' |
||
Define a subsequence to be any string obtained by deleting zero or more symbols from an input string. |
Define a ''subsequence'' to be any output string obtained by deleting zero or more symbols from an input string. |
||
The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings. |
The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings. |
||
Line 10: | Line 10: | ||
An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 < i ≤ m and 0 < j ≤ n. |
An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 < i ≤ m and 0 < j ≤ n. |
||
Define |
Define a ''non-strict'' [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. We define (≥) similarly. |
||
We say |
We say m1, m2 are ''comparable'' if either m1 ≤ m2 or m1 ≥ m2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1 ≤ m2 nor m1 ≥ m2 are possible; and we say m1, m2 are ''incomparable''. |
||
We also define the ''strict'' product-order (<) over ordered pairs, such that (i1, j1) < (i2, j2) ⇔ i1 < i2 and j1 < j2. We define (>) similarly. |
|||
⚫ | |||
⚫ | Given a set of matches '''M''', a chain '''C''' is a subset of '''M''' consisting of at least one element m; and where either m1 < m2 or m1 > m2 for every pair of distinct elements m1 and m2. An antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable. |
||
⚫ | |||
⚫ | |||
Finding an LCS can be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''. |
Finding an LCS can be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''. |
||
According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique. |
According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique. |
||
'''Contours''' |
|||
Forward Contours FC[''k''] of ''class k'' are defined inductively, as follows: |
|||
FC[0] consists of those elements m1 for which there exists no element m2 such that m2 < m1. |
|||
FC[''k''] consists of those elements m1 for which there exists no element m2 such that m2 < m1; and where neither m1 nor m2 are contained in FC[''l''] for any ''class l'' < ''k''. |
|||
Reverse Contours RC[''k''] of ''class k'' are defined similarly. |
|||
Members of the Meet (∧), or ''Infimum'' of a Forward Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 < m1. |
|||
Members of the Join (∨), or ''Supremum'' of a Reverse Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 > m1. |
|||
Where multiple Dominant Matches exist within a Meet (or within a Join, respectively) the Dominant Matches will be incomparable to each other. |
|||
'''Background''' |
'''Background''' |
||
Line 26: | Line 44: | ||
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing. |
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing. |
||
The |
The divide-and-conquer approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case. |
||
This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions. |
This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions. |