Longest common subsequence: Difference between revisions

Content added Content deleted
m (Clarified naming of the FindLCS() method, traceLCS bool, and chains deque.)
(Described Forward and Reverse Contours.)
Line 2: Line 2:
'''Introduction'''
'''Introduction'''


Define a subsequence to be any string obtained by deleting zero or more symbols from an input string.
Define a ''subsequence'' to be any output string obtained by deleting zero or more symbols from an input string.


The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings.
The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings.
Line 10: Line 10:
An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 < i ≤ m and 0 < j ≤ n.
An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 < i ≤ m and 0 < j ≤ n.


Define the [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. Defining (≥) similarly, we can write m2 ≤ m1 as m1 ≥ m2.
Define a ''non-strict'' [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. We define (≥) similarly.


We say that m1, m2 are ''comparable'' if either m1 ≤ m2 or m1 ≥ m2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1 ≤ m2 nor m1 ≥ m2 are possible; and we say that m1, m2 are ''incomparable''.
We say m1, m2 are ''comparable'' if either m1 ≤ m2 or m1 ≥ m2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1 ≤ m2 nor m1 ≥ m2 are possible; and we say m1, m2 are ''incomparable''.


We also define the ''strict'' product-order (<) over ordered pairs, such that (i1, j1) < (i2, j2) ⇔ i1 < i2 and j1 < j2. We define (>) similarly.
Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are comparable. Similarly, an antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable.


Given a set of matches '''M''', a chain '''C''' is a subset of '''M''' consisting of at least one element m; and where either m1 < m2 or m1 > m2 for every pair of distinct elements m1 and m2. An antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable.
The set '''M''' represents a relation over match pairs: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''. Any chain '''C''' can be visualized as a curve which strictly increases as it passes through each match pair in the m*n coordinate space.

The set '''M''' represents a relation over match pairs: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''. A chain '''C''' can be visualized as a curve which strictly increases as it passes through each match pair in the m*n coordinate space.


Finding an LCS can be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''.
Finding an LCS can be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''.


According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique.
According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique.

'''Contours'''

Forward Contours FC[''k''] of ''class k'' are defined inductively, as follows:

FC[0] consists of those elements m1 for which there exists no element m2 such that m2 < m1.

FC[''k''] consists of those elements m1 for which there exists no element m2 such that m2 < m1; and where neither m1 nor m2 are contained in FC[''l''] for any ''class l'' < ''k''.

Reverse Contours RC[''k''] of ''class k'' are defined similarly.

Members of the Meet (∧), or ''Infimum'' of a Forward Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 < m1.

Members of the Join (∨), or ''Supremum'' of a Reverse Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 > m1.

Where multiple Dominant Matches exist within a Meet (or within a Join, respectively) the Dominant Matches will be incomparable to each other.


'''Background'''
'''Background'''
Line 26: Line 44:
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.


The "divide and conquer" approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case.
The divide-and-conquer approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case.


This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.
This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.