Longest common subsequence: Difference between revisions

Corrected definitions made in the introduction.
(Removed discussion of Forward and Reverse Contours pending correction.)
(Corrected definitions made in the introduction.)
Line 8:
Let ''A'' ≡ ''A''[0]… ''A''[m - 1] and ''B'' ≡ ''B''[0]… ''B''[n - 1], m < n be strings drawn from an alphabet Σ of size s, containing every distinct symbol in A + B.
 
An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 &ltle; i &lelt; m and 0 &ltle; j &lelt; n.
 
The set of matches '''M''' defines a relation over matches: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''.
 
Define a ''non-strict'' [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. We define (≥) similarly.
 
We say m1,ordered pairs p1 and m2p2 are ''comparable'' if either m1p1 ≤ m2p2 or m1p1 ≥ m2p2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1p1 ≤ m2p2 nor m1p1 ≥ m2p2 are possible;, and we say m1,p1 and m2p2 are ''incomparable''.
 
We also defineDefine the ''strict'' product-order (<) over ordered pairs, such that (i1, j1) < (i2, j2) ⇔ i1 < i2 and j1 < j2. We define (>) similarly.
 
Given a set of matches '''M''', aA chain '''C''' is a subset of '''M''' consisting of at least one element m; and where either m1 < m2 or m1 > m2 for every pair of distinct elements m1 and m2. An antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable.
 
Every Common Sequence of length ''q'' corresponds to a chain of cardinality ''q'', over the set of matches '''M'''. Thus, finding an LCS can be restated as the problem of finding a chain of maximum cardinality ''p''.
The set '''M''' represents a relation over match pairs: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''. A chain '''C''' can be visualized as a curve which strictly increases as it passes through each match pair in the m*n coordinate space.
 
FindingA an LCSchain can be restatedvisualized as thea problemstrictly ofincreasing findingcurve athat chainpasses ofthrough maximummatches cardinality(i, pj) overin the setm*n coordinate space of matches '''M'''[i, j].
 
According to [Dilworth 1950], this cardinality ''p'' equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique.
 
'''Background'''
 
Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') quadratic growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.
 
The divide-and-conquer approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case.
Line 44 ⟶ 46:
A, B are input strings of lengths m, n respectively
p is the length of the LCS
M is the set of match pairsmatches (i, j) such that A[i] = B[j]
r is the magnitude of M
s is the magnitude of the alphabet Σ of distinct symbols in A + B
159

edits