Longest common subsequence: Difference between revisions

Line 2:

'''Introduction'''

Define a subsequence to be any string obtained by deleting zero or more symbols from an input string.

Define a ''subsequence'' to be any output string obtained by deleting zero or more symbols from an input string.

The [http://en.wikipedia.org/wiki/Longest_common_subsequence_problem '''Longest Common Subsequence'''] ('''LCS''') is a subsequence of maximum length common to two or more strings.

Line 10:

An ordered pair (i, j) will be referred to as a match if ''A''[i] = ''B''[j], where 0 < i ≤ m and 0 < j ≤ n.

Define ~~the~~ [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. ~~Defining~~ (≥) similarly~~, we can write m2 ≤ m1 as m1 ≥ m2~~.

Define a ''non-strict'' [https://en.wikipedia.org/wiki/Product_order product-order] (≤) over ordered pairs, such that (i1, j1) ≤ (i2, j2) ⇔ i1 ≤ i2 and j1 ≤ j2. We define (≥) similarly.

We say ~~that~~ m1, m2 are ''comparable'' if either m1 ≤ m2 or m1 ≥ m2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1 ≤ m2 nor m1 ≥ m2 are possible; and we say ~~that~~ m1, m2 are ''incomparable''.

We say m1, m2 are ''comparable'' if either m1 ≤ m2 or m1 ≥ m2 holds. If i1 < i2 and j2 < j1 (or i2 < i1 and j1 < j2) then neither m1 ≤ m2 nor m1 ≥ m2 are possible; and we say m1, m2 are ''incomparable''.

We also define the ''strict'' product-order (<) over ordered pairs, such that (i1, j1) < (i2, j2) ⇔ i1 < i2 and j1 < j2. We define (>) similarly.

⚫

Given a ~~product-order over the~~ set of matches '''M''', a chain '''C''' is ~~any~~ subset of '''M''' in ~~which~~ every pair of distinct elements m1 and m2 ~~are comparable~~. ~~Similarly, an~~ antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable.

⚫

Given a set of matches '''M''', a chain '''C''' is a subset of '''M''' consisting of at least one element m; and where either m1 < m2 or m1 > m2 for every pair of distinct elements m1 and m2. An antichain '''D''' is any subset of '''M''' in which every pair of distinct elements m1 and m2 are incomparable.

⚫

The set '''M''' represents a relation over match pairs: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''. ~~Any~~ chain '''C''' can be visualized as a curve which strictly increases as it passes through each match pair in the m*n coordinate space.

⚫

The set '''M''' represents a relation over match pairs: '''M'''[i, j] ⇔ (i, j) ∈ '''M'''. A chain '''C''' can be visualized as a curve which strictly increases as it passes through each match pair in the m*n coordinate space.

Finding an LCS can be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''.

According to [Dilworth 1950], this cardinality p equals the minimum number of disjoint antichains into which '''M''' can be decomposed. Note that such a decomposition into the minimal number p of disjoint antichains may not be unique.

'''Contours'''

Forward Contours FC[''k''] of ''class k'' are defined inductively, as follows:

FC[0] consists of those elements m1 for which there exists no element m2 such that m2 < m1.

FC[''k''] consists of those elements m1 for which there exists no element m2 such that m2 < m1; and where neither m1 nor m2 are contained in FC[''l''] for any ''class l'' < ''k''.

Reverse Contours RC[''k''] of ''class k'' are defined similarly.

Members of the Meet (&and;), or ''Infimum'' of a Forward Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 < m1.

Members of the Join (&or;), or ''Supremum'' of a Reverse Contour are referred to as its Dominant Matches: those m1 for which there exists no m2 such that m2 > m1.

Where multiple Dominant Matches exist within a Meet (or within a Join, respectively) the Dominant Matches will be incomparable to each other.

'''Background'''

Line 26:

Line 44:

Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(''m*n'') growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.

The "divide and conquer" approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case.

The divide-and-conquer approach of [Hirschberg 1975] limits the space required to O(''n''). However, this approach requires O(''m*n'') time even in the best case.

This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.