Longest common subsequence: Difference between revisions

Line 10:

Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2.

If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be "incomparable". Defining (#) to denote this case, we can write m1 # m2.

If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be ''incomparable''. Defining (#) to denote this case, we can write m1 # m2.

Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''.

Finding an '''LCS''' can then be restated as the problem of finding a chain of maximum cardinality over the set of matches '''M'''.

Finding an '''LCS''' can then be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''.

The set of matches '''M''' can be visualized as an m*n bit matrix, where each bit '''M[i, j]''' is True iff there is a match at the corresponding positions of strings A and B.

Line 20:

Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True.

⚫

'''Background'''

⚫

~~For example, the~~ sequences "1234" and "1224533324" have an LCS of "1234":

⚫

Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols increases; and the number of matches will tend towards quadratic, O(m*n) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.

⚫

The "divide and conquer" approach by Hirschberg limits the space required to O(m+n). However, this approach requires O(m*n) time even in the best case.

⚫

This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.

⚫

In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols may be reduced to the length of the LCS and the number of matches may tend towards linear, O(m+n) growth.

⚫

A binary search optimization due to Hunt and Szymanski can be applied in this case, which results in expected performance of O(n log m). In the worst case, performance may degrade to O(m*n log m) time if the number of matches, and the space required to represent them, should grow to O(m*n).

⚫

Claus Rick has also described a linear-space algorithm with a time bound of O(n*s + min(p*m, p*(n-p))).

⚫

'''References'''

⚫

"A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975

⚫

Communications of the ACM [Volume 18, Number 6, pp. 341–343]

⚫

"An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976

⚫

Computing Science Technical Report, Bell Laboratories 41

⚫

"A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977

⚫

Communications of the ACM [Volume 20, Number 5, pp. 350-353]

⚫

"Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters,

⚫

Elsevier Science [Volume 75, pp. 275–281]

'''Examples'''

⚫

The sequences "1234" and "1224533324" have an LCS of "1234":

'''1234'''

'''12'''245'''3'''332'''4'''

For a string example, consider the sequences "thisisatest" and "testing123testing". An LCS would be "tsitest":

'''t'''hi'''si'''sa'''test'''

Line 389:

Line 420:

=={{header|C++}}==

'''The Longest Common Subsequence (LCS) Problem'''

Defining a subsequence to be a string obtained by deleting zero or more symbols from an input string, the LCS Problem is to find a subsequence of maximum length that is common to two input strings.

⚫

'''Background'''

⚫

Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols ~~necessarily~~ increases; and the number of matches will tend towards quadratic, O(m*n) growth.

This occurs, for example, in Bioinformatics applications of nucleotide and protein sequencing.

⚫

~~Here the~~ "divide and conquer" approach of Hirschberg limits the space required to O(m+n). However, this approach requires O(m*n) time even in the best case.

⚫

This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.

⚫

In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols ~~used~~ ~~for~~ ~~matches~~ ~~may approach~~ the length of the LCS.

Assuming a uniform distribution of symbols, the number of matches may tend only towards linear, O(m+n) growth.

⚫

A binary search optimization due to Hunt and Szymanski can be applied in this case, which results in expected performance of O(n log m)~~, given m <= n~~. In the worst case, performance ~~degrades~~ to O(m*n log m) time if the number of matches, and the space required to represent them, should grow to O(m*n).

⚫

Claus Rick has also described a linear-space algorithm with a time bound of O(n*s + min(p*m, p*(n-p)))~~, where the alphabet is of size s and the LCS is of length p~~.

⚫

'''References'''

⚫

"A linear space algorithm for computing maximal common subsequences"

by Daniel S. Hirschberg, published June 1975

⚫

Communications of the ACM [Volume 18, Number 6, pp. 341–343]

"An Algorithm for Differential File Comparison"

⚫

by James W. Hunt and M. Douglas McIlroy, June 1976

⚫

Computing Science Technical Report, Bell Laboratories 41