Longest common subsequence: Difference between revisions

Longest common subsequence (view source)

Revision as of 19:06, 26 September 2021

411 bytes removed , 2 years ago

Moved Background and References sections from the C++ sample to the main Introduction for the page.

CNHume

159

edits

Revision as of 07:49, 26 September 2021 (view source) CNHume (talk \| contribs) m (Preferred # over <> to represent incomparable matches.) ← Older edit		Revision as of 19:06, 26 September 2021 (view source) CNHume (talk \| contribs) (Moved Background and References sections from the C++ sample to the main Introduction for the page.) Newer edit →
Line 10: Define the strict Cartesian product-order (<) over matches, such that (i1, j1) < (i2, j2) iff i1 < j1 and i2 < j2. Defining (>) similarly, we can write m2 < m1 as m1 > m2. If i1 <= j1 and i2 >= j2 (or if i1 >= i2 and i2 <= j2) then neither m1 < m2 nor m1 > m2 are possible; and m1, m2 are said to be "''incomparable"''. Defining (#) to denote this case, we can write m1 # m2. Given a product-order over the set of matches '''M''', a chain '''C''' is any subset of '''M''' where either m1 < m2 or m2 < m1 for every pair of distinct elements m1 and m2 of '''C'''. Finding an '''LCS''' can then be restated as the problem of finding a chain of maximum cardinality p over the set of matches '''M'''. The set of matches '''M''' can be visualized as an mn bit matrix, where each bit '''M[i, j]''' is True iff there is a match at the corresponding positions of strings A and B. Line 20: Then any chain '''C''' can be visualized as a monotonically increasing curve through those match bits which are set to True. '''Background'''▼ For example, the sequences "1234" and "1224533324" have an LCS of "1234":▼ Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols ~~necessarily~~ increases; and the number of matches will tend towards quadratic, O(mn) growth. This occurs, for example, in the Bioinformatics application of nucleotide and protein sequencing.▼ ~~Here the~~The "divide and conquer" approach ofby Hirschberg limits the space required to O(m+n). However, this approach requires O(mn) time even in the best case.▼ This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions.▼ In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols ~~used~~may ~~for~~be ~~matches~~reduced ~~may approach~~to the length of the LCS and the number of matches may tend towards linear, O(m+n) growth.▼ A binary search optimization due to Hunt and Szymanski can be applied in this case, which results in expected performance of O(n log m)~~, given m <= n~~. In the worst case, performance ~~degrades~~may degrade to O(mn log m) time if the number of matches, and the space required to represent them, should grow to O(mn).▼ Claus Rick has also described a linear-space algorithm with a time bound of O(ns + min(pm, p(n-p)))~~, where the alphabet is of size s and the LCS is of length p~~.▼ '''References'''▼ "A linear space algorithm for computing maximal common subsequences" by Daniel S. Hirschberg, published June 1975<br />▼ Communications of the ACM [Volume 18, Number 6, pp. 341–343]▼ "An Algorithm for Differential File Comparison" by James W. Hunt and M. Douglas McIlroy, June 1976<br />▼ Computing Science Technical Report, Bell Laboratories 41▼ "A Fast Algorithm for Computing Longest Common Subsequences" by James W. Hunt and Thomas G. Szymanski, published May 1977<br />▼ Communications of the ACM [Volume 20, Number 5, pp. 350-353]▼ "Simple and fast linear space computation of longest common subsequences" by Claus Rick, received 17 March 2000, Information Processing Letters,<br />▼ Elsevier Science [Volume 75, pp. 275–281]▼ '''Examples''' ▲~~For example, the~~The sequences "1234" and "1224533324" have an LCS of "1234": '''<u>1234</u>''' '''<u>12</u>'''245'''<u>3</u>'''332'''<u>4</u>''' For a string example, consider the sequences "thisisatest" and "testing123testing". An LCS would be "tsitest": '''<u>t</u>'''hi'''<u>si</u>'''sa'''<u>test</u>''' Line 389 ⟶ 420: =={{header\|C++}}== ~~'''The Longest Common Subsequence (LCS) Problem'''~~ ~~Defining a subsequence to be a string obtained by deleting zero or more symbols from an input string, the LCS Problem is to find a subsequence of maximum length that is common to two input strings.~~ ▲'''Background''' ▲Where the number of symbols appearing in matches is small relative to the length of the input strings, reuse of the symbols necessarily increases; and the number of matches will tend towards quadratic, O(mn) growth. ~~This occurs, for example, in Bioinformatics applications of nucleotide and protein sequencing.~~ ▲Here the "divide and conquer" approach of Hirschberg limits the space required to O(m+n). However, this approach requires O(mn) time even in the best case. ▲This quadratic time dependency may become prohibitive, given very long input strings. Thus, heuristics are often favored over optimal Dynamic Programming solutions. ▲In the application of comparing file revisions, records from the input files form a large symbol space; and the number of symbols used for matches may approach the length of the LCS. ~~Assuming a uniform distribution of symbols, the number of matches may tend only towards linear, O(m+n) growth.~~ ▲A binary search optimization due to Hunt and Szymanski can be applied in this case, which results in expected performance of O(n log m), given m <= n. In the worst case, performance degrades to O(mn log m) time if the number of matches, and the space required to represent them, should grow to O(mn). ▲Claus Rick has also described a linear-space algorithm with a time bound of O(ns + min(pm, p*(n-p))), where the alphabet is of size s and the LCS is of length p. ▲'''References''' ▲"A linear space algorithm for computing maximal common subsequences"<br /> ~~by Daniel S. Hirschberg, published June 1975<br />~~ ▲Communications of the ACM [Volume 18, Number 6, pp. 341–343] ~~"An Algorithm for Differential File Comparison"<br />~~ ▲by James W. Hunt and M. Douglas McIlroy, June 1976<br /> ▲Computing Science Technical Report, Bell Laboratories 41 ▲"A Fast Algorithm for Computing Longest Common Subsequences"<br /> ~~by James W. Hunt and Thomas G. Szymanski, published May 1977<br />~~ ▲Communications of the ACM [Volume 20, Number 5, pp. 350-353] ▲"Simple and fast linear space computation of longest common subsequences"<br /> ~~by Claus Rick, received 17 March 2000, Information Processing Letters,<br />~~ ▲Elsevier Science [Volume 75, pp. 275–281] '''Hunt and Szymanski algorithm''' <lang cpp>#include <stdint.h>