Revision as of 04:48, 27 October 2022 (view source) Avk (talk \| contribs) m (→‎{{header\|Pascal}}: cosmetic) ← Older edit		Revision as of 05:43, 23 December 2022 (view source) Alextretyak (talk \| contribs) (Added 11l) Newer edit →
Line 10: To properly test this algorithm, it would be good to search for a string which contains repeated subsequences, such as <code>alfalfa</code> and the text being searched should not only include a match but that match should be preceded by words which partially match, such as <code>alfredo</code>, <code>behalf</code>, <code>calfskin</code>, <code>halfhearted</code>, <code>malfunction</code> or <code>severalfold</code>. =={{header\|11l}}== {{trans\|Python}} <syntaxhighlight lang="11l">V ALPHABET_SIZE = 256 F alphabet_index(Char c) -> Int ‘Return the index of the given ASCII character. ’ V val = c.code assert(Int(val) C 0 .< :ALPHABET_SIZE) R val F match_length(String _s_, Int =idx1, Int =idx2) -> Int ‘Return the length of the match of the substrings of S beginning at idx1 and idx2.’ I idx1 == idx2 R _s_.len - idx1 V match_count = 0 L idx1 < _s_.len & idx2 < _s_.len & _s_[idx1] == _s_[idx2] match_count++ idx1++ idx2++ R match_count F fundamental_preprocess(String _s_) -> [Int] ‘Return Z, the Fundamental Preprocessing of S. Z[i] is the length of the substring beginning at i which is also a prefix of S. This pre-processing is done in O(n) time, where n is the length of S. ’ I _s_.empty R [] I _s_.len == 1 R [1] V z = _s_.map(x -> 0) z[0] = _s_.len z[1] = match_length(_s_, 0, 1) L(i) 2 .< 1 + z[1] z[i] = z[1] - i + 1 V l = 0 V r = 0 L(i) 2 + z[1] .< _s_.len I i <= r V k = i - l V b = z[k] V a = r - i + 1 I b < a z[i] = b E z[i] = a + match_length(_s_, a, r + 1) l = i r = i + z[i] - 1 E z[i] = match_length(_s_, 0, i) I z[i] > 0 l = i r = i + z[i] - 1 R z F bad_character_table(String _s_) -> [[Int]] ‘ Generates R for S, which is an array indexed by the position of some character c in the ASCII table. At that index in R is an array of length \|S\|+1, specifying for each index i in S (plus the index after S) the next location of character c encountered when traversing S from right to left starting at i. This is used for a constant-time lookup for the bad character rule in the Boyer-Moore string search algorithm, although it has a much larger size than non-constant-time solutions. ’ I _s_.empty R [[Int]()] * :ALPHABET_SIZE V _r_ = (0 .< :ALPHABET_SIZE).map(a -> [-1]) V alpha = (0 .< :ALPHABET_SIZE).map(a -> -1) L(c) _s_ V i = L.index alpha[alphabet_index(c)] = i L(a) alpha V j = L.index _r_[j].append(a) R _r_ F good_suffix_table(String _s_) -> [Int] ‘ Generates L for S, an array used in the implementation of the strong good suffix rule. L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]] matches the substring of T matched by a suffix of P in the previous match attempt. Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used. Since only proper suffixes matter, L[0] = -1. ’ V _l_ = _s_.map(c -> -1) V _n_ = fundamental_preprocess(reversed(_s_)) _n_.reverse() L(j) 0 .< _s_.len - 1 V i = _s_.len - _n_[j] I i != _s_.len _l_[i] = j R _l_ F full_shift_table(String _s_) -> [Int] ‘ Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the text T is len(P) - F[i] for a mismatch occurring at i-1. ’ V _f_ = _s_.map(c -> 0) V Z = fundamental_preprocess(_s_) V longest = 0 L(zv) reversed(Z) V i = L.index longest = I zv == i + 1 {max(zv, longest)} E longest _f_[(len)-i - 1] = longest R _f_ F string_search(P, _t_) -> [Int] ‘ Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal amount to shift the string and skip comparisons. In practice it runs in O(m) (and even sublinear) time, where m is the length of T. This implementation performs a case-sensitive search on ASCII characters. P must be ASCII as well. ’ I P.empty \| _t_.empty \| _t_.len < P.len R [] [Int] matches V _r_ = bad_character_table(P) V _l_ = good_suffix_table(P) V _f_ = full_shift_table(P) V k = P.len - 1 V previous_k = -1 L k < _t_.len V i = P.len - 1 V h = k L i >= 0 & h > previous_k & P[i] == _t_[h] i-- h-- I i == -1 \| h == previous_k matches.append(k - P.len + 1) k += I P.len > 1 {P.len - _f_[1]} E 1 E V char_shift = i - _r_[alphabet_index(_t_[h])][i] Int suffix_shift I i + 1 == P.len suffix_shift = 1 E I _l_[i + 1] == -1 suffix_shift = P.len - _f_[i + 1] E suffix_shift = P.len - 1 - _l_[i + 1] V shift = max(char_shift, suffix_shift) previous_k = I shift >= i + 1 {k} E previous_k k += shift R matches V TEXT1 = ‘InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented’ V TEXT2 = ‘Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.’ V (PAT1, PAT2, PAT3) = (‘put’, ‘and’, ‘alfalfa’) print(‘Found ’PAT1‘ at: ’string_search(PAT1, TEXT1)) print(‘Found ’PAT2‘ at: ’string_search(PAT2, TEXT1)) print(‘Found ’PAT3‘ at: ’string_search(PAT3, TEXT2))</syntaxhighlight> {{out}} <pre> Found put at: [26, 90] Found and at: [101, 128, 171] Found alfalfa at: [33, 87] </pre> =={{header\|Emacs Lisp}}==

Boyer-Moore string search: Difference between revisions

Boyer-Moore string search (view source)

Revision as of 05:43, 23 December 2022