Boyer-Moore string search: Difference between revisions

← Older edit

Boyer-Moore string search (view source)

Revision as of 18:56, 9 November 2023

33,636 bytes added , 6 months ago

Added Yabasic

Jjuanhdez

2,130

edits

Revision as of 17:23, 18 October 2022 (view source) Avk (talk \| contribs) (added Pascal example) ← Older edit		Latest revision as of 18:56, 9 November 2023 (view source) Jjuanhdez (talk \| contribs) (Added Yabasic)
(16 intermediate revisions by 7 users not shown)
Line 10: To properly test this algorithm, it would be good to search for a string which contains repeated subsequences, such as <code>alfalfa</code> and the text being searched should not only include a match but that match should be preceded by words which partially match, such as <code>alfredo</code>, <code>behalf</code>, <code>calfskin</code>, <code>halfhearted</code>, <code>malfunction</code> or <code>severalfold</code>. =={{header\|11l}}== {{trans\|Python}} <syntaxhighlight lang="11l">V ALPHABET_SIZE = 256 F alphabet_index(Char c) -> Int ‘Return the index of the given ASCII character.’ V val = c.code assert(Int(val) C 0 .< :ALPHABET_SIZE) R val F match_length(String s, Int =idx1, Int =idx2) -> Int ‘Return the length of the match of the substrings of S beginning at idx1 and idx2.’ I idx1 == idx2 R s.len - idx1 V match_count = 0 L idx1 < s.len & idx2 < s.len & s[idx1] == s[idx2] match_count++ idx1++ idx2++ R match_count F fundamental_preprocess(String s) -> [Int] ‘Return Z, the Fundamental Preprocessing of S. Z[i] is the length of the substring beginning at i which is also a prefix of S. This pre-processing is done in O(n) time, where n is the length of S.’ I s.empty R [] I s.len == 1 R [1] V z = [0] * s.len z[0] = s.len z[1] = match_length(s, 0, 1) L(i) 2 .< 1 + z[1] z[i] = z[1] - i + 1 V l = 0 V r = 0 L(i) 2 + z[1] .< s.len I i <= r V k = i - l V b = z[k] V a = r - i + 1 I b < a z[i] = b E z[i] = a + match_length(s, a, r + 1) l = i r = i + z[i] - 1 E z[i] = match_length(s, 0, i) I z[i] > 0 l = i r = i + z[i] - 1 R z F bad_character_table(String s) -> [[Int]] ‘ Generates R for S, which is an array indexed by the position of some character c in the ASCII table. At that index in R is an array of length \|S\|+1, specifying for each index i in S (plus the index after S) the next location of character c encountered when traversing S from right to left starting at i. This is used for a constant-time lookup for the bad character rule in the Boyer-Moore string search algorithm, although it has a much larger size than non-constant-time solutions. ’ I s.empty R [[Int]()] * :ALPHABET_SIZE V r = [[-1]] * :ALPHABET_SIZE V alpha = [-1] * :ALPHABET_SIZE L(c) s alpha[alphabet_index(c)] = L.index L(a) alpha r[L.index].append(a) R r F good_suffix_table(String s) -> [Int] ‘ Generates L for S, an array used in the implementation of the strong good suffix rule. L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]] matches the substring of T matched by a suffix of P in the previous match attempt. Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used. Since only proper suffixes matter, L[0] = -1. ’ V l = [-1] * s.len V _n_ = fundamental_preprocess(reversed(s)) _n_.reverse() L(j) 0 .< s.len - 1 V i = s.len - _n_[j] I i != s.len l[i] = j R l F full_shift_table(String s) -> [Int] ‘ Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the text T is len(P) - F[i] for a mismatch occurring at i-1. ’ V f = [0] * s.len V Z = fundamental_preprocess(s) V longest = 0 L(zv) reversed(Z) V i = L.index longest = I zv == i + 1 {max(zv, longest)} E longest f[(len)-i - 1] = longest R f F string_search(P, _t_) -> [Int] ‘ Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal amount to shift the string and skip comparisons. In practice it runs in O(m) (and even sublinear) time, where m is the length of T. This implementation performs a case-sensitive search on ASCII characters. P must be ASCII as well. ’ I P.empty \| _t_.empty \| _t_.len < P.len R [] [Int] matches V r = bad_character_table(P) V l = good_suffix_table(P) V f = full_shift_table(P) V k = P.len - 1 V previous_k = -1 L k < _t_.len V i = P.len - 1 V h = k L i >= 0 & h > previous_k & P[i] == _t_[h] i-- h-- I i == -1 \| h == previous_k matches.append(k - P.len + 1) k += I P.len > 1 {P.len - f[1]} E 1 E V char_shift = i - r[alphabet_index(_t_[h])][i] Int suffix_shift I i + 1 == P.len suffix_shift = 1 E I l[i + 1] == -1 suffix_shift = P.len - f[i + 1] E suffix_shift = P.len - 1 - l[i + 1] V shift = max(char_shift, suffix_shift) previous_k = I shift >= i + 1 {k} E previous_k k += shift R matches V TEXT1 = ‘InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented’ V TEXT2 = ‘Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.’ V (PAT1, PAT2, PAT3) = (‘put’, ‘and’, ‘alfalfa’) print(‘Found ’PAT1‘ at: ’string_search(PAT1, TEXT1)) print(‘Found ’PAT2‘ at: ’string_search(PAT2, TEXT1)) print(‘Found ’PAT3‘ at: ’string_search(PAT3, TEXT2))</syntaxhighlight> {{out}} <pre> Found put at: [26, 90] Found and at: [101, 128, 171] Found alfalfa at: [33, 87] </pre> =={{header\|C++}}== <syntaxhighlight lang="c++"> #include <algorithm> #include <cstdint> #include <experimental/iterator> #include <functional> #include <iomanip> #include <iostream> #include <string> #include <vector> void display(const std::vector<int32_t>& numbers) { std::cout << "["; std::copy(std::begin(numbers), std::end(numbers), std::experimental::make_ostream_joiner(std::cout, ", ")); std::cout << "]" << std::endl; } int32_t string_search_single(const std::string& haystack, const std::string& needle) { auto it = std::search(haystack.begin(), haystack.end(), std::boyer_moore_searcher(needle.begin(), needle.end())); if ( it != haystack.end() ) { return std::distance(haystack.begin(), it); } return -1; } std::vector<int32_t> string_search(const std::string& haystack, const std::string& needle) { std::vector<int32_t> result = {}; uint64_t start = 0; int32_t index = 0; while ( index >= 0 && start < haystack.length() ) { std::string haystackReduced = haystack.substr(start); index = string_search_single(haystackReduced, needle); if ( index >= 0 ) { result.push_back(start + index); start += index + needle.length(); } } return result; } int main() { const std::vector<std::string> texts = { "GCTAGCTCTACGAGTCTA", "GGCTATAATGCGTA", "there would have been a time for such a word", "needle need noodle needle", "DKnuthusesandprogramsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguages", "Nearby farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk." }; const std::vector<std::string> patterns = { "TCTA", "TAATAAA", "word", "needle", "and", "alfalfa" }; for ( uint64_t i = 0; i < texts.size(); ++i ) { std::cout << "text" << ( i + 1 ) << " = " << texts[i] << std::endl; } std::cout << std::endl; for ( uint64_t i = 0; i < texts.size(); ++i ) { std::vector<int32_t> indexes = string_search(texts[i], patterns[i]); std::cout << "Found " << std::quoted(patterns[i]) << " in 'text" << ( i + 1 ) << "' at indexes "; display(string_search(texts[i], patterns[i])); } } </syntaxhighlight> {{ out }} <pre> text1 = GCTAGCTCTACGAGTCTA text2 = GGCTATAATGCGTA text3 = there would have been a time for such a word text4 = needle need noodle needle text5 = DKnuthusesandprogramsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguages text6 = Nearby farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk. Found "TCTA" in 'text1' at indexes [6, 14] Found "TAATAAA" in 'text2' at indexes [] Found "word" in 'text3' at indexes [40] Found "needle" in 'text4' at indexes [0, 19] Found "and" in 'text5' at indexes [10, 46, 73] Found "alfalfa" in 'text6' at indexes [29, 79] </pre> =={{header\|Emacs Lisp}}== Line 100 ⟶ 350: <pre>33</pre> </pre> =={{header\|Fortran}}== <syntaxhighlight lang="Fortran"> module bm_str implicit none private public :: bmstr ! ! PARAMETER definitions ! integer , private , parameter :: NO_OF_CHARS = 256, SIZEX=256 contains pure subroutine badcharheuristic(Str,M,Badchar) implicit none ! Dummy arguments ! integer :: M integer , dimension(NO_OF_CHARS) :: Badchar character(1) , dimension(M) :: Str intent (in) M , Str intent (out) Badchar ! ! Local variables ! integer :: i ! Code starts here Badchar(NO_OF_CHARS) = -1 do i = 1 , M Badchar(iachar(Str(i))) = i enddo return end subroutine badcharheuristic function bmstr(Pat,M,Str,N) result(found) implicit none ! ! Dummy arguments ! integer :: M integer :: N character(len=m) :: Pat character(len=n) :: Str intent (in) M , N , Pat , Str ! ! Local variables ! integer , dimension(NO_OF_CHARS) :: badchar integer :: found integer :: i integer :: j integer :: s ! Code starts here ! found = -1 if ( (M==0) .OR. (N==0) .OR. (M>N) ) return badchar = 0 call badcharheuristic(Pat,M,badchar) i = 0 s = 0 do while ( s<=(N-M) ) j = M do j = m,1,-1 if((Pat(j:j) /= Str(s+j:s+j)) )exit ! Leave, the pattern doesn't match. enddo if ( j < 1 ) then ! Found, let's leave found = s + 1 return endif i = badchar(iachar(Str(s+j:s+j))) s = s + MAX(1,j-i) enddo found = -1 return end function bmstr end module bm_str </syntaxhighlight> </pre> =={{header\|FreeBASIC}}== <syntaxhighlight lang="vb">#define max(a, b) iif((a) > (b), (a), (b)) Dim Shared As Integer suff(), bmBc(), bmGs() Sub preBmBc(pat As String) Dim As Integer m = Len(pat), i Redim bmBc(m) For i = 1 To m-1 bmBc(Cint(pat[i])) = m - i Next End Sub Sub suffixes(pat As String) Dim As Integer m = Len(pat), g = m, i, f Redim suff(m) suff(m) = m For i = m-1 To 1 Step -1 If i > g And suff(i + m - f) < i - g Then suff(i) = suff(i + m - f) Else If i < g Then g = i f = i While g >= 1 And pat[g] = pat[g + m - f] g -= 1 Wend suff(i) = f - g End If Next End Sub Sub preBmGs(pat As String) Dim As Integer m = Len(pat), j = 1, i Redim suff(m) Redim bmGs(m) For i = m To 1 Step -1 If suff(i) = i Then While j < m - i If bmGs(j) = m Then bmGs(j) = m - i j += 1 Wend End If Next For i = 1 To m-1 bmGs(m - suff(i)) = m - i Next End Sub Sub BM(pat As String, s As String, case_insensitive As Boolean = False) Dim As String pins = "'" & pat & "' in " & "'" & s & "'" If case_insensitive Then pat = Lcase(pat) s = Lcase(s) End If ' Preprocessing preBmGs(pat) preBmBc(pat) ' Searching Dim As Integer j = 0, n = Len(s), m = Len(pat), i = m While j <= n - m i -= 1 If pat[i] <> s[i+j] Then Exit While j += Iif(i < 1, bmGs(0), max(bmGs(i), bmBc(Len(s[i+j]) - m + i))) Wend Dim As Integer many = Instr(s, pat) Dim As String tmp = "" If Not many > 0 Then Print "No "; pins Else Do While many > 0 'if not found loop will be skipped tmp &= Str(many) & "," many = Instr(many + 1, s, pat) Loop Print Using "Found & at indices [&]"; pins; tmp & Chr(8) End If End Sub BM("GCAGAGAG","GCATCGCAGAGAGTATACAGTACG") BM("TCTA","GCTAGCTCTACGAGTCTA") BM("TAATAAA","GGCTATAATGCGTA") BM("word","there would have been a time for such a word") BM("needle","needle need noodle needle") Const book = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesley" + _ "DKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeand" + _ "assemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented" BM("put",book) BM("and",book) Const farm = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with " + _ "bales of all that alfalfa exchanged for milk." BM("alfalfa",farm) Sleep</syntaxhighlight> {{out}} <pre>Found 'GCAGAGAG' in 'GCATCGCAGAGAGTATACAGTACG' at indices [6] Found 'TCTA' in 'GCTAGCTCTACGAGTCTA' at indices [7,15] No 'TAATAAA' in 'GGCTATAATGCGTA' Found 'word' in 'there would have been a time for such a word' at indices [41] Found 'needle' in 'needle need noodle needle' at indices [1,20] Found 'put' in 'InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented' at indices [27,91] Found 'and' in InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented' at indices [102,129,172] Found 'alfalfa' in 'Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.' at indices [34,88]</pre> =={{header\|J}}== Line 184 ⟶ 619: In other words: <tt>bmsearch0</tt> takes slightly over twice as long as <tt>bmsearch1</tt> for a somewhat randomly picked example search (though the results are the same). =={{header\|Java}}== <syntaxhighlight lang="java"> import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.List; import java.util.stream.Collectors; import java.util.stream.IntStream; /** * An implementation of the Boyer-Moore string search algorithm. * It finds all occurrences of a pattern in a text, performing a case-insensitive search on ASCII characters. * * For all full description of the algorithm visit: * https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm / public final class BoyerMooreStringSearch { public static void main(String[] aArgs) { List<String> texts = List.of( "GCTAGCTCTACGAGTCTA", "GGCTATAATGCGTA", "there would have been a time for such a word", "needle need noodle needle", "DKnuthusesandshowsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustrate", "Farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk."); List<String> patterns = List.of( "TCTA", "TAATAAA", "word", "needle", "put", "and", "alfalfa" ); for ( int i = 0; i < texts.size(); i++ ) { System.out.println("text" + ( i + 1 ) + " = " + texts.get(i)); } System.out.println(); for ( int i = 0; i < patterns.size(); i++ ) { final int j = ( i < 5 ) ? i : i - 1; System.out.println("Found \"" + patterns.get(i) + "\" in 'text" + ( j + 1 ) + "' at indexes " + stringSearch(texts.get(j), patterns.get(i))); } } /* * Return a list of indexes at which the given pattern matches the given text. / private static List<Integer> stringSearch(String aText, String aPattern) { if ( aPattern.isEmpty() \|\| aText.isEmpty() \|\| aText.length() < aPattern.length() ) { return Collections.emptyList(); } List<Integer> matches = new ArrayList<Integer>(); // Preprocessing List<List<Integer>> R = badCharacterTable(aPattern); int[] L = goodSuffixTable(aPattern); int[] F = fullShiftTable(aPattern); int k = aPattern.length() - 1; // Represents the alignment of the end of aPattern relative to aText int previousK = -1; // Represents the above alignment in the previous phase while ( k < aText.length() ) { int i = aPattern.length() - 1; // Index of the character to compare in aPattern int h = k; // Index of the character to compare in aText while ( i >= 0 && h > previousK && aPattern.charAt(i) == aText.charAt(h) ) { i -= 1; h -= 1; } if ( i == -1 \|\| h == previousK ) { // Match has been found matches.add(k - aPattern.length() + 1); k += ( aPattern.length() > 1 ) ? aPattern.length() - F[1] : 1; } else { // No match, shift by the maximum of the bad character and good suffix rules int suffixShift; int charShift = i - R.get(alphabetIndex(aText.charAt(h))).get(i); if ( i + 1 == aPattern.length() ) { // Mismatch occurred on the first character suffixShift = 1; } else if ( L[i + 1] == -1 ) { // Matched suffix does not appear anywhere in aPattern suffixShift = aPattern.length() - F[i + 1]; } else { // Matched suffix appears in aPattern suffixShift = aPattern.length() - 1 - L[i + 1]; } int shift = Math.max(charShift, suffixShift); if ( shift >= i + 1 ) { // Galil's rule previousK = k; } k += shift; } } return matches; } /* * Create the shift table, F, for the given text, which is an array such that * F[i] is the length of the longest suffix of, aText.substring(i), which is also a prefix of the given text. * * Use case: If a mismatch occurs at character index i - 1 in the pattern, * the magnitude of the shift of the pattern, P, relative to the text is: P.length() - F[i]. / private static int[] fullShiftTable(String aText) { int[] F = new int[aText.length()]; int[] Z = fundamentalPreprocess(aText); int longest = 0; Collections.reverse(Arrays.asList(Z)); for ( int i = 0; i < Z.length; i++ ) { int zv = Z[i]; if ( zv == i + 1 ) { longest = Math.max(zv, longest); } F[F.length - i - 1] = longest; } return F; } /* * Create the good suffix table, L, for the given text, which is an array such that * L[i] = k, is the largest index in the given text, S, * such that S.substring(i) matches a suffix of S.substring(0, k). * * Use case: If a mismatch of characters occurs at index i - 1 in the pattern, P, * then a shift of magnitude, P.length() - L[i], is such that no instances of the pattern in the text are omitted. * Furthermore, a suffix of P.substring(0, L[i]) matches the same substring of the text that was matched by a * suffix in the pattern on the previous matching attempt. * In the case that L[i] = -1, the full shift table must be used. / private static int[] goodSuffixTable(String aText) { int[] L = IntStream.generate( () -> -1 ).limit(aText.length()).toArray(); String reversed = new StringBuilder(aText).reverse().toString(); int[] N = fundamentalPreprocess(reversed); Collections.reverse(Arrays.asList(N)); for ( int j = 0; j < aText.length() - 1; j++ ) { int i = aText.length() - N[j]; if ( i != aText.length() ) { L[i] = j; } } return L; } /* * Create the bad character table, R, for the given text, * which is a list indexed by the ASCII value of a character, C, in the given text. * * Use case: The entry at index i of R is a list of size: 1 + length of the given text. * This inner list gives at each index j the next position at which character C will be found * while traversing the given text from right to left starting from index j. / private static List<List<Integer>> badCharacterTable(String aText) { if ( aText.isEmpty() ) { return Collections.emptyList(); } List<List<Integer>> R = IntStream.range(0, ALPHABET_SIZE).boxed() .map( i -> new ArrayList<Integer>(Collections.nCopies(1,-1)) ).collect(Collectors.toList()); List<Integer> alpha = new ArrayList<Integer>(Collections.nCopies(ALPHABET_SIZE, -1)); for ( int i = 0; i < aText.length(); i++ ) { alpha.set(alphabetIndex(aText.charAt(i)), i); for ( int j = 0; j < alpha.size(); j++ ) { R.get(j).add(alpha.get(j)); } } return R; } /* * Create the fundamental preprocess, Z, of the given text, which is an array such that * Z[i] is the length of the substring of the given text beginning at index i which is also a prefix of the text. / private static int[] fundamentalPreprocess(String aText) { if ( aText.isEmpty() ) { return new int[0]; } if ( aText.length() == 1 ) { return new int[] { 1 }; } int[] Z = new int[aText.length()]; Z[0] = aText.length(); Z[1] = matchLength(aText, 0, 1); for ( int i = 2; i <= Z[1]; i++ ) { Z[i] = Z[1] - i + 1; } // Define the left and right limits of the z-box int left = 0; int right = 0; for ( int i = 2 + Z[1]; i < aText.length(); i++ ) { if ( i <= right ) { // i falls within existing z-box final int k = i - left; final int b = Z[k]; final int a = right - i + 1; if ( b < a ) { // b ends within existing z-box Z[i] = b; } else { // b ends at or after the end of the z-box, // an explicit match to the right of the z-box is required Z[i] = a + matchLength(aText, a, right + 1); left = i; right = i + Z[i] - 1; } } else { // i does not fall within existing z-box Z[i] = matchLength(aText, 0, i); if ( Z[i] > 0 ) { left = i; right = i + Z[i] - 1; } } } return Z; } /* * Return the length of the match of the two substrings of the given text beginning at each of the given indexes. / private static int matchLength(String aText, int aIndexOne, int aIndexTwo) { if ( aIndexOne == aIndexTwo ) { return aText.length() - aIndexOne; } int matchCount = 0; while ( aIndexOne < aText.length() && aIndexTwo < aText.length() && aText.charAt(aIndexOne) == aText.charAt(aIndexTwo) ) { matchCount += 1; aIndexOne += 1; aIndexTwo += 1; } return matchCount; } /* * Return the ASCII index of the given character, if it is such, otherwise throw an illegal argument exception. / private static int alphabetIndex(char aChar) { final int result = (int) aChar; if ( result >= ALPHABET_SIZE ) { throw new IllegalArgumentException("Not an ASCII character:" + aChar); } return result; } / The range of ASCII characters is 0..255, both inclusive. / private static final int ALPHABET_SIZE = 256; } </syntaxhighlight> <pre> text1 = GCTAGCTCTACGAGTCTA text2 = GGCTATAATGCGTA text3 = there would have been a time for such a word text4 = needle need noodle needle text5 = DKnuthusesandshowsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustrate text6 = Farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk. Found "TCTA" in 'text1' at indexes [6, 14] Found "TAATAAA" in 'text2' at indexes [] Found "word" in 'text3' at indexes [40] Found "needle" in 'text4' at indexes [0, 19] Found "put" in 'text5' at indexes [32] Found "and" in 'text5' at indexes [10, 43, 70] Found "alfalfa" in 'text6' at indexes [22, 72] </pre> =={{header\|Julia}}== Some of this code is derived from Python code by Ben Langmead. Because of that ancestry, it returns Line 515 ⟶ 1,210: (occurrences, num_alignments, num_character_comparisons) = ([101, 128, 171], 71, 79) (occurrences, num_alignments, num_character_comparisons) = ([33, 87], 20, 42) </pre> =={{header\|Nim}}== {{trans\|Python}} <syntaxhighlight lang = Nim>#[ This version uses ASCII for case-sensitive matching. For Unicode you may want to match in UTF-8 bytes instead of creating a 0x10FFFF-sized table. ]# import std/[algorithm, sequtils, strutils] const AlphabetSize = 256 func reversed(s: string): string = ## Return the reverse of an ASCII string. for i in countdown(s.high, 0): result.add s[i] proc alphabetIndex(c: char): int = ## Return the index of the given ASCII character. result = ord(c) assert result in 0..<ALPHABET_SIZE proc matchLength(s: string; idx1, idx2: int): int = ## Return the length of the match of the substrings of "s" beginning at "idx1" and "idx2". if idx1 == idx2: return s.len - idx1 var idx1 = idx1 var idx2 = idx2 while idx1 < s.len and idx2 < s.len and s[idx1] == s[idx2]: inc result inc idx1 inc idx2 proc fundamentalPreprocess(s: string): seq[int] = ## Return "z", the Fundamental Preprocessing of "s". # "z[i]" is the length of the substring beginning at "i" which is also a prefix of "s". # This preprocessing is done in O(n) time, where n is the length of "s". if s.len == 0: return if s.len == 1: return @[1] result = repeat(0, s.len) result[0] = s.len result[1] = s.matchLength(0, 1) for i in 2..result[1]: result[i] = result[1] - i + 1 # Defines lower and upper limits of z-box. var l, r = 0 for i in (2 + result[1])..s.high: if i <= r: # "i" falls within existing z-box. let k = i - l let b = result[k] let a = r - i + 1 if b < a: # "b" ends within existing z-box. result[i] = b else: # "b" ends at or after the end of the z-box. # We need to do an explicit match to the right of the z-box. result[i] = a + s.matchLength(a, r + 1) l = i r = i + result[i] - 1 else: # "i" does not reside within existing z-box. result[i] = s.matchLength(0, i) if result[i] > 0: l = i r = i + result[i] - 1 proc badCharacterTable(s: string): seq[seq[int]] = ## Generates "R" for "s", which is an array indexed by the position of some character "c" ## in the ASCII table. # At that index in "R" is an array of length \|s\|+1, specifying for each index "i" in "s" # (plus the index after "s") the next location of character "c" encountered when traversing # "S" from right to left starting at "i". This is used for a constant-time lookup for the bad # character rule in the Boyer-Moore string search algorithm, although it has a much larger # size than non-constant-time solutions. if s.len == 0: return newSeqWith(AlphabetSize, newSeq[int]()) result = repeat(@[-1], AlphabetSize) var alpha = repeat(-1, AlphabetSize) for i, c in s: alpha[alphabetIndex(c)] = i for j, a in alpha: result[j].add a proc goodSuffixTable(s: string): seq[int] = ## Generates "L" for "s", an array used in the implementation of the strong good suffix rule. # "L[i] = k", the largest position in S such that "s[i:]" (the suffix of "s" starting at "i") # matches a suffix of "s[:k]" (a substring in "s" ending at "k"). Used in Boyer-Moore, "L" # gives an amount to shift "P" relative to "T" such that no instances of "P" in "T" are skipped # and a suffix of "P[:L[i]]" matches the substring of "T" matched by a suffix of "P" in the # previous match attempt. # Specifically, if the mismatch took place at position "i-1" in "P", the shift magnitude is # given by the formula "len(P) - L[i]". In the case that "L[i] = -1", the full shift table # is used. Since only proper suffixes matter, "L[0] = -1". result =repeat(-1, s.len) var n = fundamentalPreprocess(reversed(s)) n.reverse() for j in 0..(s.len - 2): let i = s.len - n[j] if i != s.len: result[i] = j proc fullShiftTable(s: string): seq[int] = ## Generates "F" for "s", an array used in a special case of the good suffix rule in the ## Boyer-Moore string search algorithm. # "F[i]" is the length of the longest suffix of "s[i:]" that is also a prefix of "s". In # the cases it is used, the shift magnitude of the pattern "P" relative to the text "T" is # "len(P) - F[i]" for a mismatch occurring at "i-1". result = repeat(0, s.len) let z = fundamentalPreprocess(s) var longest = 0 for i, zv in reversed(z): if zv == i + 1: longest = max(zv, longest) result[^(i + 1)] = longest proc stringSearch(p, t: string): seq[int] = ## Implementation of the Boyer-Moore string search algorithm. # This finds all occurrences of "p" in "t", and incorporates numerous ways of preprocessing # the pattern to determine the optimal amount to shift the string and skip comparisons. # In practice it runs in O(m) (and even sublinear) time, where "m" is the length of "t". # This implementation performs a case-sensitive search on ASCII characters. "p" must be # ASCII as well. if p.len == 0 or t.len == 0 or t.len < p.len: return # Preprocessing let r = badCharacterTable(p) let l = goodSuffixTable(p) let f = fullShiftTable(p) var k = p.len - 1 # Represents alignment of end of "p" relative to "t". var prevk = -1 # Represents alignment in previous phase (Galil's rule). while k < t.len: var i = p.len - 1 # Character to compare in "p". var h = k # Character to compare in "t". while i >= 0 and h > prevk and p[i] == t[h]: # Matches starting from end of "p". dec i dec h if i == -1 or h == prevk: # Match has been found (Galil's rule). result.add k - p.len + 1 inc k, (if p.len > 1: p.len - f[1] else: 1) else: # No match: shift by max of bad character and good suffix rules. let charShift = i - r[alphabetIndex(t[h])][i] let suffixShift = if i + 1 == p.len: # Mismatch happened on first attempt. 1 elif l[i + 1] == -1: # Matched suffix does not appear anywhere in "p". p.len - f[i + 1] else: # Matched suffix appears in "p". p.len - 1 - l[i + 1] let shift = max(charShift, suffixShift) if shift >= i + 1: prevk = k # Galil's rule inc k, shift const Text1 = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesley" & "DKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassembly" & "languagestoillustratetheconceptsandalgorithmsastheyarepresented" Text2 = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, " & "with bales of all that alfalfa exchanged for milk." (Pat1, Pat2, Pat3) = ("put", "and", "alfalfa") echo "Found '", Pat1, "' at: ", Pat1.stringSearch(Text1).join(", ") echo "Found '", Pat2, "' at: ", Pat2.stringSearch(Text1).join(", ") echo "Found '", Pat3, "' at: ", Pat3.stringSearch(Text2).join(", ") </syntaxhighlight> {{out}} <pre>Found 'put' at: 26, 90 Found 'and' at: 101, 128, 171 Found 'alfalfa' at: 33, 87 </pre> Line 528 ⟶ 1,394: type TSearchFun = reference to function(const s: rawbytestring): specialize TArray<SizeInt>; ~~TIntArray = array of SizeInt;~~ ~~TSearchFun = reference to function(const s: rawbytestring): TIntArray;~~ { returns a function that performs a case-sensitive search for all occurrences(1-based) Line 541 ⟶ 1,406: I: Integer; begin FillDWord(BcTable, ~~Succ~~Length(~~High(Byte)~~BcTable), DWord(Length(aPattern))); for I := 0 to Length(aPattern) - 2 do BcTable[p[I]] := Pred(Length(aPattern) - I); Line 581 ⟶ 1,446: var Needle: rawbytestring; ~~const~~ ~~MatchInitLen = 4;~~ begin if aPattern <> '' then begin Line 591 ⟶ 1,454: Needle := ''; { returns an empty array if there are no matches or the pattern is empty } Result := function(const aHaystack: rawbytestring): ~~TIntArray~~specialize TArray<SizeInt> var ~~Matches: TIntArray;~~ pNeedle: PByte absolute Needle; pHaystack: PByte absolute aHaystack; I, J, NeedleLast, ~~MatchPos~~OutPos, OldPfxEnd: SizeInt; const OutInitLen = 4; begin ~~Matches~~Result := nil; if (Needle = '') or (Length(aHaystack) < Length(Needle)) then exit~~(Matches)~~; SetLength(~~Matches~~Result, ~~MatchInitLen~~OutInitLen); ~~MatchPos~~OutPos := 0; NeedleLast := Pred(Length(Needle)); I := NeedleLast; Line 611 ⟶ 1,475: end; if J < OldPfxEnd then begin if ~~MatchPos~~OutPos = Length(~~Matches~~Result) then SetLength(~~Matches~~Result, ~~MatchPos~~OutPos 2); ~~Matches~~Result[~~MatchPos~~OutPos] := I - OldPfxEnd + 2; Inc(~~MatchPos~~OutPos); Inc(I, Succ(GsTable[0] - OldPfxEnd)); OldPfxEnd := Length(Needle)2 - GsTable[0]; Line 621 ⟶ 1,485: end; end; SetLength(~~Matches~~Result, ~~MatchPos~~OutPos); ~~Result := Matches;~~ end; end; Line 650 ⟶ 1,513: WriteArray(BmgSearch(Text3)); WriteArray(BmgSearch(Text4)); ~~ReadLn;~~ end. </syntaxhighlight> Line 1,114 ⟶ 1,976: The same examples have been used as in the Julia entry above. <syntaxhighlight lang="~~ecmascript~~wren">class BoyerMoore { /* * Returns the index within this string of the first occurrence of the Line 1,260 ⟶ 2,122: Found 'alfalfa' in 'text6' at indices [33, 87] </pre> =={{header\|Yabasic}}== {{trans\|FreeBASIC}} <syntaxhighlight lang="vb">case_insensitive = false sub preBmBc(pat$) local m, i m = len(pat$) redim bmBc(m) for i = 1 to m-1 bmBc(val(mid$(pat$, i, 1))) = m - i next end sub sub suffixes(pat$) local m, g, i, f m = len(pat$) g = m redim suff(m) suff(m) = m for i = m-1 to 1 step -1 if i > g and suff(i + m - f) < i - g then suff(i) = suff(i + m - f) else if i < g g = i f = i while g >= 1 and mid$(pat$, g, 1) = mid$(pat$, g + m - f, 1) g = q - 1 wend suff(i) = f - g fi next end sub sub preBmGs(pat$) local m, j, i m = len(pat$) j = 1 redim suff(m) redim bmGs(m) for i = m to 1 step -1 if suff(i) = i then while j < m - i if bmGs(j) = m bmGs(j) = m - i j = j + 1 wend fi next for i = 1 to m-1 bmGs(m - suff(i)) = m - i next end sub sub BM(pat$, s$, case_insensitive) local pins pins$ = "'" + pat$ + "' in " + "'" + s$ + "'" if case_insensitive then pat$ = lower$(pat$) s$ = lower$(s$) fi //* Preprocessing // preBmGs(pat$) preBmBc(pat$) // Searching *// j = 0 n = len(s$) m = len(pat$) i = m while j <= n - m i = i - 1 if mid$(pat$,i,1) <> mid$(s$,i+j,1) break if i < 1 then j = j + bmGs(0) else j = j + max(bmGs(i), bmBc(len(mid$(s$,i+j,1)) - m + i)) : fi wend many = instr(s$, pat$) tmp$ = "" if not many > 0 then print "No ", pins$ else while many > 0 //if not found loop will be skipped tmp$ = tmp$ + str$(many) + "," many = instr(s$, pat$, many + 1) wend print "Found ", pins$, " at indices [", tmp$, chr$(8), "]" fi end sub BM("GCAGAGAG","GCATCGCAGAGAGTATACAGTACG") BM("TCTA","GCTAGCTCTACGAGTCTA") BM("TAATAAA","GGCTATAATGCGTA") BM("word","there would have been a time for such a word") BM("needle","needle need noodle needle") book$ = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented" BM("put",book$) BM("and",book$) farm$ = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk." BM("alfalfa",farm$)</syntaxhighlight> {{out}} <pre>Same as QBasic entry.</pre>