Boyer-Moore string search: Difference between revisions

Content added Content deleted

Inline

Revision as of 16:43, 25 August 2022

Task

This algorithm is designed for pattern searching on certain types of devices which are backtracking-unfriendly such as Tape drives and Hard disks.

Its works by first caching a segment of data from storage and match it against the pattern from the highest position backward to the lowest, if the matching fails, cache next segment of data, and move the start point forward to skip the portion of data which will definitely fail to match, until we successfully match the pattern or the pointer exceeds the data length.

Follow this link for more information about this algorithm.

To properly test this algorithm, it would be good to search for a string which contains repeated subsequences, such as alfalfa and the text being searched should not only include a match but that match should be preceded by words which partially match, such as alfredo, behalf, calfskin, halfhearted, malfunction or severalfold.

Emacs Lisp

(defun bm_compile_pattern (pattern)
  "Compile the pattern to a right most position map"
  (let ((patLen (length pattern))
        (rightMap (make-vector 256 -1))
        (j -1))
    (while (> patLen (setq j (1+ j)))
      (aset rightMap (elt pattern j) j) )
    rightMap
    )
  )

(defun bm_substring_search (pattern text)
  "Boyer-Moore string search"
  (let ((patLen (length pattern))
        (txtLen (length text))
        (startPos 0)
        (result nil)
        (rightMap (bm_compile_pattern pattern)))
    ;; Continue this loop when no result and not exceed the text length
    (while (and (not result) (<= (+ startPos patLen) txtLen))

      (let ((idx patLen)
            (skip 0))
        (while (and (= 0 skip) (<= 0 (setq idx (1- idx))))
          ;; skip when the character at position idx is different
          (when (/= (elt pattern idx) (elt text (+ startPos idx)))
            ;; looking up the right most position in pattern
            (setq skip (max 1 (- idx (aref rightMap (elt text (+ startPos idx))))))
            )
          )
        (if (< 0 skip)
            (setq startPos (+ startPos skip))
          (setq result startPos)
          )
        )
      )
    result
    )
  )


(let ((pattern "alfalfa")
      (full_text "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk."))
  
  (bm_substring_search pattern full_text)
)

outputs:

J

Translation of: Emacs Lisp

<lang J>bmsearch0=: {{

  startPos=. 0
  rightMap=. (i.#x) (3 u: x)} 256#_1
  while. (#y) >: (#x)+startPos do.
    skip=. 0
    idx=. #x
    while. 0 <: idx=. idx-1 do.
      if. (idx{x) ~: (startPos+idx) { y do.
        skip=. 1 >. idx - rightMap {~ 3 u: y {~ startPos + idx
        break.
      end.
    end.
    if. skip do.
      startPos=. startPos+skip
    else.
      startPos return.
    end.
  end.

}}</lang>

Example use: <lang J> 'alfalfa' bmsearch0 'On behalf of an alfredo calfskin malfunction we donate alfalfa.' 55</lang>

Caution: bmsearch0 is an incomplete implementation of the algorithm. See talk page.

Here's a complete implementation of the algorithm:

<lang J>Z=: {{ y Template:1 i.~y~:(\.y }}

bmsearch1=: {{

 mx=. <:nx=. #x
 my=. <:ny=. #y
 R=. |:>./\_1,(i.@# ([`]`((256#_1)"0))}&> 3&u:) x
 L=.{{ 
   j=.I.*y
   j ((#y)-y{~j)} _1#~#y }} Z&.|. x
 F=. >./\.(*]=#-i.@#) Z x
 k=. mx
 k0=. _1
 while. k < ny do.
   i=. mx
   h=. k
   while. (0<:i) * (k0<h) * (i{x) = h{y do.
     i=. i-1
     h=. h-1
   end.
   if. (_1=i)+.k0=h do.
     1+k-nx return.
   else.
     if. mx=i do.
       suffix_shift=. 1
     elseif. _1=L{~i1=. i+1 do.
       suffix_shift=. nx-F{~i1
     else.
       suffix_shift=. mx-L{~i1
     end.
     shift=. suffix_shift >. i-R{~<(3 u: h{y),i
     if. shift > i do. k0=. k end.
     k=. k + shift
   end.
 end.
 EMPTY

}}</lang>

<lang J> 'alfalfa' bmsearch1 'On behalf of an alfredo calfskin malfunction we donate alfalfa.' 55</lang>

Testing performance on a relatively long text (and pattern):

<lang J> text=: 'acgt'{~?.1e7#4

  pat=. text{~9e6+i.1e5
  pat bmsearch0 text

9000000

  pat bmsearch1 text

9000000

  'pat bmsearch0 text' %&timex 'pat bmsearch1 text'

2.33477</lang>

In other words: bmsearch0 takes slightly over twice as long as bmsearch1 for a somewhat randomly picked example search (though the results are the same).

Julia

Some of this code is derived from Python code by Ben Langmead. Because of that ancestry, it returns results that are 0-based in indexing, though Julia native 1-based indexing is used within the individual functions. The theorem references in the comments are to the definitive work of Daniel M. Gusfield, "Algorithms on Strings, Trees, and Sequences", 1997, available online as a pdf in several publicly indexed archives on the internet. <lang ruby>""" Rosetta Code task at rosettacode.org/mw/index.php?title=Boyer-Moore_string_search """

const ASCIIchars = String([Char(i) for i = 0:255]) # any standard or extended ASCII char const NAChars = String(['A', 'C', 'G', 'T', 'U', 'N']) # RNA and DNA bases const lowercase_alphabet = "abcdefghijklmnopqrstuvwxyz " # a-z and space

""" Use Z algorithm to preprocess s """ function z_array(s)

   len = length(s)
   @assert len > 1
   z = [len; zeros(Int, len - 1)]
   # Initial comparison of s[1:] with prefix
   for i = 1:len-1
       if s[i+1] == s[i]
           z[2] += 1
       else
           break
       end
   end
   r, l = 0, 0
   if z[2] > 0
       r, l = z[2], 1
   end
   for k = 2:len-1
       @assert z[k+1] == 0
       if k > r
           # Case 1
           for i = k:len-1
               if s[i+1] == s[i-k+1]
                   z[k+1] += 1
               else
                   break
               end
           end
           r, l = k + z[k+1] - 1, k
       else
           # Case 2
           # Calculate length of beta
           nbeta = r - k + 1
           zkp = z[k-l+1]
           if nbeta > zkp
               # Case 2a: Zkp wins
               z[k+1] = zkp
           else
               # Case 2b: Compare characters just past r
               nmatch = 0
               for i = r+1:len-1
                   if s[i+1] == s[i-k+1]
                       nmatch += 1
                   else
                       break
                   end
               end
               l, r = k, r + nmatch
               z[k+1] = r - k + 1
           end
       end
   end
   return z

end

""" Compile the N array (Gusfield theorem 2.2.2) from the Z array """ function n_array(s)

   return reverse(z_array(reverse(s)))

end

""" Compile L' array (Gusfield theorem 2.2.2) using p and N array. L'[i] = largest index j less than n such that N[j] = |P[i:]| """ function big_l_prime_array(p, n)

   len = length(p)
   lp = zeros(Int, len)
   for j = 1:len-1
       i = len - n[j]
       if i < len
           lp[i+1] = j + 1
       end
   end
   return lp

end

""" Compile L array (Gusfield theorem 2.2.2) using p and L' array. L[i] = largest index j less than n such that N[j] >= |P[i:]| """ function big_l_array(p, lp)

   l = zeros(Int, length(p))
   l[2] = lp[2]
   for i = 3:length(p)
       l[i] = max(l[i-1], lp[i])
   end
   return l

end

""" Compile lp' array (Gusfield theorem 2.2.4) using N array. """ function small_l_prime_array(narray)

   len = length(narray)
   small_lp = zeros(Int, len)
   for i in eachindex(narray)
       if narray[i] == i  # prefix matching a suffix
           small_lp[len-i+1] = i
       end
   end
   for i = len-1:-1:1  # "smear" them out to the left
       if small_lp[i] == 0
           small_lp[i] = small_lp[i+1]
       end
   end
   return small_lp

end

""" Return tables needed to apply good suffix rule. """ function good_suffix_table(p)

   n = n_array(p)
   lp = big_l_prime_array(p, n)
   return lp, big_l_array(p, lp), small_l_prime_array(n)

end

""" Given a mismatch at offset i, and given L/L' and l' arrays, return amount to shift as determined by good suffix rule. """ function good_suffix_mismatch(i, big_l_prime, small_l_prime)

   len = length(big_l_prime)
   @assert i < len
   if i == len - 1
       return 0
   end
   i += 1  # i points to leftmost matching position of P
   if big_l_prime[i] > 0
       return len - big_l_prime[i]
   end
   return len - small_l_prime[i]

end

""" Given a full match of P to T, return amount to shift as determined by good suffix rule. """ good_suffix_match(small_l_prime) = length(small_l_prime) - small_l_prime[2]

""" Given pattern string and list with ordered alphabet characters, create and return a dense bad character table. Table is indexed by offset then by character position in alphabet. """ function dense_bad_char_tab(p, amap)

   tab = Vector{Int}[]
   nxt = zeros(Int, length(amap))
   for i in eachindex(p)
       c = p[i]
       @assert haskey(amap, c)
       push!(tab, nxt[:])
       nxt[amap[c]] = i
   end
   return tab

end

""" Encapsulates pattern and associated Boyer-Moore preprocessing. """ struct BoyerMoore

   pat::String
   alphabet::String
   amap::Dict{Char,Int}
   bad_char::Vector{Vector{Int}}
   big_l::Vector{Int}
   small_l_prime::Vector{Int}

end

function BoyerMoore(p, alphabet = "ACGT")

   # Create map from alphabet characters to integers
   amap = Dict(alphabet[i] => i for i in eachindex(alphabet))
   # Make bad character rule table
   bad_char = dense_bad_char_tab(p, amap)
   # Create good suffix rule table
   _, big_l, small_l_prime = good_suffix_table(p)
   return BoyerMoore(p, alphabet, amap, bad_char, big_l, small_l_prime)

end

""" Return # skips given by bad character rule at offset i """ function bad_character_rule(bm, i, c)

   @assert haskey(bm.amap, c)
   ci = bm.amap[c]
   @assert i > bm.bad_char[i+1][ci] - 1
   return i - (bm.bad_char[i+1][ci] - 1)

end

""" Given a mismatch at offset i, return amount to shift per (weak) good suffix rule. """ function good_suffix_rule(bm, i)

   len = length(bm.big_l)
   @assert i < len
   if i == len - 1
       return 0
   end
   i += 1  # i points to leftmost matching position of P
   if bm.big_l[i+1] > 0
       return len - bm.big_l[i+1]
   end
   return len - bm.small_l_prime[i+1]

end

""" Return amount to shift in case where P matches T """ match_skip(bm) = length(bm.small_l_prime) - bm.small_l_prime[2]

Let's make sure our rules give the expected results.
GCTAGCTCTACGAGTCTA

p = "TCAA" p_bm = BoyerMoore(p, "ACGT") @show p_bm.amap, p_bm.bad_char @show bad_character_rule(p_bm, 2, 'T') # 2

GCTAGCTCTACGAGTCTA
ACTA

p = "ACTA" p_bm = BoyerMoore(p, "ACGT") @show good_suffix_rule(p_bm, 0) # 3

ACACGCTCTACGAGTCTA
ACAC

p = "ACAC" p_bm = BoyerMoore(p, "ACGT") @show match_skip(p_bm) # 2

""" Do Boyer-Moore matching """ function boyer_moore(p, p_bm, t)

   i = 0
   occurrences = Int[]
   while i < length(t) - length(p) + 1
       shift = 1
       mismatched = false
       for j = length(p)-1:-1:0
           if p[j+1] != t[i+j+1]
               skip_bc = bad_character_rule(p_bm, j, t[i+j+1])
               skip_gs = good_suffix_rule(p_bm, j)
               shift = max(shift, skip_bc, skip_gs)
               mismatched = true
               break
           end
       end
       if !mismatched
           push!(occurrences, i)
           skip_gs = match_skip(p_bm)
           shift = max(shift, skip_gs)
       end
       i += shift
   end
   return occurrences

end

""" Do Boyer-Moore matching counts """ function boyer_moore_with_counts(p, p_bm, t)

   i = 0
   occurrences = Int[]
   alignments_tried, comparison_count = 0, 0
   while i < length(t) - length(p) + 1
       alignments_tried += 1
       shift = 1
       mismatched = false
       for j = length(p)-1:-1:0
           comparison_count += 1
           if p[j+1] != t[i+j+1]
               skip_bc = bad_character_rule(p_bm, j, t[i+j+1])
               skip_gs = good_suffix_rule(p_bm, j)
               shift = max(shift, skip_bc, skip_gs)
               mismatched = true
               break
           end
       end
       if !mismatched
           push!(occurrences, i)
           skip_gs = match_skip(p_bm)
           shift = max(shift, skip_gs)
       end
       i += shift
   end
   return occurrences, alignments_tried, comparison_count

end

const t1 = "GCTAGCTCTACGAGTCTA" const p1 = "TCTA" const p_bm1 = BoyerMoore(p1, "ACGT") @show boyer_moore(p1, p_bm1, t1)

const t2 = "GGCTATAATGCGTA" const p2 = "TAATAAA" const p_bm2 = BoyerMoore(p2, "ACGT") @show bad_character_rule(p_bm2, 1, 'T')

const p3 = "word" const t3 = "there would have been a time for such a word" const p_bm3 = BoyerMoore(p3, lowercase_alphabet) occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p3, p_bm3, t3) @show occurrences, num_alignments, num_character_comparisons

const p4 = "needle" const t4 = "needle need noodle needle" const p_bm4 = BoyerMoore(p4, lowercase_alphabet) occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p4, p_bm4, t4) @show occurrences, num_alignments, num_character_comparisons

const p5 = "put" const t5 = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented" const p_bm5 = BoyerMoore(p5, ASCIIchars) occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p5, p_bm5, t5) @show occurrences, num_alignments, num_character_comparisons

p6 = "and" p_bm6 = BoyerMoore(p6, ASCIIchars) occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p6, p_bm6, t5) @show occurrences, num_alignments, num_character_comparisons

p7 = "alfalfa" t7 = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk." p_bm7 = BoyerMoore(p7, ASCIIchars) occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p7, p_bm7, t7) @show occurrences, num_alignments, num_character_comparisons

</lang>

Output:

(p_bm.amap, p_bm.bad_char) = (Dict('A' => 1, 'G' => 3, 'T' => 4, 'C' => 2), [[0, 0, 0, 0], [0, 0, 0, 1], [0, 2, 0, 1], [3, 2, 0, 1]])
bad_character_rule(p_bm, 2, 'T') = 2
good_suffix_rule(p_bm, 0) = 3       
match_skip(p_bm) = 2
boyer_moore(p1, p_bm1, t1) = [6, 14] 
bad_character_rule(p_bm2, 1, 'T') = 1
(occurrences, num_alignments, num_character_comparisons) = ([40], 12, 15)
(occurrences, num_alignments, num_character_comparisons) = ([0, 19], 5, 18)
(occurrences, num_alignments, num_character_comparisons) = ([26, 90], 70, 78)       
(occurrences, num_alignments, num_character_comparisons) = ([101, 128, 171], 71, 79)
(occurrences, num_alignments, num_character_comparisons) = ([33, 87], 20, 42)

Phix

--
-- demo\rosetta\BoyerMoore.exw
-- ===========================
--
-- based on https://www-igm.univ-mlv.fr/~lecroq/string/node14.html
--
with javascript_semantics
function preBmBc(string pat)
    integer m = length(pat)
    sequence bmBc = repeat(m,255)
    for i=1 to m-1 do
      bmBc[pat[i]] = m - i
    end for
    return bmBc
end function
 
function suffixes(string pat)
    integer m = length(pat), g = m, f
    sequence suff = repeat(0,m)
    suff[m] = m;
    for i=m-1 to 1 by -1 do
        if i > g and suff[i + m - f] < i - g then
            suff[i] = suff[i + m - f]
        else
            if i < g then g = i end if
            f = i
            while g >= 1 and pat[g] == pat[g + m - f] do
                g -= 1;
            end while
            suff[i] = f - g
        end if
    end for
    return suff
end function
 
function preBmGs(string pat)
    integer m = length(pat), j = 1
    sequence suff = suffixes(pat),
             bmGs = repeat(m,m)
    for i=m to 1 by -1 do
        if (suff[i] == i) then
            while j < m - i do
                if (bmGs[j] == m) then
                    bmGs[j] = m - i;
                end if
                j += 1
            end while
        end if
    end for
    for i=1 to m-1 do
        bmGs[m - suff[i]] = m - i;
    end for
    return bmGs
end function
 
procedure BM(string pat, s, bool case_insensitive = false)
    string pins = sprintf("`%s` in `%s`",{pat,shorten(s,"characters",10)})
    if case_insensitive then
        pat = lower(pat)
        s = lower(s) 
    end if
    /* Preprocessing */
    sequence bmGs = preBmGs(pat),
             bmBc = preBmBc(pat)
    /* Searching */
    sequence res = {}
    integer i, j = 0,
            n = length(s),
            m = length(pat),
            cc = 0
    while j <= n - m do
        for i=m to 1 by -1 do
            cc += 1
            if pat[i]!=s[i+j] then exit end if
        end for
        if i<1 then
            res &= j+1
            j += bmGs[1];
        else
            j += max(bmGs[i], bmBc[s[i + j]] - m + i);
        end if
    end while
    string ccs = sprintf("(%d character comparisons)",cc)
    if length(res) then
        string many = ordinant(length(res))
        printf(1,"Found %s %s at %v %s\n",{pins,many,res,ccs})
    else
        printf(1,"No %s %s\n",{pins,ccs})
    end if
end procedure
   
BM("GCAGAGAG","GCATCGCAGAGAGTATACAGTACG")
BM("TCTA","GCTAGCTCTACGAGTCTA")
BM("TAATAAA","GGCTATAATGCGTA")
BM("word","there would have been a time for such a word")
BM("needle","needle need noodle needle")
constant book = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesley"&
                "DKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeand"&
                "assemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented"
BM("put",book)
BM("and",book)
constant farm = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with "&
                "bales of all that alfalfa exchanged for milk."
BM("alfalfa",farm)
--BM("aLfAlfa",farm)        -- none
--BM("aLfAlfa",farm,true)   -- as -2

Output:

Significantly lower character comparison counts than Knuth-Morris-Pratt_string_search#Phix.

Found `GCAGAGAG` in `GCATCGCAGAGAGTATACAGTACG` once at {6} (17 character comparisons)
Found `TCTA` in `GCTAGCTCTACGAGTCTA` twice at {7,15} (14 character comparisons)
No `TAATAAA` in `GGCTATAATGCGTA` (4 character comparisons)
Found `word` in `there woul...uch a word (44 characters)` once at {41} (15 character comparisons)
Found `needle` in `needle need noodle needle` twice at {1,20} (18 character comparisons)
Found `put` in `Inhisbooks...epresented (202 characters)` twice at {27,91} (78 character comparisons)
Found `and` in `Inhisbooks...epresented (202 characters)` three times at {102,129,172} (79 character comparisons)
Found `alfalfa` in `Nearby far... for milk. (114 characters)` twice at {34,88} (42 character comparisons)

Python

Slightly modified from en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#Python_implementation. <lang python>""" This version uses ASCII for case-sensitive matching. For Unicode you may want to match in UTF-8 bytes instead of creating a 0x10FFFF-sized table. """

from typing import List

ALPHABET_SIZE = 256

def alphabet_index(c: str) -> int:

   """Return the index of the given ASCII character. """
   val = ord(c)
   assert 0 <= val < ALPHABET_SIZE
   return val

def match_length(S: str, idx1: int, idx2: int) -> int:

   """Return the length of the match of the substrings of S beginning at idx1 and idx2."""
   if idx1 == idx2:
       return len(S) - idx1
   match_count = 0
   while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]:
       match_count += 1
       idx1 += 1
       idx2 += 1
   return match_count

def fundamental_preprocess(S: str) -> List[int]:

   """Return Z, the Fundamental Preprocessing of S.

   Z[i] is the length of the substring beginning at i which is also a prefix of S.
   This pre-processing is done in O(n) time, where n is the length of S.
   """
   if len(S) == 0:  # Handles case of empty string
       return []
   if len(S) == 1:  # Handles case of single-character string
       return [1]
   z = [0 for x in S]
   z[0] = len(S)
   z[1] = match_length(S, 0, 1)
   for i in range(2, 1 + z[1]):  # Optimization from exercise 1-5
       z[i] = z[1] - i + 1
   # Defines lower and upper limits of z-box
   l = 0
   r = 0
   for i in range(2 + z[1], len(S)):
       if i <= r:  # i falls within existing z-box
           k = i - l
           b = z[k]
           a = r - i + 1
           if b < a:  # b ends within existing z-box
               z[i] = b
           else:  # b ends at or after the end of the z-box, we need to do an explicit match to the right of the z-box
               z[i] = a + match_length(S, a, r + 1)
               l = i
               r = i + z[i] - 1
       else:  # i does not reside within existing z-box
           z[i] = match_length(S, 0, i)
           if z[i] > 0:
               l = i
               r = i + z[i] - 1
   return z

def bad_character_table(S: str) -> List[List[int]]:

   """
   Generates R for S, which is an array indexed by the position of some character c in the
   ASCII table. At that index in R is an array of length |S|+1, specifying for each
   index i in S (plus the index after S) the next location of character c encountered when
   traversing S from right to left starting at i. This is used for a constant-time lookup
   for the bad character rule in the Boyer-Moore string search algorithm, although it has
   a much larger size than non-constant-time solutions.
   """
   if len(S) == 0:
       return [[] for a in range(ALPHABET_SIZE)]
   R = [[-1] for a in range(ALPHABET_SIZE)]
   alpha = [-1 for a in range(ALPHABET_SIZE)]
   for i, c in enumerate(S):
       alpha[alphabet_index(c)] = i
       for j, a in enumerate(alpha):
           R[j].append(a)
   return R

def good_suffix_table(S: str) -> List[int]:

   """
   Generates L for S, an array used in the implementation of the strong good suffix rule.
   L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches
   a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to
   shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]]
   matches the substring of T matched by a suffix of P in the previous match attempt.
   Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given
   by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used.
   Since only proper suffixes matter, L[0] = -1.
   """
   L = [-1 for c in S]
   N = fundamental_preprocess(S[::-1])  # S[::-1] reverses S
   N.reverse()
   for j in range(0, len(S) - 1):
       i = len(S) - N[j]
       if i != len(S):
           L[i] = j
   return L

def full_shift_table(S: str) -> List[int]:

   """
   Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore
   string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a
   prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the
   text T is len(P) - F[i] for a mismatch occurring at i-1.
   """
   F = [0 for c in S]
   Z = fundamental_preprocess(S)
   longest = 0
   for i, zv in enumerate(reversed(Z)):
       longest = max(zv, longest) if zv == i + 1 else longest
       F[-i - 1] = longest
   return F

def string_search(P, T) -> List[int]:

   """
   Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P
   in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal
   amount to shift the string and skip comparisons. In practice it runs in O(m) (and even
   sublinear) time, where m is the length of T. This implementation performs a case-sensitive
   search on ASCII characters. P must be ASCII as well.
   """
   if len(P) == 0 or len(T) == 0 or len(T) < len(P):
       return []

   matches = []

   # Preprocessing
   R = bad_character_table(P)
   L = good_suffix_table(P)
   F = full_shift_table(P)

   k = len(P) - 1      # Represents alignment of end of P relative to T
   previous_k = -1     # Represents alignment in previous phase (Galil's rule)
   while k < len(T):
       i = len(P) - 1  # Character to compare in P
       h = k           # Character to compare in T
       while i >= 0 and h > previous_k and P[i] == T[h]:  # Matches starting from end of P
           i -= 1
           h -= 1
       if i == -1 or h == previous_k:  # Match has been found (Galil's rule)
           matches.append(k - len(P) + 1)
           k += len(P) - F[1] if len(P) > 1 else 1
       else:  # No match, shift by max of bad character and good suffix rules
           char_shift = i - R[alphabet_index(T[h])][i]
           if i + 1 == len(P):  # Mismatch happened on first attempt
               suffix_shift = 1
           elif L[i + 1] == -1:  # Matched suffix does not appear anywhere in P
               suffix_shift = len(P) - F[i + 1]
           else:               # Matched suffix appears in P
               suffix_shift = len(P) - 1 - L[i + 1]
           shift = max(char_shift, suffix_shift)
           previous_k = k if shift >= i + 1 else previous_k  # Galil's rule
           k += shift
   return matches

TEXT1 = 'InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented' TEXT2 = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk." PAT1, PAT2, PAT3 = 'put', 'and', 'alfalfa'

print("Found", PAT1, "at:", string_search(PAT1, TEXT1)) print("Found", PAT2, "at:", string_search(PAT2, TEXT1)) print("Found", PAT3, "at:", string_search(PAT3, TEXT2))

</lang>

Output:

Found put at: [26, 90]
Found and at: [101, 128, 171]
Found alfalfa at: [33, 87]

Raku

Translation of: Phix

<lang perl6># 20220818 Raku programming solution

sub suffixes (@pat,\m) {

  loop (my ($i,$f,$g,@suff)=m-2, 0, m-1, |flat 0 xx m-1,m; $i >= 0; --$i) {
     if $i > $g and @suff[$i + m - 1 - $f] < $i - $g {
        @suff[$i] = @suff[$i + m - 1 - $f]
     } else {

$g = $i if $i < $g;

        $f = $i;

while $g >= 0 and @pat[$g] eq @pat[$g + m - 1 - $f] { $g-- }

        @suff[$i] = $f - $g
     }
  }
  return @suff;

}

sub preBmGs (@pat,\m) {

  my (@suff, @bmGs) := suffixes(@pat,m), [m xx m];

  for m-1 ... 0 -> \k {
     if @suff[k] == k + 1 {
        loop (my $j=0; $j < m-1-k; $j++) { @bmGs[k]=m-1-k if @bmGs[$j] == m }
     }
  }
  for ^(m-1) { @bmGs[m - 1 - @suff[$_]] = m - 1 - $_ }
  return @bmGs;

}

sub BM (@txt,@pat) {

  my (\m, \n, $j)    = +@pat, +@txt, 0;
  my (@bmGs, %bmBc) := preBmGs(@pat,m), hash @pat Z=> ( m-1 ... 1 );

  return gather while $j <= n - m {
     loop (my $i = m - 1; $i >= 0 and @pat[$i] eq @txt[$i + $j]; ) { $i-- }
     if $i < 0 {

take $j;

        $j += @bmGs[0]
     } else {
        $j += max @bmGs[$i], (%bmBc{@txt[$i + $j]} // m)-m+$i
     }
  }

}

my @texts = [

  "GCTAGCTCTACGAGTCTA",
  "GGCTATAATGCGTA",
  "there would have been a time for such a word",
  "needle need noodle needle",
  "BharôtভাৰতBharôtভারতIndiaBhāratભારતBhāratभारतBhārataಭಾರತBhāratभारतBhāratamഭാരതംBhāratभारतBhāratभारतBharôtôଭାରତBhāratਭਾਰਤBhāratamभारतम्Bārataபாரதம்BhāratamഭാരതംBhāratadēsamభారతదేశం",
  "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented",
  "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.",

];

my @pats = [ "TCTA", "TAATAAA", "word", "needle", "ഭാരതം", "put", "and", "alfalfa"];

say "text$_ = @texts[$_]" for @texts.keys ; say();

for @pats.keys {

  my \j = $_ < 6 ?? $_ !! $_-1 ; # for searching text5 twice
  say "Found '@pats[$_]' in 'text{j}' at indices ", BM @texts[j].comb, @pats[$_].comb

}</lang> Output is the same as the Knuth-Morris-Pratt_string_search#Raku entry.

Wren

I've worked here from the Java code in the Wikipedia article though, as this version only finds the index of the first occurrence of the substring, I've added a routine to find all non-overlapping occurrences by repeatedly calling the BoyerMoore.indexOf method. Indices are zero-based.

The same examples have been used as in the Julia entry above. <lang ecmascript>class BoyerMoore {

   /**
    * Returns the index within this string of the first occurrence of the
    * specified substring. If it is not a substring, return -1.
    *
    * There is no Galil because it only generates one match.
    *
    * @param haystack The string to be scanned
    * @param needle The target string to search
    * @return The start index of the substring
    */
   static indexOf(haystack, needle) {
       haystack = haystack.bytes.toList
       needle = needle.bytes.toList
       var nc = needle.count
       if (nc == 0) return 0
       var charTable = makeCharTable_(needle)
       var offsetTable = makeOffsetTable_(needle)
       var i = nc - 1
       while (i < haystack.count) {
           var j = nc - 1
           while (needle[j] == haystack[i]) {
               if (j == 0) return i
               i = i - 1
               j = j - 1
           }
           i = i + offsetTable[nc - 1 - j].max(charTable[haystack[i]])
       }
       return -1
   }

   /**
    * Makes the jump table based on the mismatched character information.
    */
   static makeCharTable_(needle) {
       var ALPHABET_SIZE = 256 // use bytes rather than codepoints
       var nc = needle.count
       var table = List.filled(ALPHABET_SIZE, nc)
       for (i in 0...nc) table[needle[i]] = nc - 1 - i
       return table
   }

   /**
    * Makes the jump table based on the scan offset which mismatch occurs.
    * (bad character rule).
    */
   static makeOffsetTable_(needle) {
       var nc = needle.count
       var table = List.filled(nc, 0)
       var lastPrefixPosition = nc
       for (i in nc...0) {
           if (isPrefix_(needle, i)) lastPrefixPosition = i
           table[nc-1] = lastPrefixPosition - i - nc
       }
       for (i in 0...nc-1) {
           var slen = suffixLength_(needle, i)
           table[slen] = nc - 1 - i + slen
       }
       return table
   }

   /**
    * Is needle[p..-1] a prefix of needle?
    */
   static isPrefix_(needle, p) {
       var i = p
       var j = 0
       var nc = needle.count
       while (i < nc) {
           if (needle[i] != needle[j]) return false
           i = i + 1
           j = j + 1
       }
       return true
   }

   /**
    * Returns the maximum length of the substring ends at p and is a suffix.
    * (good suffix rule)
    */
  static suffixLength_(needle, p) {
       var len = 0
       var nc = needle.count
       var i = p
       var j = nc - 1
       while (i >= 0 && needle[i] == needle[j]) {
           len = len + 1
           i = i - 1
           j = j - 1
       }
       return len
   }

}

/*

* Uses the BoyerMoore class to find the indices of ALL non-overlapping matches of the specified substring
* and return a list of them. Returns an empty list if it's not a substring.
*/

var indicesOf = Fn.new { |haystack, needle|

   var indices = []
   var hc = haystack.bytes.count
   var bc = needle.bytes.count
   var start = 0
   while (true) {
       var haystack2 = haystack[start..-1]
       var index = BoyerMoore.indexOf(haystack2, needle)
       if (index == -1) return indices
       indices.add(start + index)
       start = start + index + bc
       if (start >= hc) return indices
   }

}

var texts = [

   "GCTAGCTCTACGAGTCTA",
   "GGCTATAATGCGTA",
   "there would have been a time for such a word",
   "needle need noodle needle",

"InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented",

   "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk."

] var pats = ["TCTA", "TAATAAA", "word", "needle", "put", "and", "alfalfa"] for (i in 0...texts.count) System.print("text%(i+1) = %(texts[i])") System.print() for (i in 0...pats.count) {

   var j = (i < 5) ? i : i-1
   System.print("Found '%(pats[i])' in 'text%(j+1)' at indices %(indicesOf.call(texts[j], pats[i]))")

}</lang>

Output:

text1 = GCTAGCTCTACGAGTCTA
text2 = GGCTATAATGCGTA
text3 = there would have been a time for such a word
text4 = needle need noodle needle
text5 = InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented
text6 = Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.

Found 'TCTA' in 'text1' at indices [6, 14]
Found 'TAATAAA' in 'text2' at indices []
Found 'word' in 'text3' at indices [40]
Found 'needle' in 'text4' at indices [0, 19]
Found 'put' in 'text5' at indices [26, 90]
Found 'and' in 'text5' at indices [101, 128, 171]
Found 'alfalfa' in 'text6' at indices [33, 87]