Boyer-Moore string search: Difference between revisions

Added Yabasic
(added Pascal example)
(Added Yabasic)
 
(16 intermediate revisions by 7 users not shown)
Line 10:
 
To properly test this algorithm, it would be good to search for a string which contains repeated subsequences, such as <code>alfalfa</code> and the text being searched should not only include a match but that match should be preceded by words which partially match, such as <code>alfredo</code>, <code>behalf</code>, <code>calfskin</code>, <code>halfhearted</code>, <code>malfunction</code> or <code>severalfold</code>.
=={{header|11l}}==
{{trans|Python}}
 
<syntaxhighlight lang="11l">V ALPHABET_SIZE = 256
 
F alphabet_index(Char c) -> Int
‘Return the index of the given ASCII character.’
V val = c.code
assert(Int(val) C 0 .< :ALPHABET_SIZE)
R val
 
F match_length(String s, Int =idx1, Int =idx2) -> Int
‘Return the length of the match of the substrings of S beginning at idx1 and idx2.’
I idx1 == idx2
R s.len - idx1
V match_count = 0
L idx1 < s.len & idx2 < s.len & s[idx1] == s[idx2]
match_count++
idx1++
idx2++
R match_count
 
F fundamental_preprocess(String s) -> [Int]
‘Return Z, the Fundamental Preprocessing of S.
 
Z[i] is the length of the substring beginning at i which is also a prefix of S.
This pre-processing is done in O(n) time, where n is the length of S.’
I s.empty
R []
I s.len == 1
R [1]
V z = [0] * s.len
z[0] = s.len
z[1] = match_length(s, 0, 1)
L(i) 2 .< 1 + z[1]
z[i] = z[1] - i + 1
V l = 0
V r = 0
L(i) 2 + z[1] .< s.len
I i <= r
V k = i - l
V b = z[k]
V a = r - i + 1
I b < a
z[i] = b
E
z[i] = a + match_length(s, a, r + 1)
l = i
r = i + z[i] - 1
E
z[i] = match_length(s, 0, i)
I z[i] > 0
l = i
r = i + z[i] - 1
R z
 
F bad_character_table(String s) -> [[Int]]
Generates R for S, which is an array indexed by the position of some character c in the
ASCII table. At that index in R is an array of length |S|+1, specifying for each
index i in S (plus the index after S) the next location of character c encountered when
traversing S from right to left starting at i. This is used for a constant-time lookup
for the bad character rule in the Boyer-Moore string search algorithm, although it has
a much larger size than non-constant-time solutions.
I s.empty
R [[Int]()] * :ALPHABET_SIZE
V r = [[-1]] * :ALPHABET_SIZE
V alpha = [-1] * :ALPHABET_SIZE
L(c) s
alpha[alphabet_index(c)] = L.index
L(a) alpha
r[L.index].append(a)
R r
 
F good_suffix_table(String s) -> [Int]
Generates L for S, an array used in the implementation of the strong good suffix rule.
L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches
a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to
shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]]
matches the substring of T matched by a suffix of P in the previous match attempt.
Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given
by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used.
Since only proper suffixes matter, L[0] = -1.
V l = [-1] * s.len
V _n_ = fundamental_preprocess(reversed(s))
_n_.reverse()
L(j) 0 .< s.len - 1
V i = s.len - _n_[j]
I i != s.len
l[i] = j
R l
 
F full_shift_table(String s) -> [Int]
Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore
string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a
prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the
text T is len(P) - F[i] for a mismatch occurring at i-1.
V f = [0] * s.len
V Z = fundamental_preprocess(s)
V longest = 0
L(zv) reversed(Z)
V i = L.index
longest = I zv == i + 1 {max(zv, longest)} E longest
f[(len)-i - 1] = longest
R f
 
F string_search(P, _t_) -> [Int]
Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P
in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal
amount to shift the string and skip comparisons. In practice it runs in O(m) (and even
sublinear) time, where m is the length of T. This implementation performs a case-sensitive
search on ASCII characters. P must be ASCII as well.
I P.empty | _t_.empty | _t_.len < P.len
R []
 
[Int] matches
 
V r = bad_character_table(P)
V l = good_suffix_table(P)
V f = full_shift_table(P)
 
V k = P.len - 1
V previous_k = -1
L k < _t_.len
V i = P.len - 1
V h = k
L i >= 0 & h > previous_k & P[i] == _t_[h]
i--
h--
I i == -1 | h == previous_k
matches.append(k - P.len + 1)
k += I P.len > 1 {P.len - f[1]} E 1
E
V char_shift = i - r[alphabet_index(_t_[h])][i]
Int suffix_shift
I i + 1 == P.len
suffix_shift = 1
E I l[i + 1] == -1
suffix_shift = P.len - f[i + 1]
E
suffix_shift = P.len - 1 - l[i + 1]
V shift = max(char_shift, suffix_shift)
previous_k = I shift >= i + 1 {k} E previous_k
k += shift
R matches
 
V TEXT1 = ‘InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented’
V TEXT2 = ‘Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.’
V (PAT1, PAT2, PAT3) = (‘put’, ‘and’, ‘alfalfa’)
 
print(‘Found ’PAT1‘ at: ’string_search(PAT1, TEXT1))
print(‘Found ’PAT2‘ at: ’string_search(PAT2, TEXT1))
print(‘Found ’PAT3‘ at: ’string_search(PAT3, TEXT2))</syntaxhighlight>
 
{{out}}
<pre>
Found put at: [26, 90]
Found and at: [101, 128, 171]
Found alfalfa at: [33, 87]
</pre>
 
=={{header|C++}}==
<syntaxhighlight lang="c++">
#include <algorithm>
#include <cstdint>
#include <experimental/iterator>
#include <functional>
#include <iomanip>
#include <iostream>
#include <string>
#include <vector>
 
void display(const std::vector<int32_t>& numbers) {
std::cout << "[";
std::copy(std::begin(numbers), std::end(numbers), std::experimental::make_ostream_joiner(std::cout, ", "));
std::cout << "]" << std::endl;
}
 
int32_t string_search_single(const std::string& haystack, const std::string& needle) {
auto it = std::search(haystack.begin(), haystack.end(), std::boyer_moore_searcher(needle.begin(), needle.end()));
 
if ( it != haystack.end() ) {
return std::distance(haystack.begin(), it);
}
return -1;
}
 
std::vector<int32_t> string_search(const std::string& haystack, const std::string& needle) {
std::vector<int32_t> result = {};
uint64_t start = 0;
int32_t index = 0;
while ( index >= 0 && start < haystack.length() ) {
std::string haystackReduced = haystack.substr(start);
index = string_search_single(haystackReduced, needle);
if ( index >= 0 ) {
result.push_back(start + index);
start += index + needle.length();
}
}
return result;
}
 
int main() {
const std::vector<std::string> texts = {
"GCTAGCTCTACGAGTCTA",
"GGCTATAATGCGTA",
"there would have been a time for such a word",
"needle need noodle needle",
"DKnuthusesandprogramsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguages",
"Nearby farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk."
};
 
const std::vector<std::string> patterns = { "TCTA", "TAATAAA", "word", "needle", "and", "alfalfa" };
 
for ( uint64_t i = 0; i < texts.size(); ++i ) {
std::cout << "text" << ( i + 1 ) << " = " << texts[i] << std::endl;
}
std::cout << std::endl;
 
for ( uint64_t i = 0; i < texts.size(); ++i ) {
std::vector<int32_t> indexes = string_search(texts[i], patterns[i]);
std::cout << "Found " << std::quoted(patterns[i]) << " in 'text" << ( i + 1 ) << "' at indexes ";
display(string_search(texts[i], patterns[i]));
}
}
</syntaxhighlight>
{{ out }}
<pre>
text1 = GCTAGCTCTACGAGTCTA
text2 = GGCTATAATGCGTA
text3 = there would have been a time for such a word
text4 = needle need noodle needle
text5 = DKnuthusesandprogramsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguages
text6 = Nearby farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk.
 
Found "TCTA" in 'text1' at indexes [6, 14]
Found "TAATAAA" in 'text2' at indexes []
Found "word" in 'text3' at indexes [40]
Found "needle" in 'text4' at indexes [0, 19]
Found "and" in 'text5' at indexes [10, 46, 73]
Found "alfalfa" in 'text6' at indexes [29, 79]
</pre>
 
=={{header|Emacs Lisp}}==
 
Line 100 ⟶ 350:
 
<pre>33</pre>
</pre>
 
=={{header|Fortran}}==
<syntaxhighlight lang="Fortran">
module bm_str
implicit none
private
public :: bmstr
!
! PARAMETER definitions
!
integer , private , parameter :: NO_OF_CHARS = 256, SIZEX=256
contains
pure subroutine badcharheuristic(Str,M,Badchar)
implicit none
! Dummy arguments
!
integer :: M
integer , dimension(NO_OF_CHARS) :: Badchar
character(1) , dimension(M) :: Str
intent (in) M , Str
intent (out) Badchar
!
! Local variables
!
integer :: i
! Code starts here
Badchar(NO_OF_CHARS) = -1
do i = 1 , M
Badchar(iachar(Str(i))) = i
enddo
return
end subroutine badcharheuristic
function bmstr(Pat,M,Str,N) result(found)
implicit none
!
! Dummy arguments
!
integer :: M
integer :: N
character(len=m) :: Pat
character(len=n) :: Str
intent (in) M , N , Pat , Str
!
! Local variables
!
integer , dimension(NO_OF_CHARS) :: badchar
integer :: found
integer :: i
integer :: j
integer :: s
! Code starts here
!
found = -1
if ( (M==0) .OR. (N==0) .OR. (M>N) ) return
badchar = 0
call badcharheuristic(Pat,M,badchar)
i = 0
s = 0
do while ( s<=(N-M) )
j = M
do j = m,1,-1
if((Pat(j:j) /= Str(s+j:s+j)) )exit ! Leave, the pattern doesn't match.
enddo
if ( j < 1 ) then ! Found, let's leave
found = s + 1
return
endif
i = badchar(iachar(Str(s+j:s+j)))
s = s + MAX(1,j-i)
enddo
found = -1
return
end function bmstr
end module bm_str
</syntaxhighlight>
</pre>
 
=={{header|FreeBASIC}}==
<syntaxhighlight lang="vb">#define max(a, b) iif((a) > (b), (a), (b))
 
Dim Shared As Integer suff(), bmBc(), bmGs()
 
Sub preBmBc(pat As String)
Dim As Integer m = Len(pat), i
Redim bmBc(m)
For i = 1 To m-1
bmBc(Cint(pat[i])) = m - i
Next
End Sub
 
Sub suffixes(pat As String)
Dim As Integer m = Len(pat), g = m, i, f
Redim suff(m)
suff(m) = m
For i = m-1 To 1 Step -1
If i > g And suff(i + m - f) < i - g Then
suff(i) = suff(i + m - f)
Else
If i < g Then g = i
f = i
While g >= 1 And pat[g] = pat[g + m - f]
g -= 1
Wend
suff(i) = f - g
End If
Next
End Sub
 
Sub preBmGs(pat As String)
Dim As Integer m = Len(pat), j = 1, i
Redim suff(m)
Redim bmGs(m)
For i = m To 1 Step -1
If suff(i) = i Then
While j < m - i
If bmGs(j) = m Then bmGs(j) = m - i
j += 1
Wend
End If
Next
For i = 1 To m-1
bmGs(m - suff(i)) = m - i
Next
End Sub
 
Sub BM(pat As String, s As String, case_insensitive As Boolean = False)
Dim As String pins = "'" & pat & "' in " & "'" & s & "'"
If case_insensitive Then
pat = Lcase(pat)
s = Lcase(s)
End If
' Preprocessing
preBmGs(pat)
preBmBc(pat)
' Searching
Dim As Integer j = 0, n = Len(s), m = Len(pat), i = m
While j <= n - m
i -= 1
If pat[i] <> s[i+j] Then Exit While
j += Iif(i < 1, bmGs(0), max(bmGs(i), bmBc(Len(s[i+j]) - m + i)))
Wend
Dim As Integer many = Instr(s, pat)
Dim As String tmp = ""
If Not many > 0 Then
Print "No "; pins
Else
Do While many > 0 'if not found loop will be skipped
tmp &= Str(many) & ","
many = Instr(many + 1, s, pat)
Loop
Print Using "Found & at indices [&]"; pins; tmp & Chr(8)
End If
End Sub
 
BM("GCAGAGAG","GCATCGCAGAGAGTATACAGTACG")
BM("TCTA","GCTAGCTCTACGAGTCTA")
BM("TAATAAA","GGCTATAATGCGTA")
BM("word","there would have been a time for such a word")
BM("needle","needle need noodle needle")
Const book = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesley" + _
"DKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeand" + _
"assemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented"
BM("put",book)
BM("and",book)
Const farm = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with " + _
"bales of all that alfalfa exchanged for milk."
BM("alfalfa",farm)
 
Sleep</syntaxhighlight>
{{out}}
<pre>Found 'GCAGAGAG' in 'GCATCGCAGAGAGTATACAGTACG' at indices [6]
Found 'TCTA' in 'GCTAGCTCTACGAGTCTA' at indices [7,15]
No 'TAATAAA' in 'GGCTATAATGCGTA'
Found 'word' in 'there would have been a time for such a word' at indices [41]
Found 'needle' in 'needle need noodle needle' at indices [1,20]
Found 'put' in 'InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented' at indices [27,91]
Found 'and' in InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented' at indices [102,129,172]
Found 'alfalfa' in 'Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk.' at indices [34,88]</pre>
 
=={{header|J}}==
Line 184 ⟶ 619:
 
In other words: <tt>bmsearch0</tt> takes slightly over twice as long as <tt>bmsearch1</tt> for a somewhat randomly picked example search (though the results are the same).
 
=={{header|Java}}==
<syntaxhighlight lang="java">
 
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
 
/**
* An implementation of the Boyer-Moore string search algorithm.
* It finds all occurrences of a pattern in a text, performing a case-insensitive search on ASCII characters.
*
* For all full description of the algorithm visit:
* https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
*/
public final class BoyerMooreStringSearch {
public static void main(String[] aArgs) {
List<String> texts = List.of(
"GCTAGCTCTACGAGTCTA",
"GGCTATAATGCGTA",
"there would have been a time for such a word",
"needle need noodle needle",
"DKnuthusesandshowsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustrate",
"Farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk.");
List<String> patterns = List.of( "TCTA", "TAATAAA", "word", "needle", "put", "and", "alfalfa" );
for ( int i = 0; i < texts.size(); i++ ) {
System.out.println("text" + ( i + 1 ) + " = " + texts.get(i));
}
System.out.println();
for ( int i = 0; i < patterns.size(); i++ ) {
final int j = ( i < 5 ) ? i : i - 1;
System.out.println("Found \"" + patterns.get(i) + "\" in 'text" + ( j + 1 ) + "' at indexes "
+ stringSearch(texts.get(j), patterns.get(i)));
}
}
/**
* Return a list of indexes at which the given pattern matches the given text.
*/
private static List<Integer> stringSearch(String aText, String aPattern) {
if ( aPattern.isEmpty() || aText.isEmpty() || aText.length() < aPattern.length() ) {
return Collections.emptyList();
}
 
List<Integer> matches = new ArrayList<Integer>();
 
// Preprocessing
List<List<Integer>> R = badCharacterTable(aPattern);
int[] L = goodSuffixTable(aPattern);
int[] F = fullShiftTable(aPattern);
 
int k = aPattern.length() - 1; // Represents the alignment of the end of aPattern relative to aText
int previousK = -1; // Represents the above alignment in the previous phase
while ( k < aText.length() ) {
int i = aPattern.length() - 1; // Index of the character to compare in aPattern
int h = k; // Index of the character to compare in aText
while ( i >= 0 && h > previousK && aPattern.charAt(i) == aText.charAt(h) ) {
i -= 1;
h -= 1;
}
if ( i == -1 || h == previousK ) { // Match has been found
matches.add(k - aPattern.length() + 1);
k += ( aPattern.length() > 1 ) ? aPattern.length() - F[1] : 1;
} else { // No match, shift by the maximum of the bad character and good suffix rules
int suffixShift;
int charShift = i - R.get(alphabetIndex(aText.charAt(h))).get(i);
if ( i + 1 == aPattern.length() ) { // Mismatch occurred on the first character
suffixShift = 1;
} else if ( L[i + 1] == -1 ) { // Matched suffix does not appear anywhere in aPattern
suffixShift = aPattern.length() - F[i + 1];
} else { // Matched suffix appears in aPattern
suffixShift = aPattern.length() - 1 - L[i + 1];
}
int shift = Math.max(charShift, suffixShift);
if ( shift >= i + 1 ) { // Galil's rule
previousK = k;
}
k += shift;
}
}
return matches;
}
/**
* Create the shift table, F, for the given text, which is an array such that
* F[i] is the length of the longest suffix of, aText.substring(i), which is also a prefix of the given text.
*
* Use case: If a mismatch occurs at character index i - 1 in the pattern,
* the magnitude of the shift of the pattern, P, relative to the text is: P.length() - F[i].
*/
private static int[] fullShiftTable(String aText) {
int[] F = new int[aText.length()];
int[] Z = fundamentalPreprocess(aText);
int longest = 0;
Collections.reverse(Arrays.asList(Z));
for ( int i = 0; i < Z.length; i++ ) {
int zv = Z[i];
if ( zv == i + 1 ) {
longest = Math.max(zv, longest);
}
F[F.length - i - 1] = longest;
}
return F;
}
/**
* Create the good suffix table, L, for the given text, which is an array such that
* L[i] = k, is the largest index in the given text, S,
* such that S.substring(i) matches a suffix of S.substring(0, k).
*
* Use case: If a mismatch of characters occurs at index i - 1 in the pattern, P,
* then a shift of magnitude, P.length() - L[i], is such that no instances of the pattern in the text are omitted.
* Furthermore, a suffix of P.substring(0, L[i]) matches the same substring of the text that was matched by a
* suffix in the pattern on the previous matching attempt.
* In the case that L[i] = -1, the full shift table must be used.
*/
private static int[] goodSuffixTable(String aText) {
int[] L = IntStream.generate( () -> -1 ).limit(aText.length()).toArray();
String reversed = new StringBuilder(aText).reverse().toString();
int[] N = fundamentalPreprocess(reversed);
Collections.reverse(Arrays.asList(N));
for ( int j = 0; j < aText.length() - 1; j++ ) {
int i = aText.length() - N[j];
if ( i != aText.length() ) {
L[i] = j;
}
}
return L;
}
/**
* Create the bad character table, R, for the given text,
* which is a list indexed by the ASCII value of a character, C, in the given text.
*
* Use case: The entry at index i of R is a list of size: 1 + length of the given text.
* This inner list gives at each index j the next position at which character C will be found
* while traversing the given text from right to left starting from index j.
*/
private static List<List<Integer>> badCharacterTable(String aText) {
if ( aText.isEmpty() ) {
return Collections.emptyList();
}
List<List<Integer>> R = IntStream.range(0, ALPHABET_SIZE).boxed()
.map( i -> new ArrayList<Integer>(Collections.nCopies(1,-1)) ).collect(Collectors.toList());
List<Integer> alpha = new ArrayList<Integer>(Collections.nCopies(ALPHABET_SIZE, -1));
for ( int i = 0; i < aText.length(); i++ ) {
alpha.set(alphabetIndex(aText.charAt(i)), i);
for ( int j = 0; j < alpha.size(); j++ ) {
R.get(j).add(alpha.get(j));
}
}
return R;
}
/**
* Create the fundamental preprocess, Z, of the given text, which is an array such that
* Z[i] is the length of the substring of the given text beginning at index i which is also a prefix of the text.
*/
private static int[] fundamentalPreprocess(String aText) {
if ( aText.isEmpty() ) {
return new int[0];
}
if ( aText.length() == 1 ) {
return new int[] { 1 };
}
int[] Z = new int[aText.length()];
Z[0] = aText.length();
Z[1] = matchLength(aText, 0, 1);
for ( int i = 2; i <= Z[1]; i++ ) {
Z[i] = Z[1] - i + 1;
}
// Define the left and right limits of the z-box
int left = 0;
int right = 0;
for ( int i = 2 + Z[1]; i < aText.length(); i++ ) {
if ( i <= right ) { // i falls within existing z-box
final int k = i - left;
final int b = Z[k];
final int a = right - i + 1;
if ( b < a ) { // b ends within existing z-box
Z[i] = b;
} else { // b ends at or after the end of the z-box,
// an explicit match to the right of the z-box is required
Z[i] = a + matchLength(aText, a, right + 1);
left = i;
right = i + Z[i] - 1;
}
} else { // i does not fall within existing z-box
Z[i] = matchLength(aText, 0, i);
if ( Z[i] > 0 ) {
left = i;
right = i + Z[i] - 1;
}
}
}
return Z;
}
/**
* Return the length of the match of the two substrings of the given text beginning at each of the given indexes.
*/
private static int matchLength(String aText, int aIndexOne, int aIndexTwo) {
if ( aIndexOne == aIndexTwo ) {
return aText.length() - aIndexOne;
}
int matchCount = 0;
while ( aIndexOne < aText.length() && aIndexTwo < aText.length()
&& aText.charAt(aIndexOne) == aText.charAt(aIndexTwo) ) {
matchCount += 1;
aIndexOne += 1;
aIndexTwo += 1;
}
return matchCount;
}
/**
* Return the ASCII index of the given character, if it is such, otherwise throw an illegal argument exception.
*/
private static int alphabetIndex(char aChar) {
final int result = (int) aChar;
if ( result >= ALPHABET_SIZE ) {
throw new IllegalArgumentException("Not an ASCII character:" + aChar);
}
return result;
}
 
/* The range of ASCII characters is 0..255, both inclusive. */
private static final int ALPHABET_SIZE = 256;
}
</syntaxhighlight>
<pre>
text1 = GCTAGCTCTACGAGTCTA
text2 = GGCTATAATGCGTA
text3 = there would have been a time for such a word
text4 = needle need noodle needle
text5 = DKnuthusesandshowsanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustrate
text6 = Farms grew an acre of alfalfa on the dairy's behalf, with bales of that alfalfa exchanged for milk.
 
Found "TCTA" in 'text1' at indexes [6, 14]
Found "TAATAAA" in 'text2' at indexes []
Found "word" in 'text3' at indexes [40]
Found "needle" in 'text4' at indexes [0, 19]
Found "put" in 'text5' at indexes [32]
Found "and" in 'text5' at indexes [10, 43, 70]
Found "alfalfa" in 'text6' at indexes [22, 72]
</pre>
 
=={{header|Julia}}==
Some of this code is derived from Python code by Ben Langmead. Because of that ancestry, it returns
Line 515 ⟶ 1,210:
(occurrences, num_alignments, num_character_comparisons) = ([101, 128, 171], 71, 79)
(occurrences, num_alignments, num_character_comparisons) = ([33, 87], 20, 42)
</pre>
 
=={{header|Nim}}==
{{trans|Python}}
<syntaxhighlight lang = Nim>#[ This version uses ASCII for case-sensitive matching. For Unicode you may want to match in UTF-8
bytes instead of creating a 0x10FFFF-sized table.
]#
 
import std/[algorithm, sequtils, strutils]
 
const AlphabetSize = 256
 
func reversed(s: string): string =
## Return the reverse of an ASCII string.
for i in countdown(s.high, 0):
result.add s[i]
 
proc alphabetIndex(c: char): int =
## Return the index of the given ASCII character.
result = ord(c)
assert result in 0..<ALPHABET_SIZE
 
proc matchLength(s: string; idx1, idx2: int): int =
## Return the length of the match of the substrings of "s" beginning at "idx1" and "idx2".
if idx1 == idx2: return s.len - idx1
var idx1 = idx1
var idx2 = idx2
while idx1 < s.len and idx2 < s.len and s[idx1] == s[idx2]:
inc result
inc idx1
inc idx2
 
proc fundamentalPreprocess(s: string): seq[int] =
## Return "z", the Fundamental Preprocessing of "s".
 
# "z[i]" is the length of the substring beginning at "i" which is also a prefix of "s".
# This preprocessing is done in O(n) time, where n is the length of "s".
if s.len == 0: return
if s.len == 1: return @[1]
result = repeat(0, s.len)
result[0] = s.len
result[1] = s.matchLength(0, 1)
for i in 2..result[1]:
result[i] = result[1] - i + 1
# Defines lower and upper limits of z-box.
var l, r = 0
for i in (2 + result[1])..s.high:
if i <= r: # "i" falls within existing z-box.
let k = i - l
let b = result[k]
let a = r - i + 1
if b < a: # "b" ends within existing z-box.
result[i] = b
else: # "b" ends at or after the end of the z-box.
# We need to do an explicit match to the right of the z-box.
result[i] = a + s.matchLength(a, r + 1)
l = i
r = i + result[i] - 1
else: # "i" does not reside within existing z-box.
result[i] = s.matchLength(0, i)
if result[i] > 0:
l = i
r = i + result[i] - 1
 
proc badCharacterTable(s: string): seq[seq[int]] =
## Generates "R" for "s", which is an array indexed by the position of some character "c"
## in the ASCII table.
 
# At that index in "R" is an array of length |s|+1, specifying for each index "i" in "s"
# (plus the index after "s") the next location of character "c" encountered when traversing
# "S" from right to left starting at "i". This is used for a constant-time lookup for the bad
# character rule in the Boyer-Moore string search algorithm, although it has a much larger
# size than non-constant-time solutions.
if s.len == 0: return newSeqWith(AlphabetSize, newSeq[int]())
result = repeat(@[-1], AlphabetSize)
var alpha = repeat(-1, AlphabetSize)
for i, c in s:
alpha[alphabetIndex(c)] = i
for j, a in alpha:
result[j].add a
 
proc goodSuffixTable(s: string): seq[int] =
## Generates "L" for "s", an array used in the implementation of the strong good suffix rule.
 
# "L[i] = k", the largest position in S such that "s[i:]" (the suffix of "s" starting at "i")
# matches a suffix of "s[:k]" (a substring in "s" ending at "k"). Used in Boyer-Moore, "L"
# gives an amount to shift "P" relative to "T" such that no instances of "P" in "T" are skipped
# and a suffix of "P[:L[i]]" matches the substring of "T" matched by a suffix of "P" in the
# previous match attempt.
# Specifically, if the mismatch took place at position "i-1" in "P", the shift magnitude is
# given by the formula "len(P) - L[i]". In the case that "L[i] = -1", the full shift table
# is used. Since only proper suffixes matter, "L[0] = -1".
result =repeat(-1, s.len)
var n = fundamentalPreprocess(reversed(s))
n.reverse()
for j in 0..(s.len - 2):
let i = s.len - n[j]
if i != s.len:
result[i] = j
 
proc fullShiftTable(s: string): seq[int] =
## Generates "F" for "s", an array used in a special case of the good suffix rule in the
## Boyer-Moore string search algorithm.
 
# "F[i]" is the length of the longest suffix of "s[i:]" that is also a prefix of "s". In
# the cases it is used, the shift magnitude of the pattern "P" relative to the text "T" is
# "len(P) - F[i]" for a mismatch occurring at "i-1".
result = repeat(0, s.len)
let z = fundamentalPreprocess(s)
var longest = 0
for i, zv in reversed(z):
if zv == i + 1:
longest = max(zv, longest)
result[^(i + 1)] = longest
 
proc stringSearch(p, t: string): seq[int] =
## Implementation of the Boyer-Moore string search algorithm.
 
# This finds all occurrences of "p" in "t", and incorporates numerous ways of preprocessing
# the pattern to determine the optimal amount to shift the string and skip comparisons.
# In practice it runs in O(m) (and even sublinear) time, where "m" is the length of "t".
# This implementation performs a case-sensitive search on ASCII characters. "p" must be
# ASCII as well.
 
if p.len == 0 or t.len == 0 or t.len < p.len: return
 
# Preprocessing
let r = badCharacterTable(p)
let l = goodSuffixTable(p)
let f = fullShiftTable(p)
 
var k = p.len - 1 # Represents alignment of end of "p" relative to "t".
var prevk = -1 # Represents alignment in previous phase (Galil's rule).
while k < t.len:
var i = p.len - 1 # Character to compare in "p".
var h = k # Character to compare in "t".
while i >= 0 and h > prevk and p[i] == t[h]: # Matches starting from end of "p".
dec i
dec h
if i == -1 or h == prevk: # Match has been found (Galil's rule).
result.add k - p.len + 1
inc k, (if p.len > 1: p.len - f[1] else: 1)
else: # No match: shift by max of bad character and good suffix rules.
let charShift = i - r[alphabetIndex(t[h])][i]
let suffixShift = if i + 1 == p.len: # Mismatch happened on first attempt.
1
elif l[i + 1] == -1: # Matched suffix does not appear anywhere in "p".
p.len - f[i + 1]
else: # Matched suffix appears in "p".
p.len - 1 - l[i + 1]
let shift = max(charShift, suffixShift)
if shift >= i + 1: prevk = k # Galil's rule
inc k, shift
 
const
Text1 = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesley" &
"DKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassembly" &
"languagestoillustratetheconceptsandalgorithmsastheyarepresented"
Text2 = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, " &
"with bales of all that alfalfa exchanged for milk."
(Pat1, Pat2, Pat3) = ("put", "and", "alfalfa")
 
echo "Found '", Pat1, "' at: ", Pat1.stringSearch(Text1).join(", ")
echo "Found '", Pat2, "' at: ", Pat2.stringSearch(Text1).join(", ")
echo "Found '", Pat3, "' at: ", Pat3.stringSearch(Text2).join(", ")
</syntaxhighlight>
 
{{out}}
<pre>Found 'put' at: 26, 90
Found 'and' at: 101, 128, 171
Found 'alfalfa' at: 33, 87
</pre>
 
Line 528 ⟶ 1,394:
 
type
TSearchFun = reference to function(const s: rawbytestring): specialize TArray<SizeInt>;
TIntArray = array of SizeInt;
TSearchFun = reference to function(const s: rawbytestring): TIntArray;
 
{ returns a function that performs a case-sensitive search for all occurrences(1-based)
Line 541 ⟶ 1,406:
I: Integer;
begin
FillDWord(BcTable, SuccLength(High(Byte)BcTable), DWord(Length(aPattern)));
for I := 0 to Length(aPattern) - 2 do
BcTable[p[I]] := Pred(Length(aPattern) - I);
Line 581 ⟶ 1,446:
var
Needle: rawbytestring;
const
MatchInitLen = 4;
begin
if aPattern <> '' then begin
Line 591 ⟶ 1,454:
Needle := '';
{ returns an empty array if there are no matches or the pattern is empty }
Result := function(const aHaystack: rawbytestring): TIntArrayspecialize TArray<SizeInt>
var
Matches: TIntArray;
pNeedle: PByte absolute Needle;
pHaystack: PByte absolute aHaystack;
I, J, NeedleLast, MatchPosOutPos, OldPfxEnd: SizeInt;
const
OutInitLen = 4;
begin
MatchesResult := nil;
if (Needle = '') or (Length(aHaystack) < Length(Needle)) then exit(Matches);
SetLength(MatchesResult, MatchInitLenOutInitLen);
MatchPosOutPos := 0;
NeedleLast := Pred(Length(Needle));
I := NeedleLast;
Line 611 ⟶ 1,475:
end;
if J < OldPfxEnd then begin
if MatchPosOutPos = Length(MatchesResult) then SetLength(MatchesResult, MatchPosOutPos * 2);
MatchesResult[MatchPosOutPos] := I - OldPfxEnd + 2;
Inc(MatchPosOutPos);
Inc(I, Succ(GsTable[0] - OldPfxEnd));
OldPfxEnd := Length(Needle)*2 - GsTable[0];
Line 621 ⟶ 1,485:
end;
end;
SetLength(MatchesResult, MatchPosOutPos);
Result := Matches;
end;
end;
Line 650 ⟶ 1,513:
WriteArray(BmgSearch(Text3));
WriteArray(BmgSearch(Text4));
ReadLn;
end.
</syntaxhighlight>
Line 1,114 ⟶ 1,976:
 
The same examples have been used as in the Julia entry above.
<syntaxhighlight lang="ecmascriptwren">class BoyerMoore {
/**
* Returns the index within this string of the first occurrence of the
Line 1,260 ⟶ 2,122:
Found 'alfalfa' in 'text6' at indices [33, 87]
</pre>
 
=={{header|Yabasic}}==
{{trans|FreeBASIC}}
<syntaxhighlight lang="vb">case_insensitive = false
 
sub preBmBc(pat$)
local m, i
 
m = len(pat$)
redim bmBc(m)
for i = 1 to m-1
bmBc(val(mid$(pat$, i, 1))) = m - i
next
end sub
 
sub suffixes(pat$)
local m, g, i, f
m = len(pat$)
g = m
redim suff(m)
suff(m) = m
for i = m-1 to 1 step -1
if i > g and suff(i + m - f) < i - g then
suff(i) = suff(i + m - f)
else
if i < g g = i
f = i
while g >= 1 and mid$(pat$, g, 1) = mid$(pat$, g + m - f, 1)
g = q - 1
wend
suff(i) = f - g
fi
next
end sub
 
sub preBmGs(pat$)
local m, j, i
m = len(pat$)
j = 1
redim suff(m)
redim bmGs(m)
for i = m to 1 step -1
if suff(i) = i then
while j < m - i
if bmGs(j) = m bmGs(j) = m - i
j = j + 1
wend
fi
next
for i = 1 to m-1
bmGs(m - suff(i)) = m - i
next
end sub
 
sub BM(pat$, s$, case_insensitive)
local pins
 
pins$ = "'" + pat$ + "' in " + "'" + s$ + "'"
if case_insensitive then
pat$ = lower$(pat$)
s$ = lower$(s$)
fi
//* Preprocessing *//
preBmGs(pat$)
preBmBc(pat$)
//* Searching *//
j = 0
n = len(s$)
m = len(pat$)
i = m
while j <= n - m
i = i - 1
if mid$(pat$,i,1) <> mid$(s$,i+j,1) break
if i < 1 then j = j + bmGs(0) else j = j + max(bmGs(i), bmBc(len(mid$(s$,i+j,1)) - m + i)) : fi
wend
many = instr(s$, pat$)
tmp$ = ""
if not many > 0 then
print "No ", pins$
else
while many > 0 //if not found loop will be skipped
tmp$ = tmp$ + str$(many) + ","
many = instr(s$, pat$, many + 1)
wend
print "Found ", pins$, " at indices [", tmp$, chr$(8), "]"
fi
end sub
 
BM("GCAGAGAG","GCATCGCAGAGAGTATACAGTACG")
BM("TCTA","GCTAGCTCTACGAGTCTA")
BM("TAATAAA","GGCTATAATGCGTA")
BM("word","there would have been a time for such a word")
BM("needle","needle need noodle needle")
book$ = "InhisbookseriesTheArtofComputerProgrammingpublishedbyAddisonWesleyDKnuthusesanimaginarycomputertheMIXanditsassociatedmachinecodeandassemblylanguagestoillustratetheconceptsandalgorithmsastheyarepresented"
BM("put",book$)
BM("and",book$)
farm$ = "Nearby farms grew a half acre of alfalfa on the dairy's behalf, with bales of all that alfalfa exchanged for milk."
BM("alfalfa",farm$)</syntaxhighlight>
{{out}}
<pre>Same as QBasic entry.</pre>
2,130

edits