Talk:Boyer-Moore string search

I am not sure if this is intended to be a task.

I also am not sure how this algorithm's performance fairs against the caching mechanisms of recent machines. (Other than branch prediction issues, boyer-moore is probably not penalized by caching implementations -- unlike tree search algorithms which experience poor cache locality -- but there might be some minimum search string length necessary to see significant gains from this approach, especially in uncached contexts.) --Rdm (talk) 01:13, 6 July 2022 (UTC)

For strings I do not know, but variants of this algorithm are commonly used to search DNA databases. See for example this paper: http://www.ijsetr.com/uploads/625413IJSETR2868-162.pdf --Wherrera (talk) 03:03, 6 July 2022 (UTC)

Emacs Lisp incorrect?

No output for one and no indication of how it might be used for two. It certainly doesn't run on TIO with the message

Trailing garbage following expression: 
;;(print (bm_compile_pattern "abcdb"))

and replit just seems to run an editor... Then again I've never used Emacs, so what do I know? --Pete Lomax (talk) 15:55, 9 July 2022 (UTC)

https://www.gnu.org/software/emacs/manual/html_node/efaq/Evaluating-Emacs-Lisp-code.html explains how to run emacs lisp code. --Rdm (talk) 15:28, 17 July 2022 (UTC)

That said, currently, the Emacs Lisp implementation (and the current J implementation which copies it) does not implement the good suffix rule. So the algorithm exhibits poor performance in certain contexts.

This isn't the sort of thing which impacts the correctness of the returned result -- it's strictly a performance issue (it's a performance improvement when "good suffixes" are rare or absent in the text being searched, it's a performance problem when relatively long "good suffixes" appear relatively often in the text being searched (lazy construction of the corresponding lookup tables would tend to reduce the cost of that part of the implementation, but would also tend to increase cache pressure)). I've got other things to do right now, but I'll circle back on this when I have time. --Rdm (talk) 13:53, 18 July 2022 (UTC)

And... my simplistic tests suggest that the performance advantages of the full Boyer-Moore algorithm are significant (performance improvement greater than a factor of 2) in at least some example cases. So, the emacs lisp implementation should probably be marked as incorrect. (But that leaves us with the question of how to determine whether an arbitrary implementation in an arbitrary programming language is correct -- this isn't about the correctness of the result, it's about the relative performance of the implementation.)

Near as I can tell, to properly judge implementations, we would have to work out search patterns which give us adequate code coverage (like 'ctcat') and ask for a display of some relevant intermediate results. --Rdm (talk) 17:54, 19 July 2022 (UTC)

What do you think about the character comparison counts as shown by Julia/Phix, as a proof/measure of correctness? --Pete Lomax (talk) 16:23, 21 July 2022 (UTC)

I guess it can be a decent indicator given well-chosen test cases, though it's arguably mislabeled -- bad_character_rule, for example, engages in slightly abstracted character comparisons, but these are not counted in those implementations -- so this count and its use would need careful description, if it were made to be part of the task. --Rdm (talk) 17:06, 21 July 2022 (UTC)