stoimen's web log

Introduction

Have you ever asked yourself which is the algorithm used to find a word after clicking Ctrl+F and typing something? Well I guess you know the answer from the title, but in this article you’ll find out how exactly this is done.

As we saw from the Morris-Pratt string searching we don’t need to compare the text and the pattern character by character. Some comparisons can be skipped in order to improve the performance of the string searching. Indeed the brute force string searching and the Rabin-Karp algorithm are quite slow only because they compare the pattern and the text character by character.

In the other hand the Morris-Pratt algorithm is a very good improvement of the brute force string searching, but the question remains. Is there any algorithm that is faster than Morris-Pratt – is there any way to skip more comparisons and to move the pattern faster.

It’s clear that if we have to find whether a single character is contained into a text we need at least “n” steps, where n is the length of the text. Once we have to find whether a pattern with the length of “m” is contained into a text with length of “n” the case is getting a little more complex.

However the answer is that there is such algorithm that is faster and more suitable than Morris-Pratt. This is the Boyer-Moore string searching.

Overview

Boyer-Moore is an algorithm that improves the performance of pattern searching into a text by considering some observations. It is defined in 1977 by Robert S. Boyer and J Strother Moore and it consist of some specific features.

First of all this algorithm starts comparing the pattern from the leftmost part of text and moves it to the right, as on the picture below.

Boyer-Moore Shifting Direction — In Boyer-Moore the pattern is shifted from left to right!

Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.

Boyer-Moore Comparison Model — Unlike other algorithms the letters of the pattern are compared from right to left!

The main idea of Boyer-Moore in order to improve the performance are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see by the following examples what they are standing for.

Good-suffix Shifts

Just like the Morris-Pratt algorithm we start to compare the pattern against some portion of the text where a possible match will occur. In Boyer-Moore as I said this is done from the rightmost letter of the pattern. After some characters have matched we find a mismatch.

So how can we move the pattern to the right in order to skip unusual comparisons. To answer this question we need to explore the pattern. Let’s say there is a portion of the pattern that is repeated inside the pattern itself, like it is shown on the picture below.

Boyer-Moore Good-suffix Shift 1 — The pattern may consist of repeating portions of characters!

In this case we must move the pattern thus the repeated portion must now align with its first occurrence in the pattern.

A variation of this case is when the portion from the pattern A overlaps with another portion that consists of the same characters.

Boyer-Moore Good Suffix Shift 2 — Sometimes these portions may overlap!

Yet again the shift must align the second portion with its first occurrence.

Finally only a portion of A, let’s say “B”, can happen to occur in the very beginning of the pattern, as on the diagram below.

Boyer-Moore Good Suffix Shift 3 — Only a sub-string of the pattern may re-occur at its front!

Now we must align the left end of the pattern with the rightmost occurrence of “B”.

Bad Character Shifts

Beside the good-suffix shifts the Boyer-Moore algorithm make use of the so called bad-character shifts. In case of a mismatch we can skip comparisons in case the character in the text doesn’t happen to appear in the pattern. To become clearer let’s see the following examples.

Boyer-Moore Bad Character 1 — If the mismatched letter of the text appears in the pattern only in its front we can align it easily!

In the picture above we see that the mismatched character “B” from the text appears only in the beginning of the pattern. Thus we can simply shift the pattern to the right and align both characters B, skipping comparisons. An even better case is described by the following diagram where the mismatched letter isn’t contained into the pattern at all. Then we can shift forward the whole pattern.

Boyer-Moore Bad Character 2 — In case the mismatched letter isn't contained into the pattern we move forward the pattern!

Maximum of Good-suffix and Bad-Character shifts

Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch the maximum of both is considered in order to move the pattern to the right.

Complexity

It’s clear that Boyer-Moore is faster than Morris-Pratt, but actually its worst-case complexity is O(n+m). The thing is that in natural language search Boyer-Moore does pretty well.

Implementation

Finally let’s see the implementation in PHP, which can be easily “transcribed” into any other programming language. The only thing we need is the structures for bad-character shifts and good-suffixes shifts.

<?php
 
/**
 * Pattern we're searching for
 *
 * @var string
 */
$pattern = 'gloria';
 
/**
 * The text we're searching in
 *
 * @var string
 */
$text = 'Sic transit gloria mundi, non transit gloria Gundi!';
 
/**
 * Calculates the suffixes for a given pattern
 *
 * @param string $pattern
 * @param array  $suffixes
 */
function suffixes($pattern, &$suffixes)
{
   $m = strlen($pattern);
 
   $suffixes[$m - 1] = $m;
   $g = $m - 1;
 
   for ($i = $m - 2; $i >= 0; --$i) {
      if ($i > $g && $suffixes[$i + $m - 1 - $f] < $i - $g) {
         $suffixes[$i] = $suffixes[$i + $m - 1 - $f];
      } else {
         if ($i < $g) {
            $g = $i;
         }
         $f = $i;
 
         while ($g >= 0 && $pattern[$g] == $pattern[$g + $m - 1 - $f]) {
            $g--;
         }
         $suffixes[$i] = $f - $g;
      }
   }
}
 
/**
 * Fills in the array of bad characters.
 *
 * @param string $pattern
 * @param array  $badChars
 */
function badCharacters($pattern, &$badChars)
{
   $m = strlen($pattern);
 
   for ($i = 0; $i < $m - 1; ++$i) {
      $badChars[$pattern{$i}] = $m - $i - 1;
   }
}
 
/**
 * Fills in the array of good suffixes
 *
 * @param string $pattern
 * @param array  $goodSuffixes
 */
function goodSuffixes($pattern, &$goodSuffixes)
{
   $m 		= strlen($pattern);
   $suff 	= array();
 
   suffixes($pattern, $suff);
 
   for ($i = 0; $i < $m; $i++) {
      $goodSuffixes[$i] = $m;
   }
 
   for ($i = $m - 1; $i >= 0; $i--) {
      if ($suff[$i] == $i + 1) {
         for ($j = 0; $j < $m - $i - 1; $j++) {
            if ($goodSuffixes[$j] == $m) {
               $goodSuffixes[$j] = $m - $i - 1;
            }
         }
      }
   }
 
   for ($i = 0; $i < $m - 2; $i++) {
      $goodSuffixes[$m - 1 - $suff[$i]] = $m - $i - 1;
   }
}
 
/**
 * Performs a search of the pattern into a given text
 *
 * @param string $pattern
 * @param string $text
 */
function boyer_moore($pattern, $text)
{
   $n = strlen($text);
   $m = strlen($pattern);
 
   $goodSuffixes 	= array();
   $badCharacters 	= array();
 
   goodSuffixes($pattern, &$goodSuffixes);
   badCharacters($pattern, &$badCharacters);
 
   $j = 0;
   while ($j < $n - $m) {
      for ($i = $m - 1; $i >= 0 && $pattern[$i] == $text[$i + $j]; $i--);
      if ($i < 0) {
         // note that if the substring occurs more
         // than once into the text, the algorithm will
         // print out each position of the substring
         echo $j;
         $j += $goodSuffixes[0];
      } else {
         $j += max($goodSuffixes[$i], $badCharacters[$text[$i + $j]] - $m + $i + 1);
      }
   }
}
 
// search using Boyer-Moore
// will return 12 and 38
boyer_moore($pattern, $text);

<?php /** * Pattern we're searching for * * @var string */ $pattern = 'gloria'; /** * The text we're searching in * * @var string */ $text = 'Sic transit gloria mundi, non transit gloria Gundi!'; /** * Calculates the suffixes for a given pattern * * @param string $pattern * @param array $suffixes */ function suffixes($pattern, &$suffixes) { $m = strlen($pattern); $suffixes[$m - 1] = $m; $g = $m - 1; for ($i = $m - 2; $i >= 0; --$i) { if ($i > $g && $suffixes[$i + $m - 1 - $f] < $i - $g) { $suffixes[$i] = $suffixes[$i + $m - 1 - $f]; } else { if ($i < $g) { $g = $i; } $f = $i; while ($g >= 0 && $pattern[$g] == $pattern[$g + $m - 1 - $f]) { $g--; } $suffixes[$i] = $f - $g; } } } /** * Fills in the array of bad characters. * * @param string $pattern * @param array $badChars */ function badCharacters($pattern, &$badChars) { $m = strlen($pattern); for ($i = 0; $i < $m - 1; ++$i) { $badChars[$pattern{$i}] = $m - $i - 1; } } /** * Fills in the array of good suffixes * * @param string $pattern * @param array $goodSuffixes */ function goodSuffixes($pattern, &$goodSuffixes) { $m = strlen($pattern); $suff = array(); suffixes($pattern, $suff); for ($i = 0; $i < $m; $i++) { $goodSuffixes[$i] = $m; } for ($i = $m - 1; $i >= 0; $i--) { if ($suff[$i] == $i + 1) { for ($j = 0; $j < $m - $i - 1; $j++) { if ($goodSuffixes[$j] == $m) { $goodSuffixes[$j] = $m - $i - 1; } } } } for ($i = 0; $i < $m - 2; $i++) { $goodSuffixes[$m - 1 - $suff[$i]] = $m - $i - 1; } } /** * Performs a search of the pattern into a given text * * @param string $pattern * @param string $text */ function boyer_moore($pattern, $text) { $n = strlen($text); $m = strlen($pattern); $goodSuffixes = array(); $badCharacters = array(); goodSuffixes($pattern, &$goodSuffixes); badCharacters($pattern, &$badCharacters); $j = 0; while ($j < $n - $m) { for ($i = $m - 1; $i >= 0 && $pattern[$i] == $text[$i + $j]; $i--); if ($i < 0) { // note that if the substring occurs more // than once into the text, the algorithm will // print out each position of the substring echo $j; $j += $goodSuffixes[0]; } else { $j += max($goodSuffixes[$i], $badCharacters[$text[$i + $j]] - $m + $i + 1); } } } // search using Boyer-Moore // will return 12 and 38 boyer_moore($pattern, $text);

Application

Boyer-Moore is one of the most used string searching algorithm in practice. It is intuitively clear where it can be useful, but yet again I’ll say only that this algorithm is considered as the mostly used in practice for search and replace operations in text editors.

21 thoughts on “Computer Algorithms: Boyer-Moore String Searching”

terry says:

May 12, 2012 at 5:39 pm

Very nice article! But, I think there is a mistake that the original Boyer-Moore algorithm is O(mn) in the worst case. A simple case is the text and the pattern both contain only one repeated charactor. It’s guaranteed to be O(m+n) only if the Galil Rule is applied!
Ashwin Jayaprakash says:

May 31, 2012 at 12:50 am

Thanks for writing these tutorials. I love the algo series!
tej says:

September 11, 2012 at 9:26 am

Thank you so much! This helped me a lot. Plus you explained it well.. thanks dude!
chris says:

October 4, 2012 at 1:02 pm

thanks, helped a lot, bloody confusing though
naresh kumar says:

October 29, 2012 at 9:48 am

please send me the boyer moore string search algo in C# language code.
thank you sir
Meroo says:

December 17, 2012 at 9:24 am

Thanks so much, your explanation is very well, please if you have the boyer moore string search algorithm in C# language, it’ll help me a lot.
thank you again
Stephan says:

December 19, 2012 at 2:44 am

I think there is a problem, when the matching part is at the end. When you call boyer_moore(“tester”, “supertester”); does not match. I think the reason is “while ($j < $n – $m) {" of boyer_moore().
I changed it to "while ($j <= $n – $m)". Will there be side-effects?
Yu says:

May 5, 2013 at 6:53 pm

great!
I like the algorithm
Bill says:

June 11, 2013 at 9:31 pm

I agree with Stephen and his solution. I tried it will adding 1 to $n – $m, but less than or equal has the same effect.
samsul says:

September 17, 2013 at 6:34 pm

great!
pattern matching boyer moore.
i have problem – application text searching with php use boyer moore….
please send me a example source code

thanks…
samsul says:

September 17, 2013 at 6:37 pm

great!
pattern matching boyer moore.
i have problem – application text searching with php use boyer moore….
please send me a example source code
samsuljambrong@gmail.com

thanks…
Li Yu says:

October 12, 2013 at 12:12 pm

“In case the mismatched letter isn’t contained into the pattern we move forward the pattern!”
Your example is:
text: …bA
pattern: …cA , pattern has no b
So you jump over A.
Is that right?

For example, let A=”123″
text: …b123c123
pattern: 123c123
next should compare from first A( =”123″), not letter c, as you said, over whold pattern.
Am I right?
samsul says:

December 5, 2013 at 6:27 am

with this algoritm i think have a problem…
if less from 3 character can’t find solutin so what we do…?
fatima says:

January 5, 2014 at 10:17 pm

great information….i want the code in c++ language
SK Ahsan says:

October 28, 2014 at 4:47 pm

Sir Its Awesome Work It Is Very Helpful For A Person Who Wanna to Research Thesis For MS In Algorithm Analysis(String Matching)
Bogdans says:

December 26, 2015 at 3:45 am

hello. you’ve got a relatively little mistake which can cause the algorithm to skip searches

it’s located here
for ($i = 0; $i < $m – 2; $i++) {
$goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
}

the code should be replaced with
for ($i = 0; $i <= $m – 2; $i++) {
$goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
}

testcase is
pattern: abbccab
text: abaccabaabbccababbccab

the result should be 8 and 15
with the old version it's 15 only

PS: thanks for the PHP solution and the article. nicely done!
kyohandi says:

May 12, 2016 at 12:45 pm

hello bro, please tell me how to running this string matching code ? or about the implementation ?
need response urgently
Rishan says:

July 7, 2016 at 5:38 am

Great help for Boyer Moore. Thanks
loki says:

November 9, 2017 at 3:59 am

Thanks
agus says:

April 24, 2018 at 12:57 pm

how to call the function for text searching for my application?
i don’t really understand. please help me
cathy says:

April 26, 2018 at 9:34 am

Call-time pass-by-reference has been removed on line 108.. how to fixed it ?