Computer Algorithms: Boyer-Moore String Searching

Introduction

Have you ever asked yourself which is the algorithm used to find a word after clicking Ctrl+F and typing something? Well I guess you know the answer from the title, but in this article you’ll find out how exactly this is done.

As we saw from the Morris-Pratt string searching we don’t need to compare the text and the pattern character by character. Some comparisons can be skipped in order to improve the performance of the string searching. Indeed the brute force string searching and the Rabin-Karp algorithm are quite slow only because they compare the pattern and the text character by character.

In the other hand the Morris-Pratt algorithm is a very good improvement of the brute force string searching, but the question remains. Is there any algorithm that is faster than Morris-Pratt – is there any way to skip more comparisons and to move the pattern faster.

It’s clear that if we have to find whether a single character is contained into a text we need at least “n” steps, where n is the length of the text. Once we have to find whether a pattern with the length of “m” is contained into a text with length of “n” the case is getting a little more complex.

However the answer is that there is such algorithm that is faster and more suitable than Morris-Pratt. This is the Boyer-Moore string searching.

Overview

Boyer-Moore is an algorithm that improves the performance of pattern searching into a text by considering some observations. It is defined in 1977 by Robert S. Boyer and J Strother Moore and it consist of some specific features.

First of all this algorithm starts comparing the pattern from the leftmost part of text and moves it to the right, as on the picture below.

Boyer-Moore Shifting Direction
In Boyer-Moore the pattern is shifted from left to right!

Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.

Boyer-Moore Comparison Model
Unlike other algorithms the letters of the pattern are compared from right to left!

The main idea of Boyer-Moore in order to improve the performance are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see by the following examples what they are standing for.

Good-suffix Shifts

Just like the Morris-Pratt algorithm we start to compare the pattern against some portion of the text where a possible match will occur. In Boyer-Moore as I said this is done from the rightmost letter of the pattern. After some characters have matched we find a mismatch.

Boyer-Moore a Mismatch

So how can we move the pattern to the right in order to skip unusual comparisons. To answer this question we need to explore the pattern. Let’s say there is a portion of the pattern that is repeated inside the pattern itself, like it is shown on the picture below.

Boyer-Moore Good-suffix Shift 1
The pattern may consist of repeating portions of characters!

In this case we must move the pattern thus the repeated portion must now align with its first occurrence in the pattern.

A variation of this case is when the portion from the pattern A overlaps with another portion that consists of the same characters.

Boyer-Moore Good Suffix Shift 2
Sometimes these portions may overlap!

Yet again the shift must align the second portion with its first occurrence.

Finally only a portion of A, let’s say “B”, can happen to occur in the very beginning of the pattern, as on the diagram below.

Boyer-Moore Good Suffix Shift 3
Only a sub-string of the pattern may re-occur at its front!

Now we must align the left end of the pattern with the rightmost occurrence of “B”.

Bad Character Shifts

Beside the good-suffix shifts the Boyer-Moore algorithm make use of the so called bad-character shifts. In case of a mismatch we can skip comparisons in case the character in the text doesn’t happen to appear in the pattern. To become clearer let’s see the following examples.

Boyer-Moore Bad Character 1
If the mismatched letter of the text appears in the pattern only in its front we can align it easily!

In the picture above we see that the mismatched character “B” from the text appears only in the beginning of the pattern. Thus we can simply shift the pattern to the right and align both characters B, skipping comparisons. An even better case is described by the following diagram where the mismatched letter isn’t contained into the pattern at all. Then we can shift forward the whole pattern.

Boyer-Moore Bad Character 2
In case the mismatched letter isn't contained into the pattern we move forward the pattern!

Maximum of Good-suffix and Bad-Character shifts

Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch the maximum of both is considered in order to move the pattern to the right.

Complexity

It’s clear that Boyer-Moore is faster than Morris-Pratt, but actually its worst-case complexity is O(n+m). The thing is that in natural language search Boyer-Moore does pretty well.

Boyer-Moore Complexity
Worst-case scenario of Boyer-Moore - O(m+n)

Implementation

Finally let’s see the implementation in PHP, which can be easily “transcribed” into any other programming language. The only thing we need is the structures for bad-character shifts and good-suffixes shifts.

<?php
 
/**
 * Pattern we're searching for
 *
 * @var string
 */
$pattern = 'gloria';
 
/**
 * The text we're searching in
 *
 * @var string
 */
$text = 'Sic transit gloria mundi, non transit gloria Gundi!';
 
/**
 * Calculates the suffixes for a given pattern
 *
 * @param string $pattern
 * @param array  $suffixes
 */
function suffixes($pattern, &$suffixes)
{
   $m = strlen($pattern);
 
   $suffixes[$m - 1] = $m;
   $g = $m - 1;
 
   for ($i = $m - 2; $i >= 0; --$i) {
      if ($i > $g && $suffixes[$i + $m - 1 - $f] < $i - $g) {
         $suffixes[$i] = $suffixes[$i + $m - 1 - $f];
      } else {
         if ($i < $g) {
            $g = $i;
         }
         $f = $i;
 
         while ($g >= 0 && $pattern[$g] == $pattern[$g + $m - 1 - $f]) {
            $g--;
         }
         $suffixes[$i] = $f - $g;
      }
   }
}
 
/**
 * Fills in the array of bad characters.
 *
 * @param string $pattern
 * @param array  $badChars
 */
function badCharacters($pattern, &$badChars)
{
   $m = strlen($pattern);
 
   for ($i = 0; $i < $m - 1; ++$i) {
      $badChars[$pattern{$i}] = $m - $i - 1;
   }
}
 
/**
 * Fills in the array of good suffixes
 *
 * @param string $pattern
 * @param array  $goodSuffixes
 */
function goodSuffixes($pattern, &$goodSuffixes)
{
   $m 		= strlen($pattern);
   $suff 	= array();
 
   suffixes($pattern, $suff);
 
   for ($i = 0; $i < $m; $i++) {
      $goodSuffixes[$i] = $m;
   }
 
   for ($i = $m - 1; $i >= 0; $i--) {
      if ($suff[$i] == $i + 1) {
         for ($j = 0; $j < $m - $i - 1; $j++) {
            if ($goodSuffixes[$j] == $m) {
               $goodSuffixes[$j] = $m - $i - 1;
            }
         }
      }
   }
 
   for ($i = 0; $i < $m - 2; $i++) {
      $goodSuffixes[$m - 1 - $suff[$i]] = $m - $i - 1;
   }
}
 
/**
 * Performs a search of the pattern into a given text
 *
 * @param string $pattern
 * @param string $text
 */
function boyer_moore($pattern, $text)
{
   $n = strlen($text);
   $m = strlen($pattern);
 
   $goodSuffixes 	= array();
   $badCharacters 	= array();
 
   goodSuffixes($pattern, &$goodSuffixes);
   badCharacters($pattern, &$badCharacters);
 
   $j = 0;
   while ($j < $n - $m) {
      for ($i = $m - 1; $i >= 0 && $pattern[$i] == $text[$i + $j]; $i--);
      if ($i < 0) {
         // note that if the substring occurs more
         // than once into the text, the algorithm will
         // print out each position of the substring
         echo $j;
         $j += $goodSuffixes[0];
      } else {
         $j += max($goodSuffixes[$i], $badCharacters[$text[$i + $j]] - $m + $i + 1);
      }
   }
}
 
// search using Boyer-Moore
// will return 12 and 38
boyer_moore($pattern, $text);

Application

Boyer-Moore is one of the most used string searching algorithm in practice. It is intuitively clear where it can be useful, but yet again I’ll say only that this algorithm is considered as the mostly used in practice for search and replace operations in text editors.

Boyer-Moore Application

21 thoughts on “Computer Algorithms: Boyer-Moore String Searching

  1. Very nice article! But, I think there is a mistake that the original Boyer-Moore algorithm is O(mn) in the worst case. A simple case is the text and the pattern both contain only one repeated charactor. It’s guaranteed to be O(m+n) only if the Galil Rule is applied!

  2. Thanks so much, your explanation is very well, please if you have the boyer moore string search algorithm in C# language, it’ll help me a lot.
    thank you again

  3. I think there is a problem, when the matching part is at the end. When you call boyer_moore(“tester”, “supertester”); does not match. I think the reason is “while ($j < $n – $m) {" of boyer_moore().
    I changed it to "while ($j <= $n – $m)". Will there be side-effects?

  4. I agree with Stephen and his solution. I tried it will adding 1 to $n – $m, but less than or equal has the same effect.

  5. great!
    pattern matching boyer moore.
    i have problem – application text searching with php use boyer moore….
    please send me a example source code

    thanks…

  6. “In case the mismatched letter isn’t contained into the pattern we move forward the pattern!”
    Your example is:
    text: …bA
    pattern: …cA , pattern has no b
    So you jump over A.
    Is that right?

    For example, let A=”123″
    text: …b123c123
    pattern: 123c123
    next should compare from first A( =”123″), not letter c, as you said, over whold pattern.
    Am I right?

  7. with this algoritm i think have a problem…
    if less from 3 character can’t find solutin so what we do…?

  8. Sir Its Awesome Work It Is Very Helpful For A Person Who Wanna to Research Thesis For MS In Algorithm Analysis(String Matching)

  9. hello. you’ve got a relatively little mistake which can cause the algorithm to skip searches

    it’s located here
    for ($i = 0; $i < $m – 2; $i++) {
    $goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
    }

    the code should be replaced with
    for ($i = 0; $i <= $m – 2; $i++) {
    $goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
    }

    testcase is
    pattern: abbccab
    text: abaccabaabbccababbccab

    the result should be 8 and 15
    with the old version it's 15 only

    PS: thanks for the PHP solution and the article. nicely done!

  10. hello bro, please tell me how to running this string matching code ? or about the implementation ?
    need response urgently

  11. how to call the function for text searching for my application?
    i don’t really understand. please help me

Leave a Reply

Your email address will not be published. Required fields are marked *