Introduction
Have you ever asked yourself which is the algorithm used to find a word after clicking Ctrl+F and typing something? Well I guess you know the answer from the title, but in this article you’ll find out how exactly this is done.
As we saw from the Morris-Pratt string searching we don’t need to compare the text and the pattern character by character. Some comparisons can be skipped in order to improve the performance of the string searching. Indeed the brute force string searching and the Rabin-Karp algorithm are quite slow only because they compare the pattern and the text character by character.
In the other hand the Morris-Pratt algorithm is a very good improvement of the brute force string searching, but the question remains. Is there any algorithm that is faster than Morris-Pratt – is there any way to skip more comparisons and to move the pattern faster.
It’s clear that if we have to find whether a single character is contained into a text we need at least “n” steps, where n is the length of the text. Once we have to find whether a pattern with the length of “m” is contained into a text with length of “n” the case is getting a little more complex.
However the answer is that there is such algorithm that is faster and more suitable than Morris-Pratt. This is the Boyer-Moore string searching.
Overview
Boyer-Moore is an algorithm that improves the performance of pattern searching into a text by considering some observations. It is defined in 1977 by Robert S. Boyer and J Strother Moore and it consist of some specific features.
First of all this algorithm starts comparing the pattern from the leftmost part of text and moves it to the right, as on the picture below.
Unlike other string searching algorithms though, Boyer-Moore compares the pattern against a possible match from right to left as shown below.
The main idea of Boyer-Moore in order to improve the performance are some observations of the pattern. In the terminology of this algorithm they are called good-suffix and bad-character shifts. Let’s see by the following examples what they are standing for.
Good-suffix Shifts
Just like the Morris-Pratt algorithm we start to compare the pattern against some portion of the text where a possible match will occur. In Boyer-Moore as I said this is done from the rightmost letter of the pattern. After some characters have matched we find a mismatch.
So how can we move the pattern to the right in order to skip unusual comparisons. To answer this question we need to explore the pattern. Let’s say there is a portion of the pattern that is repeated inside the pattern itself, like it is shown on the picture below.
In this case we must move the pattern thus the repeated portion must now align with its first occurrence in the pattern.
A variation of this case is when the portion from the pattern A overlaps with another portion that consists of the same characters.
Yet again the shift must align the second portion with its first occurrence.
Finally only a portion of A, let’s say “B”, can happen to occur in the very beginning of the pattern, as on the diagram below.
Now we must align the left end of the pattern with the rightmost occurrence of “B”.
Bad Character Shifts
Beside the good-suffix shifts the Boyer-Moore algorithm make use of the so called bad-character shifts. In case of a mismatch we can skip comparisons in case the character in the text doesn’t happen to appear in the pattern. To become clearer let’s see the following examples.
In the picture above we see that the mismatched character “B” from the text appears only in the beginning of the pattern. Thus we can simply shift the pattern to the right and align both characters B, skipping comparisons. An even better case is described by the following diagram where the mismatched letter isn’t contained into the pattern at all. Then we can shift forward the whole pattern.
Maximum of Good-suffix and Bad-Character shifts
Boyer-Moore needs both good-suffix and bad-character shifts in order to speed up searching performance. After a mismatch the maximum of both is considered in order to move the pattern to the right.
Complexity
It’s clear that Boyer-Moore is faster than Morris-Pratt, but actually its worst-case complexity is O(n+m). The thing is that in natural language search Boyer-Moore does pretty well.
Implementation
Finally let’s see the implementation in PHP, which can be easily “transcribed” into any other programming language. The only thing we need is the structures for bad-character shifts and good-suffixes shifts.
<?php /** * Pattern we're searching for * * @var string */ $pattern = 'gloria'; /** * The text we're searching in * * @var string */ $text = 'Sic transit gloria mundi, non transit gloria Gundi!'; /** * Calculates the suffixes for a given pattern * * @param string $pattern * @param array $suffixes */ function suffixes($pattern, &$suffixes) { $m = strlen($pattern); $suffixes[$m - 1] = $m; $g = $m - 1; for ($i = $m - 2; $i >= 0; --$i) { if ($i > $g && $suffixes[$i + $m - 1 - $f] < $i - $g) { $suffixes[$i] = $suffixes[$i + $m - 1 - $f]; } else { if ($i < $g) { $g = $i; } $f = $i; while ($g >= 0 && $pattern[$g] == $pattern[$g + $m - 1 - $f]) { $g--; } $suffixes[$i] = $f - $g; } } } /** * Fills in the array of bad characters. * * @param string $pattern * @param array $badChars */ function badCharacters($pattern, &$badChars) { $m = strlen($pattern); for ($i = 0; $i < $m - 1; ++$i) { $badChars[$pattern{$i}] = $m - $i - 1; } } /** * Fills in the array of good suffixes * * @param string $pattern * @param array $goodSuffixes */ function goodSuffixes($pattern, &$goodSuffixes) { $m = strlen($pattern); $suff = array(); suffixes($pattern, $suff); for ($i = 0; $i < $m; $i++) { $goodSuffixes[$i] = $m; } for ($i = $m - 1; $i >= 0; $i--) { if ($suff[$i] == $i + 1) { for ($j = 0; $j < $m - $i - 1; $j++) { if ($goodSuffixes[$j] == $m) { $goodSuffixes[$j] = $m - $i - 1; } } } } for ($i = 0; $i < $m - 2; $i++) { $goodSuffixes[$m - 1 - $suff[$i]] = $m - $i - 1; } } /** * Performs a search of the pattern into a given text * * @param string $pattern * @param string $text */ function boyer_moore($pattern, $text) { $n = strlen($text); $m = strlen($pattern); $goodSuffixes = array(); $badCharacters = array(); goodSuffixes($pattern, &$goodSuffixes); badCharacters($pattern, &$badCharacters); $j = 0; while ($j < $n - $m) { for ($i = $m - 1; $i >= 0 && $pattern[$i] == $text[$i + $j]; $i--); if ($i < 0) { // note that if the substring occurs more // than once into the text, the algorithm will // print out each position of the substring echo $j; $j += $goodSuffixes[0]; } else { $j += max($goodSuffixes[$i], $badCharacters[$text[$i + $j]] - $m + $i + 1); } } } // search using Boyer-Moore // will return 12 and 38 boyer_moore($pattern, $text); |
Application
Boyer-Moore is one of the most used string searching algorithm in practice. It is intuitively clear where it can be useful, but yet again I’ll say only that this algorithm is considered as the mostly used in practice for search and replace operations in text editors.
Very nice article! But, I think there is a mistake that the original Boyer-Moore algorithm is O(mn) in the worst case. A simple case is the text and the pattern both contain only one repeated charactor. It’s guaranteed to be O(m+n) only if the Galil Rule is applied!
Thanks for writing these tutorials. I love the algo series!
Thank you so much! This helped me a lot. Plus you explained it well.. thanks dude!
thanks, helped a lot, bloody confusing though
please send me the boyer moore string search algo in C# language code.
thank you sir
Thanks so much, your explanation is very well, please if you have the boyer moore string search algorithm in C# language, it’ll help me a lot.
thank you again
I think there is a problem, when the matching part is at the end. When you call boyer_moore(“tester”, “supertester”); does not match. I think the reason is “while ($j < $n – $m) {" of boyer_moore().
I changed it to "while ($j <= $n – $m)". Will there be side-effects?
great!
I like the algorithm
I agree with Stephen and his solution. I tried it will adding 1 to $n – $m, but less than or equal has the same effect.
great!
pattern matching boyer moore.
i have problem – application text searching with php use boyer moore….
please send me a example source code
thanks…
great!
pattern matching boyer moore.
i have problem – application text searching with php use boyer moore….
please send me a example source code
samsuljambrong@gmail.com
thanks…
“In case the mismatched letter isn’t contained into the pattern we move forward the pattern!”
Your example is:
text: …bA
pattern: …cA , pattern has no b
So you jump over A.
Is that right?
For example, let A=”123″
text: …b123c123
pattern: 123c123
next should compare from first A( =”123″), not letter c, as you said, over whold pattern.
Am I right?
with this algoritm i think have a problem…
if less from 3 character can’t find solutin so what we do…?
great information….i want the code in c++ language
Sir Its Awesome Work It Is Very Helpful For A Person Who Wanna to Research Thesis For MS In Algorithm Analysis(String Matching)
hello. you’ve got a relatively little mistake which can cause the algorithm to skip searches
it’s located here
for ($i = 0; $i < $m – 2; $i++) {
$goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
}
the code should be replaced with
for ($i = 0; $i <= $m – 2; $i++) {
$goodSuffixes[$m – 1 – $suff[$i]] = $m – $i – 1;
}
testcase is
pattern: abbccab
text: abaccabaabbccababbccab
the result should be 8 and 15
with the old version it's 15 only
PS: thanks for the PHP solution and the article. nicely done!
hello bro, please tell me how to running this string matching code ? or about the implementation ?
need response urgently
Great help for Boyer Moore. Thanks
Thanks
how to call the function for text searching for my application?
i don’t really understand. please help me
Call-time pass-by-reference has been removed on line 108.. how to fixed it ?