stoimen's web log

Introduction

Brute force string matching is the a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful. For pattern matching we need something faster, but to understand other sub-string matching algorithms let’s take a look once again on brute force matching.

In brute force sub-string matching we checked every single character from the text with the first character of the pattern. Once we have a match between them we shift the comparison between the second character of the pattern with the next character of the text, as shown on the picture below.

Brute Froce Principles — Brute force string matching is slow because it compares every single character from the pattern and the text!

This algorithm is slow for mainly two reasons. First we have to check every single character from the text. On the other hand even if we find a match between a text character and the first character of the pattern we continue to check step by step (character by character) every single symbol of the pattern in order to find whether it is in the text. So is there any other approach to find whether the text contains the pattern?

In fact there is a “faster” approach. In this case in order to avoid the comparison between the pattern and the text character by character, we’ll try to compare them at once, so we need a good hash function. With its help we can hash the pattern and check against hashed sub-strings of the text. We must be sure that the hash function is returning “small” hash codes for larger sub-strings. Another problem is that for larger patterns we can’t expect to have short hashes. But besides this the approach should be quite effective compared to the brute force string matching.

That approach is known as Rabin-Karp algorithm.

Overview

Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. Such hash function, for instance, may use the ASCII codes for every character, but we must be careful for multi-lingual support.

Rabin-Karp Basic Principles — Rabin-Karp hashes the pattern and the sub-string in order to compare them quickly!

The hash function may vary depending on many things, so it may consist of ASCII char to number converting, but it can be also anything else. The only thing we need is to convert a string (pattern) into some hash that is faster to compare. Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”. Thus on every step we take from the text a sub-string with the length of m, where m is the pattern length. Thus we hash this sub-string and we can directly compare it to the hashed pattern, as on the picture above.

Implementation

So far we saw some diagrams explaining the Rabin-Karp algorithm, but let’s take a look on its implementation. Here in this very basic example where a simple hash table is used in order to convert the characters into integers. The code is PHP and it’s used only to illustrate the principles of this algorithm.

function hash_string($str, $len)
{
	$hash = '';
 
	$hash_table = array(
		'h' => 1,
		'e' => 2,
		'l' => 3,
		'o' => 4,
		'w' => 5,
		'r' => 6,
		'd' => 7,
	);
 
	for ($i = 0; $i < $len; $i++) {
		$hash .= $hash_table[$str{$i}];
	}
 
	return (int)$hash;
}
 
function rabin_karp($text, $pattern)
{
	$n = strlen($text);
	$m = strlen($pattern);
 
	$text_hash = hash_string(substr($text, 0, $m), $m);
	$pattern_hash = hash_string($pattern, $m);
 
	for ($i = 0; $i < $n-$m+1; $i++) {
		if ($text_hash == $pattern_hash) {
			return $i;
		}
 
		$text_hash = hash_string(substr($text, $i, $m), $m);
	}
 
	return -1;
}
 
// 2
echo rabin_karp('hello world', 'ello');

Multiple Pattern Match

It’s great to say that the Rabin-Karp algorithm is great for multiple pattern match. Indeed its nature is supposed to support such functionality, which is its advantage in compare to other string searching algorithms.

Complexity

The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where it is compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s no much gain in performance. However it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below.

Note that the Rabin-Karp algorithm also needs O(m) preprocessing time.

Application

As we saw Rabin-Karp is not so faster than brute force matching. So where we should use it?

3 Reasons Why Rabin-Karp is Cool

1. Good for plagiarism, because it can deal with multiple pattern matching!

Application of Rabin-Karp — Rabin-Karp can detect plagiarism efficiently!

2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)!
3. With a good hashing function it can be quite effective and it’s easy to implement!

2 Reasons Why Rabin-Karp is Not Cool

1. There are lots of string matching algorithms that are faster than O(n+m)
2. It’s practically as slow as brute force matching and it requires additional space

Final Words

Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple pattern. This makes it perfect to detect plagiarism even for larger phrases.

13 thoughts on “Computer Algorithms: Rabin-Karp String Searching”

Sathish says:

April 3, 2012 at 9:13 pm

Nice post. What about other algos that are faster than Rabin-Karp?
Daniel Lemire says:

April 7, 2012 at 5:22 am

A closely related wikipedia article might be of interest to the readers of this post:

http://en.wikipedia.org/wiki/Rolling_hash
Ignatz Ratzkewatzke says:

April 7, 2012 at 7:00 am

This article is incorrect in two important ways:

(1) The hash function provided is O(m), where m is the length of the pattern, and therefore the algorithm, as presented, is O(mn).

(2) The algorithm returns any string with the same hash as the pattern.

Students who might otherwise be confused by this article are invited to check out the Wikipedia article on the technique; its coverage is much more accurate.
Brian Cardarella says:

April 7, 2012 at 11:50 am

Very nice post!
Theo Vosse says:

April 9, 2012 at 10:15 am

Your implementation is wrong. You need a “rolling” hash. The complexity in your version is O(n*m), since you’re repeating the inner loop of the hash function n * m times. You can improve your code by first taking out the oldest character from the hash, then shifting all remaining characters one position, then adding the new one. That takes constant time, so complexity will become O(n+m).
May Thu Zar Soe says:

June 24, 2012 at 5:04 am

This algorithm is very useful for my thesis.
Dishan says:

December 3, 2012 at 5:51 am

Easy to understand and very clear.. keep it up…… 🙂
thank you & Lord buddhas bless you…!!!!
ATec says:

December 14, 2012 at 1:06 pm

Ignatz Ratzkewatzke says:
…
(2) The algorithm returns any string with the same hash as the pattern.
…
answer:

test with hash pattern not only used for return match ok.,it is useful for not equal patterns and so
eliminate possible other variations. after hashed check must be checked also normal pattern.
Syed Arshad says:

May 2, 2013 at 12:42 pm

Very nice post sir. I am referring T.H.Cormen’s “Introduction to Algorithms”. This post helped me to implement RK algorithm in my program 🙂

Thank you 🙂
Tomas says:

October 12, 2013 at 3:30 pm

A terrible, terrible hash function…
Bryan says:

October 23, 2013 at 9:40 am

thanks a lot, very good explanation especially to those who are trying to understand it for the first time. dont mind the haterzzzz
Mahnoor says:

April 30, 2017 at 5:50 pm

It’s really nice. It help me in my presentation. I understand it very well.
Thanks
Lynelle Peeden says:

February 19, 2018 at 6:26 pm

Sure thing & done.I’ve only hacked it so everything is converted to lowercase but the end result is that both UPPER & lower cases are accepted.