Sunday, October 21, 2012

Rabin Carp String Matching Algorithm

     Rabin Carp Algorithm is also one of the string matching algorithm. This algorithm is also an improved version of Naive or Brute Force String Matching Algorithm. Because this algorithm is the a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful.
     Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987.Rabin Carp algorithm is one of the better string matching algorithm than Brute Force algorithm. This algorithm avoid comparison of every character of pattern with characters of text in each position. To do that this algorithm uses hashing. Hashing is the process of converting our data in to numerical value. To convert data in to numerical value we should have assign a numerical value for each text in the alphabet and using any hashing functions we can calculate hash value of any pattern. 
As an example if our pattern P=AABCB and we get ascii values of characters.
Then ascii value of A=65,B=66,C=67 and we can get hash value of above pattern using any hashing functions.
Hash function 1
     hash value h(P) = 65+65+66+67+66 = 329
Hash function 2
     hash value h(P) = 65*1+65*2+66*3+67*4+66*5 =991
Like above you can use any hash functions to calculate hash values.
We can say that if two strings are equal, then hash values of these two string must be same. But if hash values of two strings are equal then we can't say these two strings are equal. I may be or not.  This is the basic idea of the Rabin Carp algorithm.

Algorithm
 Compute hash value of pattern[h(P)]  
    Compute hash value of sub string of text[h(t)]  
    If h(P)=h(t)  
      Compare pattern and sub string character by character  
      If mismatch found  
         move by one position and go to second step  
      Else  
         matching sub string found in the text  
    Else  
      move by one position and go to second step  
Lets learn the algorithm using an example.
     Alphabet S = {A,B,C,D,E,F,G,H}
     Text T = AABDCFFACGABBCABHGADEAADG
     Pattern P = BBCABHG

Here i am going to assign value of character in the alphabet instead of assigning ascii values.
Value(A) = 1
Value(B) = 2
Value(C) = 3
Value(D) = 4
Value(E) = 5
Value(F) = 6
Value(G) = 7
Value(H) = 8
     Then hash value of pattern  h(P) = 2+2+3+1+2+8+7 = 25




Hash value of [AABDCFF] is 23. Hash values are not equal. Sub string is not matched. Move by one position.
Hash value of [ABDCFFA] is 23. Hash values are not equal. Sub string is not matched. Move by one position. 

Hash value of [BDCFFAC] is 25. Hash values are equal. Sub string may be matched. Compare pattern with sub string in the text.
First letter B matched with sub string. But second one mismatched. Move by one position.
Hash value of [DCFFACG] is 29. Hash values are not equal. Sub string is not matched. Move by one position. 
Hash value of [CFFACGA] is 26. Hash values are not equal. Sub string is not matched. Move by one position. 
Hash value of [FFACGAB] is 25. Hash values are equal. Sub string may be matched. Compare pattern with sub string in the text.
First letter B  mismatched. Move by one position. Like these you can find matching sub string.
C implementation of Rabin Carp Algorithm

Complexity

The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where it is compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s no much gain in performance. However it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below.
Rabin-Karp's complexity is O(nm), but in practice it's O(n+m)!
Note that the Rabin-Karp algorithm also needs O(m) preprocessing time.

Advantages

  1. Not faster than brute force matching in theory, but in practice its complexityis O(n+m)
  2. Good hashing function it can be quite effective and it’s easy to implement!
  3. Multiple pattern matching support
  4. Good for plagiarism, because it can deal with multiple pattern matching!

Disadvantages

  1. There are lots of string matching algorithms that are faster than O(n+m)
  2. It’s practically as slow as brute force matching and it requires additional space
Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple pattern. This makes it perfect to detect plagiarism even for larger phrases.

0 comments:

Post a Comment