The Rabin-Karp string search algorithm is used to search text for a pattern.
A practical application of the algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.
Given a text of "The big dog jumped over the fox" and a search pattern of "ump" this will return 13. It starts by hashing "ump" then hashing "The". If hashed don't match then it slides the window a character at a time (e.g. "he ") and subtracts out the previous hash from the "T".
The Rabin-Karp algorithm uses a sliding window the size of the search pattern. It starts by hashing the search pattern, then hashing the first x characters of the text string where x is the length of the search pattern. It then slides the window one character over and uses the previous hash value to calculate the new hash faster. Only when it finds a hash that matches the hash of the search pattern will it compare the two strings it see if they are the same (to prevent a hash collision from producing a false positive).
The major search method is next. More implementation details are in rabin-karp.swift
public func search(text: String , pattern: String) -> Int {
// convert to array of ints
let patternArray = pattern.flatMap { $0.asInt }
let textArray = text.flatMap { $0.asInt }
if textArray.count < patternArray.count {
return -1
}
let patternHash = hash(array: patternArray)
var endIdx = patternArray.count - 1
let firstChars = Array(textArray[0...endIdx])
let firstHash = hash(array: firstChars)
if (patternHash == firstHash) {
// Verify this was not a hash collision
if firstChars == patternArray {
return 0
}
}
var prevHash = firstHash
// Now slide the window across the text to be searched
for idx in 1...(textArray.count - patternArray.count) {
endIdx = idx + (patternArray.count - 1)
let window = Array(textArray[idx...endIdx])
let windowHash = nextHash(prevHash: prevHash, dropped: textArray[idx - 1], added: textArray[endIdx], patternSize: patternArray.count - 1)
if windowHash == patternHash {
if patternArray == window {
return idx
}
}
prevHash = windowHash
}
return -1
}
This code can be tested in a playground using the following:
search(text: "The big dog jumped"", "ump")
This will return 13 since ump is in the 13 position of the zero based string.
Written by Bill Barbour