Basically, I have an array of keywords, and a piece of text. I am wondering what would be the best way to find out if any of those keywords are present in the text, bearing in m
Depending on the size of the string You could use a hash to make it faster.
First iterate the text. For each word, assign it to an array:
foreach (preg_split("/\s/", $text) as $word)
{
$string[$word] = 1;
}
Then iterate the keywords checking the $string:
foreach ($keywords as $keyword)
{
if (isset($string[$keyword]))
{
// $keyword exists in string
}
}
EDIT If your text is much smaller than your keywords, do it backwards, check the keywords for each word in the text. This would likley be faster than the above if the text is pretty short.
foreach (preg_split("/\s/", $text) as $word)
{
if (isset($keywords[$word]))
{
//might be faster if sizeof($text) < sizeof($keywords)
}
}
Assuming the formatting and only that you care if any (not which) of the keywords exist, you could try something like:
$keywords = array( "dog", "cat" );
// get a valid regex
$test = "(\b".implode( "\b)|(\b", $keywords )."\b)";
if( preg_match( $test, "there is a dog chasing a cat down the road" ) )
print "keyword hit";
I really don't know if it is more efficient, but you could try to put them all in a regex like this: (keyword1|keyword2|...) With the preg_quote function you can escape the keywords for the regex. If you set the compiled option, it might be more efficient when using it with multiple strings.
Working off eWolf's idea...
foreach($keywords as &$keyword) {
$keyword = preg_quote($keyword);
}
$regex = "/(". implode('|', $keywords) .")/";
return preg_match($regex, $str);
You don't have to check for boundaries if you don't want to, but if you do just surround the group (the ()
characters) with \b then it'll match only a given word. And you'll want to make sure all the array's members are preg_quoted, for safety.
You could dump the text into an array and do a array_intersect_key on the two arrays. I am not sure of the performance of this though...