Peer to SamGoody and Lego Stormtroopr comments.
If you are looking for a PHP algorithm to rank search results based on proximity/relevance of multiple words
here comes a quick and easy way of generating search results with PHP only:
Issues with the other boolean search methods such as strpos()
, preg_match()
, strstr()
or stristr()
- can't search for multiple words
- results are unranked
PHP method based on Vector Space Model and tf-idf (term frequency–inverse document frequency):
It sounds difficult but is surprisingly easy.
If we want to search for multiple words in a string the core problem is how we assign a weight to each one of them?
If we could weight the terms in a string based on how representative they are of the string as a whole,
we could order our results by the ones that best match the query.
This is the idea of the vector space model, not far from how SQL full-text search works:
function get_corpus_index($corpus = array(), $separator=' ') {
$dictionary = array();
$doc_count = array();
foreach($corpus as $doc_id => $doc) {
$terms = explode($separator, $doc);
$doc_count[$doc_id] = count($terms);
// tf–idf, short for term frequency–inverse document frequency,
// according to wikipedia is a numerical statistic that is intended to reflect
// how important a word is to a document in a corpus
foreach($terms as $term) {
if(!isset($dictionary[$term])) {
$dictionary[$term] = array('document_frequency' => 0, 'postings' => array());
}
if(!isset($dictionary[$term]['postings'][$doc_id])) {
$dictionary[$term]['document_frequency']++;
$dictionary[$term]['postings'][$doc_id] = array('term_frequency' => 0);
}
$dictionary[$term]['postings'][$doc_id]['term_frequency']++;
}
//from http://phpir.com/simple-search-the-vector-space-model/
}
return array('doc_count' => $doc_count, 'dictionary' => $dictionary);
}
function get_similar_documents($query='', $corpus=array(), $separator=' '){
$similar_documents=array();
if($query!=''&&!empty($corpus)){
$words=explode($separator,$query);
$corpus=get_corpus_index($corpus, $separator);
$doc_count=count($corpus['doc_count']);
foreach($words as $word) {
if(isset($corpus['dictionary'][$word])){
$entry = $corpus['dictionary'][$word];
foreach($entry['postings'] as $doc_id => $posting) {
//get term frequency–inverse document frequency
$score=$posting['term_frequency'] * log($doc_count + 1 / $entry['document_frequency'] + 1, 2);
if(isset($similar_documents[$doc_id])){
$similar_documents[$doc_id]+=$score;
}
else{
$similar_documents[$doc_id]=$score;
}
}
}
}
// length normalise
foreach($similar_documents as $doc_id => $score) {
$similar_documents[$doc_id] = $score/$corpus['doc_count'][$doc_id];
}
// sort from high to low
arsort($similar_documents);
}
return $similar_documents;
}
CASE 1
$query = 'are';
$corpus = array(
1 => 'How are you?',
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULT
Array
(
[1] => 0.52832083357372
)
CASE 2
$query = 'are';
$corpus = array(
1 => 'how are you today?',
2 => 'how do you do',
3 => 'here you are! how are you? Are we done yet?'
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULTS
Array
(
[1] => 0.54248125036058
[3] => 0.21699250014423
)
CASE 3
$query = 'we are done';
$corpus = array(
1 => 'how are you today?',
2 => 'how do you do',
3 => 'here you are! how are you? Are we done yet?'
);
$match_results=get_similar_documents($query,$corpus);
echo '<pre>';
print_r($match_results);
echo '</pre>';
RESULTS
Array
(
[3] => 0.6813781191217
[1] => 0.54248125036058
)
There are plenty of improvements to be made
but the model provides a way of getting good results from natural queries,
which don't have boolean operators such as strpos()
, preg_match()
, strstr()
or stristr()
.
NOTA BENE
Optionally eliminating redundancy prior to search the words
1. Normalisation
- Convert all text to lower case
2. Stopword elimination
- Eliminate words from the text which carry no real meaning (like 'and', 'or', 'the', 'for', etc.)
3. Dictionary substitution
Replace words with others which have an identical or similar meaning.
(ex:replace instances of 'hungrily' and 'hungry' with 'hunger')
Further algorithmic measures (snowball) may be performed to further reduce words to their essential meaning.
The replacement of colour names with their hexadecimal equivalents
The reduction of numeric values by reducing precision are other ways of normalising the text.
RESOURCES
- http://linuxgazette.net/164/sephton.html
- http://snowball.tartarus.org/
- MySQL Fulltext Search Score Explained
- http://dev.mysql.com/doc/internals/en/full-text-search.html
- http://en.wikipedia.org/wiki/Vector_space_model
- http://en.wikipedia.org/wiki/Tf%E2%80%93idf
- http://phpir.com/simple-search-the-vector-space-model/