For a web application I\'m building I need to analyze a website, retrieve and rank it\'s most important keywords and display those.
Getting all words, their density and
@ refining 'Steps'
In regards to doing these many steps, i would go with a bit 'enhanced' solution, suturing some of your steps together.
Not sure, whether a full lexer is better though, if you design it perfectly to fit your needs, e.g. look only for text within hX etc. But you would have to mean _serious business since it can be a headache to implement. Though i will put my point out and say that a Flex / Bison solution in another language (PHP offers poor support as it is such a high-level language) would be an 'insane' speed boost.
However, luckily libxml
provides magnificent features and as the following should show, you will end up having multiple steps in one. Before the point where you analyse the contents, setup language(stopwords), minify the NodeList set and work from there.
into seperate field
and others like, eg. unset($fullpage);
While using DOM parsers, one should realize that settings may introduce further validation for attributes href and src, depending on library (such as parse_url and likes)
Another way of getting by the timeout / memory consumption stuff is to call php-cli (also works for a windows host) and 'get on with business' and start next document. See this question for more info.
If you scroll down a bit, look at the proposed schema - initial crawling would put only body in database (and additionally lang in your case) and then run a cron-script, filling in the ft_index whilst using the following function
function analyse() {
ob_start(); // dont care about warnings, clean ob contents after parse
$doc->loadHTML("" . $this->html_entity_decode("UTF-8") . "
");
ob_end_clean();
$weighted_ft = array('0'=>"",'5'=>"",'15'=>"");
$includes = $doc->getElementsByTagName('h1');
// relevance wieght 0
foreach ($includes as $h) {
$text = $h->textContent;
// check/filter stopwords and uniqueness
// do so with other weights as well, basically narrow it down before counting
$weighted_ft['0'] .= " " . $text;
}
// relevance wieght 5
$includes = $doc->getElementsByTagName('h2');
foreach ($includes as $h) {
$weighted_ft['5'] .= " " . $h->textContent;
}
// relevance wieght 15
$includes = $doc->getElementsByTagName('p');
foreach ($includes as $p) {
$weighted_ft['15'] .= " " . $p->textContent;
}
// pseudo; start counting frequencies and stuff
// foreach weighted_ft sz do
// foreach word in sz do
// freqency / prominence
}
function html_entity_decode($toEncoding) {
$encoding = mb_detect_encoding($this->body, "ASCII,JIS,UTF-8,ISO-8859-1,ISO-8859-15,EUC-JP,SJIS");
$body = mb_convert_encoding($this->body, $toEncoding, ($encoding != "" ? $encoding : "auto"));
return html_entity_decode($body, ENT_QUOTES, $toEncoding);
}
The above is a class, resembling your database which has the page 'body' field loaded in prehand.
Again, as far as database handling goes, i ended up inserting the above parsed result into a full-text flagged tablecolumn so that future lookups would go seemlessly. This is a huge advantage for db engines.
Note on full-text indexing:
When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.
Your indexing algorithm filters out some words, ok.. But these are enumerated by how much weight they carry - there is a strategy to think out here, since a full-text string does not carry over the weights given. That is why in the example, as basic strategy on splitting strings into 3 different strings is given.
Once put into database, the columns should then resemble this, so a schema could be like so, where we would maintain weights - and still offer a superfast query method
CREATE TABLE IF NOT EXISTS `oo_pages` (
`id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`body` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'PageBody entity encoded html',
`title` varchar(31) COLLATE utf8_danish_ci NOT NULL,
`ft_index5` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted highest',
`ft_index10` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted medium',
`ft_index15` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted lesser',
`ft_lastmodified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'last cron run',
PRIMARY KEY (`id`),
UNIQUE KEY `alias` (`alias`),
FULLTEXT KEY `ft_index5` (`ft_index5`),
FULLTEXT KEY `ft_index10` (`ft_index10`),
FULLTEXT KEY `ft_index15` (`ft_index15`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
One may add an index like so:
ALTER TABLE `oo_pages` ADD FULLTEXT (
`named_column`
)
The thing about detecting language and then selecting your stopword database from that point is a feature I myself have left out but its nifty - And By The Book! So cudos for your efforts and this answer :)
Also, keep in mind there's not only the title tag, but also anchor / img title attributes. If for some reason your analytics goes into a spider-like state, i would suggest combining the reference link () title and textContent with the target page