Keyword analysis in PHP

后端未结

关注

 5  1118

走了就别回头了 2021-01-30 03:47

For a web application I\'m building I need to analyze a website, retrieve and rank it\'s most important keywords and display those.

Getting all words, their density and

5条回答

北荒 (楼主)

2021-01-30 03:53
@ refining 'Steps'

In regards to doing these many steps, i would go with a bit 'enhanced' solution, suturing some of your steps together.

Not sure, whether a full lexer is better though, if you design it perfectly to fit your needs, e.g. look only for text within hX etc. But you would have to mean _serious business since it can be a headache to implement. Though i will put my point out and say that a Flex / Bison solution in another language (PHP offers poor support as it is such a high-level language) would be an 'insane' speed boost.

However, luckily libxml provides magnificent features and as the following should show, you will end up having multiple steps in one. Before the point where you analyse the contents, setup language(stopwords), minify the NodeList set and work from there.
1. load full page in
2. detect language
3. extract only into seperate field
4. release a tad of memory from and others like, eg. unset($fullpage);
5. fire your algorithm (if pcntl - linux host - is available, forking and releasing browser is a nice feature)
While using DOM parsers, one should realize that settings may introduce further validation for attributes href and src, depending on library (such as parse_url and likes)

Another way of getting by the timeout / memory consumption stuff is to call php-cli (also works for a windows host) and 'get on with business' and start next document. See this question for more info.

If you scroll down a bit, look at the proposed schema - initial crawling would put only body in database (and additionally lang in your case) and then run a cron-script, filling in the ft_index whilst using the following function
```
    function analyse() {
        ob_start(); // dont care about warnings, clean ob contents after parse
        $doc->loadHTML("" . $this->html_entity_decode("UTF-8") . "");
        ob_end_clean();
        $weighted_ft = array('0'=>"",'5'=>"",'15'=>"");

        $includes = $doc->getElementsByTagName('h1');
        // relevance wieght 0
        foreach ($includes as $h) {


                $text = $h->textContent;
                // check/filter stopwords and uniqueness
                // do so with other weights as well, basically narrow it down before counting
                $weighted_ft['0'] .= " " . $text;


        }
        // relevance wieght 5
        $includes = $doc->getElementsByTagName('h2');
        foreach ($includes as $h) {
            $weighted_ft['5'] .= " " . $h->textContent;
        }
        // relevance wieght 15
        $includes = $doc->getElementsByTagName('p');
        foreach ($includes as $p) {
            $weighted_ft['15'] .= " " . $p->textContent;
        }
            // pseudo; start counting frequencies and stuff
            // foreach weighted_ft sz do 
            //   foreach word in sz do 
            //      freqency / prominence
 }

    function html_entity_decode($toEncoding) {
        $encoding = mb_detect_encoding($this->body, "ASCII,JIS,UTF-8,ISO-8859-1,ISO-8859-15,EUC-JP,SJIS");
        $body = mb_convert_encoding($this->body, $toEncoding, ($encoding != "" ? $encoding : "auto"));
        return html_entity_decode($body, ENT_QUOTES, $toEncoding);
    }
```
The above is a class, resembling your database which has the page 'body' field loaded in prehand.

Again, as far as database handling goes, i ended up inserting the above parsed result into a full-text flagged tablecolumn so that future lookups would go seemlessly. This is a huge advantage for db engines.

Note on full-text indexing:

When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.

Your indexing algorithm filters out some words, ok.. But these are enumerated by how much weight they carry - there is a strategy to think out here, since a full-text string does not carry over the weights given. That is why in the example, as basic strategy on splitting strings into 3 different strings is given.

Once put into database, the columns should then resemble this, so a schema could be like so, where we would maintain weights - and still offer a superfast query method
```
CREATE TABLE IF NOT EXISTS `oo_pages` (
  `id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `body` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'PageBody entity encoded html',
  `title` varchar(31) COLLATE utf8_danish_ci NOT NULL,
  `ft_index5` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted highest',
  `ft_index10` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted medium',
  `ft_index15` mediumtext COLLATE utf8_danish_ci NOT NULL COMMENT 'Regenerated cron-wise, weighted lesser',
  `ft_lastmodified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'last cron run',
  PRIMARY KEY (`id`),
  UNIQUE KEY `alias` (`alias`),
  FULLTEXT KEY `ft_index5` (`ft_index5`),
  FULLTEXT KEY `ft_index10` (`ft_index10`),
  FULLTEXT KEY `ft_index15` (`ft_index15`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
```
One may add an index like so:
```
ALTER TABLE `oo_pages` ADD FULLTEXT (
`named_column`
)
```
The thing about detecting language and then selecting your stopword database from that point is a feature I myself have left out but its nifty - And By The Book! So cudos for your efforts and this answer :)

Also, keep in mind there's not only the title tag, but also anchor / img title attributes. If for some reason your analytics goes into a spider-like state, i would suggest combining the reference link () title and textContent with the target page </code></p> </p> <div class="appendcontent"> </div> </div> <div class="jieda-reply"> <span class="jieda-zan button_agree" type="zan" data-id='1929720'> <i class="iconfont icon-zan"></i> <em>0</em> </span> <span type="reply" class="showpinglun" data-id="1929720"> <i class="iconfont icon-svgmoban53"></i> 讨论(0) </span> <div class="jieda-admin"> </div> <div class="noreplaytext bb"> <center><div> <a href="https://www.e-learn.cn/qa/q-942106.html"> 查看其它5个回答 </a> </div></center> </div> </div> <div class="comments-mod " style="display: none; float:none;padding-top:10px;" id="comment_1929720"> <div class="areabox clearfix"> <form class="layui-form" action=""> <div class="layui-form-item"> <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label> <div class="layui-input-block" style="margin-left:90px;"> <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" /> <input type='hidden' value='0' name='replyauthor' /> </div> <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="1929720">提交评论 </span></div> </div> </form> </div> <hr> <ul class="my-comments-list nav"> <li class="loading"> <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' /> 加载中... </li> </ul> </div> </li> </ul> <div class="layui-form layui-form-pane"> <form id="huidaform" name="answerForm" method="post"> <div class="layui-form-item layui-form-text"> <a name="comment"></a> <div class="layui-input-block"> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script> <script type="text/plain" id="editor" name="content" style="width:100%;height:200px;"></script> <script type="text/javascript"> var isueditor=1; var editor = UE.getEditor('editor',{ //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个 toolbars:[['source','fullscreen', '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']], initialContent:'', //关闭字数统计 wordCount:false, zIndex:2, //关闭elementPath elementPathEnabled:false, //默认的编辑区域高度 initialFrameHeight:250 //更多其他参数，请参考ueditor.config.js中的配置项 //更多其他参数，请参考ueditor.config.js中的配置项 }); editor.ready(function() { editor.setDisabled(); }); $("#editor").find("*").css("max-width","362px"); </script> </div> </div> <div class="layui-form-item"> <label for="L_vercode" class="layui-form-label">验证码</label> <div class="layui-input-inline"> <input type="text" id="code" name="code" value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input"> </div> <div class="layui-form-mid"> <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode" href="javascript:updatecode();"> 看不清?</a></span> </div> </div> <div class="layui-form-item"> <input type="hidden" value="942106" id="ans_qid" name="qid"> <input type="hidden" id="tokenkey" name="tokenkey" value=''/> <input type="hidden" value="Keyword analysis in PHP" id="ans_title" name="title"> <div class="layui-btn layui-btn-disabled" id="ajaxsubmitasnwer" >提交回复</div> </div> </form> </div> </div> <input type="hidden" value="942106" id="adopt_qid" name="qid" /> <input type="hidden" id="adopt_answer" value="0" name="aid" /> </div> <div class="layui-col-md4">  <dl class="fly-panel fly-list-one"> <dt class="fly-panel-title">热议问题</dt>