Extract Relevant Tag/Keywords from Text block

蹲街弑〆低调 提交于 2019-12-03 10:39:08

问题


I wanted a particular implementation, such that the user provide a block of text like:

"Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable."

What I want to do is automatically select relevant keywords and create tags/keywords, hence for the above piece of text, relevant tags should be: mysql, php, json, jquery, version control, oop, web2.0, javascript

How can I go about doing it in PHP/Javascript etc? A headstart would be really helpful.


回答1:


A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

Update:

Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

function stopWords($text, $stopwords) {

  // Remove line breaks and spaces from stopwords
    $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);

  // Replace all non-word chars with comma
  $pattern = '/[0-9\W]/';
  $text = preg_replace($pattern, ',', $text);

  // Create an array from $text
  $text_array = explode(",",$text);

  // remove whitespace and lowercase words in $text
  $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);

  foreach ($text_array as $term) {
    if (!in_array($term, $stopwords)) {
      $keywords[] = $term;
    }
  };

  return array_filter($keywords);
}

$stopwords = file('stop_words.txt');
$text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";

print_r(stopWords($text, $stopwords));

You can see this, and the contents of stop_word.txt in this Gist.

Running the above on your example text produces the following array:

Array
(
    [0] => requirements
    [4] => linux
    [6] => apache
    [10] => mysql
    [13] => php
    [25] => json
    [28] => frameworks
    [30] => zend
    [34] => browser
    [35] => javascripting
    [37] => jquery
    [38] => etc
    [42] => software
    [43] => preferable
)

So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)




回答2:


I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP

  • http://code.fivefilters.org/term-extraction
  • demo: http://fivefilters.org/term-extraction/

What looked like the most professional one performed abysmally: viewer.opencalais.com

Others that were OK were (not sure what language they're written in)

  • www.nactem.ac.uk/software/termine/#form
  • www.alchemyapi.com/api/keyword/



回答3:


This is not easy to do because it requires some type of fuzzy logic. You should use the Yahoo Term extractor YQL

Check it out: link




回答4:


Depending on whether you want to show the client keywords/tags or whether you want to extract the keywords / tags from the block of text then do further computation with them.

If you only need to show them then clientside handling is fine. If you need them for further computation then use serverside handling for it.

I can recommend a javascript clientside implementation if you can supply some more details. If you want to generically "know" the keywords then some kind of clever solution is neccesary

If you have a list of keywords then you can use regular expressions to extract the data



来源:https://stackoverflow.com/questions/4828154/extract-relevant-tag-keywords-from-text-block

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!