Extract Relevant Tag/Keywords from Text block

前端 未结 4 1785
庸人自扰
庸人自扰 2021-02-03 11:16

I wanted a particular implementation, such that the user provide a block of text like:

\"Requirements - Working knowledge, on LAMP Environment using Lin

相关标签:
4条回答
  • 2021-02-03 11:58

    Depending on whether you want to show the client keywords/tags or whether you want to extract the keywords / tags from the block of text then do further computation with them.

    If you only need to show them then clientside handling is fine. If you need them for further computation then use serverside handling for it.

    I can recommend a javascript clientside implementation if you can supply some more details. If you want to generically "know" the keywords then some kind of clever solution is neccesary

    If you have a list of keywords then you can use regular expressions to extract the data

    0 讨论(0)
  • 2021-02-03 12:10

    A very naive method is to remove common stopwords from the text, leaving you with more meaningful words like 'Standards', 'JSON', etc. You will still get a lot of noise however, so you may consider a service like OpenCalais which can do a rather sophisticated analysis of your text.

    Update:

    Okay, the link in my previous answer pointed to implementations, but you asked for one so a simple one is here:

    function stopWords($text, $stopwords) {
    
      // Remove line breaks and spaces from stopwords
        $stopwords = array_map(function($x){return trim(strtolower($x));}, $stopwords);
    
      // Replace all non-word chars with comma
      $pattern = '/[0-9\W]/';
      $text = preg_replace($pattern, ',', $text);
    
      // Create an array from $text
      $text_array = explode(",",$text);
    
      // remove whitespace and lowercase words in $text
      $text_array = array_map(function($x){return trim(strtolower($x));}, $text_array);
    
      foreach ($text_array as $term) {
        if (!in_array($term, $stopwords)) {
          $keywords[] = $term;
        }
      };
    
      return array_filter($keywords);
    }
    
    $stopwords = file('stop_words.txt');
    $text = "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable.";
    
    print_r(stopWords($text, $stopwords));
    

    You can see this, and the contents of stop_word.txt in this Gist.

    Running the above on your example text produces the following array:

    Array
    (
        [0] => requirements
        [4] => linux
        [6] => apache
        [10] => mysql
        [13] => php
        [25] => json
        [28] => frameworks
        [30] => zend
        [34] => browser
        [35] => javascripting
        [37] => jquery
        [38] => etc
        [42] => software
        [43] => preferable
    )
    

    So, like I said, this is somewhat naive and could use more optimization (plus it's slow) but it does pull out the more relevant keywords from your text. You would need to do some fine tuning on the stop words as well. Capturing terms like Web 2.0 will be very difficult, so again I think you would be better off using a serious service like OpenCalais which can understand a text and return a list of entities and references. DocumentCloud relies on this very service to gather information from documents.

    Also, for client side implementation you could do pretty much the same thing with JavaScript, and probably much cleaner (although it could be slow for the client.)

    0 讨论(0)
  • 2021-02-03 12:14

    I did a quick review of these this morning and to my surprise one which performs best with my test phrase was written in PHP

    • http://code.fivefilters.org/term-extraction
    • demo: http://fivefilters.org/term-extraction/

    What looked like the most professional one performed abysmally: viewer.opencalais.com

    Others that were OK were (not sure what language they're written in)

    • www.nactem.ac.uk/software/termine/#form
    • www.alchemyapi.com/api/keyword/
    0 讨论(0)
  • 2021-02-03 12:14

    This is not easy to do because it requires some type of fuzzy logic. You should use the Yahoo Term extractor YQL

    Check it out: link

    0 讨论(0)
提交回复
热议问题