Entity Extraction/Recognition with free tools while feeding Lucene Index

前端 未结 4 543
孤城傲影
孤城傲影 2020-12-22 15:28

I\'m currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a

相关标签:
4条回答
  • 2020-12-22 15:58

    Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml

    The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.

    Just in case you find it useful.

    0 讨论(0)
  • 2020-12-22 15:59

    You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/

    For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind

    0 讨论(0)
  • 2020-12-22 16:06

    Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole. The output is in your choice of XML or JSON which makes it very easy to use with Lucene. It is written in java. There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does. Yes both versions perform entity and term disambiguation based on the linguistic usage.

    The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.

    0 讨论(0)
  • The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

    For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

    • Zemanta
    • Maui-indexer
    • Dbpedia Spotlight
    • Extractiv (my company)

    These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

    0 讨论(0)
提交回复
热议问题