How do I do Entity Extraction in Lucene

问题

I m trying to do Entity Extraction (more like matching) in Lucene. Here is a sample workflow:

Given some text (from a URL) AND a list people names, try to extract names of people from the text.

Note:

Names of people are not completely normalized. e.g. Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names).

I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have).

A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold.

The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward".

I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. Any help appreciated.

回答1:

NEE is an NLP task that is not part of lucene. For open source, you can look at lingpipe and gate and opennlp. There are various for-money alternatives.

GATE is entirely rule-based, and will be hard to use for high precision. You'll need a statistical engine for that; lingpipe has one, but you have to supply the training data. I'm not up to date on the contents of opennlp in this area.

回答2:

Disambiguation of human names is notoriously difficult. If you have other information such as locations, or co-occurrence of names this will be valuable. But there is a lot of work still going into author disambiguation and it cannot normally be solved just from a list of names.

Here is a typical project http://code.google.com/p/bibapp/wiki/AuthorAuthorities . And a typical publication http://www.springerlink.com/content/lk07h1m311t130w4/.

Here is a project on record deduplications which we find useful for author disambiguation http://datamining.anu.edu.au/projects/linkage.html

回答3:

These projects could be useful for you:

http://nlp.stanford.edu/ner/index.shtml

http://cogcomp.cs.illinois.edu/page/software_view/4

回答4:

OpenNPL is useful. http://opennlp.apache.org/

The site has documentation and examples.

For the completely uninitiated The book Taming Text : http://www.manning.com/ingersoll/ provides a good overview. You can also download the source code from the book from the above link.

回答5:

You can try this.. http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Documenataion is clear, you can also use DBPedia-Spotlight webservice too...

http://spotlight.dbpedia.org/rest/spot/?text=

来源：https://stackoverflow.com/questions/4308132/how-do-i-do-entity-extraction-in-lucene

标签

lucene

named-entity-extraction