Named entity recognition with Java

问题

I would like to use named entity recognition (NER) to find adequate tags for texts in a database. Instead of using tools like NLTK or Lingpipe I want to build my own tool.

So my questions are:

Which algorithm should I use?
How hard is to build this tool?

回答1:

I did this some time ago when I studied Markov chains.

Anyway, the answers are:

Which algorithm should I use?

Stanford NLP for example uses Conditional Random Field (CRF). If you are not trying to do this effectively, you are like dude from Jackass 3d who was pissing in the wind. There is no simple way to parse human language, as it's construction is complex and it has tons of exceptions.

How hard is to build this tool?

Well if you know what you are doing, then it's not that hard at all. The process of entering the rules and logic can be annoying and time consuming, and fixing bugs can be nontrivial. But in 20 years, you can make something almost useful (for yourself).

回答2:

There is vast of Information Extraction algorithms, to name a few: regular expressions, statical methods, machine learning based, dictionaries, etc. You can find a complete overview on methods in this survey.
Yes, it is hard to build a tool, which find tags with high precision, because it requires a lot of testing and tuning.

The -- easiest to implement -- algorithm for finding tags will consists of two steps:

Extract candidates for tags
Find most significant tags - most disti.

In the first step you can take one of two approaches:

Use entity names to use as tag candidates (here you need to use Information Extraction framework)
Use nouns or noun groups as tag candidates (here you need to use part-of-speech tagger)

In the second step, you should use tf-idf to weight tags across document corpus and discard all tags which has tf-idf weight below a given trash-hold

If you need a more powerful algorithm look for topic detection frameworks or research papers on this topic. Check also LSA, after wikipedia:

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

回答3:

NLTK is an open-source project. You might want to explore it a little bit - see how it is done, maybe get involved in the community, rather than trying to completely solve the problem by yourself from scratch...

回答4:

Look for a copy of this paper:

Name Tagging with Word Clusters and Discriminative Training

Scott Miller, Jethran Guinness, Alex Zamanian

回答5:

This may not be a satisfactory answer to your question, still: You might want to evaluate existing service providers for the task and either include their product or integrate one via web services.

My experience is that for certain well-defined and very domain-specific tasks (for example: recognizing names of medicaments within Wikipedia web pages) you can manually build NER solutions. LingPipe, OpenNLP, etc. are good tools for this.

But for generic tasks (for example: find person names in any web page on the internet), you need a lot of experience, tools, and man-power to get satisfactory results. It might therefore be more effective to use an external provider. OpenCalais is a free service, for example; many commercial ones exist.

来源：https://stackoverflow.com/questions/5571519/named-entity-recognition-with-java

标签

java