How to search for a person's name in a text? (heuristic)

感情迁移 提交于 2019-12-05 12:19:05

You said it's about 200 pages.

Divide it into 200 one-page PDFs.

Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.

Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.

What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.

Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.

At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.

I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.

The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.

I'd personnaly go for regular expressions while generating a list of permutations with some programming.

Both SQL Server and Oracle have built-in SOUNDEX Functions.

Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.

pure old regular expression scripting will do the job.

use Ruby, it's quite fast. read lines and match words.

cheers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!