问题
I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
- ...The candidate Barack Obama was elected the president of the United States... (incomplete)
- ...The candidate Barack Hussein was elected the president of the United States... (incomplete)
- ...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
- ...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
- ...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
- ...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
回答1:
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
回答2:
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
回答3:
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
回答4:
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
回答5:
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
回答6:
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
回答7:
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
回答8:
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers
来源:https://stackoverflow.com/questions/338661/how-to-search-for-a-persons-name-in-a-text-heuristic