Identifying a person's name vs. a dictionary word

前端 未结 3 1551
再見小時候
再見小時候 2021-02-08 02:39

Is there some way to recognize that a word is likely to be/is not likely to be a person\'s name?

So if I see the word \"understanding\" I would get a probability of 0.01

相关标签:
3条回答
  • 2021-02-08 02:51

    Based on just the word (or series of words that does not form a sentence), I'd say no, or at least not one that would be able to provide any more information than a "known words dictionary" lookup.

    Different locales would have different probabilities as well, and it's very much the position of the word in a sentence and the other words that signal whether it's a name or some other noun/verb.

    For example, "Word" might be a:

    1. noun - "The word on the page is blurry"
    2. verb - "I word my sentences carefully"
    3. adjective - "I like word games"
    4. proper name - "My friend Word is nice to me"

    It all depends on context and position in a sentence - and the rules for this change from language to language. Also, new names get invented regularly - next year's most popular baby name may "Galapagos" instead of "Liam".

    0 讨论(0)
  • 2021-02-08 02:53

    A related task in natural language processing is known as Named Entity Recognition and deals with names of people, organizations, locations, etc.

    Most models designed to solve this problem are statistical in nature and use both context and prior knowledge in their predictions. There is a number of open source implementations one can use, e.g. the Stanford NER, see the online demo.

    0 讨论(0)
  • 2021-02-08 02:58

    My quick hack would be this:

    Get the list from the census bureau of names in order of popularity, it's freely available. Give each name a normalized popularity score (1.0 = most popular, 0.0 = least).

    Then, get an opensource dictionary, and do some research to pull together a frequency score for every word. You can find one here, at wiktionary. Assign every word a popularity score, 1.0 to 0.0. The convenient thing is that if you can't find a word on the frequency list, you get to assume it's a pretty uncommon word.

    Look for a word on both lists. If it's on just one or the other, you're done. If it's on both, use a formula to compute a weighted probability... something like (Name Popularity) / (Name Popularity + Other Popularity). If it's not on either list, it's probably a name.

    0 讨论(0)
提交回复
热议问题