Human Name parsing

后端 未结 5 1035
醉话见心
醉话见心 2021-01-05 13:32

I have a bunch of human names. They are all \"Western\" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the pe

5条回答
  •  时光说笑
    2021-01-05 14:18

    Since you're limited to Western-style names, I think a few rules will get you most of the way there:

    1. If a comma appears, delete the leftmost one and everything after.
    2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
    3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
    4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
    5. As the surname, use the last word.

    It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.

提交回复
热议问题