Human Name parsing

后端 未结 5 1036
醉话见心
醉话见心 2021-01-05 13:32

I have a bunch of human names. They are all \"Western\" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the pe

相关标签:
5条回答
  • 2021-01-05 14:03

    humanparser

    Parse a human name string into salutation, first name, middle name, last name, suffix.

    Install

    npm install humanparser
    

    Usage

    var human = require('humanparser');
    
    var fullName = 'Mr. William R. Jenkins, III'
        , attrs = human.parseName(fullName);
    
    console.log(attrs);
    
    //produces the following output
    
    { saluation: 'Mr.',
      firstName: 'William',
      suffix: 'III',
      lastName: 'Jenkins',
      middleName: 'R.',
      fullName: 'Mr. William R. Jenkins, III' }
    
    0 讨论(0)
  • 2021-01-05 14:05

    There is a Perl based parser available to do this type of extraction http://search.cpan.org/~kimryan/Lingua-EN-NameParse/

    I ran it through your examples to get the following results.It only handles ordinal suffixes up to 12 (XII) and also does not recognise the . in Ph.D so I had to change this in your input data

    JOHN SMITH                                John                             Smith                       
    JOHN SMITH, JR.                           John                             Smith                Jr     
    JOHN SMITH JR.                            John                             Smith                Jr     
    JOHN SMITH XII                            John                             Smith                XII    
    DR. JOHN SMITH, PHD              Dr.      John                             Smith                Phd    
    
    0 讨论(0)
  • 2021-01-05 14:15

    Have you tried the Ruby gem Namae?

    It should deal with most western names well and comes with a couple of configuration options for tricky scenarios (multiple last names, comma used both to separate names in a list and name parts). Having said that, it's a deterministic parser (using this grammar) and there are some cases it won't cover.

    Here is your example:

    require('namae')
    
    Namae.parse 'John Smith and John Smith, Jr. and John Smith Jr and John Smith XIV'
    #=> [
      #<Name family="Smith" given="John">,
      #<Name family="Smith" given="John" suffix="Jr.">,
      #<Name family="Smith" given="John" suffix="Jr">,
      #<Name family="Smith" given="John" suffix="XIV">
    ]
    

    It struggles with the doctor's title, but that's something we might be able to fix.

    0 讨论(0)
  • 2021-01-05 14:18

    Since you're limited to Western-style names, I think a few rules will get you most of the way there:

    1. If a comma appears, delete the leftmost one and everything after.
    2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
    3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
    4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
    5. As the surname, use the last word.

    It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.

    0 讨论(0)
  • 2021-01-05 14:19

    Look on lufthansa page. They ask for them which kind of 'title' they wanna use. I never saw better idea like that.

    I don't recommend use gem or whatever in this case because english/spanish/french/.... there are difference on gender, then, if you try discover by yourself, you can't be successful.

    I hope help you

    0 讨论(0)
提交回复
热议问题