Human Name parsing

后端未结

关注

 5  1039

I have a bunch of human names. They are all \"Western\" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the pe

相关标签:

5条回答

逝去的感伤

2021-01-05 14:03

humanparser

Parse a human name string into salutation, first name, middle name, last name, suffix.

Install

npm install humanparser

Usage

var human = require('humanparser');

var fullName = 'Mr. William R. Jenkins, III'
    , attrs = human.parseName(fullName);

console.log(attrs);

//produces the following output

{ saluation: 'Mr.',
  firstName: 'William',
  suffix: 'III',
  lastName: 'Jenkins',
  middleName: 'R.',
  fullName: 'Mr. William R. Jenkins, III' }

0 讨论(0)

野的像风

2021-01-05 14:05

There is a Perl based parser available to do this type of extraction http://search.cpan.org/~kimryan/Lingua-EN-NameParse/

I ran it through your examples to get the following results.It only handles ordinal suffixes up to 12 (XII) and also does not recognise the . in Ph.D so I had to change this in your input data

JOHN SMITH                                John                             Smith                       
JOHN SMITH, JR.                           John                             Smith                Jr     
JOHN SMITH JR.                            John                             Smith                Jr     
JOHN SMITH XII                            John                             Smith                XII    
DR. JOHN SMITH, PHD              Dr.      John                             Smith                Phd

0 讨论(0)

悲哀的现实

2021-01-05 14:15
Have you tried the Ruby gem Namae?

It should deal with most western names well and comes with a couple of configuration options for tricky scenarios (multiple last names, comma used both to separate names in a list and name parts). Having said that, it's a deterministic parser (using this grammar) and there are some cases it won't cover.

Here is your example:
```
require('namae')

Namae.parse 'John Smith and John Smith, Jr. and John Smith Jr and John Smith XIV'
#=> [
  #<Name family="Smith" given="John">,
  #<Name family="Smith" given="John" suffix="Jr.">,
  #<Name family="Smith" given="John" suffix="Jr">,
  #<Name family="Smith" given="John" suffix="XIV">
]
```
It struggles with the doctor's title, but that's something we might be able to fix.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2021-01-05 14:18
Since you're limited to Western-style names, I think a few rules will get you most of the way there:
1. If a comma appears, delete the leftmost one and everything after.
2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
5. As the surname, use the last word.
It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.
0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2021-01-05 14:19

Look on lufthansa page. They ask for them which kind of 'title' they wanna use. I never saw better idea like that.

I don't recommend use gem or whatever in this case because english/spanish/french/.... there are difference on gender, then, if you try discover by yourself, you can't be successful.

I hope help you

0 讨论(0)
发布评论:

提交评论
- 加载中...