I have a bunch of human names. They are all \"Western\" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the pe
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
There is a Perl based parser available to do this type of extraction http://search.cpan.org/~kimryan/Lingua-EN-NameParse/
I ran it through your examples to get the following results.It only handles ordinal suffixes up to 12 (XII) and also does not recognise the . in Ph.D so I had to change this in your input data
JOHN SMITH John Smith
JOHN SMITH, JR. John Smith Jr
JOHN SMITH JR. John Smith Jr
JOHN SMITH XII John Smith XII
DR. JOHN SMITH, PHD Dr. John Smith Phd
Have you tried the Ruby gem Namae?
It should deal with most western names well and comes with a couple of configuration options for tricky scenarios (multiple last names, comma used both to separate names in a list and name parts). Having said that, it's a deterministic parser (using this grammar) and there are some cases it won't cover.
Here is your example:
require('namae')
Namae.parse 'John Smith and John Smith, Jr. and John Smith Jr and John Smith XIV'
#=> [
#<Name family="Smith" given="John">,
#<Name family="Smith" given="John" suffix="Jr.">,
#<Name family="Smith" given="John" suffix="Jr">,
#<Name family="Smith" given="John" suffix="XIV">
]
It struggles with the doctor's title, but that's something we might be able to fix.
Since you're limited to Western-style names, I think a few rules will get you most of the way there:
{ mr mrs miss ms rev dr prof }
and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4]
-- order them however you want), record the highest-scoring title that was deleted.{ jr phd }
or are Roman numerals of value roughly 50 or less (/[XVI]+/
is probably a good enough regex).It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.
Look on lufthansa page. They ask for them which kind of 'title' they wanna use. I never saw better idea like that.
I don't recommend use gem or whatever in this case because english/spanish/french/.... there are difference on gender, then, if you try discover by yourself, you can't be successful.
I hope help you