How do I parse the first, middle, and last name out of a fullname field with SQL?
I need to try to match up on names that are not a direct match on full name. I\'d
I'm not sure about SQL server, but in postgres you could do something like this:
SELECT
SUBSTRING(fullname, '(\\w+)') as firstname,
SUBSTRING(fullname, '\\w+\\s(\\w+)\\s\\w+') as middle,
COALESCE(SUBSTRING(fullname, '\\w+\\s\\w+\\s(\\w+)'), SUBSTRING(fullname, '\\w+\\s(\\w+)')) as lastname
FROM
public.person
The regex expressions could probably be a bit more concise; but you get the point. This does by the way not work for persons having two double names (in the Netherlands we have this a lot 'Jan van der Ploeg') so I'd be very careful with the results.
The biggest problem I ran into doing this was cases like "Bob R. Smith, Jr.". The algorithm I used is posted at http://www.blackbeltcoder.com/Articles/strings/splitting-a-name-into-first-and-last-names. My code is in C# but you could port it if you must have in SQL.
Like #1 said, it's not trivial. Hyphenated last names, initials, double names, inverse name sequence and a variety of other anomalies can ruin your carefully crafted function.
You could use a 3rd party library (plug/disclaimer - I worked on this product):
http://www.melissadata.com/nameobject/nameobject.htm
I once made a 500 character regular expression to parse first, last and middle names from an arbitrary string. Even with that honking regex, it only got around 97% accuracy due to the complete inconsistency of the input. Still, better than nothing.
Here's a stored procedure that will put the first word found into First Name, the last word into Last Name and everything in between into Middle Name.
create procedure [dbo].[import_ParseName]
(
@FullName nvarchar(max),
@FirstName nvarchar(255) output,
@MiddleName nvarchar(255) output,
@LastName nvarchar(255) output
)
as
begin
set @FirstName = ''
set @MiddleName = ''
set @LastName = ''
set @FullName = ltrim(rtrim(@FullName))
declare @ReverseFullName nvarchar(max)
set @ReverseFullName = reverse(@FullName)
declare @lengthOfFullName int
declare @endOfFirstName int
declare @beginningOfLastName int
set @lengthOfFullName = len(@FullName)
set @endOfFirstName = charindex(' ', @FullName)
set @beginningOfLastName = @lengthOfFullName - charindex(' ', @ReverseFullName) + 1
set @FirstName = case when @endOfFirstName <> 0
then substring(@FullName, 1, @endOfFirstName - 1)
else ''
end
set @MiddleName = case when (@endOfFirstName <> 0 and @beginningOfLastName <> 0 and @beginningOfLastName > @endOfFirstName)
then ltrim(rtrim(substring(@FullName, @endOfFirstName , @beginningOfLastName - @endOfFirstName)))
else ''
end
set @LastName = case when @beginningOfLastName <> 0
then substring(@FullName, @beginningOfLastName + 1 , @lengthOfFullName - @beginningOfLastName)
else ''
end
return
end
And here's me calling it.
DECLARE @FirstName nvarchar(255),
@MiddleName nvarchar(255),
@LastName nvarchar(255)
EXEC [dbo].[import_ParseName]
@FullName = N'Scott The Other Scott Kowalczyk',
@FirstName = @FirstName OUTPUT,
@MiddleName = @MiddleName OUTPUT,
@LastName = @LastName OUTPUT
print @FirstName
print @MiddleName
print @LastName
output:
Scott
The Other Scott
Kowalczyk
We of course all understand that there's no perfect way to solve this problem, but some solutions can get you farther than others.
In particular, it's pretty easy to go beyond simple whitespace-splitters if you just have some lists of common prefixes (Mr, Dr, Mrs, etc.), infixes (von, de, del, etc.), suffixes (Jr, III, Sr, etc.) and so on. It's also helpful if you have some lists of common first names (in various languages/cultures, if your names are diverse) so that you can guess whether a word in the middle is likely to be part of the last name or not.
BibTeX also implements some heuristics that get you part of the way there; they're encapsulated in the Text::BibTeX::Name
perl module. Here's a quick code sample that does a reasonable job.
use Text::BibTeX;
use Text::BibTeX::Name;
$name = "Dr. Mario Luis de Luigi Jr.";
$name =~ s/^\s*([dm]rs?.?|miss)\s+//i;
$dr=$1;
$n=Text::BibTeX::Name->new($name);
print join("\t", $dr, map "@{[ $n->part($_) ]}", qw(first von last jr)), "\n";