How do I parse the first, middle, and last name out of a fullname field with SQL?
I need to try to match up on names that are not a direct match on full name. I\'d
I recommend Expresso for learnin/building/testing regular expressions. Old free version, new commercial version
Are you sure the Full Legal Name will always include First, Middle and Last? I know people that have only one name as Full Legal Name, and honestly I am not sure if that's their First or Last Name. :-) I also know people that have more than one Fisrt names in their legal name, but don't have a Middle name. And there are some people that have multiple Middle names.
Then there's also the order of the names in the Full Legal Name. As far as I know, in some Asian cultures the Last Name comes first in the Full Legal Name.
On a more practical note, you could split the Full Name on whitespace and threat the first token as First name and the last token (or the only token in case of only one name) as Last name. Though this assumes that the order will be always the same.
I would do this as an iterative process.
1) Dump the table to a flat file to work with.
2) Write a simple program to break up your Names using a space as separator where firsts token is the first name, if there are 3 token then token 2 is middle name and token 3 is last name. If there are 2 tokens then the second token is the last name. (Perl, Java, or C/C++, language doesn't matter)
3) Eyeball the results. Look for names that don't fit this rule.
4) Using that example, create a new rule to handle that exception...
5) Rinse and Repeat
Eventually you will get a program that fixes all your data.
Here is a self-contained example, with easily manipulated test data.
With this example, if you have a name with more than three parts, then all the "extra" stuff will get put in the LAST_NAME field. An exception is made for specific strings that are identified as "titles", such as "DR", "MRS", and "MR".
If the middle name is missing, then you just get FIRST_NAME and LAST_NAME (MIDDLE_NAME will be NULL).
You could smash it into a giant nested blob of SUBSTRINGs, but readability is hard enough as it is when you do this in SQL.
Edit-- Handle the following special cases:
1 - The NAME field is NULL
2 - The NAME field contains leading / trailing spaces
3 - The NAME field has > 1 consecutive space within the name
4 - The NAME field contains ONLY the first name
5 - Include the original full name in the final output as a separate column, for readability
6 - Handle a specific list of prefixes as a separate "title" column
SELECT
FIRST_NAME.ORIGINAL_INPUT_DATA
,FIRST_NAME.TITLE
,FIRST_NAME.FIRST_NAME
,CASE WHEN 0 = CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)
THEN NULL --no more spaces? assume rest is the last name
ELSE SUBSTRING(
FIRST_NAME.REST_OF_NAME
,1
,CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)-1
)
END AS MIDDLE_NAME
,SUBSTRING(
FIRST_NAME.REST_OF_NAME
,1 + CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)
,LEN(FIRST_NAME.REST_OF_NAME)
) AS LAST_NAME
FROM
(
SELECT
TITLE.TITLE
,CASE WHEN 0 = CHARINDEX(' ',TITLE.REST_OF_NAME)
THEN TITLE.REST_OF_NAME --No space? return the whole thing
ELSE SUBSTRING(
TITLE.REST_OF_NAME
,1
,CHARINDEX(' ',TITLE.REST_OF_NAME)-1
)
END AS FIRST_NAME
,CASE WHEN 0 = CHARINDEX(' ',TITLE.REST_OF_NAME)
THEN NULL --no spaces @ all? then 1st name is all we have
ELSE SUBSTRING(
TITLE.REST_OF_NAME
,CHARINDEX(' ',TITLE.REST_OF_NAME)+1
,LEN(TITLE.REST_OF_NAME)
)
END AS REST_OF_NAME
,TITLE.ORIGINAL_INPUT_DATA
FROM
(
SELECT
--if the first three characters are in this list,
--then pull it as a "title". otherwise return NULL for title.
CASE WHEN SUBSTRING(TEST_DATA.FULL_NAME,1,3) IN ('MR ','MS ','DR ','MRS')
THEN LTRIM(RTRIM(SUBSTRING(TEST_DATA.FULL_NAME,1,3)))
ELSE NULL
END AS TITLE
--if you change the list, don't forget to change it here, too.
--so much for the DRY prinicple...
,CASE WHEN SUBSTRING(TEST_DATA.FULL_NAME,1,3) IN ('MR ','MS ','DR ','MRS')
THEN LTRIM(RTRIM(SUBSTRING(TEST_DATA.FULL_NAME,4,LEN(TEST_DATA.FULL_NAME))))
ELSE LTRIM(RTRIM(TEST_DATA.FULL_NAME))
END AS REST_OF_NAME
,TEST_DATA.ORIGINAL_INPUT_DATA
FROM
(
SELECT
--trim leading & trailing spaces before trying to process
--disallow extra spaces *within* the name
REPLACE(REPLACE(LTRIM(RTRIM(FULL_NAME)),' ',' '),' ',' ') AS FULL_NAME
,FULL_NAME AS ORIGINAL_INPUT_DATA
FROM
(
--if you use this, then replace the following
--block with your actual table
SELECT 'GEORGE W BUSH' AS FULL_NAME
UNION SELECT 'SUSAN B ANTHONY' AS FULL_NAME
UNION SELECT 'ALEXANDER HAMILTON' AS FULL_NAME
UNION SELECT 'OSAMA BIN LADEN JR' AS FULL_NAME
UNION SELECT 'MARTIN J VAN BUREN SENIOR III' AS FULL_NAME
UNION SELECT 'TOMMY' AS FULL_NAME
UNION SELECT 'BILLY' AS FULL_NAME
UNION SELECT NULL AS FULL_NAME
UNION SELECT ' ' AS FULL_NAME
UNION SELECT ' JOHN JACOB SMITH' AS FULL_NAME
UNION SELECT ' DR SANJAY GUPTA' AS FULL_NAME
UNION SELECT 'DR JOHN S HOPKINS' AS FULL_NAME
UNION SELECT ' MRS SUSAN ADAMS' AS FULL_NAME
UNION SELECT ' MS AUGUSTA ADA KING ' AS FULL_NAME
) RAW_DATA
) TEST_DATA
) TITLE
) FIRST_NAME
The work by @JosephStyons and @Digs is great! I used parts of their work to create a new function for SQL Server 2016 and newer. This one also handles suffixes, as well as prefixes.
CREATE FUNCTION [dbo].[NameParser]
(
@name nvarchar(100)
)
RETURNS TABLE
AS
RETURN (
WITH prep AS (
SELECT
original = @name,
cleanName = REPLACE(REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(@name)),' ',' '),' ',' '), '.', ''), ',', '')
)
SELECT
prep.original,
aux.prefix,
firstName.firstName,
middleName.middleName,
lastName.lastName,
aux.suffix
FROM
prep
CROSS APPLY (
SELECT
prefix =
CASE
WHEN LEFT(prep.cleanName, 3) IN ('MR ', 'MS ', 'DR ', 'FR ')
THEN LEFT(prep.cleanName, 2)
WHEN LEFT(prep.cleanName, 4) IN ('MRS ', 'LRD ', 'SIR ')
THEN LEFT(prep.cleanName, 3)
WHEN LEFT(prep.cleanName, 5) IN ('LORD ', 'LADY ', 'MISS ', 'PROF ')
THEN LEFT(prep.cleanName, 4)
ELSE ''
END,
suffix =
CASE
WHEN RIGHT(prep.cleanName, 3) IN (' JR', ' SR', ' II', ' IV')
THEN RIGHT(prep.cleanName, 2)
WHEN RIGHT(prep.cleanName, 4) IN (' III', ' ESQ')
THEN RIGHT(prep.cleanName, 3)
ELSE ''
END
) aux
CROSS APPLY (
SELECT
baseName = LTRIM(RTRIM(SUBSTRING(prep.cleanName, LEN(aux.prefix) + 1, LEN(prep.cleanName) - LEN(aux.prefix) - LEN(aux.suffix)))),
numParts = (SELECT COUNT(1) FROM STRING_SPLIT(LTRIM(RTRIM(SUBSTRING(prep.cleanName, LEN(aux.prefix) + 1, LEN(prep.cleanName) - LEN(aux.prefix) - LEN(aux.suffix)))), ' '))
) core
CROSS APPLY (
SELECT
firstName =
CASE
WHEN core.numParts <= 1 THEN core.baseName
ELSE LEFT(core.baseName, CHARINDEX(' ', core.baseName, 1) - 1)
END
) firstName
CROSS APPLY (
SELECT
remainder =
CASE
WHEN core.numParts <= 1 THEN ''
ELSE LTRIM(SUBSTRING(core.baseName, LEN(firstName.firstName) + 1, 999999))
END
) work1
CROSS APPLY (
SELECT
middleName =
CASE
WHEN core.numParts <= 2 THEN ''
ELSE LEFT(work1.remainder, CHARINDEX(' ', work1.remainder, 1) - 1)
END
) middleName
CROSS APPLY (
SELECT
lastName =
CASE
WHEN core.numParts <= 1 THEN ''
ELSE LTRIM(SUBSTRING(work1.remainder, LEN(middleName.middleName) + 1, 999999))
END
) lastName
)
GO
SELECT * FROM dbo.NameParser('Madonna')
SELECT * FROM dbo.NameParser('Will Smith')
SELECT * FROM dbo.NameParser('Neil Degrasse Tyson')
SELECT * FROM dbo.NameParser('Dr. Neil Degrasse Tyson')
SELECT * FROM dbo.NameParser('Mr. Hyde')
SELECT * FROM dbo.NameParser('Mrs. Thurston Howell, III')