Unable to extract date of birth from a given format

问题

I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the files but is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform.

Data:

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970

Code:

import re    
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)

expected output:

12/23/1955

回答1:

import re    

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
"""

pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)

Output:

12/23/1955
9/15/1963
10/30/1970

Explanation:

.*?             : 0 or more anycharacter but newline
\b              : word boundary
(?:             : start non capture group
  DOB           : literally
 |              : OR
  Date of birth : literally
)               : end group
\b              : word boundary
.*?             : 0 or more anycharacter but newline
(               : start group 1
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    (?:         : start non capture group
        \d\d    : 2 digits
    ){1,2}      : end group may appear 1 or twice (ie; 2 OR 4 digits)
)               : end capture group 1

回答2:

import re
string = "DOB/Sex:    12/23/1955            11/15/2014   11:53 AM"
re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)

output:

['12/23/1955']

来源：https://stackoverflow.com/questions/51887141/unable-to-extract-date-of-birth-from-a-given-format

标签

python

regex

python-3.x

data-extraction