问题
I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the files but is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform.
Data:
data="""
Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
Code:
import re
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)
matches=pattern.findall(data)
for match in matches:
print(match)
expected output:
12/23/1955
回答1:
import re
data="""
Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
"""
pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)
matches=pattern.findall(data)
for match in matches:
print(match)
Output:
12/23/1955
9/15/1963
10/30/1970
Explanation:
.*? : 0 or more anycharacter but newline
\b : word boundary
(?: : start non capture group
DOB : literally
| : OR
Date of birth : literally
) : end group
\b : word boundary
.*? : 0 or more anycharacter but newline
( : start group 1
\d{1,2} : 1 or 2 digits
[/-] : slash or dash
\d{1,2} : 1 or 2 digits
[/-] : slash or dash
(?: : start non capture group
\d\d : 2 digits
){1,2} : end group may appear 1 or twice (ie; 2 OR 4 digits)
) : end capture group 1
回答2:
import re
string = "DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)
output:
['12/23/1955']
来源:https://stackoverflow.com/questions/51887141/unable-to-extract-date-of-birth-from-a-given-format