问题
reposting this questions with specifics (because the last one was flagged down).
I am working on parsing messy (tessearct-ocr) from archives cards to get atleast 50% of the info (date1). The data rows contain dates in different forms as per data sample below.
Raw_Text
1 "15957-8 . 3n v g - vw, 1 ekresta . bowker, william e tley n0 .qu v- l. c.
s. peteris, forestville, n. y. .mafae date1 june 17,1942 by davis, c. j6
l. g. b. jonnis, buffalo, n. y. ngsted decl 17, 1949.3y 7 davis, c. j.
date3 by j date4 - by date5 by 6 -.5/, 7/19/l date6 17 jul 1916 salamanca.
hf date7 31 dec 1986 buffalo, new york "
2 ".1o2o83n5ddn.. -i ekresta i bowles, albert edwin i made date1 june 9p1909
by parker, elm. date2 dec . 18 w date3 . by dep osed by date5 by date7mqm
9 ivvld wm 4144, mac, .75 076 eaqlwli "
3 "i naime bowles, charles edward made date1 may 31. 1892 by mclaren, wneoi
date2 may 18. 1895 by mclaren, w.e. date3 . i by date4 may 10. 1908 by
bip. of chicago. date5 by date7 "
4 "101 557 am l i ekrestaibowles, donald manson ..46 ohio trlnlty cathedral,
cleveland, ohio made date1 6/19/76 by burt, ji. h. grace , cleveland, ohio
date2 11 jun 77 by bp j h burt date3 . 1 .. by date4 by date5 bv m cuyahoga
heights, ohio date6 4/29/27 date7 240000 "
5 "227354 101 575 m68, frederick augustus st. paujjs cathedral, buffalo,
n.y. made date1 6/15/63 by scaife. l.i... st. thomas. modia, bath, n.y.
date2 1/11/611 by scaife. l.eo date3 by date4 by date5 by bradford, n.y. i
. 130m 6/1/18 date7 17 jun 1996 foratvme new york z4uc-xl "
6 "1 95812d ll. il ekresta bowles, harry oscar lmade date14 july 17, 190433,
lepnard, w.a. date2 july 25 , 1905 by leonard, w.a. i date3 by date4 by
date5 by g- m. /(,,/mr date7 jay /z/,. /357i l /mwi yk/maj. "
7 "5025 ,.. 2.57631 il . - . .. .1 i ekresta bowles , jedwiah hibbafd made
deac0n 8., i5-0i1862i13y potter, iih. date2 10. 280 1864 1 biy stevens, w.
b. date3 by date4 7 .30 l 1875 by date5 by date7 "
8 "30.611126 ekhq il ekresta bowles, ralph hart made date1 12. 210 i1883 by
iwiiiliams, i36 date2 7.. 1. 1885 by williams , j. date3 by i date4 by
date5 by g .97) l/am 9- date7 10. 4. 1900 (78) if x/ma 3.4, 154.47.11.73.
4,... mya-ix "
9 "2.25678 . 1o14593 ekresta bowles, robert brigham, jr. st. matthew s
cathedra1,da11quexas made date1 6/18/65 by mason, c. a. 57 mmzws camp
dr7///9s tams date2 12 21 cs by 14.45.42 c a date3 i by date4 by date5 ,
by houston, texas date6 4/11/30 date7 12 dec 2000 dallas texas 2400-xi "
10 "101 619 34hq woe ekresta bowlin1 howard bruce cathedral modia of saint
peter 61 st. paul, washin ton, dc made date1 13 jun 92 bybp r h haines
(wdc st. alban1s modia, annandale, vir inia . pdumd 16 jan 93 by r h halnes
(wdc) date3 by atas by date4 v by date5 by date6 31 aug 1946 e st. louis.
il date7 2400-i "
11 "w k8 8km tm boiling jack dnnmwm q- f grace ch , made dat j 11201). salem
mares. stverrett. f. ,w a x st. johms modia. memphis, tenh. date1 apr. 25.
1955 - bv barth, t.in.. date3 4 by date4 by date5 by date7 wq iw r 1 w .n
. 4.1- 1 date6z1l7i1c. "
I parse date1 through two step process, - 1. Parse text between name "date1" and "by" - 2. Use date parser to extract the actual dates
import re
import dateutil.parser as dparser
for lines in Raw_Text:
lines = lines.lower() #make lower case
lines = lines.strip() #remove leading and ending spaces
lines = " ".join(lines.split()) #remove duplicated spaces
# Step 1
#Extract data between "date1" and "by"
deacondt = re.findall(r'date1(.*?)by',lines)
deacondt = ''.join(deacondt) #Convert list to a string
# Step 2
# use dateutil to parse dates in extracted data
try:
deacondt1 = dparser.parse(deacondt)
except:
deacondt1 = 'NA'
print deacondt1
The output for step 1 are,
[' june 17,1942 ']
[' june 9p1909 ']
[' may 31. 1892 ']
[' 6/19/76 ']
[' 6/15/63 ']
['4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ']
[]
[' 12. 210 i1883 ']
[' 6/18/65 ']
[' 13 jun 92 ']
[]
While Step 2 returns the following output
2018-06-17 00:00:00
1909-06-17 21:00:00
1892-05-31 00:00:00
1976-06-19 00:00:00
2063-06-15 00:00:00
NA
NA
NA
2065-06-18 00:00:00
1992-06-13 00:00:00
NA
Step 2 fails to give all dates. Is there a better date parser for Python 2.7 than "dateutil.parser"?
回答1:
You can try this,
deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
fuzzy
– allowing strings containingun-dateformat
words like “Today is January 1, 2047 at 8:21:00AM
”.dayfirst=False
meansmonth-first date-format
input string like yours.
But it is insufficient for dateutil-parser
to extract the output what you want, so more approximate string to date-format is needed to be passed to the parser.
Regex
to extract string about date1
(?s)date1\d?((?:(?!by|date2|date3).)*)
Demo,,, in which not only 'by
' but also 'date2
' and 'date3
' are used as separator
and date10
~date19
are regarded as date1
.
And then, extracted string is manipulated(leading&trailing spaces removal, etc) for the acceptable input to date-util parser.
regx= re.compile(r'(?s)date1\d?((?:(?!by|date2|date3).)*)')
raw_date= [re.sub(r'(?i)(?<=\s)[a-z]?(\d{4}|\d{2})\d*', r'\1', re.sub(r'\s+|,|(?<=\d)[^\d\s\/](?=\d)',' ', re.sub(r'^\s+|\s+$|\n+','', m))) for m in regx.findall(Raw_Text)]
for deacondt in raw_date:
try:
deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
except:
deacondt1 = 'NA'
print(deacondt +"\n"+ str(deacondt1))
Output
june 17 1942
1942-06-17 00:00:00
june 9 1909
1909-06-09 00:00:00
may 31. 1892
1892-05-31 00:00:00
6/19/76
1976-06-19 00:00:00
6/15/63
2063-06-15 00:00:00
july 17 1904 lepnard w.a.
1904-07-17 00:00:00
12. 21 1883
1883-12-21 00:00:00
6/18/65
2065-06-18 00:00:00
13 jun 92
1992-06-13 00:00:00
apr. 25. 1955 - bv barth t.in..
1955-04-25 00:00:00
回答2:
There is no parsing module to give you the complete solution for every OCR squiggle you might encounter.
You would have to build some evaluation/correction framework in place to discover and fix what you can fix.
I suggest the following workflow:
- Try to parse date sequences.
- Save sequences that have not been parsed into a special file
- Edit the file, add some regex substitution rules to rewrite the sequence into a salvageable form.
- Apply the rules from the file and try to parse again
- Repeat from 2. until everything is handled.
Here is some example code:
parser.py
import re
import csv
import glob, os
from datetime import datetime
import dateutil.parser as dparser
def load_patterns():
''' load patterns from existing pat_*.csv
return a dict of the form { sequence: [sequence,pattern,replace] }
sequence is an example of the string that should be handled by this pattern
pattern and replace have the same meaning as for re.sub
'''
patterns = {}
for pattern_file in glob.glob("pat_*.csv"):
with open(pattern_file, 'r') as fh:
reader = csv.DictReader(fh, delimiter=',', quotechar='"', skipinitialspace=True)
reader.fieldnames=[f.strip() for f in reader.fieldnames]
for row in reader:
# skipping empty patterns if there was non-empty one for this sequence
if row['sequence'] in patterns and not row['pattern']:
continue
patterns[row['sequence']]=(row['pattern'],row['replace'])
return patterns
def save_nonmatched(patterns, nonmatched):
''' saves a new pattern file with the empty pattern field
supposed to be edited manually afterwards
'''
items_to_save = [ key for key in nonmatched if key not in patterns ]
if not items_to_save:
return
new_file=datetime.now().strftime('pat_%Y%m%d_%H%M%S.csv')
with open(new_file, 'w', newline='') as fh:
writer = csv.DictWriter(fh, fieldnames=['sequence', 'pattern', 'replace'], quoting=csv.QUOTE_ALL)
writer.writeheader()
for key in items_to_save:
writer.writerow({'sequence':key, 'pattern':'', 'replace':''})
def sub_with_patterns(s, patterns):
''' try to match each pattern in patterns iterable
return expanded string if match succeeded
'''
debug=1
for sequence, (pattern, replace) in patterns.items():
if not pattern:
continue
match=re.search(pattern,s,re.X)
if match:
return match.expand(replace)
return None
nomatch={}
patterns = load_patterns()
Raw_Text = re.sub(r'\s+', ' ' ,open('in.txt','r').read().lower()).strip()
for dt in re.findall(r'date1(.*?)by', Raw_Text, re.S):
corrected = sub_with_patterns(dt, patterns)
try:
parsed = dparser.parse(corrected or dt)
print ("input:%s parsed:%s" % (dt,parsed))
except:
nomatch[dt]=1
print ("input:%s ** not parsed" % (dt))
save_nonmatched(patterns, nomatch)
Now if try the script on your input, we get the first correction csv:
"sequence","pattern","replace"
"4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","",""
" 12. 210 i1883 ","",""
" apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""
and the output:
input: june 17,1942 parsed:2018-06-17 00:00:00
...
input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ** not parsed
...
We edit the file like below:
"sequence","pattern","replace"
"4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","^
\s*(?P<day>\d+)
\s+(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*)
\s+(?P<year>\d{2})
","\g<day> \g<month> 19\g<year>"
" 12. 210 i1883 ","",""
" apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""
And run the parser again:
input: june 17,1942 parsed:2018-06-17 00:00:00
...
input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 parsed:1917-07-04 00:00:00
...
Of course this is very far from addressing all the OCR parsing problems you are going to have, but it might be a good start.
回答3:
Many of your dates have different formats: that's going to make things difficult.
You can use the datetime
library to parse dates. Since your data has several formats, you're going to need several different format strings.
datetime
has two useful functions: datetime.strptime
(string PARSE time, returns datetime.datetime
) and datetime.strftime
(string FROM time, returns str
)
Here's an example of how you can parse, provided you have enough format strings:
import datetime
for lines in Raw_Text:
## Do the regex stuff above.
## Keep each returned result as a separate string.
regex_results = get_your_regex_results()
# Step 2
# use dateutil to parse dates in extracted data
date_formats = [ ## You will need several formats to try.
'%m/%d/%Y',
]
for datestring in regex_results:
for fmt in date_formats:
try:
date_str = date_str.strip()
deacondt1 = datetime.datetime.strptime(date_str, fmt)
print(deacondt1)
break
except ValueError:
continue
https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
来源:https://stackoverflow.com/questions/49888234/parse-different-date-formats-regex