Parse Different Date formats: Regex

ぐ巨炮叔叔 提交于 2019-12-04 05:29:15

问题


reposting this questions with specifics (because the last one was flagged down).

I am working on parsing messy (tessearct-ocr) from archives cards to get atleast 50% of the info (date1). The data rows contain dates in different forms as per data sample below.

Raw_Text
1   "15957-8 . 3n v g - vw, 1 ekresta . bowker, william e tley n0 .qu v- l. c. 
    s. peteris, forestville, n. y. .mafae date1 june 17,1942 by davis, c. j6 
    l. g. b. jonnis, buffalo, n. y. ngsted decl 17, 1949.3y 7 davis, c. j. 
    date3 by j date4 - by date5 by 6 -.5/, 7/19/l date6 17 jul 1916 salamanca. 
    hf date7 31 dec 1986 buffalo, new york "
2   ".1o2o83n5ddn.. -i ekresta i bowles, albert edwin i made date1 june 9p1909 
    by parker, elm. date2 dec . 18 w date3 . by dep osed by date5 by date7mqm 
    9 ivvld wm 4144, mac, .75 076 eaqlwli "
3   "i naime bowles, charles edward made date1 may 31. 1892 by mclaren, wneoi 
    date2 may 18. 1895 by mclaren, w.e. date3 . i by date4 may 10. 1908 by 
    bip. of chicago. date5 by date7 "
4   "101 557 am l i ekrestaibowles, donald manson ..46 ohio trlnlty cathedral, 
    cleveland, ohio made date1 6/19/76 by burt, ji. h. grace , cleveland, ohio 
   date2 11 jun 77 by bp j h burt date3 . 1 .. by date4 by date5 bv m cuyahoga 
   heights, ohio date6 4/29/27 date7 240000 "
5   "227354 101 575 m68, frederick augustus st. paujjs cathedral, buffalo, 
   n.y. made date1 6/15/63 by scaife. l.i... st. thomas. modia, bath, n.y. 
   date2 1/11/611 by scaife. l.eo date3 by date4 by date5 by bradford, n.y. i 
   . 130m 6/1/18 date7 17 jun 1996 foratvme new york z4uc-xl "
6   "1 95812d ll. il ekresta bowles, harry oscar lmade date14 july 17, 190433, 
    lepnard, w.a. date2 july 25 , 1905 by leonard, w.a. i date3 by date4 by 
   date5 by g- m. /(,,/mr date7 jay /z/,. /357i l /mwi yk/maj. "
7   "5025 ,.. 2.57631 il . - . .. .1 i ekresta bowles , jedwiah hibbafd made 
    deac0n 8., i5-0i1862i13y potter, iih. date2 10. 280 1864 1 biy stevens, w. 
    b. date3 by date4 7 .30 l 1875 by date5 by date7 "
8   "30.611126 ekhq il ekresta bowles, ralph hart made date1 12. 210 i1883 by 
    iwiiiliams, i36 date2 7.. 1. 1885 by williams , j. date3 by i date4 by 
    date5 by g .97) l/am 9- date7 10. 4. 1900 (78) if x/ma 3.4, 154.47.11.73. 
    4,... mya-ix "
9   "2.25678 . 1o14593 ekresta bowles, robert brigham, jr. st. matthew s 
    cathedra1,da11quexas made date1 6/18/65 by mason, c. a. 57 mmzws camp 
    dr7///9s tams date2 12 21 cs by 14.45.42 c a date3 i by date4 by date5 , 
    by houston, texas date6 4/11/30 date7 12 dec 2000 dallas texas 2400-xi "
10  "101 619 34hq woe ekresta bowlin1 howard bruce cathedral modia of saint 
    peter 61 st. paul, washin ton, dc made date1 13 jun 92 bybp r h haines 
   (wdc st. alban1s modia, annandale, vir inia . pdumd 16 jan 93 by r h halnes 
    (wdc) date3 by atas by date4 v by date5 by date6 31 aug 1946 e st. louis. 
   il date7 2400-i "
11  "w k8 8km tm boiling jack dnnmwm q- f grace ch , made dat j 11201). salem 
    mares. stverrett. f. ,w a x st. johms modia. memphis, tenh. date1 apr. 25. 
    1955 - bv barth, t.in.. date3 4 by date4 by date5 by date7 wq iw r 1 w .n 
    . 4.1- 1 date6z1l7i1c. "

I parse date1 through two step process, - 1. Parse text between name "date1" and "by" - 2. Use date parser to extract the actual dates

import re
import dateutil.parser as dparser
for lines in Raw_Text:
    lines = lines.lower() #make lower case
    lines = lines.strip() #remove leading and ending spaces
    lines = " ".join(lines.split()) #remove duplicated spaces



    # Step 1
    #Extract data between "date1" and "by"
    deacondt = re.findall(r'date1(.*?)by',lines)

    deacondt = ''.join(deacondt)  #Convert list to a string


    # Step 2
    # use dateutil to parse dates in extracted data

    try:
        deacondt1 = dparser.parse(deacondt)
    except:
        deacondt1 = 'NA'

    print deacondt1

The output for step 1 are,

[' june 17,1942 ']
[' june 9p1909 ']
[' may 31. 1892 ']
[' 6/19/76 ']
[' 6/15/63 ']
['4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ']
[]
[' 12. 210 i1883 ']
[' 6/18/65 ']
[' 13 jun 92 ']
[]

While Step 2 returns the following output

2018-06-17 00:00:00
1909-06-17 21:00:00
1892-05-31 00:00:00
1976-06-19 00:00:00
2063-06-15 00:00:00
NA
NA
NA
2065-06-18 00:00:00
1992-06-13 00:00:00
NA

Step 2 fails to give all dates. Is there a better date parser for Python 2.7 than "dateutil.parser"?


回答1:


You can try this,

deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
  • fuzzy – allowing strings containing un-dateformat words like “Today is January 1, 2047 at 8:21:00AM”.
  • dayfirst=False means month-first date-format input string like yours.

But it is insufficient for dateutil-parser to extract the output what you want, so more approximate string to date-format is needed to be passed to the parser.

Regex to extract string about date1

(?s)date1\d?((?:(?!by|date2|date3).)*)

Demo,,, in which not only 'by' but also 'date2' and 'date3' are used as separator and date10~date19 are regarded as date1.

And then, extracted string is manipulated(leading&trailing spaces removal, etc) for the acceptable input to date-util parser.

regx= re.compile(r'(?s)date1\d?((?:(?!by|date2|date3).)*)')
raw_date= [re.sub(r'(?i)(?<=\s)[a-z]?(\d{4}|\d{2})\d*', r'\1', re.sub(r'\s+|,|(?<=\d)[^\d\s\/](?=\d)',' ', re.sub(r'^\s+|\s+$|\n+','', m))) for m in regx.findall(Raw_Text)]

for deacondt in raw_date: 
    try:
        deacondt1 = dparser.parse(deacondt, dayfirst=False, fuzzy=True)
    except:
        deacondt1 = 'NA'

print(deacondt +"\n"+ str(deacondt1))

Output

june 17 1942
1942-06-17 00:00:00
june 9 1909
1909-06-09 00:00:00
may 31. 1892
1892-05-31 00:00:00
6/19/76
1976-06-19 00:00:00
6/15/63
2063-06-15 00:00:00
july 17  1904  lepnard  w.a.
1904-07-17 00:00:00
12. 21 1883
1883-12-21 00:00:00
6/18/65
2065-06-18 00:00:00
13 jun 92
1992-06-13 00:00:00
apr. 25. 1955 - bv barth  t.in..
1955-04-25 00:00:00



回答2:


There is no parsing module to give you the complete solution for every OCR squiggle you might encounter.
You would have to build some evaluation/correction framework in place to discover and fix what you can fix.

I suggest the following workflow:

  1. Try to parse date sequences.
  2. Save sequences that have not been parsed into a special file
  3. Edit the file, add some regex substitution rules to rewrite the sequence into a salvageable form.
  4. Apply the rules from the file and try to parse again
  5. Repeat from 2. until everything is handled.

Here is some example code:

parser.py

import re
import csv
import glob, os
from datetime import datetime
import dateutil.parser as dparser

def load_patterns():
    ''' load patterns from existing pat_*.csv 
        return a dict of the form { sequence: [sequence,pattern,replace] }
        sequence is an example of the string that should be handled by this pattern
        pattern and replace have the same meaning as for re.sub
    '''
    patterns = {}
    for pattern_file in glob.glob("pat_*.csv"):
        with open(pattern_file, 'r') as fh:
            reader = csv.DictReader(fh, delimiter=',', quotechar='"', skipinitialspace=True)
            reader.fieldnames=[f.strip() for f in reader.fieldnames]
            for row in reader:
                # skipping empty patterns if there was non-empty one for this sequence
                if row['sequence'] in patterns and  not row['pattern']:
                    continue
                patterns[row['sequence']]=(row['pattern'],row['replace'])
    return patterns

def save_nonmatched(patterns, nonmatched):
    ''' saves a new pattern file with the empty pattern field
        supposed to be edited manually afterwards
    '''
    items_to_save = [ key for key in nonmatched if key not in patterns ]
    if not items_to_save:
        return

    new_file=datetime.now().strftime('pat_%Y%m%d_%H%M%S.csv')
    with open(new_file, 'w', newline='') as fh:
        writer = csv.DictWriter(fh, fieldnames=['sequence', 'pattern', 'replace'], quoting=csv.QUOTE_ALL)
        writer.writeheader()
        for key in items_to_save:
            writer.writerow({'sequence':key, 'pattern':'', 'replace':''})

def sub_with_patterns(s, patterns):
    ''' try to match each pattern in patterns iterable
        return expanded string if match succeeded
    '''
    debug=1
    for sequence, (pattern, replace) in patterns.items():
        if not pattern:
            continue
        match=re.search(pattern,s,re.X)
        if match:
            return match.expand(replace)
    return None


nomatch={}
patterns = load_patterns()
Raw_Text = re.sub(r'\s+', ' ' ,open('in.txt','r').read().lower()).strip()

for dt in re.findall(r'date1(.*?)by', Raw_Text, re.S):
    corrected = sub_with_patterns(dt, patterns)
    try:
        parsed = dparser.parse(corrected or dt)
        print ("input:%s parsed:%s" % (dt,parsed))
    except:
        nomatch[dt]=1
        print ("input:%s ** not parsed" % (dt))            

save_nonmatched(patterns, nomatch)

Now if try the script on your input, we get the first correction csv:

"sequence","pattern","replace"
"4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","",""
" 12. 210 i1883 ","",""
" apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""

and the output:

input: june 17,1942  parsed:2018-06-17 00:00:00
...
input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905  ** not parsed
...

We edit the file like below:

"sequence","pattern","replace"                                                    
"4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905 ","^
     \s*(?P<day>\d+)
     \s+(?P<month>(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]*)
     \s+(?P<year>\d{2})
    ","\g<day> \g<month> 19\g<year>"
" 12. 210 i1883 ","",""
" apr. 25. 1955 - bv barth, t.in.. date3 4 ","",""

And run the parser again:

input: june 17,1942  parsed:2018-06-17 00:00:00
...
input:4 july 17, 190433, lepnard, w.a. date2 july 25 , 1905  parsed:1917-07-04 00:00:00
...

Of course this is very far from addressing all the OCR parsing problems you are going to have, but it might be a good start.




回答3:


Many of your dates have different formats: that's going to make things difficult.

You can use the datetime library to parse dates. Since your data has several formats, you're going to need several different format strings.

datetime has two useful functions: datetime.strptime (string PARSE time, returns datetime.datetime) and datetime.strftime (string FROM time, returns str)

Here's an example of how you can parse, provided you have enough format strings:

import datetime

for lines in Raw_Text:

    ## Do the regex stuff above.
    ## Keep each returned result as a separate string.
    regex_results = get_your_regex_results()


    # Step 2
    # use dateutil to parse dates in extracted data

    date_formats = [ ## You will need several formats to try.
        '%m/%d/%Y',
    ] 

    for datestring in regex_results:

        for fmt in date_formats:
            try:
                date_str = date_str.strip()
                deacondt1 = datetime.datetime.strptime(date_str, fmt)
                print(deacondt1)
                break
            except ValueError:
                continue

https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



来源:https://stackoverflow.com/questions/49888234/parse-different-date-formats-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!