How can I parse multiple (unknown) date formats in python?

前端未结

关注

 4  1576

I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can th

Edit 1

And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

import re
from time import clock
iterat = 100

from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
         ' 03-07-2009', '09/01/2010']

reobj = re.compile(
r"""\s*  # optional whitespace
(\d+)    # Month
[-/]     # separator
(\d+)    # Day
[-/]     # separator
(?:20)?  # century (optional)
(\d+)    # years (YY)
\s*      # optional whitespace""",
re.VERBOSE)

te = clock()
for i in xrange(iterat):
    ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
    fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
               for date in ndates]
print "Tim's method   ",clock()-te,'seconds'



regx = re.compile('[-/]')


te = clock()
for i in xrange(iterat):
    ndates = (reobj.match(date).groups() for date in dates)
    fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'


te = clock()
for i in xrange(iterat):
    ndates = (regx.split(date.strip()) for date in dates)
    fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
              for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'



te = clock()
for i in xrange(iterat):
    fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format   ",clock()-te,'seconds'


print fdates1==fdates2==fdates3==fdates4

result

number of iteration's turns : 100
Tim's method    0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format    0.0153756971906 seconds 
True

The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

0 讨论(0)

醉酒成梦

2020-12-10 13:33
The third-party module dateutil has a function parse that operates similarly to PHP's strtotime: you don't need to specify a particular date format, it just tries a bunch of its own.
```
>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0)  # default to be in American date format
```
It also allows you to specify different assumptions:
- dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
- yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2020-12-10 13:38

If you don't want to install a third-party module like dateutil:

import re
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
reobj = re.compile(
    r"""\s*  # optional whitespace
    (\d+)    # Month
    [-/]     # separator
    (\d+)    # Day
    [-/]     # separator
    (?:20)?  # century (optional)
    (\d+)    # years (YY)
    \s*      # optional whitespace""", 
    re.VERBOSE)
ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
          for date in ndates]

Result:

['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']

0 讨论(0)