How can I parse multiple (unknown) date formats in python?

前端 未结 4 1576
死守一世寂寞
死守一世寂寞 2020-12-10 12:50

I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can th

相关标签:
4条回答
  • 2020-12-10 13:31

    You can use a regex like r'(\d+)\D(\d+)\D(\d+)' to get the month, day and year in a tuple with the re.findall function.

    then just concatenate the 2-digit years with the number 20 or 19 and use the separator you want to join then back:

    '/'.join(the_list)

    As pointed by Tim:

    To normalize days, just do '{0:0>2}'.format(day) and the same to months.

    0 讨论(0)
  • 2020-12-10 13:33
    import re
    
    ss = '''10/02/09
    07/22/09
    09-08-2008
    9/9/2008
    11/4/2010
    03-07-2009
    09/01/2010'''
    
    
    regx = re.compile('[-/]')
    for xd in ss.splitlines():
        m,d,y = regx.split(xd)
        print xd,'   ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))
    

    result

    10/02/09     10/02/2009
    07/22/09     07/22/2009
    09-08-2008     09/08/2008
    9/9/2008     09/09/2008
    11/4/2010     11/04/2010
    03-07-2009     03/07/2009
    09/01/2010     09/01/2010
    

    Edit 1

    And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

    import re
    from time import clock
    iterat = 100
    
    from datetime import datetime
    dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
             ' 03-07-2009', '09/01/2010']
    
    reobj = re.compile(
    r"""\s*  # optional whitespace
    (\d+)    # Month
    [-/]     # separator
    (\d+)    # Day
    [-/]     # separator
    (?:20)?  # century (optional)
    (\d+)    # years (YY)
    \s*      # optional whitespace""",
    re.VERBOSE)
    
    te = clock()
    for i in xrange(iterat):
        ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
        fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
                   for date in ndates]
    print "Tim's method   ",clock()-te,'seconds'
    
    
    
    regx = re.compile('[-/]')
    
    
    te = clock()
    for i in xrange(iterat):
        ndates = (reobj.match(date).groups() for date in dates)
        fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
    print "mixing solution",clock()-te,'seconds'
    
    
    te = clock()
    for i in xrange(iterat):
        ndates = (regx.split(date.strip()) for date in dates)
        fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
                  for m,d,y in ndates]
    print "eyquem's method",clock()-te,'seconds'
    
    
    
    te = clock()
    for i in xrange(iterat):
        fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
    print "Tim + format   ",clock()-te,'seconds'
    
    
    print fdates1==fdates2==fdates3==fdates4
    

    result

    number of iteration's turns : 100
    Tim's method    0.295053700959 seconds
    mixing solution 0.0459111423379 seconds
    eyquem's method 0.0192239516475 seconds
    Tim + format    0.0153756971906 seconds 
    True
    

    The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

    That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

    0 讨论(0)
  • 2020-12-10 13:33

    The third-party module dateutil has a function parse that operates similarly to PHP's strtotime: you don't need to specify a particular date format, it just tries a bunch of its own.

    >>> from dateutil.parser import parse
    >>> parse("10/02/09", fuzzy=True)
    datetime.datetime(2009, 10, 2, 0, 0)  # default to be in American date format
    

    It also allows you to specify different assumptions:

    • dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
    • yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
    0 讨论(0)
  • 2020-12-10 13:38

    If you don't want to install a third-party module like dateutil:

    import re
    from datetime import datetime
    dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
    reobj = re.compile(
        r"""\s*  # optional whitespace
        (\d+)    # Month
        [-/]     # separator
        (\d+)    # Day
        [-/]     # separator
        (?:20)?  # century (optional)
        (\d+)    # years (YY)
        \s*      # optional whitespace""", 
        re.VERBOSE)
    ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
    fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
              for date in ndates]
    

    Result:

    ['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']
    
    0 讨论(0)
提交回复
热议问题