Python - finding date in a string

前端 未结 5 1945
误落风尘
误落风尘 2021-02-20 12:24

I want to be able to read a string and return the first date appears in it. Is there a ready module that I can use? I tried to write regexs for all possible date format, but it

相关标签:
5条回答
  • 2021-02-20 12:33

    As far as I can tell, there is no such module in standard python library. There are so many different date formats that it's hard to catch them all. If I was you, I will turn to Regex. refer to this page

    0 讨论(0)
  • 2021-02-20 12:37

    You can run a date parser on all subtexts of your text and pick the first date. Of course, such solution would either catch things that are not dates or would not catch things that are, or most likely both.

    Let me provide an example that uses dateutil.parser to catch anything that looks like a date:

    import dateutil.parser
    from itertools import chain
    import re
    
    # Add more strings that confuse the parser in the list
    UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP, 
                              dateutil.parser.parserinfo.PERTAIN,
                              ['a']))
    
    def _get_date(tokens):
        for end in xrange(len(tokens), 0, -1):
            region = tokens[:end]
            if all(token.isspace() or token in UNINTERESTING
                   for token in region):
                continue
            text = ''.join(region)
            try:
                date = dateutil.parser.parse(text)
                return end, date
            except ValueError:
                pass
    
    def find_dates(text, max_tokens=50, allow_overlapping=False):
        tokens = filter(None, re.split(r'(\S+|\W+)', text))
        skip_dates_ending_before = 0
        for start in xrange(len(tokens)):
            region = tokens[start:start + max_tokens]
            result = _get_date(region)
            if result is not None:
                end, date = result
                if allow_overlapping or end > skip_dates_ending_before:
                    skip_dates_ending_before = end
                    yield date
    
    
    test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a 
    child during the Daleks' abduction and invasion of Earth in 2009. 
    On 1st July 2058, Bowie Base One became the first Human colony on Mars. It 
    was commanded by Captain Adelaide Brooke, and initially seemed to prove that 
    it was possible for Humans to live long term on Mars."""
    
    print "With no overlapping:"
    for date in find_dates(test, allow_overlapping=False):
        print date
    
    
    print "With overlapping:"
    for date in find_dates(test, allow_overlapping=True):
        print date
    

    The result from the code is, quite unsurprisingly, rubbish whether you allow overlapping or not. If overlapping is allowed, you get a lot of dates that are nowhere to be seen, and if if it is not allowed, you miss the important date in the text.

    With no overlapping:
    1999-05-12 00:00:00
    2009-07-01 20:58:00
    With overlapping:
    1999-05-12 00:00:00
    1999-05-12 00:00:00
    1999-05-12 00:00:00
    1999-05-12 00:00:00
    1999-05-03 00:00:00
    1999-05-03 00:00:00
    1999-07-03 00:00:00
    1999-07-03 00:00:00
    2009-07-01 20:58:00
    2009-07-01 20:58:00
    2058-07-01 00:00:00
    2058-07-01 00:00:00
    2058-07-01 00:00:00
    2058-07-01 00:00:00
    2058-07-03 00:00:00
    2058-07-03 00:00:00
    2058-07-03 00:00:00
    2058-07-03 00:00:00
    

    Essentially, if overlapping is allowed:

    1. "12 May 1999" is parsed to 1999-05-12 00:00:00
    2. "May 1999" is parsed to 1999-05-03 00:00:00 (because today is the 3rd day of the month)

    If, however, overlapping is not allowed, "2009. On 1st July 2058" is parsed as 2009-07-01 20:58:00 and no attempt is made to parse the date after the period.

    0 讨论(0)
  • 2021-02-20 12:38

    Here, I suppose that you want to parse dates in different formats (and perhaps even languages). If you just need the datestring out of some text, use dateutil like the other commenters recommend...

    I had this task some time ago as well, and I used pyParsing for creating a parser based on my requirements, although any decent parser should do. It is far easier to read, test and to debug than regular expressions.

    I do have some (although crappy) example code on my blog that aims to find date expressions in USA format and German format alike. It may not be what you need, but it's pretty well adjustable.

    0 讨论(0)
  • 2021-02-20 12:48

    Also you can try dateutil.parser... Did not tried it myself, but heard some good comments. python-dateutil

    0 讨论(0)
  • 2021-02-20 12:54

    I found the following very useful for converting the time to a uniform format and then searching for this format pattern:

    from datetime import datetime

    date_object = datetime.strptime('March-1-05', '%B-%d-%y')
    print date_object.strftime("%B %d, %Y")

    0 讨论(0)
提交回复
热议问题